gridsearchcv:Found input variables with inconsistent numbers of samples: [1176, 294]

你们好,我想求助一个非常困扰我的问题。 我在使用GridSearchCV.fit(X_train, y_train)时,遇到了这样的错误  
ValueError: Found input variables with inconsistent numbers of samples: [1176, 294]  
这让我十分费解,因为之前才使用过这个方法,没有遇到任何报错,同样,这次遇到报错的代码我也是按上次那样写的,但是却报错了! 我尝试了网上讲到的reshape、to_frame解决方案,但是都没有效果,还是报同样的错! 希望能有人帮我解决这个费解的问题,非常感谢!!! 下面这个是报错的代码(附件中有ipynb文件)
def optimize_train_test_model(X_train, y_train, X_test, y_test, model_name, model, param_range):
    """
        训练并测试模型
        
        输入参数:
        model_name: 模型名字
        model:训练模型
        param_range:模型参数取值

        输出:
        训练之后的GridSearchCV对象

    """
    print('训练{}中'.format(model_name))   
    
    # TODO
    # 基于传入的model构建一个名为clf的GridSearchCV对象
    # 其中cv为10,scoring为roc_auc,参数选项为传入的param_range
    clf = GridSearchCV(model, param_grid = param_range, cv = 10, scoring = 'roc_auc')
    
    
    start = time.time()
    clf.fit(X_train, y_train)
    # 计时
    end = time.time()
    duration = end - start

    # TODO
    # 验证模型,得到模型在训练集和测试集上的评分,并分别存储到train_score和test_score中
    train_score = clf.best_score_
    test_score = clf.best_estimator_.score(X_test, y_test)
    
    print('训练AUC:{:.3f}'.format(train_score))
    print('测试AUC:{:.3f}'.format(test_score))
    print('最优参数:{}'.format(clf.best_params_))
    
    print('训练模型耗时: {:.4f}s'.format(duration))
    print('###########################################')
    
    return clf
然后调用这个函数执行,就报错了(函数没问题)
model_name_param_dict = {
                         'DT': (DecisionTreeClassifier(),
                                {'max_depth': parameters}),
                         }
# for model_name, (model, param) in model_name_param_dict.items():
#     print(model_name, model, param)
model_name = list(model_name_param_dict.keys())[0]
model, param = model_name_param_dict[model_name]
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(attrition_comb, y,test_size= 0.2, random_state=1);

# # TODO
# # 调用optimize_train_test_model重新训练,模型及参数为以上model_name_param_dict中所定义的决策树,并存储到gscv中
gscv = optimize_train_test_model(X_train, X_test, y_train, y_test, model_name = model_name, model = model, param_range = param )
下面是报错
训练DT中
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-89-79000dea2265> in <module>
     12 # # TODO
     13 # # 调用optimize_train_test_model重新训练,模型及参数为以上model_name_param_dict中所定义的决策树,并存储到gscv中
---> 14 gscv = optimize_train_test_model(X_train, X_test, y_train, y_test, model_name = model_name, model = model, param_range = param )

<ipython-input-83-fb64dcf9be5d> in optimize_train_test_model(X_train, y_train, X_test, y_test, model_name, model, param_range)
     25 
     26     start = time.time()
---> 27     clf.fit(X_train, y_train)
     28     # 计时
     29     end = time.time()

~\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    672             refit_metric = 'score'
    673 
--> 674         X, y, groups = indexable(X, y, groups)
    675         n_splits = cv.get_n_splits(X, y, groups)
    676 

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables)
    258         else:
    259             result.append(np.array(X))
--> 260     check_consistent_length(*result)
    261     return result
    262 

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
    233     if len(uniques) > 1:
    234         raise ValueError("Found input variables with inconsistent numbers of"
--> 235                          " samples: %r" % [int(l) for l in lengths])
    236 
    237 

ValueError: Found input variables with inconsistent numbers of samples: [1176, 294]

柯蓝iop

赞同来自:

我今天也遇到这个问题,应该是X_train, X_test, y_train, y_test 的格式问题。 划分数据集后再对这些数据转换一下格式应该就可以了
X_train = X_train.iloc[:, 0].values
X_test = X_test.iloc[:, 0].values
y_train = y_train.iloc[:, 0].values
y_test = y_test.iloc[:, 0].values

要回复问题请先登录注册