sata如何三种回归结果比较异同
发布网友
发布时间:2022-12-24 00:46
我来回答
共1个回答
热心网友
时间:2023-10-06 03:35
我在上一目粘了三种回归预测的标杆模型,以及评价方法。
今天具体跑一下,相较,变化不大,算是一个简单补充。
(用的windows,没装linux虚拟机,不要在吐槽,装了linux,还要重新打各种包,对于一个调包侠来说,实在过于繁琐。)
1.还是先跑简单线性
# -*- coding: utf-8 -*- # 引入模块 import pandas as pd from sklearn.linear_model import LinearRegression # 读取数据 train = pd.read_csv("data/train1.csv") test = pd.read_csv("data/test1.csv") submit = pd.read_csv("data/sample_submit.csv") # 删除id train.drop('id', axis=1, inplace=True) test.drop('id', axis=1, inplace=True) # 取出训练集的y y_train = train.pop('y') # 建立线性回归模型 reg = LinearRegression() reg.fit(train, y_train) #y_pred = reg.predict(test) # 若预测值是负数,则取0 #y_pred = map(lambda x: x if x >= 0 else 0, y_pred) # 输出预测结果至my_XGB_prediction.csv #submit['y'] = y_pred #submit.to_csv('data/my_linearRegression_prediction22.csv', index=False) print reg.coef_ from sklearn import metrics import numpy as np rmse=np.sqrt(metrics.mean_squared_error(y_train, reg.predict(train))) print 'linearRegression rmse为%f'%rmse #linearRegression rmse为38.920108
rmse指数为38(具体计算见上一目),三种里面直线回归是最差的。
2.决策树预测
# -*- coding: utf-8 -*- # 引入模块 import pandas as pd from sklearn.tree import DecisionTreeRegressor# 读取数据 train = pd.read_csv("data/train1.csv") test = pd.read_csv("data/test1.csv") submit = pd.read_csv("data/sample_submit.csv") # 删除id train.drop('id', axis=1, inplace=True) test.drop('id', axis=1, inplace=True) # 取出训练集的y y_train = train.pop('y') reg = DecisionTreeRegressor(max_depth=5) reg.fit(train, y_train) #y_pred = reg.predict(test) # 输出预测结果至my_tdr_prediction.csv #submit['y'] = y_pred #submit.to_csv('data/my_tdr_prediction22.csv', index=False) from sklearn import metrics import numpy as np rmse=np.sqrt(metrics.mean_squared_error(y_train, reg.predict(train))) print ' rmse为%f'%rmse #linearRegression rmse为38.920108 # rmse为27.865298
rmse的指数为27,比线性好
3.xgboots
# -*- coding: utf-8 -*- #python 3有xgboots # 引入模块 import pandas as pd from xgboost import XGBRegressor # 读取数据 train = pd.read_csv("data/train1.csv") test = pd.read_csv("data/test1.csv") submit = pd.read_csv("data/sample_submit.csv") # 删除id train.drop('id', axis=1, inplace=True) test.drop('id', axis=1, inplace=True) # 取出训练集的y y_train = train.pop('y') # 建立一个默认的xgboost回归模型 reg = XGBRegressor() reg.fit(train, y_train) #y_pred = reg.predict(test) # 输出预测结果至my_XGB_prediction.csv #submit['y'] = y_pred #submit.to_csv('data/my_XGB_prediction22.csv', index=False) from sklearn import metrics import numpy as np rmse=np.sqrt(metrics.mean_squared_error(y_train, reg.predict(train))) print (rmse) #18.5718185229
注意python2是没有这个包,所以此处是3.x版本。可以看到print已经是函数了
rmse为18,三种模型是效果最好。
4.做一个决策树的特征工程分析
import pandas as pd # 读取数据 train = pd.read_csv("data/train1.csv") test = pd.read_csv("data/test1.csv") submit = pd.read_csv("data/sample_submit.csv") # 删除id train.drop('id', axis=1, inplace=True) test.drop('id', axis=1, inplace=True) # 取出训练集的y y_train = train.pop('y') from sklearn import metrics from sklearn.ensemble import ExtraTreesClassifier model = ExtraTreesClassifier() model.fit(train, y_train) # display the relative importance of each attribute print model.feature_importances_
第一个变量的信息载量最少,考虑将其删除
5.删除city变量
效果并没有变好,所以保留原变量。
用xgboots进行predict