用什么算法确定保险用户是欺诈用户
发布网友
发布时间:2022-05-03 06:06
我来回答
共1个回答
热心网友
时间:2023-10-12 06:44
1、 数据类型转换:因”gender”实际代表男性、女性用户,故需要将“gender”变量转换成因子变量,且因子水平用“F”替换1,“M”替换2;将“fraudRisk”变量也转换成因子变量。
答:首先将数据导入到R中,并查看数据维度:
> # 导入数据> ccFraud <- read.csv("ccFraud.csv")> # 查看数据维度> str(ccFraud)'data.frame':10000000 obs. of 9 variables:$ custID : int 1 2 3 4 5 6 7 8 9 10 ...$ gender : int 1 2 2 1 1 2 1 1 2 1 ...$ state : int 35 2 2 15 46 44 3 10 32 23 ...$ cardholder : int 1 1 1 1 1 2 1 1 1 1 ...$ balance : int 3000 0 0 0 0 5546 2000 6016 2428 0 ...$ numTrans : int 4 9 27 12 11 21 41 20 4 18 ...$ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ...$ creditLine : int 2 18 16 5 7 13 1 6 22 5 ...$ fraudRisk : int 0 0 0 0 0 0 0 0 0 0 ...
ccFraud数据集一共有一千万行9列,各列均为整型变量。按照题目要求先将“gender”变量转变为因子型,且因子水平用“F”替换1,“M”替换2。代码如下:
> ccFraud$gender <- factor(ifelse(ccFraud$gender==1,'F','M'))> str(ccFraud)'data.frame':10000000 obs. of 9 variables:$ custID : int 1 2 3 4 5 6 7 8 9 10 ...$ gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 2 1 1 2 1 ...$ state : int 35 2 2 15 46 44 3 10 32 23 ...$ cardholder : int 1 1 1 1 1 2 1 1 1 1 ...$ balance : int 3000 0 0 0 0 5546 2000 6016 2428 0 ...$ numTrans : int 4 9 27 12 11 21 41 20 4 18 ...$ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ...$ creditLine : int 2 18 16 5 7 13 1 6 22 5 ...$ fraudRisk : int 0 0 0 0 0 0 0 0 0 0 ...
将“fraudRisk”变量也转换成因子变量,代码如下:
>ccFraud$fraudRisk <- as.factor(ccFraud$fraudRisk)>str(ccFraud)'data.frame': 10000000 obs. of 9 variables: $ custID : int 1 2 3 4 5 6 7 8 9 10 ... $ gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 2 1 1 2 1... $ state : int 35 2 2 15 46 44 3 10 32 23... $ cardholder : int 1 1 1 1 1 2 1 1 1 1 ... $ balance : int 3000 0 0 0 0 5546 2000 60162428 0 ... $ numTrans : int 4 9 27 12 11 21 41 20 4 18... $ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ... $ creditLine : int 2 18 16 5 7 13 1 6 22 5 ... $ fraudRisk : Factor w/ 2 levels"0","1": 1 1 1 1 1 1 1 1 1 1 ...
2、 数据探索:查看“fraudRisk”变量中0、1的频数及占比情况。
答:此题是送分题,通过table函数、prop.table函数即可实现。代码如下:
> table(ccFraud$fraudRisk) 0 1 9403986 596014 > prop.table(table(ccFraud$fraudRisk)) 0 1 0.9403986 0.0596014
3、 数据分区:需要按照变量fraudRisk来进行等比例抽样,其中80%作为训练集train数据,20%作为测试集test数据。
答:由于要根据fraudRisk变量进行等比例抽样,我们用caret包的createDataPartition函数实现。代码如下:
> library(caret)载入需要的程辑包:lattice载入需要的程辑包:ggplot2> idx <- createDataPartition(ccFraud$fraudRisk,p=0.8,list=F)> train <- ccFraud[idx,]> test <- ccFraud[-idx,]> prop.table(table(train$fraudRisk)) 0 1 0.94039851 0.05960149 > prop.table(table(test$fraudRisk)) 0 1 0.94039897 0.05960103
4、 建立模型:利用至少三种常见的分类算法(如KNN近邻算法、决策树算法、随机森林等)对数据建立预测模型。
答:由于数据量较大,学员反馈运行慢,这边利用MicrosoftML包来跑模型。关于MRS的快速入门请查阅之前文章:https://ask.hellobi.com/blog/xiejiabiao/8559
> # 模型一:利用MicrosoftML包的rxFastTrees()函数构建快速决策树模型> (a <- Sys.time()) #模型运行前时间[1] "2017-09-03 23:32:04 CST"> treeModel <- rxFastTrees(fraudRisk ~ gender + cardholder + balance + numTrans+ + numIntlTrans + creditLine,data = train)Not adding a normalizer.Making per-feature arraysChanging data from row-wise to column-wiseBeginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Processed 8000001 instancesBinning and forming Feature objectsReserved memory for tree learner: 79664 bytesStarting to train ...Not training a calibrator because it is not needed.Elapsed time: 00:01:04.6222538> (b <- Sys.time()) #模型运行后时间[1] "2017-09-03 23:33:09 CST"> b-a # 模型运行时长Time difference of 1.086313 mins> # 模型二:利用MicrosoftML包的rxFastForest()函数构建快速随机森林模型> (a <- Sys.time()) #模型运行前时间[1] "2017-09-03 23:33:31 CST"> forestModel <- rxFastForest(fraudRisk ~ gender + cardholder + balance + numTrans+ + numIntlTrans + creditLine,data = train)Not adding a normalizer.Making per-feature arraysChanging data from row-wise to column-wiseBeginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Processed 8000001 instancesBinning and forming Feature objectsReserved memory for tree learner: 79664 bytesStarting to train ...Training calibrator.Beginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Elapsed time: 00:01:25.4585776> (b <- Sys.time()) #FastTrees模型运行后时间[1] "2017-09-03 23:34:57 CST"> b-a # 模型运行时长Time difference of 1.433823 mins> # 模型三:利用MicrosoftML包的rxLogisticRegression()函数构建快速逻辑回归模型> (a <- Sys.time()) #模型运行前时间[1] "2017-09-03 23:34:57 CST"> logitModel <- rxLogisticRegression(fraudRisk ~ gender + cardholder + balance + numTrans+ + numIntlTrans + creditLine,data = train)Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.Beginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Beginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Beginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.Beginning optimizationnum vars: 8improvement criterion: Mean ImprovementL1 regularization selected 8 of 8 weights.Not training a calibrator because it is not needed.Elapsed time: 00:00:19.5887244Elapsed time: 00:00:00.0383181> (b <- Sys.time()) #模型运行后时间[1] "2017-09-03 23:35:17 CST"> b-a # 模型运行时长Time difference of 20.27396 secs>
逻辑回归模型运行时间最短,消耗20.3秒,其次是决策树,消耗1.08分钟,时间最长的是随机森林,时长为1.4分钟。
5、 模型评估:利用上面建立的预测模型(至少第四点建立的三个模型),对训练集和测试集数据进行预测,评估模型效果,最终选择最优模型作为以后的业务预测模型。(提示:构建混淆矩阵)
答:针对上面的三种模型,我们分别对train、test数据集进行预测并评估。
# 利用决策树模型对数据进行预测,并计算误差率> treePred_tr <- rxPredict(treeModel,data = train)Beginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Elapsed time: 00:00:52.1015119Finished writing 8000001 rows.Writing completed.> t <- table(train$fraudRisk,treePred_tr$PredictedLabel)> t 0 1 0 7446742 76447 1 253008 223804> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算决策树对train数据集的预测误差率[1] "4.1%"> treePred_te <- rxPredict(treeModel,data = test)Beginning processing data.Rows Read: 1999999, Read Time: 0, Transform Time: 0Beginning processing data.Elapsed time: 00:00:13.4980323Finished writing 1999999 rows.Writing completed.> t1 <- table(test$fraudRisk,treePred_te$PredictedLabel)> t1 0 1 0 1861406 19391 1 63176 56026> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t1),3)*100,"%")) #计算决策树对test数据集的预测误差率[1] "4.1%"> # 利用随机森林模型对数据进行预测,并计算误差率> forestPred_tr <- rxPredict(forestModel,data = train)Beginning processing data.Rows Read: 8000001, Read Time: 0.001, Transform Time: 0Beginning processing data.Elapsed time: 00:00:56.2862657Finished writing 8000001 rows.Writing completed.> t <- table(train$fraudRisk,forestPred_tr$PredictedLabel)> t 0 1 0 7508808 14381 1 373777 103035> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算随机森林对train数据集的预测误差率[1] "4.9%"> forestPred_te <- rxPredict(forestModel,data = test)Beginning processing data.Rows Read: 1999999, Read Time: 0.001, Transform Time: 0Beginning processing data.Elapsed time: 00:00:14.0430130Finished writing 1999999 rows.Writing completed.> t1 <- table(test$fraudRisk,forestPred_te$PredictedLabel)> t1 0 1 0 1877117 3680 1 93419 25783> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t),3)*100,"%")) #计算随机森林对test数据集的预测误差率[1] "1.2%"> # 利用逻辑回归模型对数据进行预测,并计算误差率> logitPred_tr <- rxPredict(logitModel,data = train)Beginning processing data.Rows Read: 8000001, Read Time: 0.001, Transform Time: 0Beginning processing data.Elapsed time: 00:00:08.1674394Finished writing 8000001 rows.Writing completed.> t <- table(train$fraudRisk,logitPred_tr$PredictedLabel)> t 0 1 0 7444156 79033 1 250679 226133> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算逻辑回归对train数据集的预测误差率[1] "4.1%"> logitPred_te <- rxPredict(logitModel,data = test)Beginning processing data.Rows Read: 1999999, Read Time: 0, Transform Time: 0Beginning processing data.Elapsed time: 00:00:02.0736547Finished writing 1999999 rows.Writing completed.> t1 <- table(test$fraudRisk,logitPred_te$PredictedLabel)> t1 0 1 0 1860885 19912 1 62428 56774> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t),3)*100,"%")) #计算逻辑回归对test数据集的预测误差率[1] "1%"
从训练集和测试集的预测误差率来看,对于此份数据,逻辑回归是最优的选择。
热心网友
时间:2023-10-12 06:44
1、 数据类型转换:因”gender”实际代表男性、女性用户,故需要将“gender”变量转换成因子变量,且因子水平用“F”替换1,“M”替换2;将“fraudRisk”变量也转换成因子变量。
答:首先将数据导入到R中,并查看数据维度:
> # 导入数据> ccFraud <- read.csv("ccFraud.csv")> # 查看数据维度> str(ccFraud)'data.frame':10000000 obs. of 9 variables:$ custID : int 1 2 3 4 5 6 7 8 9 10 ...$ gender : int 1 2 2 1 1 2 1 1 2 1 ...$ state : int 35 2 2 15 46 44 3 10 32 23 ...$ cardholder : int 1 1 1 1 1 2 1 1 1 1 ...$ balance : int 3000 0 0 0 0 5546 2000 6016 2428 0 ...$ numTrans : int 4 9 27 12 11 21 41 20 4 18 ...$ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ...$ creditLine : int 2 18 16 5 7 13 1 6 22 5 ...$ fraudRisk : int 0 0 0 0 0 0 0 0 0 0 ...
ccFraud数据集一共有一千万行9列,各列均为整型变量。按照题目要求先将“gender”变量转变为因子型,且因子水平用“F”替换1,“M”替换2。代码如下:
> ccFraud$gender <- factor(ifelse(ccFraud$gender==1,'F','M'))> str(ccFraud)'data.frame':10000000 obs. of 9 variables:$ custID : int 1 2 3 4 5 6 7 8 9 10 ...$ gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 2 1 1 2 1 ...$ state : int 35 2 2 15 46 44 3 10 32 23 ...$ cardholder : int 1 1 1 1 1 2 1 1 1 1 ...$ balance : int 3000 0 0 0 0 5546 2000 6016 2428 0 ...$ numTrans : int 4 9 27 12 11 21 41 20 4 18 ...$ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ...$ creditLine : int 2 18 16 5 7 13 1 6 22 5 ...$ fraudRisk : int 0 0 0 0 0 0 0 0 0 0 ...
将“fraudRisk”变量也转换成因子变量,代码如下:
>ccFraud$fraudRisk <- as.factor(ccFraud$fraudRisk)>str(ccFraud)'data.frame': 10000000 obs. of 9 variables: $ custID : int 1 2 3 4 5 6 7 8 9 10 ... $ gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 2 1 1 2 1... $ state : int 35 2 2 15 46 44 3 10 32 23... $ cardholder : int 1 1 1 1 1 2 1 1 1 1 ... $ balance : int 3000 0 0 0 0 5546 2000 60162428 0 ... $ numTrans : int 4 9 27 12 11 21 41 20 4 18... $ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ... $ creditLine : int 2 18 16 5 7 13 1 6 22 5 ... $ fraudRisk : Factor w/ 2 levels"0","1": 1 1 1 1 1 1 1 1 1 1 ...
2、 数据探索:查看“fraudRisk”变量中0、1的频数及占比情况。
答:此题是送分题,通过table函数、prop.table函数即可实现。代码如下:
> table(ccFraud$fraudRisk) 0 1 9403986 596014 > prop.table(table(ccFraud$fraudRisk)) 0 1 0.9403986 0.0596014
3、 数据分区:需要按照变量fraudRisk来进行等比例抽样,其中80%作为训练集train数据,20%作为测试集test数据。
答:由于要根据fraudRisk变量进行等比例抽样,我们用caret包的createDataPartition函数实现。代码如下:
> library(caret)载入需要的程辑包:lattice载入需要的程辑包:ggplot2> idx <- createDataPartition(ccFraud$fraudRisk,p=0.8,list=F)> train <- ccFraud[idx,]> test <- ccFraud[-idx,]> prop.table(table(train$fraudRisk)) 0 1 0.94039851 0.05960149 > prop.table(table(test$fraudRisk)) 0 1 0.94039897 0.05960103
4、 建立模型:利用至少三种常见的分类算法(如KNN近邻算法、决策树算法、随机森林等)对数据建立预测模型。
答:由于数据量较大,学员反馈运行慢,这边利用MicrosoftML包来跑模型。关于MRS的快速入门请查阅之前文章:https://ask.hellobi.com/blog/xiejiabiao/8559
> # 模型一:利用MicrosoftML包的rxFastTrees()函数构建快速决策树模型> (a <- Sys.time()) #模型运行前时间[1] "2017-09-03 23:32:04 CST"> treeModel <- rxFastTrees(fraudRisk ~ gender + cardholder + balance + numTrans+ + numIntlTrans + creditLine,data = train)Not adding a normalizer.Making per-feature arraysChanging data from row-wise to column-wiseBeginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Processed 8000001 instancesBinning and forming Feature objectsReserved memory for tree learner: 79664 bytesStarting to train ...Not training a calibrator because it is not needed.Elapsed time: 00:01:04.6222538> (b <- Sys.time()) #模型运行后时间[1] "2017-09-03 23:33:09 CST"> b-a # 模型运行时长Time difference of 1.086313 mins> # 模型二:利用MicrosoftML包的rxFastForest()函数构建快速随机森林模型> (a <- Sys.time()) #模型运行前时间[1] "2017-09-03 23:33:31 CST"> forestModel <- rxFastForest(fraudRisk ~ gender + cardholder + balance + numTrans+ + numIntlTrans + creditLine,data = train)Not adding a normalizer.Making per-feature arraysChanging data from row-wise to column-wiseBeginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Processed 8000001 instancesBinning and forming Feature objectsReserved memory for tree learner: 79664 bytesStarting to train ...Training calibrator.Beginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Elapsed time: 00:01:25.4585776> (b <- Sys.time()) #FastTrees模型运行后时间[1] "2017-09-03 23:34:57 CST"> b-a # 模型运行时长Time difference of 1.433823 mins> # 模型三:利用MicrosoftML包的rxLogisticRegression()函数构建快速逻辑回归模型> (a <- Sys.time()) #模型运行前时间[1] "2017-09-03 23:34:57 CST"> logitModel <- rxLogisticRegression(fraudRisk ~ gender + cardholder + balance + numTrans+ + numIntlTrans + creditLine,data = train)Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.Beginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Beginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Beginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.Beginning optimizationnum vars: 8improvement criterion: Mean ImprovementL1 regularization selected 8 of 8 weights.Not training a calibrator because it is not needed.Elapsed time: 00:00:19.5887244Elapsed time: 00:00:00.0383181> (b <- Sys.time()) #模型运行后时间[1] "2017-09-03 23:35:17 CST"> b-a # 模型运行时长Time difference of 20.27396 secs>
逻辑回归模型运行时间最短,消耗20.3秒,其次是决策树,消耗1.08分钟,时间最长的是随机森林,时长为1.4分钟。
5、 模型评估:利用上面建立的预测模型(至少第四点建立的三个模型),对训练集和测试集数据进行预测,评估模型效果,最终选择最优模型作为以后的业务预测模型。(提示:构建混淆矩阵)
答:针对上面的三种模型,我们分别对train、test数据集进行预测并评估。
# 利用决策树模型对数据进行预测,并计算误差率> treePred_tr <- rxPredict(treeModel,data = train)Beginning processing data.Rows Read: 8000001, Read Time: 0, Transform Time: 0Beginning processing data.Elapsed time: 00:00:52.1015119Finished writing 8000001 rows.Writing completed.> t <- table(train$fraudRisk,treePred_tr$PredictedLabel)> t 0 1 0 7446742 76447 1 253008 223804> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算决策树对train数据集的预测误差率[1] "4.1%"> treePred_te <- rxPredict(treeModel,data = test)Beginning processing data.Rows Read: 1999999, Read Time: 0, Transform Time: 0Beginning processing data.Elapsed time: 00:00:13.4980323Finished writing 1999999 rows.Writing completed.> t1 <- table(test$fraudRisk,treePred_te$PredictedLabel)> t1 0 1 0 1861406 19391 1 63176 56026> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t1),3)*100,"%")) #计算决策树对test数据集的预测误差率[1] "4.1%"> # 利用随机森林模型对数据进行预测,并计算误差率> forestPred_tr <- rxPredict(forestModel,data = train)Beginning processing data.Rows Read: 8000001, Read Time: 0.001, Transform Time: 0Beginning processing data.Elapsed time: 00:00:56.2862657Finished writing 8000001 rows.Writing completed.> t <- table(train$fraudRisk,forestPred_tr$PredictedLabel)> t 0 1 0 7508808 14381 1 373777 103035> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算随机森林对train数据集的预测误差率[1] "4.9%"> forestPred_te <- rxPredict(forestModel,data = test)Beginning processing data.Rows Read: 1999999, Read Time: 0.001, Transform Time: 0Beginning processing data.Elapsed time: 00:00:14.0430130Finished writing 1999999 rows.Writing completed.> t1 <- table(test$fraudRisk,forestPred_te$PredictedLabel)> t1 0 1 0 1877117 3680 1 93419 25783> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t),3)*100,"%")) #计算随机森林对test数据集的预测误差率[1] "1.2%"> # 利用逻辑回归模型对数据进行预测,并计算误差率> logitPred_tr <- rxPredict(logitModel,data = train)Beginning processing data.Rows Read: 8000001, Read Time: 0.001, Transform Time: 0Beginning processing data.Elapsed time: 00:00:08.1674394Finished writing 8000001 rows.Writing completed.> t <- table(train$fraudRisk,logitPred_tr$PredictedLabel)> t 0 1 0 7444156 79033 1 250679 226133> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算逻辑回归对train数据集的预测误差率[1] "4.1%"> logitPred_te <- rxPredict(logitModel,data = test)Beginning processing data.Rows Read: 1999999, Read Time: 0, Transform Time: 0Beginning processing data.Elapsed time: 00:00:02.0736547Finished writing 1999999 rows.Writing completed.> t1 <- table(test$fraudRisk,logitPred_te$PredictedLabel)> t1 0 1 0 1860885 19912 1 62428 56774> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t),3)*100,"%")) #计算逻辑回归对test数据集的预测误差率[1] "1%"
从训练集和测试集的预测误差率来看,对于此份数据,逻辑回归是最优的选择。