Overview
之前的文章中记录了大数据平台上lightGBM
分类器的Grid Search
调参方法的应用。这次我们继续用lightGBM
分类器,看看另外两种常用的调参方法随机搜索Random Search
和贝叶斯优化Bayesian Optimization
怎么在Spark
平台上使用。
1. 加载相关包
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import numpy as np import pyspark spark = pyspark.sql.SparkSession.builder.appName( "spark lightgbm" ) \ .config( "spark.jars.packages" , "com.microsoft.ml.spark:mmlspark_2.11:0.18.1" ) \ .config( "spark.cores.max" , "20" ) \ .config( "spark.driver.memory" , "6G" ) \ .config( "spark.executor.memory" , "6G" ) \ .config( "spark.executor.cores" , "6" ) \ .getOrCreate() import mmlspark from mmlspark.lightgbm import LightGBMClassifier from pyspark.ml.feature import VectorAssembler from hyperopt import fmin, rand, tpe, hp, space_eval, Trials |
其中hyperopt
是一款强大的调参工具,实现了随机搜索和贝叶斯优化。
如果没有安装hyperopt
,那么先在master
节点安装hyperopt
:
1 | pip3 install hyperopt |
2. 加载数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | df_train = spark.read. format ( "csv" ) \ .option( "inferSchema" , "true" ) \ .option( "header" , "true" ) \ .option( "sep" , "," ) \ df_val = spark.read. format ( "csv" ) \ .option( "inferSchema" , "true" ) \ .option( "header" , "true" ) \ .option( "sep" , "," ) \ df_test = spark.read. format ( "csv" ) \ .option( "inferSchema" , "true" ) \ .option( "header" , "true" ) \ .option( "sep" , "," ) \ |
处理特征:
1 2 3 4 5 6 | feature_cols = list (df_train.columns) feature_cols.remove( "label" ) assembler = VectorAssembler(inputCols = feature_cols, outputCol = "features" ) df_train = assembler.transform(df_train) df_val = assembler.transform(df_val) df_test = assembler.transform(df_test) |
3. 定义参数空间和代价函数
定义参数搜索空间,以lambdaL1
和lambdaL2
为例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | space = { 'objective' : "binary" , 'boostingType' : 'gbdt' , 'isUnbalance' : True , 'featuresCol' : 'features' , 'labelCol' : 'label' , 'maxBin' : 60 , 'baggingFreq' : 1 , 'baggingSeed' : 696 , 'earlyStoppingRound' : 20 , 'learningRate' : 0.1 , 'maxDepth' : 3 , 'numLeaves' : 128 , 'baggingFraction' : 0.7 , 'featureFraction' : 0.7 , 'minSumHessianInLeaf' : 0.001 , 'numIterations' : 800 , 'verbosity' : 1 , 'lambdaL1' : hp.uniform( 'lambdaL1' , 0.0 , 200.0 ), 'lambdaL2' : hp.uniform( 'lambdaL2' , 0.0 , 200.0 ), } |
定义ks
计算函数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import numpy as np from sklearn.metrics import roc_curve, auc def ks_score(y, y_pred, threshold = None ): """ 计算KS """ fpr, tpr, thresholds = roc_curve(y, y_pred) if threshold is None : score = np. max (np. abs (tpr - fpr)) else : idx = np.digitize(threshold, thresholds) - 1 score = np. abs (tpr[idx] - fpr[idx]) return score |
训练过程记录:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | def train_evaluate(df_train, df_val, df_test, params): lgb = LightGBMClassifier() model = lgb.fit(df_train, params) val_preds = model.transform(df_val) test_preds = model.transform(df_test) val_prob_list = [row.probability[ 0 ] for row in val_preds.select( 'probability' ).collect()] val_label_list = [row.label for row in val_preds.select( 'label' ).collect()] test_prob_list = [row.probability[ 0 ] for row in test_preds.select( 'probability' ).collect()] test_label_list = [row.label for row in test_preds.select( 'label' ).collect()] val_ks = ks_score( 1 - np.array(val_label_list), np.array(val_prob_list)) test_ks = ks_score( 1 - np.array(test_label_list), np.array(test_prob_list)) print ( "val ks = %f" % (val_ks)) print ( "test ks = %f" % (test_ks)) print ( "-------------------------------" ) return val_ks |
定义代价函数:
1 2 | def objective(params): return - train_evaluate(df_train, df_val, df_test, params) |
定义代价函数的时候一定要注意,hyperopt
会求目标函数的最小值,如果目标函数是loss
,那么train_evaluate
之前是不用加负号-
的;如果我们要求AUC
或者KS
等最大值,那么train_evaluate
之前一定要加上负号-
,这样才能得到我们想要的结果。
4. 随机搜索调参和贝叶斯优化调参
随机搜索调参优化:
1 2 3 4 5 6 7 8 | rand_trials = Trials() rand_best = fmin(fn = objective, space = space, algo = rand.suggest, trials = rand_trials, max_evals = 5 ) space_eval(space, rand_best) |
输出如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | val ks = 0.380907 test ks = 0.365298 ------------------------------- val ks = 0.372782 test ks = 0.367067 ------------------------------- val ks = 0.370621 test ks = 0.368570 ------------------------------- val ks = 0.383066 test ks = 0.370889 ------------------------------- val ks = 0.378644 test ks = 0.367473 ------------------------------- 100%|██████████| 5/5 [12:04<00:00, 144.83s/it, best loss: -0.38306588065580294] {'baggingFraction': 0.7, 'baggingFreq': 1, 'baggingSeed': 696, 'boostingType': 'gbdt', 'featureFraction': 0.7, 'featuresCol': 'features', 'isUnbalance': True, 'labelCol': 'label', 'lambdaL1': 73.35909229652955, 'lambdaL2': 198.46023902528685, 'learningRate': 0.1, 'maxBin': 60, 'maxDepth': 3, 'minSumHessianInLeaf': 0.001, 'numLeaves': 128, 'objective': 'binary', 'verbosity': 1} |
也就是说,随机搜索调参时,KS
最大可以达到0.38306588065580294
。space_eval
这个函数是为了找出最好的那组参数,而Trials
类可以记录整个训练过程。
如果要是想用贝叶斯优化,那么这段代码应该是:
1 2 3 4 5 6 7 8 | tpe_trials = Trials() tpe_best = fmin(fn = objective, space = space, algo = tpe.suggest, trials = tpe_trials, max_evals = 5 ) space_eval(space, tpe_best) |
输出过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | val ks = 0.392768 test ks = 0.373026 ------------------------------- val ks = 0.383776 test ks = 0.367786 ------------------------------- val ks = 0.385806 test ks = 0.370686 ------------------------------- val ks = 0.375033 test ks = 0.370436 ------------------------------- val ks = 0.388076 test ks = 0.371288 ------------------------------- 100%|██████████| 5/5 [10:51<00:00, 130.20s/it, best loss: -0.39276827239870765] {'baggingFraction': 0.7, 'baggingFreq': 1, 'baggingSeed': 696, 'boostingType': 'gbdt', 'featureFraction': 0.7, 'featuresCol': 'features', 'isUnbalance': True, 'labelCol': 'label', 'lambdaL1': 52.782517457829606, 'lambdaL2': 20.799935111367063, 'learningRate': 0.1, 'maxBin': 60, 'maxDepth': 3, 'minSumHessianInLeaf': 0.001, 'numLeaves': 128, 'objective': 'binary', 'verbosity': 1} |
使用贝叶斯优化时,KS
可以达到0.39276827239870765
。可以看到,贝叶斯优化还是要好于随机搜索算法的。
我们在拿到最优参数后,就可以训练最佳的模型了。
很不幸,mmlspark
当中实现的lightGBM
算法,当前很可能是无法复现的,也就是说少了一个控制随机性的seed
。我们可以看到,LightGBMClassifier
这个类,接收的参数当中,只有一个baggingSeed
,用来控制baggingFraction
时每次抽取哪些样本;而featureFraction
显然也需要一个参数来控制每次抽取哪些特征。当然,原生的lightGBM
当中的seed
参数是可以控制这两个随机性的。这个问题我已经在GitHub
上进行反馈了,有兴趣可以去该页面留言 lightgbm run twice with the same parameters, but got different result in validation #564,催微软更新,或者自己贡献代码。我会继续关注这个问题。
本文参考了:
Parameter Tuning with Hyperopt
Introduction: Bayesian Optimization using Hyperopt
An Introductory Example of Bayesian Optimization in Python with Hyperopt
Hyperopt tutorial for Optimizing Neural Networks’ Hyperparameters
Hyperopt