Overview
之前的文章中记录了大数据平台上lightGBM
分类器的Grid Search
调参方法的应用。这次我们继续用lightGBM
分类器,看看另外两种常用的调参方法随机搜索Random Search
和贝叶斯优化Bayesian Optimization
怎么在Spark
平台上使用。
1. 加载相关包
import numpy as np
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("spark lightgbm") \
.master("spark://***.***.***.***:7077") \
.config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:0.18.1") \
.config("spark.cores.max", "20") \
.config("spark.driver.memory", "6G") \
.config("spark.executor.memory", "6G") \
.config("spark.executor.cores", "6") \
.getOrCreate()
import mmlspark
from mmlspark.lightgbm import LightGBMClassifier
from pyspark.ml.feature import VectorAssembler
from hyperopt import fmin, rand, tpe, hp, space_eval, Trials
其中hyperopt
是一款强大的调参工具,实现了随机搜索和贝叶斯优化。
如果没有安装hyperopt
,那么先在master
节点安装hyperopt
:
pip3 install hyperopt
2. 加载数据
df_train = spark.read.format("csv") \
.option("inferSchema", "true") \
.option("header", "true") \
.option("sep", ",") \
.load("hdfs://***.***.***.***:39000/yangbingjiao/训练集特征.csv")
df_val = spark.read.format("csv") \
.option("inferSchema", "true") \
.option("header", "true") \
.option("sep", ",") \
.load("hdfs://***.***.***.***:39000/yangbingjiao/验证集特征.csv")
df_test = spark.read.format("csv") \
.option("inferSchema", "true") \
.option("header", "true") \
.option("sep", ",") \
.load("hdfs://***.***.***.***:39000/yangbingjiao/测试集特征.csv")
处理特征:
feature_cols = list(df_train.columns)
feature_cols.remove("label")
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_train = assembler.transform(df_train)
df_val = assembler.transform(df_val)
df_test = assembler.transform(df_test)
3. 定义参数空间和代价函数
定义参数搜索空间,以lambdaL1
和lambdaL2
为例:
space = {
'objective':"binary",
'boostingType':'gbdt',
'isUnbalance':True,
'featuresCol':'features',
'labelCol':'label',
'maxBin':60,
'baggingFreq':1,
'baggingSeed':696,
'earlyStoppingRound':20,
'learningRate':0.1,
'maxDepth':3,
'numLeaves':128,
'baggingFraction':0.7,
'featureFraction':0.7,
'minSumHessianInLeaf':0.001,
'numIterations':800,
'verbosity':1,
'lambdaL1': hp.uniform('lambdaL1', 0.0, 200.0),
'lambdaL2': hp.uniform('lambdaL2', 0.0, 200.0),
}
定义ks
计算函数:
import numpy as np
from sklearn.metrics import roc_curve, auc
def ks_score(y, y_pred, threshold=None):
"""
计算KS
"""
fpr, tpr, thresholds = roc_curve(y, y_pred)
if threshold is None:
score = np.max(np.abs(tpr - fpr))
else:
idx = np.digitize(threshold, thresholds) - 1
score = np.abs(tpr[idx] - fpr[idx])
return score
训练过程记录:
def train_evaluate(df_train, df_val, df_test, params):
lgb = LightGBMClassifier()
model = lgb.fit(df_train, params)
val_preds = model.transform(df_val)
test_preds = model.transform(df_test)
val_prob_list = [row.probability[0] for row in val_preds.select('probability').collect()]
val_label_list = [row.label for row in val_preds.select('label').collect()]
test_prob_list = [row.probability[0] for row in test_preds.select('probability').collect()]
test_label_list = [row.label for row in test_preds.select('label').collect()]
val_ks = ks_score(1-np.array(val_label_list), np.array(val_prob_list))
test_ks = ks_score(1-np.array(test_label_list), np.array(test_prob_list))
print("val ks = %f" %(val_ks))
print("test ks = %f" %(test_ks))
print("-------------------------------")
return val_ks
定义代价函数:
def objective(params):
return -train_evaluate(df_train, df_val, df_test, params)
定义代价函数的时候一定要注意,hyperopt
会求目标函数的最小值,如果目标函数是loss
,那么train_evaluate
之前是不用加负号-
的;如果我们要求AUC
或者KS
等最大值,那么train_evaluate
之前一定要加上负号-
,这样才能得到我们想要的结果。
4. 随机搜索调参和贝叶斯优化调参
随机搜索调参优化:
rand_trials = Trials()
rand_best = fmin(fn=objective,
space=space,
algo=rand.suggest,
trials=rand_trials,
max_evals=5)
space_eval(space, rand_best)
输出如下:
val ks = 0.380907
test ks = 0.365298
-------------------------------
val ks = 0.372782
test ks = 0.367067
-------------------------------
val ks = 0.370621
test ks = 0.368570
-------------------------------
val ks = 0.383066
test ks = 0.370889
-------------------------------
val ks = 0.378644
test ks = 0.367473
-------------------------------
100%|██████████| 5/5 [12:04<00:00, 144.83s/it, best loss: -0.38306588065580294]
{'baggingFraction': 0.7,
'baggingFreq': 1,
'baggingSeed': 696,
'boostingType': 'gbdt',
'featureFraction': 0.7,
'featuresCol': 'features',
'isUnbalance': True,
'labelCol': 'label',
'lambdaL1': 73.35909229652955,
'lambdaL2': 198.46023902528685,
'learningRate': 0.1,
'maxBin': 60,
'maxDepth': 3,
'minSumHessianInLeaf': 0.001,
'numLeaves': 128,
'objective': 'binary',
'verbosity': 1}
也就是说,随机搜索调参时,KS
最大可以达到0.38306588065580294
。space_eval
这个函数是为了找出最好的那组参数,而Trials
类可以记录整个训练过程。
如果要是想用贝叶斯优化,那么这段代码应该是:
tpe_trials = Trials()
tpe_best = fmin(fn=objective,
space=space,
algo=tpe.suggest,
trials=tpe_trials,
max_evals=5)
space_eval(space, tpe_best)
输出过程:
val ks = 0.392768
test ks = 0.373026
-------------------------------
val ks = 0.383776
test ks = 0.367786
-------------------------------
val ks = 0.385806
test ks = 0.370686
-------------------------------
val ks = 0.375033
test ks = 0.370436
-------------------------------
val ks = 0.388076
test ks = 0.371288
-------------------------------
100%|██████████| 5/5 [10:51<00:00, 130.20s/it, best loss: -0.39276827239870765]
{'baggingFraction': 0.7,
'baggingFreq': 1,
'baggingSeed': 696,
'boostingType': 'gbdt',
'featureFraction': 0.7,
'featuresCol': 'features',
'isUnbalance': True,
'labelCol': 'label',
'lambdaL1': 52.782517457829606,
'lambdaL2': 20.799935111367063,
'learningRate': 0.1,
'maxBin': 60,
'maxDepth': 3,
'minSumHessianInLeaf': 0.001,
'numLeaves': 128,
'objective': 'binary',
'verbosity': 1}
使用贝叶斯优化时,KS
可以达到0.39276827239870765
。可以看到,贝叶斯优化还是要好于随机搜索算法的。
我们在拿到最优参数后,就可以训练最佳的模型了。
很不幸,mmlspark
当中实现的lightGBM
算法,当前很可能是无法复现的,也就是说少了一个控制随机性的seed
。我们可以看到,LightGBMClassifier
这个类,接收的参数当中,只有一个baggingSeed
,用来控制baggingFraction
时每次抽取哪些样本;而featureFraction
显然也需要一个参数来控制每次抽取哪些特征。当然,原生的lightGBM
当中的seed
参数是可以控制这两个随机性的。这个问题我已经在GitHub
上进行反馈了,有兴趣可以去该页面留言 lightgbm run twice with the same parameters, but got different result in validation #564,催微软更新,或者自己贡献代码。我会继续关注这个问题。
本文参考了:
Parameter Tuning with Hyperopt
Introduction: Bayesian Optimization using Hyperopt
An Introductory Example of Bayesian Optimization in Python with Hyperopt
Hyperopt tutorial for Optimizing Neural Networks’ Hyperparameters
Hyperopt