您正在查看: 2019年7月

CatBoost贝叶斯调参程序

Overview

之前我们记录了CatBoost一个训练的例子,这次我们更新一个CatBoost调参的例子,用的是业界比较流行的贝叶斯调参法。

1. 引入依赖包并加载数据

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, CatBoost, Pool, cv
from bayes_opt import BayesianOptimization

data_train = pd.read_csv('data/训练集.csv')
data_val = pd.read_csv('data/验证集.csv')
data_test = pd.read_csv('data/测试集.csv')

2. 加载特征列表并处理数据

name_list = pd.read_csv('特征列表_20190705.txt', header=None, index_col=0)
my_feature_names = list(name_list.transpose())
len(my_feature_names)

data_train_X = data_train[my_feature_names]
data_val_X = data_val[my_feature_names]
data_test_X = data_test[my_feature_names]

data_train_y = data_train['label']
data_val_y  = data_val['label']
data_test_y  = data_test['label']

3. 贝叶斯调参

def cat_train(bagging_temperature, reg_lambda, learning_rate):
    params = {
        'iterations':800,
        'depth':3,
        'bagging_temperature':bagging_temperature,
        'reg_lambda':reg_lambda,
        'learning_rate':learning_rate,
        'loss_function':'Logloss',
        'eval_metric':'AUC',
        'random_seed':696,
        'verbose':30
    }

    model = CatBoost(params)
    # 评价数据集是验证集,评价指标是AUC
    model.fit(data_train_X, data_train_y, eval_set=(data_val_X, data_val_y), plot=False, early_stopping_rounds=20) 
    
    print(params)
    score_max = model.best_score_.get('validation').get('AUC')
    return score_max

cat_opt = BayesianOptimization(cat_train, 
                           {
                              'bagging_temperature': (1, 50),  
                              'reg_lambda': (1, 200),
                              'learning_rate':(0.05, 0.2)
                            })

cat_opt.maximize(n_iter=15, init_points=5)

有了最佳参数之后,用这组最佳参数即可训练出最终的模型了。

贝叶斯调参部分,我们参考了如下文章:
Bayesian methods of hyperparameter optimization

Hyperparameter Optimization using bayesian optimization

以及GitHub源码:BayesianOptimization

Python版本CatBoost在Ubuntu16.04上安装与初步使用

Overview

CatBoost据说是比XgboostLightGBM更快更准确的GBDT算法。本文记录一下安装过程中的一个小坑和初步使用例子。

1. 安装

先安装依赖包,sixNumPy(假定你已经安装好了Python3.6以上版本):

pip install six

由于Ubuntu16.04中自带的NumPy版本是比较老的,所以要指定NumPy版本为1.16.0以上:

pip install numpy==1.16.0

否则会有如下报错:

numpy.ufunc size changed, may indicate binary incompatibility. expected 216, got 192

最近在用TensorFlow2.0时,因为numpy版本是1.71.1,也会报这个错误,所以把numpy版本安装成1.16.0也可以解决TensorFlow2.0报错问题。

然后用pip或者conda安装CatBoost,我习惯用pip安装:

pip install catboost

安装可视化工具:

pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension

2. 二分类任务demo

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool

### 加载训练集,验证集和测试集数据
df_train = pd.read_csv('train_dataset.csv')
df_val = pd.read_csv('validation_dataset.csv')
df_test = pd.read_csv('test_dataset.csv')

### 取特征列表
features_list = list(df_train.columns)

### 处理数据
train_x = df_train[features_list]
train_y = list(df_train.label)
val_x = df_val[features_list]
val_y = list(df_val.label)
test_x = df_test[features_list]
test_y = list(df_test.label)

至此,三个数据集的特征都是Dataframe形式,如果是用xgboost,还得转换成DMatrix形式,而CatBoost则可以直接训练Dataframe格式数据,非常方便。

### 设定参数
model = CatBoostClassifier(iterations=800, # 树的棵树,即轮数
                           depth=3, # 树深
                           learning_rate=1, # 学习率
                           loss_function='Logloss', # 损失函数
                           eval_metric='AUC', # 评价指标
                           random_seed=696, # 随机数
                           reg_lambda=3, # L2正则化系数
                           #bootstrap_type='Bayesian',
                           verbose=True)

### 开始训练模型
model.fit(data_train_X, train_y, eval_set=(data_val_X, val_y), early_stopping_rounds=10)

输出如下:

0:  test: 0.6155599 best: 0.6155599 (0) total: 124ms    remaining: 1m 38s
1:  test: 0.6441688 best: 0.6441688 (1) total: 188ms    remaining: 1m 15s
2:  test: 0.6531472 best: 0.6531472 (2) total: 252ms    remaining: 1m 6s
3:  test: 0.6632070 best: 0.6632070 (3) total: 315ms    remaining: 1m 2s
4:  test: 0.6675112 best: 0.6675112 (4) total: 389ms    remaining: 1m 1s
5:  test: 0.6728938 best: 0.6728938 (5) total: 454ms    remaining: 1m
6:  test: 0.6770872 best: 0.6770872 (6) total: 512ms    remaining: 58s
7:  test: 0.6779621 best: 0.6779621 (7) total: 574ms    remaining: 56.9s
8:  test: 0.6794292 best: 0.6794292 (8) total: 636ms    remaining: 55.9s
9:  test: 0.6799766 best: 0.6799766 (9) total: 695ms    remaining: 54.9s
···
70: test: 0.7060854 best: 0.7060854 (70)    total: 4.54s    remaining: 46.6s
71: test: 0.7066276 best: 0.7066276 (71)    total: 4.58s    remaining: 46.4s
72: test: 0.7071572 best: 0.7071572 (72)    total: 4.63s    remaining: 46.1s
73: test: 0.7066621 best: 0.7071572 (72)    total: 4.68s    remaining: 45.9s
74: test: 0.7058151 best: 0.7071572 (72)    total: 4.74s    remaining: 45.8s
75: test: 0.7057014 best: 0.7071572 (72)    total: 4.78s    remaining: 45.5s
76: test: 0.7056642 best: 0.7071572 (72)    total: 4.82s    remaining: 45.3s
77: test: 0.7054756 best: 0.7071572 (72)    total: 4.86s    remaining: 45s
78: test: 0.7064983 best: 0.7071572 (72)    total: 4.91s    remaining: 44.8s
79: test: 0.7060492 best: 0.7071572 (72)    total: 4.96s    remaining: 44.6s
80: test: 0.7057876 best: 0.7071572 (72)    total: 5.02s    remaining: 44.6s
81: test: 0.7058538 best: 0.7071572 (72)    total: 5.09s    remaining: 44.6s
82: test: 0.7063121 best: 0.7071572 (72)    total: 5.16s    remaining: 44.6s
Stopped by overfitting detector  (10 iterations wait)

bestTest = 0.7071571623
bestIteration = 72

Shrink model to first 73 iterations.

会显示每一轮的结果和截至目前最好的轮数。180000*500的训练集,用CPU一次训练仅需要30s左右。

我们可以预测测试集:

preds_proba = model.predict_proba(data_test_X)

每一个样本,得到两个概率,分别是正样本和负样本的概率。
我们也可以看特征重要性:

model.feature_importances_

保存模型则是:

model.save_model('catboost版本模型.model')

加载模型则是:

my_model = model.load_model('catboost版本模型.model')

本文主要参考了官方文档:CatBoost