Overview
CatBoost
据说是比Xgboost
和LightGBM
更快更准确的GBDT
算法。本文记录一下安装过程中的一个小坑和初步使用例子。
1. 安装
先安装依赖包,six
和NumPy
(假定你已经安装好了Python3.6
以上版本):
1 | pip install six |
由于Ubuntu16.04中自带的NumPy
版本是比较老的,所以要指定NumPy
版本为1.16.0
以上:
1 | pip install numpy==1.16.0 |
否则会有如下报错:
1 | numpy.ufunc size changed, may indicate binary incompatibility. expected 216, got 192 |
最近在用
TensorFlow2.0
时,因为numpy
版本是1.71.1
,也会报这个错误,所以把numpy
版本安装成1.16.0
也可以解决TensorFlow2.0
报错问题。
然后用pip
或者conda
安装CatBoost
,我习惯用pip
安装:
1 | pip install catboost |
安装可视化工具:
1 2 | pip install ipywidgets jupyter nbextension enable --py widgetsnbextension |
2. 二分类任务demo
下面是一个二分类任务的demo
。
import pandas as pd import numpy as np from catboost import CatBoostClassifier, Pool ### 加载训练集,验证集和测试集数据 df_train = pd.read_csv('train_dataset.csv') df_val = pd.read_csv('validation_dataset.csv') df_test = pd.read_csv('test_dataset.csv') ### 取特征列表 features_list = list(df_train.columns) ### 处理数据 train_x = df_train[features_list] train_y = list(df_train.label) val_x = df_val[features_list] val_y = list(df_val.label) test_x = df_test[features_list] test_y = list(df_test.label)
至此,三个数据集的特征都是Dataframe
形式,如果是用xgboost
,还得转换成DMatrix
形式,而CatBoost
则可以直接训练Dataframe
格式数据,非常方便。
### 设定参数 model = CatBoostClassifier(iterations=800, # 树的棵树,即轮数 depth=3, # 树深 learning_rate=1, # 学习率 loss_function='Logloss', # 损失函数 eval_metric='AUC', # 评价指标 random_seed=696, # 随机数 reg_lambda=3, # L2正则化系数 #bootstrap_type='Bayesian', verbose=True) ### 开始训练模型 model.fit(data_train_X, train_y, eval_set=(data_val_X, val_y), early_stopping_rounds=10)
输出如下:
0: test: 0.6155599 best: 0.6155599 (0) total: 124ms remaining: 1m 38s 1: test: 0.6441688 best: 0.6441688 (1) total: 188ms remaining: 1m 15s 2: test: 0.6531472 best: 0.6531472 (2) total: 252ms remaining: 1m 6s 3: test: 0.6632070 best: 0.6632070 (3) total: 315ms remaining: 1m 2s 4: test: 0.6675112 best: 0.6675112 (4) total: 389ms remaining: 1m 1s 5: test: 0.6728938 best: 0.6728938 (5) total: 454ms remaining: 1m 6: test: 0.6770872 best: 0.6770872 (6) total: 512ms remaining: 58s 7: test: 0.6779621 best: 0.6779621 (7) total: 574ms remaining: 56.9s 8: test: 0.6794292 best: 0.6794292 (8) total: 636ms remaining: 55.9s 9: test: 0.6799766 best: 0.6799766 (9) total: 695ms remaining: 54.9s ··· 70: test: 0.7060854 best: 0.7060854 (70) total: 4.54s remaining: 46.6s 71: test: 0.7066276 best: 0.7066276 (71) total: 4.58s remaining: 46.4s 72: test: 0.7071572 best: 0.7071572 (72) total: 4.63s remaining: 46.1s 73: test: 0.7066621 best: 0.7071572 (72) total: 4.68s remaining: 45.9s 74: test: 0.7058151 best: 0.7071572 (72) total: 4.74s remaining: 45.8s 75: test: 0.7057014 best: 0.7071572 (72) total: 4.78s remaining: 45.5s 76: test: 0.7056642 best: 0.7071572 (72) total: 4.82s remaining: 45.3s 77: test: 0.7054756 best: 0.7071572 (72) total: 4.86s remaining: 45s 78: test: 0.7064983 best: 0.7071572 (72) total: 4.91s remaining: 44.8s 79: test: 0.7060492 best: 0.7071572 (72) total: 4.96s remaining: 44.6s 80: test: 0.7057876 best: 0.7071572 (72) total: 5.02s remaining: 44.6s 81: test: 0.7058538 best: 0.7071572 (72) total: 5.09s remaining: 44.6s 82: test: 0.7063121 best: 0.7071572 (72) total: 5.16s remaining: 44.6s Stopped by overfitting detector (10 iterations wait) bestTest = 0.7071571623 bestIteration = 72 Shrink model to first 73 iterations.
会显示每一轮的结果和截至目前最好的轮数。180000*500
的训练集,用CPU
一次训练仅需要30s
左右。
我们可以预测测试集:
preds_proba = model.predict_proba(data_test_X)
每一个样本,得到两个概率,分别是正样本和负样本的概率。
我们也可以看特征重要性:
model.feature_importances_
保存模型则是:
model.save_model('catboost版本模型.model')
加载模型则是:
my_model = model.load_model('catboost版本模型.model')
本文主要参考了官方文档:CatBoost