Overview

CatBoost据说是比XgboostLightGBM更快更准确的GBDT算法。本文记录一下安装过程中的一个小坑和初步使用例子。

1. 安装

先安装依赖包,sixNumPy(假定你已经安装好了Python3.6以上版本):

pip install six

由于Ubuntu16.04中自带的NumPy版本是比较老的,所以要指定NumPy版本为1.16.0以上:

pip install numpy==1.16.0

否则会有如下报错:

numpy.ufunc size changed, may indicate binary incompatibility. expected 216, got 192

然后用pip或者conda安装CatBoost,我习惯用pip安装:

pip install catboost

安装可视化工具:

pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension

2. 二分类任务demo

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool

### 加载训练集,验证集和测试集数据
df_train = pd.read_csv('train_dataset.csv')
df_val = pd.read_csv('validation_dataset.csv')
df_test = pd.read_csv('test_dataset.csv')

### 取特征列表
features_list = list(df_train.columns)

### 处理数据
train_x = df_train[features_list]
train_y = list(df_train.label)
val_x = df_val[features_list]
val_y = list(df_val.label)
test_x = df_test[features_list]
test_y = list(df_test.label)

至此,三个数据集的特征都是Dataframe形式,如果是用xgboost,还得转换成DMatrix形式,而CatBoost则可以直接训练Dataframe格式数据,非常方便。

### 设定参数
model = CatBoostClassifier(iterations=800, # 树的棵树,即轮数
                           depth=3, # 树深
                           learning_rate=1, # 学习率
                           loss_function='Logloss', # 损失函数
                           eval_metric='AUC', # 评价指标
                           random_seed=696, # 随机数
                           reg_lambda=3, # L2正则化系数
                           #bootstrap_type='Bayesian',
                           verbose=True)

### 开始训练模型
model.fit(data_train_X, train_y, eval_set=(data_val_X, val_y), early_stopping_rounds=10)

输出如下:

0:  test: 0.6155599 best: 0.6155599 (0) total: 124ms    remaining: 1m 38s
1:  test: 0.6441688 best: 0.6441688 (1) total: 188ms    remaining: 1m 15s
2:  test: 0.6531472 best: 0.6531472 (2) total: 252ms    remaining: 1m 6s
3:  test: 0.6632070 best: 0.6632070 (3) total: 315ms    remaining: 1m 2s
4:  test: 0.6675112 best: 0.6675112 (4) total: 389ms    remaining: 1m 1s
5:  test: 0.6728938 best: 0.6728938 (5) total: 454ms    remaining: 1m
6:  test: 0.6770872 best: 0.6770872 (6) total: 512ms    remaining: 58s
7:  test: 0.6779621 best: 0.6779621 (7) total: 574ms    remaining: 56.9s
8:  test: 0.6794292 best: 0.6794292 (8) total: 636ms    remaining: 55.9s
9:  test: 0.6799766 best: 0.6799766 (9) total: 695ms    remaining: 54.9s
···
70: test: 0.7060854 best: 0.7060854 (70)    total: 4.54s    remaining: 46.6s
71: test: 0.7066276 best: 0.7066276 (71)    total: 4.58s    remaining: 46.4s
72: test: 0.7071572 best: 0.7071572 (72)    total: 4.63s    remaining: 46.1s
73: test: 0.7066621 best: 0.7071572 (72)    total: 4.68s    remaining: 45.9s
74: test: 0.7058151 best: 0.7071572 (72)    total: 4.74s    remaining: 45.8s
75: test: 0.7057014 best: 0.7071572 (72)    total: 4.78s    remaining: 45.5s
76: test: 0.7056642 best: 0.7071572 (72)    total: 4.82s    remaining: 45.3s
77: test: 0.7054756 best: 0.7071572 (72)    total: 4.86s    remaining: 45s
78: test: 0.7064983 best: 0.7071572 (72)    total: 4.91s    remaining: 44.8s
79: test: 0.7060492 best: 0.7071572 (72)    total: 4.96s    remaining: 44.6s
80: test: 0.7057876 best: 0.7071572 (72)    total: 5.02s    remaining: 44.6s
81: test: 0.7058538 best: 0.7071572 (72)    total: 5.09s    remaining: 44.6s
82: test: 0.7063121 best: 0.7071572 (72)    total: 5.16s    remaining: 44.6s
Stopped by overfitting detector  (10 iterations wait)

bestTest = 0.7071571623
bestIteration = 72

Shrink model to first 73 iterations.

会显示每一轮的结果和截至目前最好的轮数。180000*500的训练集,用CPU一次训练仅需要30s左右。

我们可以预测测试集:

preds_proba = model.predict_proba(data_test_X)

每一个样本,得到两个概率,分别是正样本和负样本的概率。
我们也可以看特征重要性:

model.feature_importances_

保存模型则是:

model.save_model('catboost版本模型.model')

加载模型则是:

my_model = model.load_model('catboost版本模型.model')

本文主要参考了官方文档:CatBoost