您正在查看: Young 发布的文章

Python3连接PostgreSQL数据库

Overview

之前项目的人遗留的数据散落在多种数据库中,既有MySQLMongoDB,也有CassandraPostgreSQL。在Python3版本的jupyter中连接PostgreSQL需要安装psycopg2,而psycopg2Python2中则是已经集成好的。

1. 安装python3-psycopg2libpq-dev

先在Linux上安装好这两个包,

sudo apt-get install python3-psycopg2
sudo apt-get install libpq-dev

再安装psycopg2,

pip3 install psycopg2

否则就会报错如下:

ERROR: Command errored out with exit status 1:
     command: /home/yangbingjiao/anaconda3/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-wxj7s4lx/psycopg2/setup.py'"'"'; __file__='"'"'/tmp/pip-install-wxj7s4lx/psycopg2/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
         cwd: /tmp/pip-install-wxj7s4lx/psycopg2/
    Complete output (23 lines):
    running egg_info
    creating pip-egg-info/psycopg2.egg-info
    writing pip-egg-info/psycopg2.egg-info/PKG-INFO
    writing dependency_links to pip-egg-info/psycopg2.egg-info/dependency_links.txt
    writing top-level names to pip-egg-info/psycopg2.egg-info/top_level.txt
    writing manifest file 'pip-egg-info/psycopg2.egg-info/SOURCES.txt'

    Error: pg_config executable not found.

    pg_config is required to build psycopg2 from source.  Please add the directory
    containing pg_config to the $PATH or specify the full executable path with the
    option:

        python setup.py build_ext --pg-config /path/to/pg_config build ...

    or with the pg_config option in 'setup.cfg'.

    If you prefer to avoid building psycopg2 from source, please install the PyPI
    'psycopg2-binary' package instead.

    For further information please check the 'doc/src/install.rst' file (also at
    <http://initd.org/psycopg/docs/install.html>).

    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

2.连接PostgreSQL数据库

import psycopg2
import pandas as pd

conn = psycopg2.connect(database="database", user="user", password="password", host="192.168.1.230", port="5432")
cur = conn.cursor()
pql = """
SELECT * 
FROM database.table
WHERE  create_time >= '2019-09-23' AND create_time < '2019-09-30'
order by create_time
limit 10
"""
cur.execute(pql)
rows = cur.fetchall()
df = pd.DataFrame(rows)
conn.close()

本文主要参考了以下文章:
Python3连接PostgreSQL(10.5)数据库

阿里云Ubuntu16.04服务器安装Jupyter

Overview

Jupyter已经不用再花笔墨去介绍了。今年公司国内的业务已经很稳定,我也可以放心交给其他人了,现在主要精力放在东南亚的业务上。所以,离线模型训练就需要在云上安装Jupyter环境。这次,我用Anaconda来安装。

1. 安装Anaconda

首先找到LinuxPython3.7版本的AnacondaAnaconda,复制链接地址。然后ssh登录到云服务器上,运行下面的命令下载:

wget https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh

下载完成后,安装:

bash Anaconda3-2019.07-Linux-x86_64.sh

一路yes和回车,最后会安装到本地用户目录下。

2. 启动Jupyter

安装好Anaconda之后,jupyter其实已经安装好了,接下来我们启动它。

jupyter notebook

会发现报错如下:

Jupyter Notebook won't start due to ports being already in use.

这是因为服务器上已经有Jupyter了,它占用了8888端口,导致我安装的程序打不开。由于已经安装的Jupyter在别的用户目录下,涉及到更烦人的权限问题,所以我采用更换端口的方式来运行我的Jupyter

2.1 配置Jupyter端口

我们在命令行中运行下面命令:

jupyter notebook --generate-config

然后打开配置文件:

vim  ~/.jupyter/jupyter_notebook_config.py

直接Shift+G跳转到文件尾部,添加这几句:

c.NotebookApp.ip = '*'  # 允许访问此服务器的 IP,星号表示任意 IP
c.NotebookApp.open_browser = False # 运行时不打开本机浏览器
c.NotebookApp.port = 8080 # 使用的端口

我们把端口设置成8080,和另外的Jupyter区别开,保存文件并退出。

然后重新运行:

jupyter notebook

这下就正常了,显示如下:

[I 17:49:13.763 NotebookApp] The Jupyter Notebook is running at:
[I 17:49:13.763 NotebookApp] http://****.**.id:8080/?token=555aa7ce6bbddb*********c1c716b3e4744626d
[I 17:49:13.764 NotebookApp]  or http://127.0.0.1:8080/?token=555aa7ce6bbddb*********c1c716b3e4744626d

这里token=后面的一串字符就是密码,要记住。
然后我们在本地浏览器中输入 http://服务器ip:8080,就会提示输入密码,输入前面保存的token即可。
但是这里有个问题:我们的命令行窗口是不稳定的,一旦断掉Jupyter就失效了。我们只要用screen打开Jupyter就可以关掉窗口了:

screen jupyter notebook 

之后我们就可以在本地很便捷地使用远程服务器进行离线建模训练了。

本文主要参考了以下文章,感谢:
远程服务器上开启jupyter notebook
Jupyter Notebook won't start due to ports being already in use

CatBoost贝叶斯调参程序

Overview

之前我们记录了CatBoost一个训练的例子,这次我们更新一个CatBoost调参的例子,用的是业界比较流行的贝叶斯调参法。

1. 引入依赖包并加载数据

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, CatBoost, Pool, cv
from bayes_opt import BayesianOptimization

data_train = pd.read_csv('data/训练集.csv')
data_val = pd.read_csv('data/验证集.csv')
data_test = pd.read_csv('data/测试集.csv')

2. 加载特征列表并处理数据

name_list = pd.read_csv('特征列表_20190705.txt', header=None, index_col=0)
my_feature_names = list(name_list.transpose())
len(my_feature_names)

data_train_X = data_train[my_feature_names]
data_val_X = data_val[my_feature_names]
data_test_X = data_test[my_feature_names]

data_train_y = data_train['label']
data_val_y  = data_val['label']
data_test_y  = data_test['label']

3. 贝叶斯调参

def cat_train(bagging_temperature, reg_lambda, learning_rate):
    params = {
        'iterations':800,
        'depth':3,
        'bagging_temperature':bagging_temperature,
        'reg_lambda':reg_lambda,
        'learning_rate':learning_rate,
        'loss_function':'Logloss',
        'eval_metric':'AUC',
        'random_seed':696,
        'verbose':30
    }

    model = CatBoost(params)
    # 评价数据集是验证集,评价指标是AUC
    model.fit(data_train_X, data_train_y, eval_set=(data_val_X, data_val_y), plot=False, early_stopping_rounds=20) 
    
    print(params)
    score_max = model.best_score_.get('validation').get('AUC')
    return score_max

cat_opt = BayesianOptimization(cat_train, 
                           {
                              'bagging_temperature': (1, 50),  
                              'reg_lambda': (1, 200),
                              'learning_rate':(0.05, 0.2)
                            })

cat_opt.maximize(n_iter=15, init_points=5)

有了最佳参数之后,用这组最佳参数即可训练出最终的模型了。

贝叶斯调参部分,我们参考了如下文章:
Bayesian methods of hyperparameter optimization

Hyperparameter Optimization using bayesian optimization

以及GitHub源码:BayesianOptimization

Python版本CatBoost在Ubuntu16.04上安装与初步使用

Overview

CatBoost据说是比XgboostLightGBM更快更准确的GBDT算法。本文记录一下安装过程中的一个小坑和初步使用例子。

1. 安装

先安装依赖包,sixNumPy(假定你已经安装好了Python3.6以上版本):

pip install six

由于Ubuntu16.04中自带的NumPy版本是比较老的,所以要指定NumPy版本为1.16.0以上:

pip install numpy==1.16.0

否则会有如下报错:

numpy.ufunc size changed, may indicate binary incompatibility. expected 216, got 192

最近在用TensorFlow2.0时,因为numpy版本是1.71.1,也会报这个错误,所以把numpy版本安装成1.16.0也可以解决TensorFlow2.0报错问题。

然后用pip或者conda安装CatBoost,我习惯用pip安装:

pip install catboost

安装可视化工具:

pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension

2. 二分类任务demo

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool

### 加载训练集,验证集和测试集数据
df_train = pd.read_csv('train_dataset.csv')
df_val = pd.read_csv('validation_dataset.csv')
df_test = pd.read_csv('test_dataset.csv')

### 取特征列表
features_list = list(df_train.columns)

### 处理数据
train_x = df_train[features_list]
train_y = list(df_train.label)
val_x = df_val[features_list]
val_y = list(df_val.label)
test_x = df_test[features_list]
test_y = list(df_test.label)

至此,三个数据集的特征都是Dataframe形式,如果是用xgboost,还得转换成DMatrix形式,而CatBoost则可以直接训练Dataframe格式数据,非常方便。

### 设定参数
model = CatBoostClassifier(iterations=800, # 树的棵树,即轮数
                           depth=3, # 树深
                           learning_rate=1, # 学习率
                           loss_function='Logloss', # 损失函数
                           eval_metric='AUC', # 评价指标
                           random_seed=696, # 随机数
                           reg_lambda=3, # L2正则化系数
                           #bootstrap_type='Bayesian',
                           verbose=True)

### 开始训练模型
model.fit(data_train_X, train_y, eval_set=(data_val_X, val_y), early_stopping_rounds=10)

输出如下:

0:  test: 0.6155599 best: 0.6155599 (0) total: 124ms    remaining: 1m 38s
1:  test: 0.6441688 best: 0.6441688 (1) total: 188ms    remaining: 1m 15s
2:  test: 0.6531472 best: 0.6531472 (2) total: 252ms    remaining: 1m 6s
3:  test: 0.6632070 best: 0.6632070 (3) total: 315ms    remaining: 1m 2s
4:  test: 0.6675112 best: 0.6675112 (4) total: 389ms    remaining: 1m 1s
5:  test: 0.6728938 best: 0.6728938 (5) total: 454ms    remaining: 1m
6:  test: 0.6770872 best: 0.6770872 (6) total: 512ms    remaining: 58s
7:  test: 0.6779621 best: 0.6779621 (7) total: 574ms    remaining: 56.9s
8:  test: 0.6794292 best: 0.6794292 (8) total: 636ms    remaining: 55.9s
9:  test: 0.6799766 best: 0.6799766 (9) total: 695ms    remaining: 54.9s
···
70: test: 0.7060854 best: 0.7060854 (70)    total: 4.54s    remaining: 46.6s
71: test: 0.7066276 best: 0.7066276 (71)    total: 4.58s    remaining: 46.4s
72: test: 0.7071572 best: 0.7071572 (72)    total: 4.63s    remaining: 46.1s
73: test: 0.7066621 best: 0.7071572 (72)    total: 4.68s    remaining: 45.9s
74: test: 0.7058151 best: 0.7071572 (72)    total: 4.74s    remaining: 45.8s
75: test: 0.7057014 best: 0.7071572 (72)    total: 4.78s    remaining: 45.5s
76: test: 0.7056642 best: 0.7071572 (72)    total: 4.82s    remaining: 45.3s
77: test: 0.7054756 best: 0.7071572 (72)    total: 4.86s    remaining: 45s
78: test: 0.7064983 best: 0.7071572 (72)    total: 4.91s    remaining: 44.8s
79: test: 0.7060492 best: 0.7071572 (72)    total: 4.96s    remaining: 44.6s
80: test: 0.7057876 best: 0.7071572 (72)    total: 5.02s    remaining: 44.6s
81: test: 0.7058538 best: 0.7071572 (72)    total: 5.09s    remaining: 44.6s
82: test: 0.7063121 best: 0.7071572 (72)    total: 5.16s    remaining: 44.6s
Stopped by overfitting detector  (10 iterations wait)

bestTest = 0.7071571623
bestIteration = 72

Shrink model to first 73 iterations.

会显示每一轮的结果和截至目前最好的轮数。180000*500的训练集,用CPU一次训练仅需要30s左右。

我们可以预测测试集:

preds_proba = model.predict_proba(data_test_X)

每一个样本,得到两个概率,分别是正样本和负样本的概率。
我们也可以看特征重要性:

model.feature_importances_

保存模型则是:

model.save_model('catboost版本模型.model')

加载模型则是:

my_model = model.load_model('catboost版本模型.model')

本文主要参考了官方文档:CatBoost

R语言版本LightGBM在ubuntu16.04上的安装

Overview

首先非常感谢谢若鹏同学给的LightGBM安装教程和调优等脚本。下午在自己的6G内存ubuntu16.04系统上安装,在make -j这一步编译c++boosting库时总是退出,提示虚拟内存不足,看来是电脑配置太低了。只能在Bastion3服务器上面测试了。

1.LightGBM的安装

首先安装git

sudo apt-get install git

github上面clone``LightGBM的源码:

git clone --recursive https://github.com/Microsoft/LightGBM

然后执行以下几步:

cd LightGBM  
mkdir build  
cd build

下面先安装一下两个依赖包:

sudo apt-get install cmake
sudo apt-get install make

继续执行:

cmake ..
make -j

安装完成即可。

这一步对电脑内存要求比较高,我6G的内存都不够,坑。

2.R包的安装

大约需要以下几步安装:

install.packages("readr")
install.packages("GA")
install.packages("dplyr")
install.packages("parallelMap")
install.packages("jsonlite")
install.packages("mlr")

提示一下:在安装mlr包之前必须用以下命令安装好两个系统依赖包:

sudo apt-get install libxml2
sudo apt-get install libxml2-dev

否则会提示:

ERROR: dependency ‘XML’ is not available for package ‘mlr’

继续安装LightGBM需要的R包:

cd LightGBM/R-package
Rscript build_package.R
sudo R CMD INSTALL lightgbm_2.1.0.tar.gz --no-multiarch

至此,R语言版本的LightGBM算是安装好了。

本文参考了以下文章,感谢!
案例 | lightgbm算法优化-不平衡二分类问题(附代码)
Ubuntu 14.04下libxml2的安装和使用