XGBoost 基础笔记

一、XGBoost 是什么（一句话）

XGBoost（eXtreme Gradient Boosting）是一个高性能的梯度提升库，提供了基于树的 GBDT/DART 与（历史上的）线性模型接口，具备稀疏感知、正则化、早停、分布式/多 GPU、类别特征等特性，适合结构化数据的分类、回归与排序任务。官方文档入口与导航在此。 (docs.xgboost.com.cn)

二、核心工作原理（你需要知道的几个“点”）

目标函数与正则化
在每轮 boosting 中，用二阶泰勒展开近似损失函数并按“增益”寻找最佳分裂；同时对树进行 L1/L2 正则与结构惩罚（如 gamma、min_child_weight 等）来抑制过拟合。这是 XGBoost 与一般 GBDT 相比更稳健的重要原因。论文与正则化综述可进一步参考。 (arXiv)
缺失值与稀疏感知
XGBoost 在训练时为每个分裂学习一个“缺失值默认方向”，预测时缺失样本自动沿该方向下行，无需事先做均值/众数插补。这得益于其稀疏感知分裂算法。 (arXiv)
近似/直方图分裂与加速
树构建的三种方法：exact、approx、hist。实际应用中 hist（直方图）几乎是默认首选；在大数据或 GPU 上同样建议 hist + device="cuda"。 (docs.xgboost.com.cn)
数据容器 DMatrix/QuantileDMatrix
XGBoost 的基础数据结构是 DMatrix；在基于直方图的方法中，还可用 QuantileDMatrix 降低内存、提升训练效率（尤其配合 GPU）。 (docs.xgboost.com.cn)
早停（Early Stopping）与回调
通过 early_stopping_rounds 或显式的回调（xgboost.callback.EarlyStopping）在验证集度量不再提升时自动停止，记录最佳迭代轮次。 (docs.xgboost.com.cn)
类别特征原生支持
直接把 pandas 列的 dtype 设为 category，并在模型里加 enable_categorical=True 即可，无需独热编码（one-hot）的繁琐预处理；同时提供如 max_cat_to_onehot 等专门参数。 (docs.xgboost.com.cn)
不平衡样本
调整 scale_pos_weight ≈ #neg / #pos，或使用样本权重，以改善二分类极度不平衡时的学习。 (xgboosting.com)
GPU/多 GPU
3.x 文档建议用 device="cuda"（而不是旧式 gpu_hist 作为唯一开关）；配合 tree_method="hist" 更稳，支持多 GPU 与分布式（Dask / Spark）。 (docs.xgboost.com.cn)
模型存取
2.1 起默认以 UBJSON/JSON 存储模型与超参，跨平台、跨版本更稳妥（建议避免 pickle）。 (docs.xgboost.com.cn)
解释与输出
可用 predict(..., pred_contribs=True) 得到 SHAP 值（贡献度），或 importance_type（gain/cover/weight/...）获取特征重要性。 (docs.xgboost.com.cn)

三、参数速览（实用分组）

只列关键参数与“直觉”作用，完整清单与含义请以官方参数页为准。 (docs.xgboost.com.cn)

学习过程：n_estimators / num_boost_round（弱学习器数量）、learning_rate（缩小步长防过拟合）、early_stopping_rounds。 (docs.xgboost.com.cn)
树复杂度：max_depth、min_child_weight、gamma。
采样/随机化：subsample（行采样）、colsample_bytree/bylvl/bynode（列采样）。
正则项：lambda（L2）、alpha（L1）。 (Tutorialspoint)
类别特征：enable_categorical、max_cat_to_onehot、max_cat_threshold。 (docs.xgboost.com.cn)
树方法/设备：tree_method="hist"、device="cuda"（或省略为 CPU）。 (docs.xgboost.com.cn)
不平衡：scale_pos_weight。 (xgboosting.com)

四、动手实战（Python）

1）二分类模板（含早停 & 不平衡处理）

import numpy as np, pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier

# 1. 构造不平衡数据
X, y = make_classification(n_samples=10000, n_features=30,
                           n_informative=10, weights=[0.95, 0.05],
                           random_state=42)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# 2. 估算不平衡比例 -> 设置 scale_pos_weight
ratio = (y_train == 0).sum() / (y_train == 1).sum()

model = XGBClassifier(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=2,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0, reg_alpha=0.0,
    tree_method="hist",
    # device="cuda",  # 如有 GPU，取消注释
    scale_pos_weight=ratio,
    eval_metric="auc",
    enable_categorical=False,
    random_state=42
)

# 3. 早停：验证集 AUC 连续 50 轮不提升就停止
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=50,
    early_stopping_rounds=50
)

pred = model.predict_proba(X_val)[:, 1]
print("Val AUC:", roc_auc_score(y_val, pred))

上面用的是 scikit-learn API；early_stopping_rounds 会在内部注册 EarlyStopping 回调，自动记录最佳迭代。 (docs.xgboost.com.cn)
scale_pos_weight ≈ #neg/#pos 是一个常见且实用的经验默认值。 (xgboosting.com)
建议默认 tree_method="hist"；如需 GPU，再加 device="cuda"。 (docs.xgboost.com.cn)

2）回归模板（原生 API + DMatrix/早停）

import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np

X, y = fetch_california_housing(return_X_y=True, as_frame=False)
X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.2, random_state=0)

# DMatrix 是 XGBoost 的底层高效数据结构
dtr = xgb.DMatrix(X_tr, label=y_tr)
dva = xgb.DMatrix(X_va, label=y_va)

params = {
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
    "learning_rate": 0.05,
    "max_depth": 8,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "tree_method": "hist",
    # "device": "cuda",  # 如需 GPU
}

evals = [(dtr, "train"), (dva, "valid")]
bst = xgb.train(
    params, dtr,
    num_boost_round=5000,
    evals=evals,
    early_stopping_rounds=100,  # 早停
    verbose_eval=100
)

pred = bst.predict(dva, strict_shape=True)
print("RMSE:", mean_squared_error(y_va, pred, squared=False))

DMatrix/QuantileDMatrix 是训练/预测/解释的统一数据容器；基于直方图建议优先考虑 QuantileDMatrix（大数据/显存吃紧场景）。 (docs.xgboost.com.cn)
早停也可以通过 callbacks=[xgb.callback.EarlyStopping(rounds=..., save_best=True)] 显式传入。 (docs.xgboost.com.cn)

3）类别特征一把梭

import pandas as pd
from xgboost import XGBClassifier

# 假设 df 为 pandas.DataFrame
# 将类别列标明为 category
for col in ["city", "device_type"]:
    df[col] = df[col].astype("category")

X = df.drop(columns=["label"])
y = df["label"]

clf = XGBClassifier(
    enable_categorical=True,
    tree_method="hist",
    # device="cuda",
)
clf.fit(X, y)

指定 category + enable_categorical=True 即可，无需手动 one-hot。可配合 max_cat_to_onehot/max_cat_threshold 控制行为。 (docs.xgboost.com.cn)

4）交叉验证与最佳轮次

import xgboost as xgb
from sklearn.model_selection import train_test_split
import numpy as np

dtrain = xgb.DMatrix(X, label=y)
params = {"objective":"binary:logistic", "eval_metric":"auc", "tree_method":"hist"}
cv = xgb.cv(
    params, dtrain,
    num_boost_round=2000,
    nfold=5,
    stratified=True,
    early_stopping_rounds=50,
    verbose_eval=50,
    shuffle=True
)
best_round = len(cv)

xgb.cv 支持早停，返回包含均值与标准差的度量日志，便于确定最佳 num_boost_round。 (docs.xgboost.com.cn)

5）特征重要性与 SHAP（贡献度）

# sklearn API
imps = model.feature_importances_              # 缺省重要性
# 原生 booster
booster = model.get_booster()
gain_imp  = booster.get_score(importance_type="gain")
cover_imp = booster.get_score(importance_type="cover")

# SHAP 值（原生 API）
dval = xgb.DMatrix(X_val)
shap_values = booster.predict(dval, pred_contribs=True)

importance_type 支持 weight/gain/cover/total_gain/total_cover；pred_contribs=True 返回 SHAP 值矩阵，适合解释单个样本的贡献度。 (docs.xgboost.com.cn)

6）模型保存/加载（JSON/UBJSON）

# 保存为 JSON/UBJSON（默认）
bst.save_model("model.json")
bst_loaded = xgb.Booster()
bst_loaded.load_model("model.json")

1.0 起支持 JSON；2.1 起默认 UBJSON；建议避免 pickle 作为长期存档格式。 (docs.xgboost.com.cn)

7）GPU 加速（单/多 GPU）

import xgboost as xgb
params = {"tree_method": "hist", "device": "cuda"}  # 指定使用 GPU
dtrain = xgb.QuantileDMatrix(X_train, label=y_train)
bst = xgb.train(params, dtrain, num_boost_round=200)

新版用 device="cuda" 切换到 GPU；建议搭配 QuantileDMatrix 与直方图方法；也支持 Dask/Spark 做分布式多 GPU。 (docs.xgboost.com.cn)

五、常见“坑位”与经验

无脑 one-hot：XGBoost 已支持类别特征，别把高基数列展开到爆内存；优先尝试 enable_categorical。 (docs.xgboost.com.cn)
过拟合：调小 max_depth、增大 min_child_weight/gamma，加采样（subsample、colsample_*），并开启早停。 (Tutorialspoint)
深分页样本不平衡：使用 scale_pos_weight 或样本权重，而不是只靠阈值调参。 (xgboosting.com)
缺失值预处理：通常不需要手动插补；XGBoost 会学习缺失默认方向。 (arXiv)
GPU 参数旧写法：老教程里用 tree_method="gpu_hist" 单独开启 GPU；3.x 更推荐 device="cuda" 搭配 tree_method="hist"。 (docs.xgboost.com.cn)

六、参数调优小抄（起步区间）

经验起步，非唯一解：先用早停确定 最佳轮次，再微调树深/采样/正则。

max_depth: 4–8；min_child_weight: 1–5；gamma: 0–5
subsample/colsample_bytree: 0.6–0.9
learning_rate: 0.03–0.1（配合较大的 n_estimators + 早停）
二分类不平衡：scale_pos_weight ≈ #neg/#pos 起步再细调。 (xgboosting.com)

七、进阶专题（建议阅读）

树方法：exact/approx/hist 的差异、适用与限制。 (docs.xgboost.com.cn)
官方 Python 示例库：从评估指标访问、列采样到 GLM、Gamma 回归等案例。 (docs.xgboost.com.cn)
论文：KDD 2016 论文，涵盖稀疏感知与加权分位数草图（近似分裂）。 (arXiv)
预测选项：pred_contribs、pred_leaf、strict_shape 等。 (docs.xgboost.com.cn)

参考与延伸

官方首页与总览、参数、Python 包/回调、DMatrix/分类特征、树方法、GPU、模型 IO、预测选项。 (docs.xgboost.com.cn)
论文（KDD 2016 & arXiv）。 (kdd.org)
scale_pos_weight 的经验设置。 (xgboosting.com)

需要我把这些代码整理成一个 可运行的 Notebook 或者根据你的 业务数据集 给一套更贴近生产的调参清单吗？我可以直接按你的任务（分类/回归、数据规模、是否 GPU、时延约束