1.原理
output=base rate+shap(AGE)+shap(SEX)+shap(BP)+shap(BMI)
0.4=0.1+0.4+(-0.3)+(0.1)+(0.1)
shap值0说明这个特征对最终预测没有起作用,正数说明是正向作用,负数说明是负向作用,值越大说明该特征对预测值影响越大
2.使用
1.单样本,不同特征的shap值
2.单特征,特征值和shap的关系
3.两个特征,特征值和shap的关系
特征->one hot/multi hot-> embedding
目的:one hot/multi hot 表示特征太稀疏,不利于训练
举例子: https://arxiv.org/pdf/1706.06978.pdf DIN论文中的特征表示和embdding层
https://zhuanlan.zhihu.com/p/111296130
均值补全
检测异常值
数值范围
sigma准则
knn
箱线图
处理异常值
剔除
均值补全
特征分类:数值特征,文本特征,类别特征
1.直接使用数值
2.离散化
分桶
1.one hot
2.embedding
3.其他
catboost
https://blog.csdn.net/Datawhale/article/details/120582526
大致分为3种,filter,wrapper,embedded
Features with sparse data are features that have mostly zero values. This is different from features with missing data.
Common problems with sparse features include:
Removing features from the model
Make the features dense
Using models that are robust to sparse features