2024-04-06 baa794591a7a06fb66838e47748fe1a2 99+ fast 0.0 k

特征重要性

1.shap

2.LossFunctionChange

机器学习特征工程

特征重要性

2024-04-06 ebee67b97e5fa95723d691e59ad7da26 99+ a minute 0.1 k

shap

1.原理

https://github.com/shap/shap

output=base rate+shap(AGE)+shap(SEX)+shap(BP)+shap(BMI)

0.4=0.1+0.4+(-0.3)+(0.1)+(0.1)

shap值0说明这个特征对最终预测没有起作用，正数说明是正向作用，负数说明是负向作用，值越大说明该特征对预测值影响越大

2.使用

1.单样本，不同特征的shap值

2.单特征，特征值和shap的关系

3.两个特征，特征值和shap的关系

3.问题

机器学习特征工程

2022-08-04 d553c972bfd580a2c47d196780cd9f95 99+ fast 0.0 k

feature scale

机器学习特征工程

2021-11-22 d71bbf24c40ff490f9a4b65aeffd0956 99+ fast 0.0 k

特征向量化

特征->one hot/multi hot-> embedding

目的：one hot/multi hot 表示特征太稀疏，不利于训练

举例子： https://arxiv.org/pdf/1706.06978.pdf DIN论文中的特征表示和embdding层

机器学习特征工程

特征向量化

2021-10-21 2aa4853dbbb01ab40c00e1a30cb08c0f 99+ a minute 0.1 k

特征工程

https://zhuanlan.zhihu.com/p/111296130

1.特征预处理

0.是否去重

1.缺失值

均值补全

2.异常值

检测异常值

数值范围

sigma准则

knn

箱线图

处理异常值

剔除

均值补全

2.特征表示

特征分类：数值特征，文本特征，类别特征

1.数值特征

1.直接使用数值

2.离散化

分桶

2.类别特征

1.one hot

2.embedding

3.其他

catboost

3.特征选择

https://blog.csdn.net/Datawhale/article/details/120582526

大致分为3种，filter，wrapper，embedded

机器学习特征工程

2021-09-30 0f5a5f655c5ed5274d387449e02b35aa 99+ 2 m 0.3 k

特征稀疏

What are sparse features?

Features with sparse data are features that have mostly zero values. This is different from features with missing data.

Why is machine learning difficult with sparse features?

Common problems with sparse features include:

If the model has many sparse features, it will increase the space and time complexity of models. Linear regression models will fit more coefficients, and tree-based models will have greater depth to account for all features.
Model algorithms and diagnostic measures might behave in unknown ways if the features have sparse data. Kuss [2002] shows that goodness-of-fit tests are flawed when the data is sparse.
If there are too many features, models fit the noise in the training data. This is called overfitting. When models overfit, they are unable to generalize to newer data when they are put in production. This negatively impacts the predictive power of models.
Some models may underestimate the importance of sparse features and given preference to denser features even though the sparse features may have predictive power. Tree-based models are notorious for behaving like this. For example, random forests overpredict the importance of features that have more categories than those features that have fewer categories.

Methods for dealing with sparse features

Removing features from the model
Make the features dense
Using models that are robust to sparse features

参考

https://www.kdnuggets.com/2021/01/sparse-features-machine-learning-models.html#:~:text=%20Methods%20for%20dealing%20with%20sparse%20features%20,that%20are%20robust%20to%20sparse%20features%20More%20

机器学习特征工程