In total, the advantages of our proposed FM are:
1) FMs allow parameter estimation under very sparse data where SVMs fail.
2) FMs have linear complexity, can be optimized in the primal and do not rely on support vectors like SVMs.
3) FMs are a general predictor that can work with any real valued feature vector.

II. PREDICTION UNDER SPARSITY

III. FACTORIZATION MACHINES (FM)

A. Factorization Machine Model

1) 模型:

$\hat{y}(x):=w_0+\sum_{i=1}^nw_ix_i+\sum_{i=1}^n\sum_{j=i+1}^n \bbox[border: 2px solid red]{w_{i,j}}x_ix_j$

$x_i$表示第$i$个特征，但是针对上式，一个很大的问题，用户交互矩阵往往是比较稀疏的，这样就会导致对$w_{i,j}$的估算存在很大的问题。举个例子，假如想要估计Alice(A)和Star Trek(ST)的交互参数$w_{A,ST}$，由于训练集中没有实例同时满足$x_A$和$x_{ST}$非零，这会造成$w_{A,ST}=0$。因此这里使用了矩阵分解的思想：

$\textbf{W} = \textbf{V}\textbf{V}^T,\textbf{V}=\begin{pmatrix} \textbf{v}_1 \\ \textbf{v}_2\\...\\\textbf{v}_n \end{pmatrix} \in {R}^{n\times k} \\\bbox[border: 2px solid red]{w_{i,j}=<\textbf{v}_i,\textbf{v}_j>=\sum_{f=1}^k v_{i,f}\cdot v_{j,f}} \\\hat{y}(x):=w_0+\sum_{i=1}^nw_ix_i+\sum_{i=1}^n\sum_{j=i+1}^n \bbox[border: 2px solid red]{<\textbf{v}_i,\textbf{v}_j>}x_ix_j$

2) 提升效率:

直接计算上面的公式求解$\hat{y}(x)$的时间复杂度为$O ( k n^2 ) $，因为所有的特征交叉都需要计算。但是可以通过公式变换，将时间复杂度减少到$O(kn)$，如下公式推导

$\begin{align*} \\\sum_{i=1}^n\sum_{j=i+1}^n <\textbf{v}_i,\textbf{v}_j>x_ix_j&=\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n <\textbf{v}_i,\textbf{v}_j>x_ix_j-\frac{1}{2}\sum_{i=1}^n\ <\textbf{v}_i,\textbf{v}_i>x_ix_i \\&=... \\&=\frac{1}{2}\sum_{f=1}^k((\sum_{i=1}^nv_{i,f}x_i)^2-\sum_{i=1}^nv_{i,f}^2x_i^2) \end{align*}$

B. Factorization Machines as Predictors

FM can be applied to a variety of prediction tasks. Among them are: Regression，Binary classification，Ranking

C. Learning Factorization Machines

the model parameters of FMs can be learned efficiently by gradient descent methods – e.g. stochastic gradient descent (SGD).The gradient of the FM model is:

$\\\begin{equation} \frac{\partial \hat{y}(x)}{\partial \theta}=\left\{ \begin{array}{rcl} 1& & {if \ \theta \ is \ w_0 }\\ x_i & & {if \ \theta \ is \ w_i}\\ x_i\sum_{j=1}^nv_{j,f}x_j-v_{i,f}x_i^2 & & {if \ \theta \ is \ v_{i,f}} \end{array} \right. \end{equation}$

D. d-way Factorization Machine

The 2-way FM described so far can easily be generalized to a d-way FM:

$\hat{y}(x):=w_0+\sum_{i=1}^nw_ix_i+\sum_{l=2}^d\sum_{i_l=1}^n ...\sum_{i_l=i_{l-1}+1}^n (\prod \limits_{j=1}^lx_{i_j})(\sum_{f=1}^{k_l}\prod \limits_{j=1}^lv_{i_j,f}^{(l)})$

直接计算上式的时间复杂度为$O(k_dn^d)$，利用类似上面的公式变形也可以将其降低为$O(k_d n )$

E. Summary

FMs model all possible interactions between values in the feature vector $x$ using factorized interactions instead of full parametrized ones. This has two main advantages:

1) The interactions between values can be estimated even under high sparsity. Especially, it is possible to generalize to unobserved interactions.
2) The number of parameters as well as the time for prediction and learning is linear.

IV. FMS VS. SVMS

V. FMS VS. OTHER FACTORIZATION MODELS

参考

https://blog.csdn.net/qq_26822029/article/details/103993243

机器学习模型结构

2021-09-30 0f5a5f655c5ed5274d387449e02b35aa 99+ 2 m 0.3 k

特征稀疏

What are sparse features?

Features with sparse data are features that have mostly zero values. This is different from features with missing data.

Why is machine learning difficult with sparse features?

Common problems with sparse features include:

If the model has many sparse features, it will increase the space and time complexity of models. Linear regression models will fit more coefficients, and tree-based models will have greater depth to account for all features.
Model algorithms and diagnostic measures might behave in unknown ways if the features have sparse data. Kuss [2002] shows that goodness-of-fit tests are flawed when the data is sparse.
If there are too many features, models fit the noise in the training data. This is called overfitting. When models overfit, they are unable to generalize to newer data when they are put in production. This negatively impacts the predictive power of models.
Some models may underestimate the importance of sparse features and given preference to denser features even though the sparse features may have predictive power. Tree-based models are notorious for behaving like this. For example, random forests overpredict the importance of features that have more categories than those features that have fewer categories.

Methods for dealing with sparse features

Removing features from the model
Make the features dense
Using models that are robust to sparse features

参考

https://www.kdnuggets.com/2021/01/sparse-features-machine-learning-models.html#:~:text=%20Methods%20for%20dealing%20with%20sparse%20features%20,that%20are%20robust%20to%20sparse%20features%20More%20

机器学习特征工程

特征稀疏

2021-09-23 88139e860b94cc8e241f92b9e138f7dd 99+ 11 m 1.7 k

决策树

1.概述

基本上决策树都是二叉树

两幅图意思一样

2.决策树 vs 逻辑回归

最大的差异上图就可以看出，左边为逻辑回归的决策面，右边为决策树的决策面

3.构建算法

常见方法的有ID3，C4.5，CART，总结如下

3.1 ID3

假设训练数据集为 $D$,∣$D∣$表示其大小。设有$K$个分类$ C_1,C_2,…,C_K$。设特征集为$\textbf{A}$,假设某个特征$ A$ 有$ n$ 个不同的取值 $\{a_1,a_2,…,a_n\}$,根据特征$A$的取值将 $D$ 划分成$n$个子集。记子集 $D_i$中属于类$ C_k$的样本集合为 $D_{ik}$。

数据集$D$的经验熵$H(D)$:

$H(D) = - \sum_{j=1}^K \frac {|C_j|}{|D|} \log \frac {|C_j|}{|D|}$

特征$A$对数据集$ D$的经验条件熵$H(D|A)$

$\begin{equation*} \begin{split} H(D|A) &= \sum_{i=1}^n \frac {|D_i|}{|D|} H(D_i) \\ & = - \sum_{i=1}^n \frac {|D_i|}{|D|}\sum_{j=1}^K \frac {|D_{ij}|}{|D_i|} \log \frac {|D_{ij}|}{|D_i|} \\ \end{split} \end{equation*}$

信息增益$g(D,A)$

$g(D,A)=H(D)-H(D|A)$

算法流程:

若$D$中所有实例都属于同一类 $C_k$,则 $T$ 为单节点树,并将类 $C_k$作为该节点的类标记,返回$T$.
若$\textbf{A}=\phi$,则$T$为单节点树,并将$D$中实例最大的类$C_k$作为该节点的类标记,返回$T$.
否则,按照信息增益的算法,计算每个特征对$D$的信息增益,取信息增益最大的特征 $A_g$.
如果$A_g< \varepsilon$,则置 $T$为单节点树,并将$D$中实例最大的类$C_k$作为该节点的类标记,返回$T$.
否则,对$A_g$的每一可能值 $a_i$,依$A_g=a_i$将$D$分成若干非空子集$D_i$
以$D_i$为训练集,以$\textbf{A}- A_g $为特征集,递归地调用步骤1到步骤5,得到子树 $T_i$,全部 $T_i$构成$T$,返回$T$.

3.2 C4.5

C4.5算法流程与ID3相类似，只不过将信息增益改为信息增益比，以解决偏向取值较多的属性的问题，另外它可以处理连续型属性。

分裂信息 $SplitInformation(D,A)$

$SplitInformation(D,A) = -\sum_{i=1}^n \frac {|D_i|}{|D|} \log \frac {|D_i|}{|D|}$

信息增益比 $GainRatio(D, A)$

$GainRatio(D, A) = \frac {g(D, A)} {SplitInformation(D, A)}$

3.3 CART

3.3.1 CART分类树

CART分类树算法使用基尼系数来代替信息增益（比），基尼系数代表了模型的不纯度，基尼系数越小，不纯度越低，特征越好。这和信息增益（比）相反。

对于样本$D$，个数为$|D|$，假设$K$个类别，第$k$个类别的数量为$|C_k|$，则样本$D$的基尼系数表达式：

$Gini(D)=1-\sum_{k=1}^{K}(\frac{|C_k|}{|D|})^2$

对于样本$D$，个数为$|D|$，根据特征$A$的某个值$a$，把$D$分成$|D_1|$和$|D_2|$，则在特征$A=a$的条件下，样本$D$的基尼系数表达式为：

$Gini(D,A)=\frac{|D_1|}{|D|}Gini(D_1)+\frac{|D_2|}{|D|}Gini(D_2)$

算法流程：算法输入训练集$D$，基尼系数的阈值，样本个数阈值。输出的是决策树$T$。

(1)、对于当前节点的数据集为$D$，如果样本个数小于阈值或没有特征，则当前节点停止递归，返回决策树。

(2)、计算样本集$D$的基尼系数，如果基尼系数小于阈值，则当前节点停止递归，返回决策树。

(3)、计算当前节点所有特征的各个特征值对数据集$D$的基尼系数

(4)、在计算出来的所有基尼系数中，选择基尼系数最小的特征$A$和对应的特征值$a$，并把数据集划分成两部分$D_1$和$D_2$，同时建立当前节点的左右节点，左节点的数据集$D$为$D_1$，右节点的数据集$D$为$D_2$。

(5)、对左右的子节点递归的调用1-4步，生成决策树。

3.3.2 CART回归树

对回归树用平方误差最小化准则

算法流程：输入为训练数据$D$，输出为回归树$f(x)$

(1) 选择最优的切分变量$j$和切分点$s$，遍历$j$，对固定的$j$遍历$s$，求解

$\min \limits_{j,s} \ [\min \limits_{c_1}\sum_{x_i\in R_1(j,s)}(y_i-c_1)^2+\min \limits_{c_2}\sum_{x_i\in R_2(j,s)}(y_i-c_2)^2]$

(2) 用选定的$(j,s)$划分区域并决定输出值

$R_1(j,s)=\{x|x^{(j)}\le s\},R_2(j,s)=\{x|x^{(j)}> s\} \\\hat{c}_m=\frac{1}{N_m}\sum_{x_i\in R_m(j,s)}y_i,m=1,2$

(3) 继续对两个子区域调用步骤（1）（2），直至满足停止条件

(4) 将输入空间划分成$M$个区域$R_1,R_2,…,R_M$，生成回归树

$f(x)=\sum_{m=1}^{M}\hat{c}_mI(x\in R_m)$

3.4 多变量决策树

无论ID3，C4.5，CART都是选择一个最优的特征做分类决策，但大多数，分类决策不是由某一个特征决定，而是一组特征。这样得到的决策树更加准确，这种决策树叫多变量决策树(multi-variate decision tree)。在选择最优特征的时，多变量决策树不是选择某一个最优特征，而是选择一个最优的特征线性组合做决策。

代表算法OC1。

4.剪枝

剪枝(pruning)是解决决策树过拟合的主要手段，通过剪枝可以大大提升决策树的泛化能力。通常，剪枝处理可分为：预剪枝，后剪枝。

预剪枝：通过启发式方法，在生成决策树过程中对划分进行考察，若当前结点的划分影响决策树的泛化性能，则停止划分，并将其标记为叶节点
后剪枝：对已有的决策树，自底向上的对非叶结点进行考察，若该结点对应的子树替换为叶结点能提升决策树的泛化能力，则将改子树替换为叶结点

参考

https://zhuanlan.zhihu.com/p/89607509

https://www.cnblogs.com/wxquare/p/5379970.html

https://blog.csdn.net/qq_43468807/article/details/105969232

http://leijun00.github.io/2014/09/decision-tree/

https://zhuanlan.zhihu.com/p/89607509

https://www.cnblogs.com/wxquare/p/5379970.html

https://blog.csdn.net/qq_43468807/article/details/105969232

http://leijun00.github.io/2014/09/decision-tree/

https://www.cnblogs.com/wqbin/p/11689709.html

https://www.cnblogs.com/keye/p/10564914.html

https://www.cnblogs.com/keye/p/10601501.html

https://cloud.tencent.com/developer/article/1486712

https://blog.csdn.net/weixin_44132485/article/details/106502422

机器学习模型结构

决策树

2021-09-08 0eb98971f8b7965b646875b43430b474 99+ 6 m 1.0 k

降维

1.意义

1.高纬空间样本具有稀疏性，容易欠拟合

2.可视化

3.维度过大导致训练时间长，预测慢

2.分类

大致分为线形降维度和非线性降维，线形降维包括PCA，LDA等，非线性降维包括LLE，t-SNE，auto encoder等。

3.PCA系列

3.1 PCA

假设矩阵$x\in \mathbb{R}^{ n}$，假设有$M$个样本，将原始数据按列组成$M$ 行$ n $列矩阵$ X\in \mathbb{R}^{M\times n}$，PCA的使用过程为：

1.计算协方差矩阵$G_t \in \mathbb{R}^{n \times n}$

$G_t=\frac{1}{M}\sum_{j=1}^{M}(X-\overline{X})^T(X-\overline{X})$

注意，其中$\overline{X}\in \mathbb{R}^{n}$为列的均值，$X-\overline{X}$表示将$ X$ 的每一列进行零均值化，即减去这一列的均值

2.求出协方差矩阵的特征值及对应的特征向量

$(U,\sum,V)=SVD(G_t)$

3.将特征向量按对应的特征值大小排列，取前 $k$ 列组成矩阵 $P\in \mathbb{R}^{n \times k} $

$P=U(:,1:k)$

4.实现数据降维

$y_{[1\times k]}=x_{[1\times n ]}P_{[n\times k]}$

局限：

a. PCA只能针对1D的向量，对于2D的矩阵而言，比如图片数据，需要先flatten成向量

3.2 2DPCA

将2D的矩阵flatten成向量其实丢失了行列的位置信息，为了直接在2D的矩阵上实现降维，提出了2DPCA。

假设有原始矩阵$A\in\mathbb{R}^{m \times n }$，使用过程如下：

1.计算协方差矩阵

$G_t=\frac{1}{M}\sum_{j=1}^{M}(A_j-\overline{A})^{T}(A_j-\overline{A}) \\\overline{A}=\frac{1}{M}\sum_{j=1}^{M}A_j$

2.求出协方差矩阵的特征值及对应的特征向量

$(U,\sum,V)=SVD(G_t)$

3.将特征向量按对应的特征值大小排列，取前 $k$ 列组成矩阵

$X=U(:,1:k)$

4.实现数据降维

$Y_{[m\times k]}=A_{[m\times n ]}X_{[n\times k]}$

3.3 （2D）2PCA

作者证明了2DPCA只是在行上工作，然后提出了Alternative 2DPCA可以工作在列上，最后将其结合得到（2D）2PCA，使其可以同时在行列工作

假设有原始矩阵$A\in\mathbb{R}^{m \times n }$，使用过程如下：

1.计算协方差矩阵

$G_x=\frac{1}{M}\sum_{k=1}^{M}\sum_{i=1}^{m}(A_k^{(i)}-\overline{A}^{(i)})^{T}(A_k^{(i)}-\overline{A}^{(i)})$

其中$A_k=[(A_k^{(1)})^{T} \ (A_k^{(2)})^{T} \ …\ (A_k^{(m)})^{T}]^{T},\overline{A}=[(\overline{A}^{(1)})^{T} \ (\overline{A}^{(2)})^{T} \ …\ (\overline{A}^{(m)})^{T}]^{T}, \ A_k^{(i)}, \overline{A}^{(i)}$表示$A_k,\overline{A}$的第$i$行

$\\G_z=\frac{1}{M}\sum_{k=1}^{M}\sum_{j=1}^{n}(A_k^{(j)}-\overline{A}^{(j)})(A_k^{(j)}-\overline{A}^{(j)})^{T}$

其中$A_k=[(A_k^{(1)}) \ (A_k^{(2)}) \ …\ (A_k^{(n)})],\overline{A}=[(\overline{A}^{(1)}) \ (\overline{A}^{(2)}) \ …\ (\overline{A}^{(n)})],A_k^{(j)},\overline{A}^{(j)}$表示$A_k,\overline{A}$的第$j$列

2.求出协方差矩阵的特征值及对应的特征向量

$(U_x,\sum x,V_x)=SVD(G_x) \\(U_z,\sum z,V_z)=SVD(G_z)$

3.将特征向量按对应的特征值大小排列

$X=U_x(:,1:k) \in \mathbb{R}^{n\times k} \\Z=U_z(:,1:q) \in \mathbb{R}^{m\times q}$

4.实现数据降维

$C_{[q\times k]}=Z^{T}_{[q\times m]}A_{[m\times n ]}X_{[n\times k]}$

4.t-SNE

4.1 原理

可以参考 https://blog.csdn.net/scott198510/article/details/76099700

4.2代码

from sklearn.manifold import TSNE
from matplotlib import pylab
import  torch
import  pandas as pd


embdding=torch.load(path1)
words = pd.read_csv(path2)

words_embedded = TSNE(n_components=2).fit_transform(embdding)

pylab.figure(figsize=(20, 20))
for i, label in enumerate(words):
  x, y = words_embedded[i, :]
  pylab.scatter(x, y)
  pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                 ha='right', va='bottom')
pylab.show()

5.auto encoder

AutoEncoder在优化过程中无需使用样本的label，通过最小化重构误差希望学习到样本的抽象特征表示z，这种无监督的优化方式大大提升了模型的通用性。

参考

https://zhuanlan.zhihu.com/p/68903857

https://blog.csdn.net/scott198510/article/details/76099700

机器学习模型结构

降维

2021-09-07 a163153037c01b910d58c60152e6ada9 99+ 3 m 0.5 k

正则化

是机器学习中对原始损失函数引入额外信息，以便防止过拟合和提高模型泛化性能的一类方法的统称。

1.L1正则（Lasso回归）

L1正则化可以使得参数稀疏化，即得到的参数是一个稀疏矩阵，可以用于特征选择。

$\begin{align*} L_{L1}(w)&=L(w)+\lambda\Vert w \Vert_1=L(w)+\lambda\sum_{i=1}^{N}|w_i| \\ \frac{\partial L_{L1}}{\partial w_i}&=\frac{\partial L}{\partial w_i}+\lambda \ sgn(w_i) \\w_i &\rightarrow w_i-\eta(\frac{\partial L}{\partial w_i}+\lambda \ sgn(w_i)) \rightarrow w_i-\eta\lambda \ sgn(w_i)-\eta\frac{\partial L}{\partial w_i} \end{align*}$

L1是每次减去一个常数的收敛，所以L1更容易收敛到0。

2.L2正则（Ridge回归）

L2正则化使得参数平滑。

$\begin{align*} L_{L2}(w)&=L(w)+\lambda\Vert w \Vert_2^2=L(w)+\lambda\sum_{i=1}^{N}w_i^2 \\ \frac{\partial L_{L2}}{\partial w_i}&=\frac{\partial L}{\partial w_i}+2\lambda w_i \\w_i& \rightarrow w_i-\eta(\frac{\partial L}{\partial w_i}+2\lambda w_i) \rightarrow(1-2\eta\lambda)w_i-\eta \frac{\partial L}{\partial w_i} \end{align*}$

L2是每次乘上一个小于1的倍数进行收敛，所以L2使得参数平滑。

3.dropout

使用：在训练时，每个神经单元以概率$p$被保留(Dropout丢弃率为$1−p$)；在预测阶段，每个神经单元都是存在的。

原理：神经网络通过Dropout层以一定比例随即的丢弃神经元，使得每次训练的网络模型都不相同，多个Epoch下来相当于训练了多个模型，同时每一个模型都参与了对最终结果的投票，从而提高了模型的泛化能力，类似bagging。

参考

https://www.cnblogs.com/zingp/p/11631913.html

https://blog.csdn.net/b876144622/article/details/81276818

https://www.zhihu.com/question/37096933/answer/70494622

机器学习训练技巧

正则化

2021-09-07 a1181f0da285c1067e9bf5d83c659fb8 99+ a minute 0.2 k

损失函数

损失函数一般都要用可导函数，因为常用的优化算法，比如梯度下降，牛顿法，都需要导数。

1.回归损失

1.1 Mean Squared Error

$L=\frac{1}{N}\sum_{i=1}^{N}(y_i-\hat{y}_i)^2$

1.2 Mean Absolute Error

$L=\frac{1}{N}\sum_{i=1}^{N}|y_i-\hat{y}_i|$

1.3 Huber Loss ( Smooth Mean Absolute Error Loss )

$L=\frac{1}{N}\sum_{i=1}^{N}[\mathbb{1}_{|y_i-\hat{y_i}|\le \delta}\frac{(y_i-\hat{y_i})^2}{2}+\mathbb{1}_{|y_i-\hat{y_i}|> \delta} \ (\delta \cdot |y_i-\hat{y_i}|-\frac{1}{2}\delta^{2})]$

2.分类损失

2.1 Cross-entropy loss

https://zhuanlan.zhihu.com/p/100921909

K是种类数量
y是标签
p是神经网络的输出，也就是指类别是i的概率

2.2 Hinge loss

$L=max(0,1-y\cdot \hat{y})$

SVM模型的损失函数本质上就是 Hinge Loss + L2 正则化

参考

https://blog.csdn.net/m0_38007695/article/details/114983574

机器学习 loss

损失函数

CRF

HMM

一.算法流程

二.距离度量选择

三.K值的选择

四.优缺点

1.回顾线性回归和LR

2.介绍svm

I. INTRODUCTION

II. PREDICTION UNDER SPARSITY

III. FACTORIZATION MACHINES (FM)

A. Factorization Machine Model

B. Factorization Machines as Predictors

C. Learning Factorization Machines

D. d-way Factorization Machine

E. Summary

IV. FMS VS. SVMS

V. FMS VS. OTHER FACTORIZATION MODELS

参考

What are sparse features?

Why is machine learning difficult with sparse features?

Methods for dealing with sparse features

参考

1.概述

2.决策树 vs 逻辑回归

3.构建算法

3.1 ID3

3.2 C4.5

3.3 CART

3.3.1 CART分类树

3.3.2 CART回归树

3.4 多变量决策树

4.剪枝

参考

1.意义

2.分类

3.PCA系列

3.1 PCA

3.2 2DPCA

3.3 （2D）2PCA

4.t-SNE

4.1 原理

4.2代码

5.auto encoder

参考

1.L1正则（Lasso回归）

2.L2正则（Ridge回归）

3.dropout

参考

1.回归损失

1.1 Mean Squared Error

1.2 Mean Absolute Error

1.3 Huber Loss ( Smooth Mean Absolute Error Loss )

2.分类损失

2.1 Cross-entropy loss

2.2 Hinge loss

参考

Recents

Categories

Archives

Tags

Subscribe for updates