bm25
改进tf-idf
tf部分改进
idf部分改进
https://zhuanlan.zhihu.com/p/435367182
1.应用在预训练
Pre-trained Models for Natural Language Processing A Survey
https://arxiv.org/pdf/2003.08271v4.pdf
2.应用在finetune
simcse,consert
特征->one hot/multi hot-> embedding
目的:one hot/multi hot 表示特征太稀疏,不利于训练
举例子: https://arxiv.org/pdf/1706.06978.pdf DIN论文中的特征表示和embdding层
paper: https://arxiv.org/abs/2012.07436
code github : https://github.com/zhouhaoyi/Informer2020
https://zhuanlan.zhihu.com/p/363084133
https://blog.csdn.net/fluentn/article/details/115392229
https://blog.csdn.net/weixin_42838061/article/details/117361871
there are several severe issues with Transformer that prevent it from being directly applicable to LSTF(Long sequence time-series forecasting), including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture
Informer, with three distinctive characteristics:1. ProbSparse self-attention mechanism 2. the self-attention distilling 3.the generative style decoder
Query Sparsity Measurement
Some previous attempts have revealed that the distribution of self-attention probability has potential sparsity. The “sparsity” self-attention score forms a long tail distribution,见上图,接下来就是要把不符合sparsity的query找出来,也就是uniform distribution,然后用KL散度衡量两种分布的距离,得出the i-th query’s sparsity measurement
但是直接用上面的式子存在几个问题:1. requires calculating each dot-product pairs 2.LogSumExp operation has the potential numerical stability issue 因此提出近似的计算
ProbSparse Self-attention
The masked version can be achieved by applying positional mask on step 6 and using cmusum() in mean() of step 7. In the practice, we can use sum() as the simpler implement of mean().
To enhance the robustness of the distilling operation, we build replicas of the main stack with halving inputs, and progressively decrease the number of self-attention distilling layers by dropping one layer at a time, like a pyramid in Fig.(2) 注意:encoder部分有两个stack,一个是主stack,一个是从stack,如图2,从stack的输入为主stack的一半,
Self-attention Distilling
和原来transformer的区别:主要在于prediction,原来是step by step,现在是one forward procedure,怎么实现的呢?关键在于decoder的输入的构造
A fully connected layer acquires the final output, and its outsize dy depends on whether we are performing a univariate forecasting or a multivariate one.
dy=1 uni,dy>1 multi
tf 1.x : 1. session run 2. 官方推荐 tf.data.Dataset + tf.estimator.Estimator
tf 2.x: 官方推荐的是 tf.data.Dataset + tf.keras
https://blog.csdn.net/qq_38978225/article/details/108942427
https://blog.csdn.net/keeppractice/article/details/105934521