start=>start: 开始 io=>inputoutput: 输入文本 cond=>condition: 条件 sub=>subroutine: 子流程 end=>end: 结束 op1=>operation: 输入文本 op2=>operation: tokenize op3=>operation: 词向量矩阵（预训练的或者随机初始化） op4=>operation: token embbedding op1->op2->op3->op4{"scale":1,"line-width":2,"line-length":50,"text-margin":10,"font-size":12}

NLP 文本表示

token embedding

2021-11-22 131ba9bed36f84e35eaf4a4a48c7fd59 99+ fast 0.0 k

Semi-supervised Learning和Self-Supervised Learning

https://blog.csdn.net/qq_44015059/article/details/106448533

机器学习机器学习

半监督和自监督

2021-11-21 d8e0e4c4632e0ce3dc469ee922f3fc26 99+ fast 0.0 k

colab

没有GPU怎么办，薅大户！！！

https://blog.csdn.net/zhang_li_ke/article/details/89704682

https://www.jianshu.com/p/a42d69568966

https://blog.csdn.net/oldmao_2001/article/details/90737735?

机器学习机器学习

Google colab

2021-11-21 1dc8cb9c6a7d0c8d01b67f9d80eee6e7 99+ 3 m 0.5 k

Informer Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

paper： https://arxiv.org/abs/2012.07436

code github ： https://github.com/zhouhaoyi/Informer2020

https://zhuanlan.zhihu.com/p/363084133

https://blog.csdn.net/fluentn/article/details/115392229

https://blog.csdn.net/weixin_42838061/article/details/117361871

摘要

there are several severe issues with Transformer that prevent it from being directly applicable to LSTF（Long sequence time-series forecasting）, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture

Informer, with three distinctive characteristics：1. ProbSparse self-attention mechanism 2. the self-attention distilling 3.the generative style decoder

2 Preliminary

3 Methodology

3.1 The Uniform Input Representation

3.2 Efficient Self-attention Mechanism

Query Sparsity Measurement

Some previous attempts have revealed that the distribution of self-attention probability has potential sparsity. The “sparsity” self-attention score forms a long tail distribution,见上图,接下来就是要把不符合sparsity的query找出来,也就是uniform distribution，然后用KL散度衡量两种分布的距离，得出the i-th query’s sparsity measurement

但是直接用上面的式子存在几个问题：1. requires calculating each dot-product pairs 2.LogSumExp operation has the potential numerical stability issue 因此提出近似的计算

ProbSparse Self-attention

The masked version can be achieved by applying positional mask on step 6 and using cmusum() in mean() of step 7. In the practice, we can use sum() as the simpler implement of mean().

3.3 Encoder: Allowing for Processing Longer Sequential Inputs under the Memory Usage Limitation

To enhance the robustness of the distilling operation, we build replicas of the main stack with halving inputs, and progressively decrease the number of self-attention distilling layers by dropping one layer at a time, like a pyramid in Fig.(2) 注意:encoder部分有两个stack，一个是主stack，一个是从stack，如图2，从stack的输入为主stack的一半，

Self-attention Distilling

3.4 Decoder: Generating Long Sequential Outputs Through One Forward Procedure‘

和原来transformer的区别：主要在于prediction，原来是step by step，现在是one forward procedure，怎么实现的呢？关键在于decoder的输入的构造

A fully connected layer acquires the final output, and its outsize dy depends on whether we are performing a univariate forecasting or a multivariate one.

dy=1 uni，dy>1 multi

时间序列预测

Informer

2021-11-16 9155a331831bc0976f8fdfac49b766e5 99+ fast 0.1 k