Informer Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
paper: https://arxiv.org/abs/2012.07436
code github : https://github.com/zhouhaoyi/Informer2020
https://zhuanlan.zhihu.com/p/363084133
https://blog.csdn.net/fluentn/article/details/115392229
https://blog.csdn.net/weixin_42838061/article/details/117361871
摘要
there are several severe issues with Transformer that prevent it from being directly applicable to LSTF(Long sequence time-series forecasting), including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture
Informer, with three distinctive characteristics:1. ProbSparse self-attention mechanism 2. the self-attention distilling 3.the generative style decoder
2 Preliminary
3 Methodology
3.1 The Uniform Input Representation
3.2 Efficient Self-attention Mechanism
Query Sparsity Measurement
Some previous attempts have revealed that the distribution of self-attention probability has potential sparsity. The “sparsity” self-attention score forms a long tail distribution,见上图,接下来就是要把不符合sparsity的query找出来,也就是uniform distribution,然后用KL散度衡量两种分布的距离,得出the i-th query’s sparsity measurement
但是直接用上面的式子存在几个问题:1. requires calculating each dot-product pairs 2.LogSumExp operation has the potential numerical stability issue 因此提出近似的计算
ProbSparse Self-attention
The masked version can be achieved by applying positional mask on step 6 and using cmusum() in mean() of step 7. In the practice, we can use sum() as the simpler implement of mean().
3.3 Encoder: Allowing for Processing Longer Sequential Inputs under the Memory Usage Limitation
To enhance the robustness of the distilling operation, we build replicas of the main stack with halving inputs, and progressively decrease the number of self-attention distilling layers by dropping one layer at a time, like a pyramid in Fig.(2) 注意:encoder部分有两个stack,一个是主stack,一个是从stack,如图2,从stack的输入为主stack的一半,
Self-attention Distilling
3.4 Decoder: Generating Long Sequential Outputs Through One Forward Procedure‘
和原来transformer的区别:主要在于prediction,原来是step by step,现在是one forward procedure,怎么实现的呢?关键在于decoder的输入的构造
A fully connected layer acquires the final output, and its outsize dy depends on whether we are performing a univariate forecasting or a multivariate one.
dy=1 uni,dy>1 multi