token embedding

https://www.cnblogs.com/d0main/p/10447853.html

Informer Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

paper: https://arxiv.org/abs/2012.07436

code github : https://github.com/zhouhaoyi/Informer2020

https://zhuanlan.zhihu.com/p/363084133

https://blog.csdn.net/fluentn/article/details/115392229

https://blog.csdn.net/weixin_42838061/article/details/117361871

摘要

there are several severe issues with Transformer that prevent it from being directly applicable to LSTF(Long sequence time-series forecasting), including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture

Informer, with three distinctive characteristics:1. ProbSparse self-attention mechanism 2. the self-attention distilling 3.the generative style decoder

2 Preliminary

3 Methodology

3.1 The Uniform Input Representation

3.2 Efficient Self-attention Mechanism

Query Sparsity Measurement

Some previous attempts have revealed that the distribution of self-attention probability has potential sparsity. The “sparsity” self-attention score forms a long tail distribution,见上图,接下来就是要把不符合sparsity的query找出来,也就是uniform distribution,然后用KL散度衡量两种分布的距离,得出the i-th query’s sparsity measurement

但是直接用上面的式子存在几个问题:1. requires calculating each dot-product pairs 2.LogSumExp operation has the potential numerical stability issue 因此提出近似的计算

ProbSparse Self-attention

The masked version can be achieved by applying positional mask on step 6 and using cmusum() in mean() of step 7. In the practice, we can use sum() as the simpler implement of mean().

3.3 Encoder: Allowing for Processing Longer Sequential Inputs under the Memory Usage Limitation

To enhance the robustness of the distilling operation, we build replicas of the main stack with halving inputs, and progressively decrease the number of self-attention distilling layers by dropping one layer at a time, like a pyramid in Fig.(2) 注意:encoder部分有两个stack,一个是主stack,一个是从stack,如图2,从stack的输入为主stack的一半,

Self-attention Distilling

3.4 Decoder: Generating Long Sequential Outputs Through One Forward Procedure‘

和原来transformer的区别:主要在于prediction,原来是step by step,现在是one forward procedure,怎么实现的呢?关键在于decoder的输入的构造

A fully connected layer acquires the final output, and its outsize dy depends on whether we are performing a univariate forecasting or a multivariate one.

dy=1 uni,dy>1 multi


:D 一言句子获取中...