Informer Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

paper: https://arxiv.org/abs/2012.07436

code github : https://github.com/zhouhaoyi/Informer2020

https://zhuanlan.zhihu.com/p/363084133

https://blog.csdn.net/fluentn/article/details/115392229

https://blog.csdn.net/weixin_42838061/article/details/117361871

摘要

there are several severe issues with Transformer that prevent it from being directly applicable to LSTF(Long sequence time-series forecasting), including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture

Informer, with three distinctive characteristics:1. ProbSparse self-attention mechanism 2. the self-attention distilling 3.the generative style decoder

2 Preliminary

3 Methodology

3.1 The Uniform Input Representation

3.2 Efficient Self-attention Mechanism

Query Sparsity Measurement

Some previous attempts have revealed that the distribution of self-attention probability has potential sparsity. The “sparsity” self-attention score forms a long tail distribution,见上图,接下来就是要把不符合sparsity的query找出来,也就是uniform distribution,然后用KL散度衡量两种分布的距离,得出the i-th query’s sparsity measurement

但是直接用上面的式子存在几个问题:1. requires calculating each dot-product pairs 2.LogSumExp operation has the potential numerical stability issue 因此提出近似的计算

ProbSparse Self-attention

The masked version can be achieved by applying positional mask on step 6 and using cmusum() in mean() of step 7. In the practice, we can use sum() as the simpler implement of mean().

3.3 Encoder: Allowing for Processing Longer Sequential Inputs under the Memory Usage Limitation

To enhance the robustness of the distilling operation, we build replicas of the main stack with halving inputs, and progressively decrease the number of self-attention distilling layers by dropping one layer at a time, like a pyramid in Fig.(2) 注意:encoder部分有两个stack,一个是主stack,一个是从stack,如图2,从stack的输入为主stack的一半,

Self-attention Distilling

3.4 Decoder: Generating Long Sequential Outputs Through One Forward Procedure‘

和原来transformer的区别:主要在于prediction,原来是step by step,现在是one forward procedure,怎么实现的呢?关键在于decoder的输入的构造

A fully connected layer acquires the final output, and its outsize dy depends on whether we are performing a univariate forecasting or a multivariate one.

dy=1 uni,dy>1 multi

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

原文:https://arxiv.org/abs/1907.00235

作者利用transformer做时间序列预测,发现了两个问题,然后提出了改进。一个问题是locality-agnostics,the point-wise dot product self-attention in canonical Transformer architecture is insensitive to local context。另外一个问题是memory bottleneck :space complexity of canonical Transformer grows quadratically with sequence length L, making directly modeling long time series infeasible.为了解决这两个问题,作者提出了convolutional self-attention和LogSparse Transformer。

3.背景

问题定义

其中$\Phi$是参数,$\textbf{X}$是辅助输入,就是除了观测值以外的输入,$\textbf{Z}_{i,t}$表示序列$i$在时刻$t$的值

为了简化式子,定义了

目标就变成了$\textbf{z}_t \sim f(\textbf{Y}_t)$

Transformer

$h$表示某个头,$M$表示mask matrix

4.方法论

4.1 Enhancing the locality of Transformer

改进思想如上图所示,原版的transformer,利用了point-wise之间的相似度,万一存在异常点,就会造成偏差。改进方向就是将点和点之间的相似度变为local context based,也就是先利用卷积得到local的表示,然后基于local做Q和K的相似度。当卷积核的尺寸为1就退化为原版的transformer。

4.2 Breaking the memory bottleneck of Transformer

原来的transformer需要$O(L^2)$的空间复杂度,每层每个cell为$O(L)$,每层所有cell为$O(L^2)$,然后堆叠$h$层,$h$为常数,所以为$O(L^2)$,如图(a)。作者提出了LogSparse Transformer,如图(b),空间复杂度为$O(L(logL)^2)$。首先每层每一个cell需要$logL$,每层所有cell就是$LlogL$,然后堆叠$logL$层,最后为$O(L(logL)^2)$。

对于LogSparse Transformer,筛选规则为:

图(c)和图(d)是对(b)的改进。

5.实验

评级指标:p-quantile loss $R_p$ with $p\in(0,1)$

参考

https://zhuanlan.zhihu.com/p/412800154


:D 一言句子获取中...