Transformer-XL Attentive Language Models Beyond a Fixed-Length Context

https://arxiv.org/abs/1901.02860

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling ( memory and computation受限,长度不可能很大 ). propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.

3 Model

3.1 Vanilla Transformer Language Models

问题:In order to apply Transformer or self-attention to language modeling, the central problem is how to train a Transformer to effectively encode an arbitrarily long context into a fixed size representation.通常做法为vanilla model。 vanilla model就是说把长文本分隔成固定长度的seg来处理,如上图。

During training,There are two critical limitations of using a fixed length context. First, the largest possible dependency length is upper bounded by the segment length. Second. simply chunking a sequence into fixed-length segments will lead to the context fragmentation problem

During evaluation, As shown in Fig. 1b, this procedure ensures that each prediction utilizes the longest possible context exposed during training, and also relieves context fragmentation issue encountered in training. However, this evaluation procedure is extremely expensive.

3.2 Segment-Level Recurrence with State Reuse

introduce a recurrence mechanism to the Transformer architecture.

定义变量

转换过程

SG() stands for stop-gradient,$\circ$ 表示矩阵拼接

具体过程如下图

During training, the hidden state sequence computed for the previous segment is fixed and cached to be reused as an extended context when the model processes the next new segment, as shown in Fig. 2a.

during evaluation, the representations from the previous segments can be reused instead of being computed from scratch as in the case of the vanilla model.

3.3 Relative Positional Encodings

how can we keep the positional information coherent when we reuse the states? 如果保留原来的位置编码形式,可以得到如下

这种方式存在问题:

为了解决这个问题提出了relative positional information。

standard Transformer

we propose

3.4 完整算法流程

attention总结

https://easyai.tech/ai-definition/attention/#type

1.Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

提出了两种 attention 模式,即 hard attention 和 soft attention

2.Effective Approaches to Attention-based Neural Machine Translation

文章提出了两种 attention 的改进版本,即 global attention 和 local attention。

3.Attention Is All You Need

提出self attention

4.Hierarchical Attention Networks for Document Classification

提出了Hierarchical Attention用于文档分类

5.Attention-over-Attention Neural Networks for Reading Comprehension

提出了Attention Over Attention的Attention机制

6.Convolutional Sequence to Sequence Learning

论文中还采用了 Multi-step Attention

transformer(attention is all your need)

1.We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

2.Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

1 Positional Encoding

in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. There are many choices of positional encodings, learned and fixed [9]. In this work, we use sine and cosine functions of different frequencies:

详细可参考 https://wmathor.com/index.php/archives/1453/

2 Attention

其中不同颜色表示不同head,颜色深浅表示词的关联程度。

不同head表示不同应用场景 ,单一head表示某个场景下,各个字之间的关联程度

1 Scaled Dot-Product Attention

$d_{k}$ : keys of dimension

为什么scale?We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

Mask

可以分为两类:Attention Mask和Padding Mask,接下来具体讲解。

1.Attention Mask

ensures that the predictions for position i can depend only on the known outputs at positions less than i.

sotfmax前要mask,上三角mask掉

2.Padding Mask

Padding位置上的信息是无效的,所以需要丢弃。

过程如下图示:

2 Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

3 Applications of Attention in our Model

1.encoder-decoder attention layers

结构:queries come from the previous decoder layer,and the memory keys and values come from the output of the encoder.

目的:This allows every position in the decoder to attend over all positions in the input sequence.

2.encoder contains self-attention layers

结构:keys, values and queries come from the same place

目的:Each position in the encoder can attend to all positions in the previous layer of the encoder.

3.self-attention layers in the decoder

结构:keys, values and queries come from the same place

目的:allow each position in the decoder to attend to all positions in the decoder up to and including that position

3 Encoder and Decoder Stacks

1 encoder

1).Input Embedding与Positional Encoding

2). multi-head attention

3). 残差连接与 Layer Normalization

4). FeedForward

5). 残差连接与 Layer Normalization

其中$ X_{hidden} \in \mathbb{R}^{batch_size \ \ seq_len \ \ embed_dim} $

2 decoder

我们先从 HighLevel 的角度观察一下 Decoder 结构,从下到上依次是:

  • Masked Multi-Head Self-Attention
  • Multi-Head Encoder-Decoder Attention
  • FeedForward Network

4 常见问题

1 并行化

训练encoder,decoder都并行,测试encoder并行,decoder不是并行

https://zhuanlan.zhihu.com/p/368592551

2 self-attention和普通attention的区别

取决于query和key是否在一个地方

3 Why Self-Attention

Motivating our use of self-attention we consider three desiderata.

1.One is the total computational complexity per layer.

2.Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.

3.The third is the path length between long-range dependencies in the network

参考

https://arxiv.org/abs/1706.03762

大佬详解: https://jalammar.github.io/illustrated-transformer/


:D 一言句子获取中...