2021-08-27 1938a114a2321582b8811dce19579c75 99+ 3 m 0.5 k

XLNet Generalized Autoregressive Pretraining for Language Understanding

1 主要改动

relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy.

propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. (3) , XLNet integrates ideas from Transformer-XL

example：[New, York, is, a, city] . select the two tokens [New, York] as the prediction targets and maximize log p （New York | is a city）

In this case, BERT and XLNet respectively reduce to the following objectives:

2 现有PTM的问题

1 AR language modeling

对于给定的句子$\textbf{x}=[x_1,…,x_T]$，AR language modeling performs pretraining by maximizing the likelihood under the forward autoregressive factorization

$\max \limits_{\theta} \quad logp_{\theta}(\textbf{x})=\sum_{t=1}^{T}logp_{\theta}(x_t|\textbf{x}_{<t})=\sum_{t=1}^{T} log\frac{e^{h_{\theta}(\textbf{x}_{1:t-1})^\top e(x_t)}}{\sum_{x^{'}} e^{h_{\theta}(\textbf{x}_{1:t-1})^\top e(x^{'})}} \tag{1}$

其中$h_{\theta}(\textbf{x}_{1:t-1})$是考虑上下文的文本表示，$e(x_t)$为$x_t$的词向量

2 AE anguage modeling

对于BERT这种AE模型，首先利用$\textbf{x}$构造遮盖的tokens$\overline{\textbf{x}}$和未遮盖的tokens$\hat{\textbf{x}}$，然后the training objective is to reconstruct $\overline{\textbf{x}}$ from $\hat{\textbf{x}}$:

$\max \limits_{\theta} \quad logp_{\theta}(\overline{\textbf{x}}\ |\ \hat{\textbf{x}})\approx \sum_{t=1}^{T}m_tlogp_{\theta}(x_t\ |\ \hat{\textbf{x}})=\sum_{t=1}^{T} \ m_t log \frac{e^{H_{\theta}(\hat{\textbf{x}})_t^\top e(x_t)}}{\sum_{x^{'}}e^{H_{\theta}(\hat{\textbf{x}})_t^\top e(x^{'})}} \tag{2}$

其中$m_t=1$表示$x_t$被遮盖了，AR语言模型$t$时刻只能看到之前的时刻，因此记号是$h_{\theta}(\textbf{x}_{1:t-1})$；而AE模型可以同时看到整个句子的所有Token，因此记号是$H_{\theta}(\hat{\textbf{x}})_t$

这两个模型的优缺点分别为：

3 对比

1.AE因为遮盖词只是假设相互独立不是严格相互独立，因此为$\approx$。

2.AE在预训练时会出现特殊的token为[MASK]，但是它在下游的fine-tuning中不会出现，这就出现了预训练 — finetune的不一致问题。而AR语言模型不会有这个问题。

3.AR语言模型只能参考一个方向的上下文，而AE可以参考双向的上下文。

3 改动

3.1 排列语言模型

we propose the permutation language modeling objective that not only retains the benefits of AR models but also allows models to capture bidirectional context

给定长度为$T$的序列，总共有$T!$种排列方法。注意输入顺序是不会变的，因为模型在微调期间只会遇到具有自然顺序的文本序列。作者就是通Attention Mask，把其它没有被选到的单词Mask掉，不让它们在预测单词$x_i$的时候发生作用，看着就类似于把这些被选中的单词放到了上文。

举个例子，如下图，输入序列为$\{x_1,x_2,x_3,x_4\}$，总共有4!，24种情况，作者取了其中4个。假如预测$x_3$，第一个排列为$x_3 \rightarrow x_2 \rightarrow x_4 \rightarrow x_1 $，没有排在$x_3$前面对象，所以只连接了mem，对于真实情况就是输入还是$x_1 \rightarrow x_2 \rightarrow x_3 \rightarrow x_4 $，然后mask掉全部输入，即只利用mem预测$ x_3 $；第二个排列为$x_2 \rightarrow x_4 \rightarrow x_3 \rightarrow x_1 $，$x_2,x_4$排在$x_3$前面，所以连接了$x_2,x_4$对应的向量表示，对于真实情况就是输入还是$x_1 \rightarrow x_2 \rightarrow x_3 \rightarrow x_4 $，然后mask掉$x_1,x_3$，剩余$x_2,x_4$，即利用mem，$x_2,x_4$预测$ x_3 $。

排列语言模型的目标是调整模型参数使得下面的似然概率最大

$\max \limits_{\theta} \ \mathbb{E}_{\textbf{z}\sim \mathcal{Z}_T}[\sum_{t=1}^Tlogp_{\theta}(x_{z_t}|\textbf{x}_{\textbf{z}_{<t}})] \tag{3}$

其中$\textbf{z}$为随机变量，表示某个位置排列，$\mathcal{Z}_T$表示全部的排列，$z_t$，$\textbf{z}_{<t}$分别表示某个位置排列的第$t$个元素和与其挨着的前面$t-1$个元素。

3.2 Two-Stream Self-Attention

Target-Aware Representations

采用AE原来的表达形式来描述下一个token的分布$p_{\theta}(X_{z_t}|\textbf{x}_{\textbf{z}_{<t}})$如下

$p_{\theta}(X_{z_t}=x|\textbf{x}_{\textbf{z}_{<t}})= \frac{e^{ e(x)^\top h_{\theta}(\textbf{x}_{\textbf{z}_{<t}})}}{\sum_{x^{'}} e^{ e(x^{'})^\top h_{\theta}(\textbf{x}_{\textbf{z}_{<t}})}}$

这样表达有一个问题就是没有考虑预测目标词的位置，即没有考虑$ z_t$，这会导致ambiguity in target prediction。证明如下：假设有两个不同的排列$\textbf{z}^{(1)}$和$\textbf{z}^{(2)}$，并且满足如下关系：

$\textbf{z}^{(1)}_{<t}=\textbf{z}^{(2)}_{<t}=\textbf{z}_{<t} \ but \ {z}^{(1)}_{t}\neq{z}^{(2)}_{t}$

可以推导出

$p_{\theta}(X_{z_t^{(1)}}=x|\textbf{x}_{\textbf{z}_{<t}^{(1)}})=p_{\theta}(X_{z_t^{(2)}}=x|\textbf{x}_{\textbf{z}_{<t}^{(2)}})=\frac{e^{ e(x)^\top h_{\theta}(\textbf{x}_{\textbf{z}_{<t}})}}{\sum_{x^{'}} e^{ e(x^{'})^\top h_{\theta}(\textbf{x}_{\textbf{z}_{<t}})}}$

但是$p_{\theta}(X_{z_t^{(1)}}=x|\textbf{x}_{\textbf{z}_{<t}^{(1)}}),p_{\theta}(X_{z_t^{(2)}}=x|\textbf{x}_{\textbf{z}_{<t}^{(2)}})$应该不一样，因为目标词的位置不同

为了解决这个问题，提出了Target-Aware Representations，其实就是考虑了目标词的位置

$p_{\theta}(X_{z_t}=x|\textbf{x}_{\textbf{z}_{<t}})= \frac{e^{ e(x)^\top g_{\theta}(\textbf{x}_{\textbf{z}_{<t}},z_t)}}{\sum_{x^{'}} e^{ e(x^{'})^\top g_{\theta}(\textbf{x}_{\textbf{z}_{<t}},z_t)}} \tag{4}$

Two-Stream Self-Attention

contradiction

To resolve such a contradiction，we propose to use two sets of hidden representations instead of one:

假设有self-attention的层号为$m=1,2,…,M$，$g_i^{(0)}=w$，$h_i^{(0)}=e(x_i)$，Two-Stream Self-Attention可以表示为

$g_{z_t}^{(m)}\leftarrow Attention(Q=g_{z_t}^{(m-1)},KV=\textbf{h}^{(m-1)}_{z_{<t}};\theta) \\h_{z_t}^{(m)}\leftarrow Attention(Q=h_{z_t}^{(m-1)},KV=\textbf{h}^{(m-1)}_{z_{\le t}};\theta)$

举个例子，如下图

预训练最终使用$g_{z_t}^{(M)}$计算公式（4）,during finetuning, we can simply drop the query stream and use the content stream

during pretrain， we can use the last-layer query representation $g_{z_t}^{(M)}$ to compute Eq. (4).

during finetuning, we can simply drop the query stream and use the content stream as a normal Transformer(-XL).

3.3 Partial Prediction

因为排序很多，计算量很大，所以需要采样。将$z$分隔成$z_{t_\le c}$和 $z_{t_>c}$，$c$为分隔点，我们选择预测后面的词语，因为后面的词语包含的信息更加丰富。引入超参数$K$调整$c$，使得需要预测$\frac{1}{K}$的词（$\frac{|z|-c}{|z|}\approx\frac{1}{K}$），优化目标为:

$\max \limits_{\theta}\mathbb{E}_{\textbf{z}\sim \mathcal{Z}_T}[logp_{\theta}(\textbf{x}_{\textbf{z}_{>c}}|\textbf{x}_{\textbf{z}_{ \le c}})]=\mathbb{E}_{\textbf{z}\sim \mathcal{Z}_T}[\sum_{t=c+1}^{|\textbf{z}|}logp_{\theta}(x_{z_t}|\textbf{x}_{\textbf{z}_{<t}})]$

3.4 融合Transformer-XL的思想

We integrate two important techniques in Transformer-XL, namely the relative positional encoding scheme and the segment recurrence mechanism

Relative Segment Encodings

recurrence mechanism

3.5 Modeling Multiple Segments

the input to our model is the same as BERT: [CLS, A, SEP, B, SEP], where “SEP” and “CLS” are two special symbols and “A” and “B” are the two segments. Although we follow the two-segment data format, XLNet-Large does not use the objective of next sentence prediction

BERT that adds an absolute segment embedding，这里采用Relative Segment Encodings

There are two benefits of using relative segment encodings. First, the inductive bias of relative encodings improves generalization [9]. Second, it opens the possibility of finetuning on tasks that have more than two input segments, which is not possible using absolute segment encodings.

这里有个疑问，对于多于两个seg的情况，比如3个seg，输入格式是否变成[CLS, A, SEP, B, SEP,C,SEP]

参考

https://zhuanlan.zhihu.com/p/107350079

https://blog.csdn.net/weixin_37947156/article/details/93035607

https://www.cnblogs.com/nsw0419/p/12892241.html

https://www.cnblogs.com/mantch/archive/2019/09/30/11611554.html

https://cloud.tencent.com/developer/article/1492776

https://zhuanlan.zhihu.com/p/96023284

https://arxiv.org/pdf/1906.08237.pdf

NLP PTM

XLNet

2021-08-26 3521a71f5e83cbc0585d9ef84abedc3f 99+ 9 m 1.4 k

gpt

GPT三部曲宣告NLP的“预训练+微调”时代的崛起和走向辉煌。

原文分别为：

《Improving Language Understanding by Generative Pre-Training》

《Language Models are Unsupervised Multitask Learners》

《Language Models are Few-Shot Learners》

1.GPT1

模型的整体结构如上图所示。使用过程过程分为两步：第一步预训练，利用大量语料学习得到high-capacity的语言模型；第二步是fine_tuning，利用标签数据使其拟合到特定任务。

1.1 Unsupervised pre-training

作者将transformer decoder中Encoder-Decoder Attention层去掉后作为基本单元，然后多层堆叠作为语言模型的主体，然后将输出经过一个softmax层，来得到目标词的输出分布：

$h_0=UW_e+W_p \\h_l=transformer\_block(h_{l-1}),\ \forall l \in [1,n] \\P(u|u_{-k},...,u_{-1}) =softmax(h_nW_e^T)\$

其中$U=\{u_{-k},…,u_{-1}\}$ 是预测词$u $前$k$个token的独热编码序列，$n$是模型的层数，$W_e$是token embedding matrix，$W_p$是position embedding matrix。

给定一个无监督的语料库$\mathcal{U}$，use a standard language modeling objective to maximize the following likelihood

$L_1(\mathcal{U})=\sum_ilog P(u_i|u_{i-k},...,u_{i-1})$

其中$k$ 是上下文窗口大小。

1.2 Supervised fine-tuning

对于数据集$\mathcal{C}$，有数据$(x^1,x^2,…,x^m,y)$

$P(y|x^1,x^2,...,x^m)=softmax(h_l^mW_y) \\L_2(\mathcal{C})=\sum_{(x,y)}log P(y|x^1,x^2,...,x^m)$

其中$W_y$为全连接层的参数

作者发现，使用语言模型来辅助监督学习进行微调，有两个好处：

提高监督模型的泛化能力；
加速收敛。

所以，最终下游使用的监督模型损失函数为：

$L_3(\mathcal{C})=L_2(\mathcal{C})+\lambda*L_1(\mathcal{C})$

1.3 Task-specific input transformations

所有的输入文本都会加上开始和结合token$(s),(e)$

分类

分类过程可如上1.2，输入表示为$[(s);Context;(e)]$

文本蕴含

将输入拼接成$[(s); premise; ($) ; hypothesis ; (e)]$

相似度

由于文本相似度与两个比较文本的前后顺序没有关系，因此将两种文本顺序都考虑进来，如上图所示

问答与常识推理

假设文档为$z$，问题为$q$，一系列答案为$\{a_k\}$，将其输入表示为$[(s); z; q; ($); a_k;(e)]$，然后多个回答组合的形式，如上图。

2.GPT2

总结就是：多任务预训练+超大数据集+超大规模模型。通过一个超大数据集涵盖NLP的大多任务，然后使用一个超大规模模型进行多任务预训练，使其无需任何下游任务的finetune就可以做到多个NLP任务的SOTA。举个例子，拿高考为例，人的智力和脑容量可以理解为参数大小，由于个体差异，可以将不同的学生理解为不同参数量的模型，卷子可以理解为数据集，不同的学科可以理解为不同任务。GPT2有点类似学霸，就是有超高的智力和脑容量，然后刷大量不同学科的题目，因此对高考这个多任务的下游任务就可以取得好成绩。

GPT2相对于GPT1有哪些不同呢？

GPT2去掉了fine-tuning：不再针对不同任务分别进行微调建模，模型会自动识别出来需要做什么任务。这就好比一个人博览群书，你问他什么类型的问题，他都可以顺手拈来，GPT2就是这样一个博览群书的模型。
超大数据集：WebText，该数据集做了一些简单的数据清理，并且实验结果表明目前模型仍然处于一个欠拟合的情况。
增加网络参数：GPT2将Transformer堆叠的层数增加到48层，隐层的维度为1600，参数量更是达到了15亿。15亿什么概念呢，Bert的参数量也才只有3亿哦~当然，这样的参数量也不是说谁都能达到的，这也得取决于money的多少啊~
调整transformer：将layer normalization放到每个sub-block之前，并在最后一个transformer后再增加一个layer normalization，如下图。
输入表示：GPT2采用了BPE这种subword的结构作为输入
其他：GPT2将词汇表数量增加到50257个；最大的上下文大小 (context size) 从GPT的512提升到了1024 tokens；batchsize增加到512。

GPT2的输入是完全的文本，什么提示都不加吗？

当然不是，它也会加入提示词，比如：$TL;DR:$，GPT2模型就会知道是做摘要工作了，输入的格式就是 $文本+TL;DR:$，然后就等待输出就行了~

3.GPT3

GPT3，这是一种具有1750亿个参数的超大规模模型，比GPT2大100倍，感觉真是进入算力时代了。距离个人用户太远了，就不深挖了。

参考

https://zhuanlan.zhihu.com/p/146719974

https://zhuanlan.zhihu.com/p/125139937

https://www.cnblogs.com/yifanrensheng/p/13167796.html#_label1_0

https://www.jianshu.com/p/96c5d5d5c468

https://blog.csdn.net/qq_35128926/article/details/111399679

https://zhuanlan.zhihu.com/p/96791725

https://terrifyzhao.github.io/2019/02/18/GPT2.0%E8%AE%BA%E6%96%87%E8%A7%A3%E8%AF%BB.html

https://zhuanlan.zhihu.com/p/56865533

NLP PTM

PTM

2021-08-19 0c9b0086f0310ef439a36d4a03dea104 99+ 5 m 0.8 k

ELMo(Deep contextualized word representations)

引入了新的深度考虑上下文的词语表示，模型考虑了两个方面：（1）词语的复杂特性，包括语法和语义，（2）在语境中的不同含义。模型使用了深度双向语言模型，并且在大预料库上做了预训练。这个模型可以很方便地和现有的模型结合，并且在NLP的6个任务上取得了SOTA。作者还揭露了预训练网络的深层构件是关键，这使得下游模型能够混合不同类型的半监督信号。

3 ELMo: Embeddings from Language Models

模型的整体机构如上所示，由左右两个单向的多层LSTM网络构成，左边为正向，右边为反向。

3.1 Bidirectional language models（预训练）

假定一个句子有$N$个token，分别为$(t_1,t_2,…,t_N)$，正向的语言模型的句子概率为：

$p(t_1,t_2,...,t_N)=\prod_{k=1}^{N}p(t_k|t_1,t_2,...,t_{k-1})$

反向的语言模型的句子概率为：

$p(t_1,t_2,...,t_N)=\prod_{k=1}^{N}p(t_k|t_{k+1},t_{k+2},...,t_{N})$

得到正向和反向的语言后，将其结合可以得到双向的语言模型，这里取对数表示为：

$\sum_{k=1}^N(log\ p(t_k|t_1,t_2,...,t_{k-1};\Theta_x,\overrightarrow{\Theta}_{LSTM} ,\Theta_s )+log \ p(t_k|t_{k+1},t_{k+2},...,t_{N};\Theta_x,\overleftarrow{\Theta}_{LSTM} ,\Theta_s) )\\$

其中$\Theta_x$为token表示的参数，$\Theta_s$为softmax层的参数，$\overrightarrow{\Theta}_{LSTM}$表示前向语言模型的参数，$\overleftarrow{\Theta}_{LSTM}$表示反向语言模型的参数。

3.2 ELMo（如何表示词向量）

得到$L$层的预训练双向深度语言模型后，对于token $t_k$，一共包含了$2L+1$个相关的表示，集合如下

$R_k=\{x_{k}^{LM},\overrightarrow{h^{LM}_{k,j}},\overleftarrow{h^{LM}_{k,j}}|j=1,2,...,L \}\\=\{h_{k,j}^{LM} | j=0,...,L\}$

注意$h_{k,0}^{LM}=x_{k}^{LM}，h_{k,j}^{LM}=[\overrightarrow{h^{LM}_{k,j}};\overleftarrow{h^{LM}_{k,j}}]$,其中$x_{k}^{LM}$为token表示，$\overrightarrow{h^{LM}_{k,j}},\overleftarrow{h^{LM}_{k,j}}$分别为正反向语言模型的表示

对于下游任务，需要将$2L+1$个表示压缩到一个向量$ELmo_k^{task}$，最简单的做法是只取顶层的表示，即

$ELmo_k^{task}=E(R_k)=h_{k,L}^{LM}$

更加通用的做法为线形组合输出，如下图，公式表达为

$ELmo_k^{task}=E(R_k,\Theta^{task})=\gamma^{task}\sum_{j=0}^{L}s_{j}^{task}h_{k,j}^{LM}$

其中$\gamma^{task}$用于缩放向量，$s_{j}^{task}$表示权重，通过下游任务学习。

3.3 Using biLMs for supervised NLP tasks（fine tune）

对于下游任务模型，可以得到不考虑上下文的静态词向量$x_k$和考虑上下文的向量表示$h_k$

对于一部分任务，将$x_k$和$ ELMo_k^{task}$ 拼接作为下游任务的特征：$[x_k;ELMo_k^{task}]$

对于一部分任务，将 $h_k$和 $ ELMo_k^{task}$ 拼接可提升效果：$[h_k;ELMo_k^{task}]$

参考

https://blog.csdn.net/linchuhai/article/details/97170541

https://zhuanlan.zhihu.com/p/63115885

https://zhuanlan.zhihu.com/p/88993965

https://arxiv.org/abs/1802.05365

NLP PTM

ELMo

2021-08-10 fa4f8744578ff1a660a64a4ac67da7c2 99+ 13 m 2.0 k

Pre-trained Models for Natural Language Processing A Survey

原文内容很丰富，慢慢学习更新。

摘要

这篇综述从language representation learning入手，然后全面的阐述Pre-trained Models的原理，结构以及downstream任务，最后还罗列了PTM的未来发展方向。该综述目的旨在为NLP小白，PTM小白做引路人，感人。

1.Introduction

随着深度学习的发展，许多深度学习技术被应用在NLP，比如CNN，RNN，GNN以及attention。

尽管NLP任务的取得很大成功，但是和CV比较，性能提高可能不是非常明显。这主要是因为NLP任务的数据集都非常小（除了机器翻译），然而深度网络参数非常多，没有足够的数据支撑网络训练会导致过拟合问题。

最近，大量工作表明，预先训练的模型（PTMs），在大型语料库上可以学习通用语言表示，这有利于下游NLP任务可以避免从零开始训练新模型。随着算力的发展，深度模型（例如，transformer）的出现和训练技巧的不断调高，PTM的结构从浅层发展成深层。第一代PTM被用于Non-contextual word Embedding。由于下游任务不需要这些模型本身，只需要训练好的词向量矩阵，因此对于现在的算力，这些模型非常浅层，比如Skip-Gram和GloVe。虽然这些预训练词向量可以捕获词语的语义，但它们不受上下文限制，无法捕获上下文中的高级含义，某些任务会失效，例如多义词，句法结构，语义角色、回指。第二代PTM关注Contextual word embeddings，比如BERT，GPT等。这些编码器任然需要通过下游任务在上下文中表示词语。

2.Background

2.1 Language Representation Learning

The core idea of distributed representation is to describe the meaning of a piece of text by low-dimensional real-valued vectors. And each dimension of the vector has no corresponding sense, while the whole represents a concrete concept.

Non-contextual Embeddings

这一步主要是将分割的字符，比如图中的$x$，变成向量表达$e_x \in \mathbb{R}^{D_e}$，$D_e$是词向量维度。向量化过程就是基于一个离线训练的词向量矩阵$E\in \mathbb{R}^{D_e\times |\mathcal{V}|} $做查找，$\mathcal{V}$是词汇表。

这个过程主要有两个问题。第一个是这个词向量是静态的，没有考虑上下文含义，无法处理多义词。第二个是oov问题，许多算法可以缓解这个问题，比如基于character level，比如基于subword，subword算法有BPE，CharCNN等。

Contextual Embeddings

To address the issue of polysemous and the context-dependent nature of words, we need distinguish the semantics of words in different contexts：

$[\textbf{h}_1,\textbf{h}_2,...,\textbf{h}_T]=f_{enc}(x_1,x_2,...,x_T)$

其中$f_{enc}(\cdot)$为深度编码器。$\textbf{h}_t$就是contextual embedding或者dynamical embedding。

2.2 Neural Contextual Encoders

可以分成两类，sequence models and non-sequence models。

2.2.1 sequence models

sequence models 分为两类，Convolutional Models和Recurrent Models，见上图。

Convolutional

Convolutional models take the embeddings of words in the input sentence and capture the meaning of a word by aggregating the local information from its neighbors by convolution operations

Recurrent

Recurrent models capture the contextual representations of words with short memory, such as LSTMs and GRUs . In practice, bi-directional LSTMs or GRUs are used to collect information from both sides of a word, but its performance is often affected by the long-term dependency problem.

2.2.2 non-sequence models

transformer： model the relation of every two words

2.2.3 Analysis

Sequence models：

1.Sequence models learn the contextual representation of the word with locality bias and are hard to capture the long-range interactions between words.

2.Nevertheless, sequence models are usually easy to train and get good results for various NLP tasks.

fully-connected self-attention model：

1.can directly model the dependency between every two words in a sequence, which is more powerful and suitable to model long range dependency of language

2.However, due to its heavy structure and less model bias, the Transformer usually requires a large training corpus and is easy to overfit on small or modestly-sized datasets

结论：the Transformer has become the mainstream architecture of PTMs due to its powerful capacity.

2.3 Why Pre-training?

Pre-training on the huge text corpus can learn universal language representations and help with the downstream tasks.
Pre-training provides a better model initialization,which usually leads to a better generalization performance and speeds up convergence on the target task.
Pre-training can be regarded as a kind of regularization to avoid overfitting on small data

3 Overview of PTMs

3.1 Pre-training Tasks

预训练任务对于学习通用语言表示至关重要。通常，这些预训练任务应具有挑战性，并拥有大量训练数据。在本节中，我们将预训练任务分成三个类别：Supervised learning、Unsupervised learning和Self-Supervised learning。

Self-Supervised learning： is a blend of supervised learning and unsupervised learning. The learning paradigm of SSL is entirely the same as supervised learning, but the labels of training data are generated automatically. The key idea of SSL is to predict any part of the input from other parts in some form. For example, the masked language model (MLM) is a self-supervised task that attempts to predict the masked words in a sentence given the rest words.

接下来基于介绍常用的基于Self-Supervised learning的预训练任务。

3.1.1 Language Modeling (LM)

3.1.2 Masked Language Modeling (MLM)

3.1.3 Permuted Language Modeling (PLM)

3.1.4 Denoising Autoencoder (DAE)

3.1.5 Contrastive Learning (CTL)

nsp也属于CTL

https://zhuanlan.zhihu.com/p/360892229

3.1.6 Others

3.2 Taxonomy of PTMs

作者从以下四个角度，即Representation Type，Architectures，Pre-Training Task Types，Extensions，对现有的PTM分类，分类结果如上。图和这里有一点不统一，是作者没注意？图里有5个类别，多了Tuning Strategies，而且Representation Type在图中为Contextual?。

3.3 Model Analysis

4 Extensions of PTMs

4.1 Knowledge-Enriched PTMs

4.2 Multilingual and Language-Specific PTMs

4.4 Domain-Specific and Task-Specific PTMs

4.5 Model Compression

5 Adapting PTMs to Downstream Tasks

虽然PTM学习了很多通用知识，但是如何将这些知识有效应用到下游任务是个挑战。

5.1 Transfer Learning

Transfer learning is to adapt the knowledge from a source task (or domain) to a target task (or domain).如下图。

5.2 How to Transfer?

5.2.1 Choosing appropriate pre-training task, model architecture and corpus

5.2.2 Choosing appropriate layers

使用哪些层参与下游任务

选择的层model1+下游任务model2

对于深度模型的不同层，捕获的知识是不同的，比如说词性标注，句法分析，长期依赖，语义角色，协同引用。对于RNN based的模型，研究表明多层的LSTM编码器的不同层对于不同任务的表现不一样。对于transformer based 的模型，基本的句法理解在网络的浅层出现，然而高级的语义理解在深层出现。

用$\textbf{H}^{l}(1<=l<=L)$表示PTM的第$l$层的representation，$g(\cdot)$为特定的任务模型。有以下几种方法选择representation:

a) Embedding Only

choose only the pre-trained static embeddings，即$g(\textbf{H}^{1})$

b) Top Layer

选择顶层的representation，然后接入特定的任务模型，即$g(\textbf{H}^{L})$

c) All Layers

输入全部层的representation，让模型自动选择最合适的层次，然后接入特定的任务模型，比如ELMo，式子如下

$g(\textbf{r}_t)=g(\gamma \sum_{l=1}^{L}\alpha_l\textbf{H}^{(l)})$

其中$\alpha$ is the softmax-normalized weight for layer $l$ and $\gamma$ is a scalar to scale the vectors output by pre-trained model

5.2.3 To tune or not to tune?

总共有两种常用的模型迁移方式：feature extraction (where the pre-trained parameters are frozen), and fine-tuning (where the pre-trained parameters are unfrozen and fine-tuned).

选择的层model1参数是否固定，model2一定要训练

bert 只有top layer finetune？？？？

5.3 Fine-Tuning Strategies

Two-stage fine-tuning

第一阶段为中间任务，第二阶段为目标任务

Multi-task fine-tuning

multi-task learning and pre-training are complementary technologies.

Fine-tuning with extra adaptation modules

The main drawback of fine-tuning is its parameter ineffciency: every downstream task has its own fine-tuned parameters. Therefore, a better solution is to inject some fine-tunable adaptation modules into PTMs while the original parameters are fixed.

Others

self-ensemble ，self-distillation，gradual unfreezing，sequential unfreezing

参考

https://arxiv.org/pdf/2003.08271v4.pdf

NLP PTM

PTM

2021-07-20 9778d6a091288619b35b401f0aad2c15 99+ 4 m 0.6 k

bert(Pre-training of Deep Bidirectional Transformers for Language Understanding)

https://arxiv.org/abs/1810.04805

1 结构

整体结构如上图，基本单元为Transformer 的encoder部分。作者对结构的描述为：BERT’s model architecture is a multi-layer bidirectional Transformer encoder。

2 Input/Output Representations

[CLS]表征句子开始，[SEP]表示句子结束以及分割两个句子

Token Embedding为词向量的表示，Position Embedding为位置信息，Segment Embedding表示A，B两句话，最后的输入向量为三者相加。比起transformer多一个Segment Embedding。

具体例子：https://www.cnblogs.com/d0main/p/10447853.html

3 预训练任务

1 Masked LM

standard conditional language models can only be trained left-to-right or right-to-left , since bidirectional conditioning would allow each word to indirectly “see itself”.In order to train a deep bidirectional representation,MLM

The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time.

2 Next Sentence Prediction (NSP)

In order to train a model that understands sentence relationships

choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext).

4 Fine-tuning BERT

For each task, we simply plug in the task specific inputs and outputs into BERT and finetune all the parameters end-to-end.

输入: 可以为句子对或者单句，取决于特定任务

输出：At the output, the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

5 常见问题

1 bert为什么双向，gpt单向？

1.结构的不同

因为BERT用了transformer的encoder，在编码某个token的时候同时利用了其上下文的token，但是gptT用了transformer的decoder，只能利用上文

2.预训练任务的不同

2 为什么bert长度固定？

因为bert是基于transformer encoder的，不同位置的词语都是并行的，所以长度要提前固定，不可变

bert的输入输出长度为max_length,大于截断，小于padding，max_length的最大值为512

3 为什么bert需要补充位置信息？

因为是并行，不像迭代，没有天然的位置信息，需要补充position embedding。

NLP PTM

PTM

XLNet Generalized Autoregressive Pretraining for Language Understanding

1 主要改动

2 现有PTM的问题

3 改动

3.1 排列语言模型

3.2 Two-Stream Self-Attention

3.3 Partial Prediction

3.4 融合Transformer-XL的思想

3.5 Modeling Multiple Segments

参考

gpt

1.GPT1

1.1 Unsupervised pre-training

1.2 Supervised fine-tuning

1.3 Task-specific input transformations

2.GPT2

3.GPT3

参考

ELMo(Deep contextualized word representations)

3 ELMo: Embeddings from Language Models

3.1 Bidirectional language models（预训练）

3.2 ELMo（如何表示词向量）

3.3 Using biLMs for supervised NLP tasks（fine tune）

参考

Pre-trained Models for Natural Language Processing A Survey

摘要

1.Introduction

2.Background

2.1 Language Representation Learning

2.2 Neural Contextual Encoders

2.2.1 sequence models

2.2.2 non-sequence models

2.2.3 Analysis

2.3 Why Pre-training?

3 Overview of PTMs

3.1 Pre-training Tasks

3.1.1 Language Modeling (LM)

3.1.2 Masked Language Modeling (MLM)

3.1.3 Permuted Language Modeling (PLM)

3.1.4 Denoising Autoencoder (DAE)

3.1.5 Contrastive Learning (CTL)

3.1.6 Others

3.2 Taxonomy of PTMs

3.3 Model Analysis

4 Extensions of PTMs

4.1 Knowledge-Enriched PTMs

4.2 Multilingual and Language-Specific PTMs

4.3 Multi-Modal PTMs

4.4 Domain-Specific and Task-Specific PTMs

4.5 Model Compression

5 Adapting PTMs to Downstream Tasks

5.1 Transfer Learning

5.2 How to Transfer?

5.2.1 Choosing appropriate pre-training task, model architecture and corpus

5.2.2 Choosing appropriate layers

5.2.3 To tune or not to tune?

5.3 Fine-Tuning Strategies

参考

bert(Pre-training of Deep Bidirectional Transformers for Language Understanding)

1 结构

2 Input/Output Representations

3 预训练任务

4 Fine-tuning BERT

5 常见问题

Recents

Categories

Archives

Tags

Subscribe for updates