Pre-Training with Whole Word Masking for Chinese BERT

BERT-wwm-ext

wwm:whole word mask

ext: we also use extended training data (mark with ext in the model name)

预训练

1 改变mask策略

Whole Word Masking,wwm

cws: Chinese Word Segmentation

对比四种mask策略

参考

Pre-Training with Whole Word Masking for Chinese BERT

https://arxiv.org/abs/1906.08101v3

Revisiting Pre-trained Models for Chinese Natural Language Processing

https://arxiv.org/abs/2004.13922

github:https://hub.fastgit.org/ymcui/Chinese-BERT-wwm

 NLP PTM
  
 PTM

ALBERT A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

There are three main contributions that ALBERT makes over the design choices of BERT:

1 Factorized embedding parameterization

原来embedding层是一个矩阵$M_{emb[V\times H]} $,现在变为两个$M_{emb1[V\times E]}$和$M_{emb2[E\times H]}$,参数量从VH变为VE+EH(This parameter reduction is significant when H >> E.)

2 Cross-layer parameter sharing

The default decision for ALBERT is to share all parameters across layers(attention,FFN))

3 Inter-sentence coherence loss

原来的NSP改为现在的sop,正例的构建和NSP是一样的,不过负例则是将两句话反过来。

参考

https://zhuanlan.zhihu.com/p/88099919

https://blog.csdn.net/weixin_37947156/article/details/101529943

https://openreview.net/pdf?id=H1eA7AEtvS

 NLP PTM
  
 PTM

中文词粒度BERT

1 Is Word Segmentation Necessary for Deep Learning of Chinese Representations?

we find that charbased(字粒度) models consistently outperform wordbased (词粒度)models.

We show that it is because word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting.

2 腾讯中文词模型

词模型在公开数据集的表现逊于字模型

参考

https://arxiv.org/pdf/1905.05526.pdf

https://www.jiqizhixin.com/articles/2019-06-27-17

 NLP PTM
  
 PTM

RoBERTa A Robustly Optimized BERT Pretraining Approach

1.和BERT比较

在结构上和原版BERT没有差异,主要的改动在于:

2.改动分析

2.1 Static vs. Dynamic Masking

static masking: 原本的BERT采用的是static mask的方式,就是在create pretraining data中,先对数据进行提前的mask

dynamic masking: 每一次将训练example喂给模型的时候,才进行随机mask。

结果对比:

结论:动态占优

2.2 Model Input Format and Next Sentence Prediction

做了结果对比试验,结果如下:

结论:

Model Input Format:

​ 1.find that using individual sentences hurts performance on downstream tasks

Next Sentence Prediction:

​ 1.removing the NSP loss matches or slightly improves downstream task performance

2.3 Training with large batches

2.4 Text Encoding

采用BBPE而不是wordpiece

3 常见问题

1 roberta tokenizer 没有token_type_ids?
roberta 取消了NSP,所以不需要segment embedding 也就不需要token_type_ids,但是使用的时候发现中文是有token_type_ids的,英文没有token_type_ids的。没有token_type_ids,两句话怎么区别,分隔符sep还是有的,只是没有segment embedding

2 使用避坑

https://blog.csdn.net/zwqjoy/article/details/107533184

https://hub.fastgit.org/ymcui/Chinese-BERT-wwm

参考

https://zhuanlan.zhihu.com/p/103205929

https://zhuanlan.zhihu.com/p/143064748

https://blog.csdn.net/zwqjoy/article/details/107533184

https://hub.fastgit.org/ymcui/Chinese-BERT-wwm

 NLP PTM
  
 PTM

gpt

GPT三部曲宣告NLP的“预训练+微调”时代的崛起和走向辉煌。

原文分别为:

《Improving Language Understanding by Generative Pre-Training》

《Language Models are Unsupervised Multitask Learners》

《Language Models are Few-Shot Learners》

1.GPT1

img

模型的整体结构如上图所示。使用过程过程分为两步:第一步预训练,利用大量语料学习得到high-capacity的语言模型;第二步是fine_tuning,利用标签数据使其拟合到特定任务。

1.1 Unsupervised pre-training

作者将transformer decoder中Encoder-Decoder Attention层去掉后作为基本单元,然后多层堆叠作为语言模型的主体,然后将输出经过一个softmax层,来得到目标词的输出分布:

其中$U=\{u_{-k},…,u_{-1}\}$ 是预测词$u $前$k$个token的独热编码序列,$n$是模型的层数,$W_e$是token embedding matrix,$W_p$是position embedding matrix。

给定一个无监督的语料库$\mathcal{U}$,use a standard language modeling objective to maximize the following likelihood

其中$k$ 是上下文窗口大小。

1.2 Supervised fine-tuning

对于数据集$\mathcal{C}$,有数据$(x^1,x^2,…,x^m,y)$

其中$W_y$为全连接层的参数

作者发现,使用语言模型来辅助监督学习进行微调,有两个好处:

  1. 提高监督模型的泛化能力;
  2. 加速收敛。

所以,最终下游使用的监督模型损失函数为:

1.3 Task-specific input transformations

所有的输入文本都会加上开始和结合token$(s),(e)$

分类

分类过程可如上1.2,输入表示为$[(s);Context;(e)]$

文本蕴含

将输入拼接成$[(s); premise; ($) ; hypothesis ; (e)]$

相似度

由于文本相似度与两个比较文本的前后顺序没有关系,因此将两种文本顺序都考虑进来,如上图所示

问答与常识推理

假设文档为$z$,问题为$q$,一系列答案为$\{a_k\}$,将其输入表示为$[(s); z; q; ($); a_k;(e)]$,然后多个回答组合的形式,如上图。

2.GPT2

总结就是:多任务预训练+超大数据集+超大规模模型。通过一个超大数据集涵盖NLP的大多任务,然后使用一个超大规模模型进行多任务预训练,使其无需任何下游任务的finetune就可以做到多个NLP任务的SOTA。举个例子,拿高考为例,人的智力和脑容量可以理解为参数大小,由于个体差异,可以将不同的学生理解为不同参数量的模型,卷子可以理解为数据集,不同的学科可以理解为不同任务。GPT2有点类似学霸,就是有超高的智力和脑容量,然后刷大量不同学科的题目,因此对高考这个多任务的下游任务就可以取得好成绩。

GPT2相对于GPT1有哪些不同呢?

  1. GPT2去掉了fine-tuning:不再针对不同任务分别进行微调建模,模型会自动识别出来需要做什么任务。这就好比一个人博览群书,你问他什么类型的问题,他都可以顺手拈来,GPT2就是这样一个博览群书的模型。

  2. 超大数据集:WebText,该数据集做了一些简单的数据清理,并且实验结果表明目前模型仍然处于一个欠拟合的情况。

  3. 增加网络参数:GPT2将Transformer堆叠的层数增加到48层,隐层的维度为1600,参数量更是达到了15亿。15亿什么概念呢,Bert的参数量也才只有3亿哦~当然,这样的参数量也不是说谁都能达到的,这也得取决于money的多少啊~

  4. 调整transformer:将layer normalization放到每个sub-block之前,并在最后一个transformer后再增加一个layer normalization,如下图。

  5. 输入表示:GPT2采用了BPE这种subword的结构作为输入

  6. 其他:GPT2将词汇表数量增加到50257个;最大的上下文大小 (context size) 从GPT的512提升到了1024 tokens;batchsize增加到512。

GPT2的输入是完全的文本,什么提示都不加吗?

当然不是,它也会加入提示词,比如:$TL;DR:$,GPT2模型就会知道是做摘要工作了,输入的格式就是 $文本+TL;DR:$,然后就等待输出就行了~

3.GPT3

GPT3,这是一种具有1750亿个参数的超大规模模型,比GPT2大100倍,感觉真是进入算力时代了。距离个人用户太远了,就不深挖了。

参考

https://zhuanlan.zhihu.com/p/146719974

https://zhuanlan.zhihu.com/p/125139937

https://www.cnblogs.com/yifanrensheng/p/13167796.html#_label1_0

https://www.jianshu.com/p/96c5d5d5c468

https://blog.csdn.net/qq_35128926/article/details/111399679

https://zhuanlan.zhihu.com/p/96791725

https://terrifyzhao.github.io/2019/02/18/GPT2.0%E8%AE%BA%E6%96%87%E8%A7%A3%E8%AF%BB.html

https://zhuanlan.zhihu.com/p/56865533

 NLP PTM
  
 PTM

Pre-trained Models for Natural Language Processing A Survey

原文内容很丰富,慢慢学习更新。

摘要

这篇综述从language representation learning入手,然后全面的阐述Pre-trained Models的原理,结构以及downstream任务,最后还罗列了PTM的未来发展方向。该综述目的旨在为NLP小白,PTM小白做引路人,感人。

1.Introduction

随着深度学习的发展,许多深度学习技术被应用在NLP,比如CNN,RNN,GNN以及attention。

尽管NLP任务的取得很大成功,但是和CV比较,性能提高可能不是非常明显。这主要是因为NLP任务的数据集都非常小(除了机器翻译),然而深度网络参数非常多,没有足够的数据支撑网络训练会导致过拟合问题。

最近,大量工作表明,预先训练的模型(PTMs),在大型语料库上可以学习通用语言表示,这有利于下游NLP任务可以避免从零开始训练新模型。随着算力的发展,深度模型(例如,transformer)的出现和训练技巧的不断调高,PTM的结构从浅层发展成深层。第一代PTM被用于Non-contextual word Embedding。由于下游任务不需要这些模型本身,只需要训练好的词向量矩阵,因此对于现在的算力,这些模型非常浅层,比如Skip-Gram和GloVe。虽然这些预训练词向量可以捕获词语的语义,但它们不受上下文限制,无法捕获上下文中的高级含义,某些任务会失效,例如多义词,句法结构,语义角色、回指。第二代PTM关注Contextual word embeddings,比如BERT,GPT等。这些编码器任然需要通过下游任务在上下文中表示词语。

2.Background

2.1 Language Representation Learning

The core idea of distributed representation is to describe the meaning of a piece of text by low-dimensional real-valued vectors. And each dimension of the vector has no corresponding sense, while the whole represents a concrete concept.

Non-contextual Embeddings

这一步主要是将分割的字符,比如图中的$x$,变成向量表达$e_x \in \mathbb{R}^{D_e}$,$D_e$是词向量维度。向量化过程就是基于一个离线训练的词向量矩阵$E\in \mathbb{R}^{D_e\times |\mathcal{V}|} $做查找,$\mathcal{V}$是词汇表。

这个过程主要有两个问题。第一个是这个词向量是静态的,没有考虑上下文含义,无法处理多义词。第二个是oov问题,许多算法可以缓解这个问题,比如基于character level,比如基于subword,subword算法有BPE,CharCNN等。

Contextual Embeddings

To address the issue of polysemous and the context-dependent nature of words, we need distinguish the semantics of words in different contexts:

其中$f_{enc}(\cdot)$为深度编码器。$\textbf{h}_t$就是contextual embedding或者dynamical embedding。

2.2 Neural Contextual Encoders

可以分成两类,sequence models and non-sequence models。

2.2.1 sequence models

sequence models 分为两类,Convolutional Models和Recurrent Models,见上图。

Convolutional

Convolutional models take the embeddings of words in the input sentence and capture the meaning of a word by aggregating the local information from its neighbors by convolution operations

Recurrent

Recurrent models capture the contextual representations of words with short memory, such as LSTMs and GRUs . In practice, bi-directional LSTMs or GRUs are used to collect information from both sides of a word, but its performance is often affected by the long-term dependency problem.

2.2.2 non-sequence models

transformer: model the relation of every two words

2.2.3 Analysis

Sequence models:

1.Sequence models learn the contextual representation of the word with locality bias and are hard to capture the long-range interactions between words.

2.Nevertheless, sequence models are usually easy to train and get good results for various NLP tasks.

fully-connected self-attention model:

1.can directly model the dependency between every two words in a sequence, which is more powerful and suitable to model long range dependency of language

2.However, due to its heavy structure and less model bias, the Transformer usually requires a large training corpus and is easy to overfit on small or modestly-sized datasets

结论:the Transformer has become the mainstream architecture of PTMs due to its powerful capacity.

2.3 Why Pre-training?

  1. Pre-training on the huge text corpus can learn universal language representations and help with the downstream tasks.
  2. Pre-training provides a better model initialization,which usually leads to a better generalization performance and speeds up convergence on the target task.
  3. Pre-training can be regarded as a kind of regularization to avoid overfitting on small data

3 Overview of PTMs

3.1 Pre-training Tasks

预训练任务对于学习通用语言表示至关重要。通常,这些预训练任务应具有挑战性,并拥有大量训练数据。在本节中,我们将预训练任务分成三个类别:Supervised learning、Unsupervised learning和Self-Supervised learning。

Self-Supervised learning: is a blend of supervised learning and unsupervised learning. The learning paradigm of SSL is entirely the same as supervised learning, but the labels of training data are generated automatically. The key idea of SSL is to predict any part of the input from other parts in some form. For example, the masked language model (MLM) is a self-supervised task that attempts to predict the masked words in a sentence given the rest words.

接下来基于介绍常用的基于Self-Supervised learning的预训练任务。

3.1.1 Language Modeling (LM)

3.1.2 Masked Language Modeling (MLM)

3.1.3 Permuted Language Modeling (PLM)

3.1.4 Denoising Autoencoder (DAE)

3.1.5 Contrastive Learning (CTL)

nsp也属于CTL

https://zhuanlan.zhihu.com/p/360892229

3.1.6 Others

3.2 Taxonomy of PTMs

作者从以下四个角度,即Representation Type,Architectures,Pre-Training Task Types,Extensions,对现有的PTM分类,分类结果如上。图和这里有一点不统一,是作者没注意?图里有5个类别,多了Tuning Strategies,而且Representation Type在图中为Contextual?。

3.3 Model Analysis

4 Extensions of PTMs

4.1 Knowledge-Enriched PTMs

4.2 Multilingual and Language-Specific PTMs

4.3 Multi-Modal PTMs

4.4 Domain-Specific and Task-Specific PTMs

4.5 Model Compression

5 Adapting PTMs to Downstream Tasks

虽然PTM学习了很多通用知识,但是如何将这些知识有效应用到下游任务是个挑战。

5.1 Transfer Learning

Transfer learning is to adapt the knowledge from a source task (or domain) to a target task (or domain).如下图。

5.2 How to Transfer?

5.2.1 Choosing appropriate pre-training task, model architecture and corpus

5.2.2 Choosing appropriate layers

使用哪些层参与下游任务

选择的层model1+下游任务model2

对于深度模型的不同层,捕获的知识是不同的,比如说词性标注,句法分析,长期依赖,语义角色,协同引用。对于RNN based的模型,研究表明多层的LSTM编码器的不同层对于不同任务的表现不一样。对于transformer based 的模型,基本的句法理解在网络的浅层出现,然而高级的语义理解在深层出现。

用$\textbf{H}^{l}(1<=l<=L)$表示PTM的第$l$层的representation,$g(\cdot)$为特定的任务模型。有以下几种方法选择representation:

a) Embedding Only

choose only the pre-trained static embeddings,即$g(\textbf{H}^{1})$

b) Top Layer

选择顶层的representation,然后接入特定的任务模型,即$g(\textbf{H}^{L})$

c) All Layers

输入全部层的representation,让模型自动选择最合适的层次,然后接入特定的任务模型,比如ELMo,式子如下

其中$\alpha$ is the softmax-normalized weight for layer $l$ and $\gamma$ is a scalar to scale the vectors output by pre-trained model

5.2.3 To tune or not to tune?

总共有两种常用的模型迁移方式:feature extraction (where the pre-trained parameters are frozen), and fine-tuning (where the pre-trained parameters are unfrozen and fine-tuned).

选择的层model1参数是否固定,model2一定要训练

bert 只有top layer finetune????

5.3 Fine-Tuning Strategies

Two-stage fine-tuning

第一阶段为中间任务,第二阶段为目标任务

Multi-task fine-tuning

multi-task learning and pre-training are complementary technologies.

Fine-tuning with extra adaptation modules

The main drawback of fine-tuning is its parameter ineffciency: every downstream task has its own fine-tuned parameters. Therefore, a better solution is to inject some fine-tunable adaptation modules into PTMs while the original parameters are fixed.

Others

self-ensemble ,self-distillation,gradual unfreezing,sequential unfreezing

参考

https://arxiv.org/pdf/2003.08271v4.pdf

 NLP PTM
  
 PTM

bert(Pre-training of Deep Bidirectional Transformers for Language Understanding)

https://arxiv.org/abs/1810.04805

1 结构

整体结构如上图,基本单元为Transformer 的encoder部分。作者对结构的描述为:BERT’s model architecture is a multi-layer bidirectional Transformer encoder。

2 Input/Output Representations

[CLS]表征句子开始,[SEP]表示句子结束以及分割两个句子

Token Embedding为词向量的表示,Position Embedding为位置信息,Segment Embedding表示A,B两句话,最后的输入向量为三者相加。比起transformer多一个Segment Embedding。

具体例子:https://www.cnblogs.com/d0main/p/10447853.html

3 预训练任务

1 Masked LM

standard conditional language models can only be trained left-to-right or right-to-left , since bidirectional conditioning would allow each word to indirectly “see itself”.In order to train a deep bidirectional representation,MLM

The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time.

2 Next Sentence Prediction (NSP)

In order to train a model that understands sentence relationships

choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext).

4 Fine-tuning BERT

For each task, we simply plug in the task specific inputs and outputs into BERT and finetune all the parameters end-to-end.

输入: 可以为句子对或者单句,取决于特定任务

输出:At the output, the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

5 常见问题

1 bert为什么双向,gpt单向?

1.结构的不同

因为BERT用了transformer的encoder,在编码某个token的时候同时利用了其上下文的token,但是gptT用了transformer的decoder,只能利用上文

2.预训练任务的不同

2 为什么bert长度固定?

因为bert是基于transformer encoder的,不同位置的词语都是并行的,所以长度要提前固定,不可变

bert的输入输出长度为max_length,大于截断,小于padding,max_length的最大值为512

3 为什么bert需要补充位置信息?

因为是并行,不像迭代,没有天然的位置信息,需要补充position embedding。

 NLP PTM
  
 PTM


:D 一言句子获取中...