ChineseBERT Chinese Pretraining Enhanced by Glyph and Pinyin Information

考虑字形和拼音的中文PTM

1 模型结构

变动在bert的输入

原来Char embedding+Position embedding+segment embedding-> 现在 Fusion embedding+Position embedding (omit the segment embedding)

Char embedding +Glyph ( 字形 ) embedding +Pinyin (拼音)embedding -》Fusion embedding

2 预训练任务

Whole Word Masking (WWM) and Char Masking (CM)

3 使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>> from datasets.bert_dataset import BertDataset
>>> from models.modeling_glycebert import GlyceBertModel

>>> tokenizer = BertDataset([CHINESEBERT_PATH])
>>> chinese_bert = GlyceBertModel.from_pretrained([CHINESEBERT_PATH])
>>> sentence = '我喜欢猫'

>>> input_ids, pinyin_ids = tokenizer.tokenize_sentence(sentence)
>>> length = input_ids.shape[0]
>>> input_ids = input_ids.view(1, length)
>>> pinyin_ids = pinyin_ids.view(1, length, 8)
>>> output_hidden = chinese_bert.forward(input_ids, pinyin_ids)[0]
>>> print(output_hidden)
tensor([[[ 0.0287, -0.0126, 0.0389, ..., 0.0228, -0.0677, -0.1519],
[ 0.0144, -0.2494, -0.1853, ..., 0.0673, 0.0424, -0.1074],
[ 0.0839, -0.2989, -0.2421, ..., 0.0454, -0.1474, -0.1736],
[-0.0499, -0.2983, -0.1604, ..., -0.0550, -0.1863, 0.0226],
[ 0.1428, -0.0682, -0.1310, ..., -0.1126, 0.0440, -0.1782],
[ 0.0287, -0.0126, 0.0389, ..., 0.0228, -0.0677, -0.1519]]],
grad_fn=<NativeLayerNormBackward>)

参考

https://github.com/ShannonAI/ChineseBert

https://arxiv.org/pdf/2106.16038.pdf

finetune

1 使用哪些层参与下游任务

使用哪些层参与下游任务

选择的层model1+下游任务model2

对于深度模型的不同层,捕获的知识是不同的,比如说词性标注,句法分析,长期依赖,语义角色,协同引用。对于RNN based的模型,研究表明多层的LSTM编码器的不同层对于不同任务的表现不一样。对于transformer based 的模型,基本的句法理解在网络的浅层出现,然而高级的语义理解在深层出现。

用$\textbf{H}^{l}(1<=l<=L)$表示PTM的第$l$层的representation,$g(\cdot)$为特定的任务模型。有以下几种方法选择representation:

a) Embedding Only

choose only the pre-trained static embeddings,即$g(\textbf{H}^{1})$

b) Top Layer

选择顶层的representation,然后接入特定的任务模型,即$g(\textbf{H}^{L})$

c) All Layers

输入全部层的representation,让模型自动选择最合适的层次,然后接入特定的任务模型,比如ELMo,式子如下

其中$\alpha$ is the softmax-normalized weight for layer $l$ and $\gamma$ is a scalar to scale the vectors output by pre-trained model

2 参数是否固定

总共有两种常用的模型迁移方式:feature extraction (where the pre-trained parameters are frozen), and fine-tuning (where the pre-trained parameters are unfrozen and fine-tuned).

3 Fine-Tuning Strategies

Two-stage fine-tuning

第一阶段为中间任务,第二阶段为目标任务

Multi-task fine-tuning

multi-task learning and pre-training are complementary technologies.

Fine-tuning with extra adaptation modules

The main drawback of fine-tuning is its parameter ineffciency: every downstream task has its own fine-tuned parameters. Therefore, a better solution is to inject some fine-tunable adaptation modules into PTMs while the original parameters are fixed.

Others

self-ensemble ,self-distillation,gradual unfreezing,sequential unfreezing

参考

https://arxiv.org/pdf/2003.08271v4.pdf

 NLP PTM
  

Pre-Training with Whole Word Masking for Chinese BERT

BERT-wwm-ext

wwm:whole word mask

ext: we also use extended training data (mark with ext in the model name)

预训练

1 改变mask策略

Whole Word Masking,wwm

cws: Chinese Word Segmentation

对比四种mask策略

参考

Pre-Training with Whole Word Masking for Chinese BERT

https://arxiv.org/abs/1906.08101v3

Revisiting Pre-trained Models for Chinese Natural Language Processing

https://arxiv.org/abs/2004.13922

github:https://hub.fastgit.org/ymcui/Chinese-BERT-wwm

 NLP PTM
  
 PTM

ALBERT A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

There are three main contributions that ALBERT makes over the design choices of BERT:

1 Factorized embedding parameterization

原来embedding层是一个矩阵$M_{emb[V\times H]} $,现在变为两个$M_{emb1[V\times E]}$和$M_{emb2[E\times H]}$,参数量从VH变为VE+EH(This parameter reduction is significant when H >> E.)

2 Cross-layer parameter sharing

The default decision for ALBERT is to share all parameters across layers(attention,FFN))

3 Inter-sentence coherence loss

原来的NSP改为现在的sop,正例的构建和NSP是一样的,不过负例则是将两句话反过来。

参考

https://zhuanlan.zhihu.com/p/88099919

https://blog.csdn.net/weixin_37947156/article/details/101529943

https://openreview.net/pdf?id=H1eA7AEtvS

 NLP PTM
  
 PTM

中文词粒度BERT

1 Is Word Segmentation Necessary for Deep Learning of Chinese Representations?

we find that charbased(字粒度) models consistently outperform wordbased (词粒度)models.

We show that it is because word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting.

2 腾讯中文词模型

词模型在公开数据集的表现逊于字模型

参考

https://arxiv.org/pdf/1905.05526.pdf

https://www.jiqizhixin.com/articles/2019-06-27-17

 NLP PTM
  
 PTM

RoBERTa A Robustly Optimized BERT Pretraining Approach

1.和BERT比较

在结构上和原版BERT没有差异,主要的改动在于:

2.改动分析

2.1 Static vs. Dynamic Masking

static masking: 原本的BERT采用的是static mask的方式,就是在create pretraining data中,先对数据进行提前的mask

dynamic masking: 每一次将训练example喂给模型的时候,才进行随机mask。

结果对比:

结论:动态占优

2.2 Model Input Format and Next Sentence Prediction

做了结果对比试验,结果如下:

结论:

Model Input Format:

​ 1.find that using individual sentences hurts performance on downstream tasks

Next Sentence Prediction:

​ 1.removing the NSP loss matches or slightly improves downstream task performance

2.3 Training with large batches

2.4 Text Encoding

采用BBPE而不是wordpiece

3 常见问题

1 roberta tokenizer 没有token_type_ids?
roberta 取消了NSP,所以不需要segment embedding 也就不需要token_type_ids,但是使用的时候发现中文是有token_type_ids的,英文没有token_type_ids的。没有token_type_ids,两句话怎么区别,分隔符sep还是有的,只是没有segment embedding

2 使用避坑

https://blog.csdn.net/zwqjoy/article/details/107533184

https://hub.fastgit.org/ymcui/Chinese-BERT-wwm

参考

https://zhuanlan.zhihu.com/p/103205929

https://zhuanlan.zhihu.com/p/143064748

https://blog.csdn.net/zwqjoy/article/details/107533184

https://hub.fastgit.org/ymcui/Chinese-BERT-wwm

 NLP PTM
  
 PTM


:D 一言句子获取中...