>>> from datasets.bert_dataset import BertDataset
>>> from models.modeling_glycebert import GlyceBertModel

>>> tokenizer = BertDataset([CHINESEBERT_PATH])
>>> chinese_bert = GlyceBertModel.from_pretrained([CHINESEBERT_PATH])
>>> sentence = '我喜欢猫'

>>> input_ids, pinyin_ids = tokenizer.tokenize_sentence(sentence)
>>> length = input_ids.shape[0]
>>> input_ids = input_ids.view(1, length)
>>> pinyin_ids = pinyin_ids.view(1, length, 8)
>>> output_hidden = chinese_bert.forward(input_ids, pinyin_ids)[0]
>>> print(output_hidden)
tensor([[[ 0.0287, -0.0126,  0.0389,  ...,  0.0228, -0.0677, -0.1519],
         [ 0.0144, -0.2494, -0.1853,  ...,  0.0673,  0.0424, -0.1074],
         [ 0.0839, -0.2989, -0.2421,  ...,  0.0454, -0.1474, -0.1736],
         [-0.0499, -0.2983, -0.1604,  ..., -0.0550, -0.1863,  0.0226],
         [ 0.1428, -0.0682, -0.1310,  ..., -0.1126,  0.0440, -0.1782],
         [ 0.0287, -0.0126,  0.0389,  ...,  0.0228, -0.0677, -0.1519]]],
       grad_fn=<NativeLayerNormBackward>)

参考

https://github.com/ShannonAI/ChineseBert

https://arxiv.org/pdf/2106.16038.pdf

NLP PTM

ChineseBERT Chinese Pretraining Enhanced by Glyph and Pinyin Information

2022-06-11 ff2ed5a2fbb088083a33f8c473f3db69 99+ 3 m 0.5 k

finetune

1 使用哪些层参与下游任务

使用哪些层参与下游任务

选择的层model1+下游任务model2

对于深度模型的不同层，捕获的知识是不同的，比如说词性标注，句法分析，长期依赖，语义角色，协同引用。对于RNN based的模型，研究表明多层的LSTM编码器的不同层对于不同任务的表现不一样。对于transformer based 的模型，基本的句法理解在网络的浅层出现，然而高级的语义理解在深层出现。

用$\textbf{H}^{l}(1<=l<=L)$表示PTM的第$l$层的representation，$g(\cdot)$为特定的任务模型。有以下几种方法选择representation:

a) Embedding Only

choose only the pre-trained static embeddings，即$g(\textbf{H}^{1})$

b) Top Layer

选择顶层的representation，然后接入特定的任务模型，即$g(\textbf{H}^{L})$

c) All Layers

输入全部层的representation，让模型自动选择最合适的层次，然后接入特定的任务模型，比如ELMo，式子如下

$g(\textbf{r}_t)=g(\gamma \sum_{l=1}^{L}\alpha_l\textbf{H}^{(l)})$

其中$\alpha$ is the softmax-normalized weight for layer $l$ and $\gamma$ is a scalar to scale the vectors output by pre-trained model

2 参数是否固定

总共有两种常用的模型迁移方式：feature extraction (where the pre-trained parameters are frozen), and fine-tuning (where the pre-trained parameters are unfrozen and fine-tuned).

3 Fine-Tuning Strategies

Two-stage fine-tuning

第一阶段为中间任务，第二阶段为目标任务

Multi-task fine-tuning

multi-task learning and pre-training are complementary technologies.

Fine-tuning with extra adaptation modules

The main drawback of fine-tuning is its parameter ineffciency: every downstream task has its own fine-tuned parameters. Therefore, a better solution is to inject some fine-tunable adaptation modules into PTMs while the original parameters are fixed.

Others

self-ensemble ，self-distillation，gradual unfreezing，sequential unfreezing

参考

https://arxiv.org/pdf/2003.08271v4.pdf

NLP PTM

finetune

2022-05-29 caa4d2c00fbf24ca0cc72ecc916f97fe 99+ fast 0.0 k

预训练任务

分类

TLM : Translation Language Modeling

DAE: Denoising Autoencoder

CTL: Contrastive Learning

RTD： Replaced Token Detection

SOP：Sentence Order Prediction

DIM：Deep InfoMAx

参考

https://arxiv.org/pdf/2003.08271v4.pdf

https://zhuanlan.zhihu.com/p/360892229

NLP PTM

预训练任务

2022-03-28 81968be5cd4d92e9426b0d24ca360006 99+ fast 0.0 k

ptm之间的联系

NLP PTM

ptm之间的联系

2021-11-04 b17e53cb12eb2fba48a006bf6a67add0 99+ fast 0.1 k

Pre-Training with Whole Word Masking for Chinese BERT

BERT-wwm-ext

wwm：whole word mask

ext： we also use extended training data (mark with ext in the model name)

预训练

1 改变mask策略

Whole Word Masking，wwm

cws: Chinese Word Segmentation

对比四种mask策略

参考

Pre-Training with Whole Word Masking for Chinese BERT

https://arxiv.org/abs/1906.08101v3

Revisiting Pre-trained Models for Chinese Natural Language Processing

https://arxiv.org/abs/2004.13922

github：https://hub.fastgit.org/ymcui/Chinese-BERT-wwm

NLP PTM

PTM

2021-11-04 218eec650c59229902b7dd7e3520a172 99+ a minute 0.2 k

ALBERT A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

There are three main contributions that ALBERT makes over the design choices of BERT：

1 Factorized embedding parameterization

原来embedding层是一个矩阵$M_{emb[V\times H]} $,现在变为两个$M_{emb1[V\times E]}$和$M_{emb2[E\times H]}$,参数量从VH变为VE+EH（This parameter reduction is significant when H >> E.）

2 Cross-layer parameter sharing

The default decision for ALBERT is to share all parameters across layers（attention，FFN)）

3 Inter-sentence coherence loss

原来的NSP改为现在的sop，正例的构建和NSP是一样的，不过负例则是将两句话反过来。

参考

https://zhuanlan.zhihu.com/p/88099919

https://blog.csdn.net/weixin_37947156/article/details/101529943

https://openreview.net/pdf?id=H1eA7AEtvS

NLP PTM

PTM

2021-10-26 6fc64fa940d006e20b55dd3fa555894d 99+ fast 0.1 k

中文词粒度BERT

1 Is Word Segmentation Necessary for Deep Learning of Chinese Representations?

we find that charbased（字粒度） models consistently outperform wordbased （词粒度）models.

We show that it is because word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting.

2 腾讯中文词模型

词模型在公开数据集的表现逊于字模型

参考

https://arxiv.org/pdf/1905.05526.pdf

https://www.jiqizhixin.com/articles/2019-06-27-17

NLP PTM

PTM

2021-10-26 9b456cc8675c74a5066445aae3f9352b 99+ fast 0.0 k

T5

原文 https://arxiv.org/pdf/1910.10683.pdf

https://zhuanlan.zhihu.com/p/88438851

mT5

原文 https://arxiv.org/pdf/2010.11934.pdf

https://zhuanlan.zhihu.com/p/302380842

Sentence-T5（文本表示新SOTA）

原文 https://arxiv.org/pdf/2108.08877.pdf

https://zhuanlan.zhihu.com/p/403153114

NLP PTM

PTM

2021-09-15 607add707f93360424dff10eb802fe91 99+ 2 m 0.3 k

RoBERTa A Robustly Optimized BERT Pretraining Approach

1.和BERT比较

在结构上和原版BERT没有差异，主要的改动在于：

2.改动分析

2.1 Static vs. Dynamic Masking

static masking: 原本的BERT采用的是static mask的方式，就是在create pretraining data中，先对数据进行提前的mask

dynamic masking: 每一次将训练example喂给模型的时候，才进行随机mask。

结果对比：

结论：动态占优

2.2 Model Input Format and Next Sentence Prediction

做了结果对比试验，结果如下：

结论：

Model Input Format:

1.find that using individual sentences hurts performance on downstream tasks

Next Sentence Prediction:

1.removing the NSP loss matches or slightly improves downstream task performance

2.3 Training with large batches

2.4 Text Encoding

采用BBPE而不是wordpiece

3 常见问题

1 roberta tokenizer 没有token_type_ids？
roberta 取消了NSP，所以不需要segment embedding 也就不需要token_type_ids，但是使用的时候发现中文是有token_type_ids的，英文没有token_type_ids的。没有token_type_ids，两句话怎么区别，分隔符sep还是有的，只是没有segment embedding

2 使用避坑

https://blog.csdn.net/zwqjoy/article/details/107533184

https://hub.fastgit.org/ymcui/Chinese-BERT-wwm

参考

https://zhuanlan.zhihu.com/p/103205929

https://zhuanlan.zhihu.com/p/143064748

https://blog.csdn.net/zwqjoy/article/details/107533184

https://hub.fastgit.org/ymcui/Chinese-BERT-wwm

NLP PTM

PTM

细粒度NLP任务

参考

ChineseBERT Chinese Pretraining Enhanced by Glyph and Pinyin Information

1 模型结构

2 预训练任务

3 使用

参考

finetune

1 使用哪些层参与下游任务

2 参数是否固定

3 Fine-Tuning Strategies

参考

预训练任务

分类

参考

ptm之间的联系

Pre-Training with Whole Word Masking for Chinese BERT

预训练

参考

ALBERT A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

参考

中文词粒度BERT

参考

T5

RoBERTa A Robustly Optimized BERT Pretraining Approach

1.和BERT比较

2.改动分析

2.1 Static vs. Dynamic Masking

2.2 Model Input Format and Next Sentence Prediction

2.3 Training with large batches

2.4 Text Encoding

3 常见问题

参考

Recents

Categories

Archives

Tags

Subscribe for updates