Bert系列之句向量生成

https://zhuanlan.zhihu.com/p/444346578

1 sentence-bert

sts任务,数据分为sts无标签数据,sts有标签数据,nli有标签

无监督,有监督loss一样,文中有3种loss,区别在于数据集

无监督:nli有标签;有监督:sts有标签数据

2 simcse

sts任务,数据分为sts无标签数据,sts有标签数据

无监督,有监督区别在于:样本构造不同

无监督样本正负来源于sts无标签数据数据增强,有监督样本正负来源于sts有标签数据

3 consert

sts任务,数据分为sts无标签数据,sts有标签数据,还有nli数据集(有标签)

相同

和simcse相同之处:都是在finetune引入对比

不同

1 无监督

和simces loss一样为NT-Xent,不同在于sts无标签数据数据增强方式不同

2 有监督

区别在于loss和数据源

simcse loss为NT-Xent,数据源为sts有标签数据

consert loss为 NT-Xent + 别的有监督loss(比如cross entropy),数据源为sts无标签数据和nli数据集(有标签),+表示融合 ,论文有3种融合方式

Enhanced-RCNN An Efficient Method for Learning Sentence Similarity

特点:非预训练,参数量少

1 input encoding

得到两个encoding,RNN Encoding,RCNN Encoding

1 BiGRU

$\textbf{a}=\{a_1,a_2,…,a_{l_a}\},\textbf{a}$ 是句子,$l_a$ 是句子1的长度

得到RNN Encoding,$\overline{\textbf{p}}_i$统一表示$\overline{\textbf{a}}_i,\overline{\textbf{b}}_i$

2 CNN

在 BiGRU 编码的基础上,使用 CNN 来进行二次编码

结构如下,“newtork in network”,k 是卷积核的kernel size,比如k=1,卷积核为$1 \times 1$

对于每个 CNN 单元,具体的计算过程如下:

得到 RCNN Encoding $\widetilde{\textbf{p}}_i$

2 Interactive Sentence Representation

1 Soft-attention Alignment

attention:

加了attention的rnn encoding:

2 Interaction Modeling

$\overline{\textbf{p}}$是rnn encoding

$\hat{}$是加了attention的rnn encoding

$\widetilde{}$是rcnn encoding

最终得到Interactive Sentence Representation为$\textbf{o}_a,\textbf{o}_b$

3 Similarity Modeling

1 Fusion Layer

g是门控函数

2 Label Prediction

全连接层

4 loss

交叉熵

参考

https://sci-hub.st/10.1145/3366423.3379998

https://zhuanlan.zhihu.com/p/138061003

Pre-train, Prompt, and Predict A Systematic Survey of Prompting Methods in Natural Language Processing

0 和pre-train,finetune区别

prompt感觉是一种特殊的finetune方式,还是先pre-train然后prompt tuning

目的:prompt narrowing the gap between pre-training and fine-tuning

1 怎么做

3步

1 Prompt Addition

$x^{‘}=f_{prompt}(x)$ x是input text

  1. Apply a template, which is a textual string that has two slots: an input slot [X] for input x and an answer slot
    [Z] for an intermediate generated answer text z that will later be mapped into y.
  2. Fill slot [X] with the input text x.

f:fills in the location [Z] in prompt $x^{‘}$ with the potential answer z

Z:a set of permissible values for z

3 Answer Mapping

因为上面的 $\hat{z}$ 还不是 $\hat{y}$,比如情感分析,“excellent”, “fabulous”, “wonderful” -》positive

go from the highest-scoring answer $\hat{z}$ to the highest-scoring output $\hat{y}$

4 举个例子,文本情感分类的任务

原来

“ I love this movie.” -》 positive

现在

1 $x=$ “ I love this movie.” -》模板为: “ [x] Overall, it was a [z] movie.” -》$x^{‘}$为”I love this movie. Overall ,it was a [z] movie.”

2 下一步会进行答案搜索,顾名思义就是LM寻找填在[z] 处可以使得分数最高的文本 $\hat{z}$(比如”excellent”, “great”, “wonderful” )

3 最后是答案映射。有时LM填充的文本并非任务需要的最终形式(最终为positive,上述为”excellent”, “great”, “wonderful”),因此要将此文本映射到最终的输出$\hat{y}$

2 Prompt方法分类

3 Prompt Engineering

1 one must first consider the prompt shape,

2 then decide whether to take a manual or automated approach to create prompts of the desired shape

1 Prompt Shape

Prompt的形状主要指的是[X]和[Z]的位置和数量。

如果在句中,一般称这种prompt为cloze prompt;如果在句末,一般称这种prompt为prefix prompt

在实际应用过程中选择哪一种主要取决于任务的形式和模型的类别。cloze prompts和Masked Language Model的训练方式非常类似,因此对于使用MLM的任务来说cloze prompts更加合适;对于生成任务来说,或者使用自回归LM解决的任务,prefix prompts就会更加合适;Full text reconstruction models较为通用,因此两种prompt均适用。另外,对于文本对的分类,prompt模板通常要给输入预留两个空,[x1]和[x2]。

2 create prompts

1 Manual Template Engineering

2 Automated Template Learning

1 Discrete Prompts

the prompt 作用在文本上

D1: Prompt Mining

D2: Prompt Paraphrasing

D3: Gradient-based Search

D4: Prompt Generation

D5: Prompt Scoring

2 Continuous Prompts

the prompt 直接作用到模型的embedding空间

C1: Prefix Tuning

C2: Tuning Initialized with Discrete Prompts

C3: Hard-Soft Prompt Hybrid Tuning

4 Answer Engineering

two dimensions that must be considered when performing answer
engineering:1 deciding the answer shape and 2 choosing an answer design method.

1 Answer Shape

和Prompt Shape啥区别???

2 Answer Space Design Methods

1 Manual Design
2 automatic automatic

5 Multi-Prompt Learning

之前在讨论single prompt,现在介绍multiple prompts

6 Training Strategies for Prompting Methods

1 Training Settings

full-data

few-shot /zero-shot

2 Parameter Update Methods

参考

https://arxiv.org/abs/2107.13586

刘鹏飞博士 https://zhuanlan.zhihu.com/p/395115779

https://zhuanlan.zhihu.com/p/399295895

https://zhuanlan.zhihu.com/p/440169921

https://zhuanlan.zhihu.com/p/399295895

  

Generalizing from a Few Examples A Survey on Few-Shot Learning

paper: https://arxiv.org/abs/1904.05046

git: https://github.com/tata1661/FSL-Mate/tree/master/FewShotPapers#Applications

原文按应用对FSL做了总结,与NLP相关的有:

  1. High-risk learning: Acquiring new word vectors from tiny data, in EMNLP, 2017. A. Herbelot and M. Baroni. paper
  2. MetaEXP: Interactive explanation and exploration of large knowledge graphs, in TheWebConf, 2018. F. Behrens, S. Bischoff, P. Ladenburger, J. Rückin, L. Seidel, F. Stolp, M. Vaichenker, A. Ziegler, D. Mottin, F. Aghaei, E. Müller, M. Preusse, N. Müller, and M. Hunger. paper code
  3. Few-shot representation learning for out-of-vocabulary words, in ACL, 2019. Z. Hu, T. Chen, K.-W. Chang, and Y. Sun. paper
  4. Learning to customize model structures for few-shot dialogue generation tasks, in ACL, 2020. Y. Song, Z. Liu, W. Bi, R. Yan, and M. Zhang. paper
  5. Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network, in ACL, 2020. Y. Hou, W. Che, Y. Lai, Z. Zhou, Y. Liu, H. Liu, and T. Liu. paper
  6. Meta-reinforced multi-domain state generator for dialogue systems, in ACL, 2020. Y. Huang, J. Feng, M. Hu, X. Wu, X. Du, and S. Ma. paper
  7. Few-shot knowledge graph completion, in AAAI, 2020. C. Zhang, H. Yao, C. Huang, M. Jiang, Z. Li, and N. V. Chawla. paper
  8. Universal natural language processing with limited annotations: Try few-shot textual entailment as a start, in EMNLP, 2020. W. Yin, N. F. Rajani, D. Radev, R. Socher, and C. Xiong. paper code
  9. Simple and effective few-shot named entity recognition with structured nearest neighbor learning, in EMNLP, 2020. Y. Yang, and A. Katiyar. paper code
  10. Discriminative nearest neighbor few-shot intent detection by transferring natural language inference, in EMNLP, 2020. J. Zhang, K. Hashimoto, W. Liu, C. Wu, Y. Wan, P. Yu, R. Socher, and C. Xiong. paper code
  11. Few-shot learning for opinion summarization, in EMNLP, 2020. A. Bražinskas, M. Lapata, and I. Titov. paper code
  12. Adaptive attentional network for few-shot knowledge graph completion, in EMNLP, 2020. J. Sheng, S. Guo, Z. Chen, J. Yue, L. Wang, T. Liu, and H. Xu. paper code
  13. Few-shot complex knowledge base question answering via meta reinforcement learning, in EMNLP, 2020. Y. Hua, Y. Li, G. Haffari, G. Qi, and T. Wu. paper code
  14. Self-supervised meta-learning for few-shot natural language classification tasks, in EMNLP, 2020. T. Bansal, R. Jha, T. Munkhdalai, and A. McCallum. paper code
  15. Uncertainty-aware self-training for few-shot text classification, in NeurIPS, 2020. S. Mukherjee, and A. Awadallah. paper code
  16. Learning to extrapolate knowledge: Transductive few-shot out-of-graph link prediction, in NeurIPS, 2020:. J. Baek, D. B. Lee, and S. J. Hwang. paper code
  17. MetaNER: Named entity recognition with meta-learning, in TheWebConf, 2020. J. Li, S. Shang, and L. Shao. paper
  18. Conditionally adaptive multi-task learning: Improving transfer learning in NLP using fewer parameters & less data, in ICLR, 2021. J. Pilault, A. E. hattami, and C. Pal. paper code
  19. Revisiting few-sample BERT fine-tuning, in ICLR, 2021. T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, and Y. Artzi. paper code
  20. Few-shot conversational dense retrieval, in SIGIR, 2021. S. Yu, Z. Liu, C. Xiong, T. Feng, and Z. Liu. paper code
  21. Relational learning with gated and attentive neighbor aggregator for few-shot knowledge graph completion, in SIGIR, 2021. G. Niu, Y. Li, C. Tang, R. Geng, J. Dai, Q. Liu, H. Wang, J. Sun, F. Huang, and L. Si. paper
  22. Few-shot language coordination by modeling theory of mind, in ICML, 2021. H. Zhu, G. Neubig, and Y. Bisk. paper code
  23. Graph-evolving meta-learning for low-resource medical dialogue generation, in AAAI, 2021. S. Lin, P. Zhou, X. Liang, J. Tang, R. Zhao, Z. Chen, and L. Lin. paper
  24. KEML: A knowledge-enriched meta-learning framework for lexical relation classification, in AAAI, 2021. C. Wang, M. Qiu, J. Huang, and X. He. paper
  25. Few-shot learning for multi-label intent detection, in AAAI, 2021. Y. Hou, Y. Lai, Y. Wu, W. Che, and T. Liu. paper code
  26. SALNet: Semi-supervised few-shot text classification with attention-based lexicon construction, in AAAI, 2021. J.-H. Lee, S.-K. Ko, and Y.-S. Han. paper
  27. Learning from my friends: Few-shot personalized conversation systems via social networks, in AAAI, 2021. Z. Tian, W. Bi, Z. Zhang, D. Lee, Y. Song, and N. L. Zhang. paper code
  28. Relative and absolute location embedding for few-shot node classification on graph, in AAAI, 2021. Z. Liu, Y. Fang, C. Liu, and S. C.H. Hoi. paper
  29. Few-shot question answering by pretraining span selection, in ACL-IJCNLP, 2021. O. Ram, Y. Kirstain, J. Berant, A. Globerson, and O. Levy. paper code
  30. A closer look at few-shot crosslingual transfer: The choice of shots matters, in ACL-IJCNLP, 2021. M. Zhao, Y. Zhu, E. Shareghi, I. Vulic, R. Reichart, A. Korhonen, and H. Schütze. paper code
  31. Learning from miscellaneous other-classwords for few-shot named entity recognition, in ACL-IJCNLP, 2021. M. Tong, S. Wang, B. Xu, Y. Cao, M. Liu, L. Hou, and J. Li. paper code
  32. Distinct label representations for few-shot text classification, in ACL-IJCNLP, 2021. S. Ohashi, J. Takayama, T. Kajiwara, and Y. Arase. paper code
  33. Entity concept-enhanced few-shot relation extraction, in ACL-IJCNLP, 2021. S. Yang, Y. Zhang, G. Niu, Q. Zhao, and S. Pu. paper code
  34. On training instance selection for few-shot neural text generation, in ACL-IJCNLP, 2021. E. Chang, X. Shen, H.-S. Yeh, and V. Demberg. paper code
  35. Unsupervised neural machine translation for low-resource domains via meta-learning, in ACL-IJCNLP, 2021. C. Park, Y. Tae, T. Kim, S. Yang, M. A. Khan, L. Park, and J. Choo. paper code
  36. Meta-learning with variational semantic memory for word sense disambiguation, in ACL-IJCNLP, 2021. Y. Du, N. Holla, X. Zhen, C. Snoek, and E. Shutova. paper code
  37. Multi-label few-shot learning for aspect category detection, in ACL-IJCNLP, 2021. M. Hu, S. Z. H. Guo, C. Xue, H. Gao, T. Gao, R. Cheng, and Z. Su. paper
  38. TextSETTR: Few-shot text style extraction and tunable targeted restyling, in ACL-IJCNLP, 2021. P. Rileya, N. Constantb, M. Guob, G. Kumarc, D. Uthusb, and Z. Parekh. paper
  39. Few-shot text ranking with meta adapted synthetic weak supervision, in ACL-IJCNLP, 2021. S. Sun, Y. Qian, Z. Liu, C. Xiong, K. Zhang, J. Bao, Z. Liu, and P. Bennett. paper code
  40. PROTAUGMENT: Intent detection meta-learning through unsupervised diverse paraphrasing, in ACL-IJCNLP, 2021. T. Dopierre, C. Gravier, and W. Logerais. paper code
  41. AUGNLG: Few-shot natural language generation using self-trained data augmentation, in ACL-IJCNLP, 2021. X. Xu, G. Wang, Y.-B. Kim, and S. Lee. paper code
  42. Meta self-training for few-shot neural sequence labeling, in KDD, 2021. Y. Wang, S. Mukherjee, H. Chu, Y. Tu, M. Wu, J. Gao, and A. H. Awadallah. paper code
  43. Knowledge-enhanced domain adaptation in few-shot relation classification, in KDD, 2021. J. Zhang, J. Zhu, Y. Yang, W. Shi, C. Zhang, and H. Wang. paper code
  44. Few-shot text classification with triplet networks, data augmentation, and curriculum learning, in NAACL-HLT, 2021. J. Wei, C. Huang, S. Vosoughi, Y. Cheng, and S. Xu. paper code
  45. Few-shot intent classification and slot filling with retrieved examples, in NAACL-HLT, 2021. D. Yu, L. He, Y. Zhang, X. Du, P. Pasupat, and Q. Li. paper
  46. Non-parametric few-shot learning for word sense disambiguation, in NAACL-HLT, 2021. H. Chen, M. Xia, and D. Chen. paper code
  47. Towards few-shot fact-checking via perplexity, in NAACL-HLT, 2021. N. Lee, Y. Bang, A. Madotto, and P. Fung. paper
  48. ConVEx: Data-efficient and few-shot slot labeling, in NAACL-HLT, 2021. M. Henderson, and I. Vulic. paper
  49. Few-shot text generation with natural language instructions, in EMNLP, 2021. T. Schick, and H. Schütze. paper
  50. Towards realistic few-shot relation extraction, in EMNLP, 2021. S. Brody, S. Wu, and A. Benton. paper code
  51. Few-shot emotion recognition in conversation with sequential prototypical networks, in EMNLP, 2021. G. Guibon, M. Labeau, H. Flamein, L. Lefeuvre, and C. Clavel. paper code
  52. Learning prototype representations across few-shot tasks for event detection, in EMNLP, 2021. V. Lai, F. Dernoncourt, and T. H. Nguyen. paper
  53. Exploring task difficulty for few-shot relation extraction, in EMNLP, 2021. J. Han, B. Cheng, and W. Lu. paper code
  54. Honey or poison? Solving the trigger curse in few-shot event detection via causal intervention, in EMNLP, 2021. J. Chen, H. Lin, X. Han, and L. Sun. paper code
  55. Nearest neighbour few-shot learning for cross-lingual classification, in EMNLP, 2021. M. S. Bari, B. Haider, and S. Mansour. paper
  56. Knowledge-aware meta-learning for low-resource text classification, in EMNLP, 2021. H. Yao, Y. Wu, M. Al-Shedivat, and E. P. Xing. paper code
  57. Few-shot named entity recognition: An empirical baseline study, in EMNLP, 2021. J. Huang, C. Li, K. Subudhi, D. Jose, S. Balakrishnan, W. Chen, B. Peng, J. Gao, and J. Han. paper
  58. MetaTS: Meta teacher-student network for multilingual sequence labeling with minimal supervision, in EMNLP, 2021. Z. Li, D. Zhang, T. Cao, Y. Wei, Y. Song, and B. Yin. paper
  59. Meta-LMTC: Meta-learning for large-scale multi-label text classification, in EMNLP, 2021. R. Wang, X. Su, S. Long, X. Dai, S. Huang, and J. Chen. paper

huggingface

NLP小帮手,huggingface的transformer

git: https://github.com/huggingface/transformers

paper: https://arxiv.org/abs/1910.03771v5

整体结构

简单教程:

https://blog.csdn.net/weixin_44614687/article/details/106800244

from_pretrained

底层为load_state_dict

1
2
3
4
5
6
7
8
9
10
Some weights of the model checkpoint at ../../../../test/data/chinese-roberta-wwm-ext were not used when initializing listnet_bert: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing listnet_bert from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing listnet_bert from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of listnet_bert were not initialized from the model checkpoint at ../../../../test/data/chinese-roberta-wwm-ext and are newly initialized: ['Linear2.weight', 'Linear1.weight', 'Linear1.bias', 'Linear2.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


两部分:1 加载的预训练模型中有参数没有用到 2 自己的模型有参数没有初始化
finetune的时候报这个 很正常
predict的时候应该不会有

关于model

BertModel -> our model

1 加载transformers中的模型

1
from transformers import BertPreTrainedModel, BertModel,AutoTokenizer,AutoConfig

2 基于1中的模型搭建自己的结构

Transformer-XL Attentive Language Models Beyond a Fixed-Length Context

https://arxiv.org/abs/1901.02860

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling ( memory and computation受限,长度不可能很大 ). propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.

3 Model

3.1 Vanilla Transformer Language Models

问题:In order to apply Transformer or self-attention to language modeling, the central problem is how to train a Transformer to effectively encode an arbitrarily long context into a fixed size representation.通常做法为vanilla model。 vanilla model就是说把长文本分隔成固定长度的seg来处理,如上图。

During training,There are two critical limitations of using a fixed length context. First, the largest possible dependency length is upper bounded by the segment length. Second. simply chunking a sequence into fixed-length segments will lead to the context fragmentation problem

During evaluation, As shown in Fig. 1b, this procedure ensures that each prediction utilizes the longest possible context exposed during training, and also relieves context fragmentation issue encountered in training. However, this evaluation procedure is extremely expensive.

3.2 Segment-Level Recurrence with State Reuse

introduce a recurrence mechanism to the Transformer architecture.

定义变量

转换过程

SG() stands for stop-gradient,$\circ$ 表示矩阵拼接

具体过程如下图

During training, the hidden state sequence computed for the previous segment is fixed and cached to be reused as an extended context when the model processes the next new segment, as shown in Fig. 2a.

during evaluation, the representations from the previous segments can be reused instead of being computed from scratch as in the case of the vanilla model.

3.3 Relative Positional Encodings

how can we keep the positional information coherent when we reuse the states? 如果保留原来的位置编码形式,可以得到如下

这种方式存在问题:

为了解决这个问题提出了relative positional information。

standard Transformer

we propose

3.4 完整算法流程

Bridging the Gap Between Relevance Matching and Semantic Matching for Short Text Similarity Modeling

https://cs.uwaterloo.ca/~jimmylin/publications/Rao_etal_EMNLP2019.pdf

2 HCAN: Hybrid Co-Attention Network

three major components: (1) a hybrid encoder (2) a relevance matching module (3) a semantic matching module

2.1 Hybrid Encoders

hybrid encoder module that explores three types of encoders: deep, wide, and contextual

query and context words :$\{w_1^q,w_2^q,…,w_n^q\},\{w_1^c,w_2^c,…,w_m^c\}$, embedding representations $\textbf{Q}\in \mathbb{R}^{n\times L},\textbf{C}\in \mathbb{R}^{m\times L}$

Deep Encoder

$\textbf{U}$表示$\textbf{Q},\textbf{C}$

Wide Encoder

Unlike the deep encoder that stacks multiple convolutional layers hierarchically, the wide encoder organizes convolutional layers in parallel, with each convolutional layer having a different window size k

Contextual Encoder

2.2 Relevance Matching

2.3 Semantic Matching

2.4 Final Classification


:D 一言句子获取中...