SimCSE Simple Contrastive Learning of Sentence Embeddings

https://arxiv.org/pdf/2104.08821.pdf

1.背景

1 target

对于$D=\{(x_i,x_i^{+})\}_{i=1}^{m}$,where $x_i$ and $x_i^{+}$ are semantically related. xi,xj+ are not semantically related

x->h

Contrastive learning aims to learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors

N is mini-batch size,分子是正样本,分母为负样本(有一个正样本,感觉是可以忽略)

分母会包含分子的项吗?从代码看,会的

loss

https://www.jianshu.com/p/d73e499ec859

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def loss(self,y_pred,y_true,lamda=0.05):

'''

exist a query q1 and ranked condidat list [d1,d2,d3,...,dn]
loss= -log( exp^sim(q1,d1)/t / sum(exp^sim(q1,di)/t) i=2,...,n)

[q1,q2] [[d11,d12,d13],[d21,d22,d23]]
similarities=[[sim(q1d11),sim(q1d12),sim(q1d13)],[sim(q2d21),sim(q2d22),sim(q2d23)] ] y_true=[y1 ,y2 ]

loss = F.cross_entropy(similarities, y_true)
ref : https://www.jianshu.com/p/d73e499ec859
'''

# idxs = torch.arange(0, y_pred.shape[0])
# y_true = idxs + 1 - idxs % 2 * 2
y_pred = y_pred.reshape(-1, y_true.shape[1])

# y_true=[0]*y_pred.sha pe[0]
# similarities = F.cosine_similarity(y_pred.unsqueeze(1), y_pred.unsqueeze(0), dim=2)
# similarities = similarities - torch.eye(y_pred.shape[0]) * 1e12
y_pred = y_pred / lamda
y_true = torch.argmax(y_true, dim=1)
loss = F.cross_entropy(y_pred, y_true)
return loss

2 representations评价指标

Alignment: calculates expected distance between embeddings of the paired instances(paired instances就是正例)

uniformity: measures how well the embeddings are uniformly distributed

2.结构

2.1 Unsupervised

$x_i->h_i^{z_i},x_i->h_i^{z_i^{‘}}$

z is a random mask for dropout,loss为

2.2 Supervised

引入非目标任务的有标签数据集,比如NLI任务,$(x_i,x_i^{+},x_i^{-})$,where $x_i$ is the premise, $x_i^{+}$and $x_i^{-}$are entailment and contradiction hypotheses.

$(h_i,h_j^{+})$为normal negatives,$(h_i,h_j^{-})$为hard negatives

ConSERT A Contrastive Framework for Self-Supervised Sentence Representation Transfer

https://arxiv.org/abs/2105.11741

https://tech.meituan.com/2021/06/03/acl-2021-consert-bert.html

1.背景

首先,BERT其自身导出的句向量(不经过Fine-tune,对所有词向量求平均)会出现“坍缩(Collapse)”现象,即所有的句子都倾向于编码到一个较小的空间区域内,如图。为了解决这个问题,将对比学习结合到finetune过程,借助无标签数据来提升模型的能力。

2.原理

给定一个类似BERT的预训练语言模型$\textbf{M}$,以及从目标领域数据分布中收集的无标签文本语料库$\mathcal{D}$,我们希望通过构建自监督任务在$\mathcal{D}$上对$\textbf{M}$进行Fine-tune,使得Fine-tune后的模型能够在目标任务(文本语义匹配)上表现最好。

2.1 整体框架

模型整体结构如上图所示,主要由三个部分组成

A data augmentation module that generates different views for input samples at the token embedding layer.

A shared BERT encoder that computes sentence representations for each input text. During training, we use the average pooling of the token embeddings at the last layer to obtain sentence representations.

A contrastive loss layer on top of the BERT encoder. It maximizes the agreement between one representation and its corresponding version that is augmented from the same sentence while keeping it distant from other sentence representations in the same batch.

对于任意一个句子输入$x$,得到其对应的两个增强向量$e_i=T_1(x),e_j=T_2(x),e_i,e_j\in \mathbb{R}^{L\times d}$,然后经过shared BERT encoder编码为$r_i,r_j$,其中$T_1,T_2$为不同的数据增强方式,$L$为句子$x$的长度,$d$为隐藏单元的数量。对于每个train step,从$\mathcal{D}$随机选取$N$个样本作为mini-batch,然后得到$2N$个增强样本,使用NT-Xent构造loss为

其中$sim(.)$为余弦相似度计算,$\tau$表示temperature,是一个超参数,实验中取0.1,$\mathbb{1}$是指示器,当$k=i$时,值为0。上式分子为正样本,分母为全部(但是基本为负样本,所以可以看成负样本),所以loss变小就是让分子变大,分母变小,也就是让正样本相似度变大,负样本相似度变小

2.2 数据增强策略

显式生成增强样本的方法包括:回译、同义词替换、意译等,然而这些方法一方面不一定能保证语义一致。所以考虑了在Embedding层隐式生成增强样本的方法。

  • 对抗攻击(Adversarial Attack):这一方法通过梯度反传生成对抗扰动,将该扰动加到原本的Embedding矩阵上,就能得到增强后的样本。由于生成对抗扰动需要梯度反传,因此这一数据增强方法仅适用于有监督训练的场景。

  • 打乱词序(Token Shuffling):这一方法扰乱输入样本的词序。由于Transformer结构没有“位置”的概念,模型对Token位置的感知全靠Embedding中的Position Ids得到。因此在实现上,我们只需要将Position Ids进行Shuffle即可。

  • 裁剪(Cutoff)

    :又可以进一步分为两种:

    • Token Cutoff:随机选取Token,将对应Token的Embedding整行置为零。
    • Feature Cutoff:随机选取Embedding的Feature,将选取的Feature维度整列置为零。
  • Dropout:Embedding中的每一个元素都以一定概率置为零,与Cutoff不同的是,该方法并没有按行或者按列的约束。

2.3 融合监督信号

除了无监督训练以外,作者给出3种进一步融合监督信号的策略,以NLI任务为例:

Joint training (joint):

Supervised training then unsupervised transfer (sup-unsup):

first train the model with $\mathcal{L}_{ce}$on NLI dataset, then use $\mathcal{L}_{con}$to finetune it on the target dataset.

Joint training then unsupervised transfer (joint-unsup):

first train the model with the $\mathcal{L}_{joint}$on NLI dataset, then use $\mathcal{L}_{con }$to fine-tune it on the target dataset.

3.定性分析

后又发现BERT句向量表示的坍缩和句子中的高频词有关。具体来说,当通过平均词向量的方式计算句向量时,那些高频词的词向量将会主导句向量,使之难以体现其原本的语义。当计算句向量时去除若干高频词时,坍缩现象可以在一定程度上得到缓解(如图2蓝色曲线所示)。

4 实验结果

4.1 Unsupervised Results

4.2 Supervised Results

word2vec

一.原理

两种训练模型

  • 如果是用一个词语作为输入,来预测它周围的上下文,那这个模型叫做『Skip-gram 模型』
  • 而如果是拿一个词语的上下文作为输入,来预测这个词语本身,则是 『CBOW 模型』

训练技巧

hierarchical softmax 和 negative sampling

二.代码

训练代码

1
2
3
4
5
6
7
8
9
10
11
12
13
from gensim.models.word2vec import Word2Vec
import pandas as pd
from gensim import models
import jieba


###train
data=pd.read_csv(data_path)
sentences=data.tolist()
model= Word2Vec()
model.build_vocab(sentences)
model.train(sentences,total_examples = model.corpus_count,epochs = 5)
model.save(model_path)

词向量矩阵

1
2
3
4
5
6
7
8
from gensim import models
if __name__ == '__main__':
model=models.KeyedVectors.load_word2vec_format(model_path,binary=True)
print(model.vectors) ##(779845, 400))
print("\n")
print(model.index_to_key)
print("\n")
print(model["的"])
1
2
3
4
5
array([[-1.3980628e+00, -4.6281612e-01,  5.8368486e-01, ...,         5.3952241e-01,  4.4697687e-01,  1.3505782e+00],       [ 4.9143720e-01, -1.4818899e-01, -2.8366420e-01, ...,         1.1110669e+00,  2.1992767e-01,  7.0457202e-01],       [-8.5650706e-01,  8.2832746e-02, -8.4218192e-01, ...,         2.1654253e+00,  6.4846051e-01, -5.7714492e-01],       ...,       [ 7.5072781e-03, -1.3543828e-02,  2.3101490e-02, ...,         4.2363801e-03, -5.6749382e-03,  6.3404259e-03],       [-2.6244391e-04, -3.0459568e-02,  5.9752418e-03, ...,         1.7844304e-02, -4.7109672e-04,  7.7916058e-03],       [ 7.2062697e-04, -6.5988898e-03,  1.1346856e-02, ...,        -3.7340564e-03, -1.8825980e-02,  2.7245486e-03]], dtype=float32)

[',', '的', '。', '、', '0', '1', '在', '”', '2', '了', '“', '和', '是', '5', ...]

array([ 4.9143720e-01, -1.4818899e-01, -2.8366420e-01, -3.6405793e-01, 1.0851435e-01, 4.9507666e-02, -7.1219063e-01, -5.4614645e-01, -1.3581418e+00, 3.0274218e-01, 6.1700332e-01, 3.5553512e-01, 1.6602433e+00, 7.5298291e-01, -1.4151905e-01, -2.1077128e-01, -2.6325354e-01, 1.6108564e+00, -4.6750236e-01, -1.6261842e+00, 1.3063166e-01, 8.0702168e-01, 4.0011466e-01, 1.2198541e+00, -6.2879241e-01, ... 2.1928079e-01, 7.1725255e-01, -2.3430648e-01, -1.2066336e+00, 9.7590965e-01, -1.5906478e-01, -3.5802779e-01, -3.8005975e-01, 1.9056025e-01, 1.1110669e+00, 2.1992767e-01, 7.0457202e-01], dtype=float32)

参考

https://zhuanlan.zhihu.com/p/26306795

https://arxiv.org/abs/1301.3781v3

https://arxiv.org/abs/1405.4053

nlp中使用预训练的词向量和随机初始化的词向量的区别在哪里?

当你训练数据不充足的时候,可以直接使用别人已经预训练好的词向量,也可以根据自己的训练数据微调(fine-tuning)预训练词向量,也可以把词向量和整个模型一块训练,但是通常预训练的词向量我们不会再在训练的过程中进行更新。

当你的训练数据比较充足的时候,并且想让词向量能更好的捕捉自己的训练数据的语义信息时,应该使用随机初始化的词向量。当然,随机初始化的词向量必须要在训练网络的过程中不断进行更新,就和神经网络的权重参数一样进行训练。

例子:

1.直观展示

1
2
3
4
5
6
7
8
9
10
11
12
import torch
from torch import nn
from torch.autograd import Variable
###random
embeds = nn.Embedding(2, 5)
print(embeds.weight)
embeds = nn.Embedding(2, 5)
print(embeds.weight)
###from pretrain
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
print(embedding.weight)
1
2
3
4
5
6
7
8
9
Parameter containing:
tensor([[-0.1754, 1.6604, -1.5025, -1.0980, -0.4718],
[-1.1276, 0.1408, -1.0746, -1.2768, -0.6789]], requires_grad=True)
Parameter containing:
tensor([[-0.7366, 0.0607, 0.6151, 0.2282, 0.3878],
[-1.1365, 0.1844, -1.1191, -0.8787, -0.5121]], requires_grad=True)
Parameter containing:
tensor([[1.0000, 2.3000, 3.0000],
[4.0000, 5.1000, 6.3000]])

2.n-gram

1
2
3
self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
self.embedding_ngram2 = nn.Embedding(config.n_gram_vocab, config.embed)
self.embedding_ngram3 = nn.Embedding(config.n_gram_vocab, config.embed)

参考

https://www.zhihu.com/question/337950427

Sentence-BERT Sentence Embeddings using Siamese BERT-Networks

paper: https://arxiv.org/abs/1908.10084

giit: https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications

1.贡献

基于bert利用孪生结构或者三胞胎结构训练,使得产生在低维空间可用的句子Embedding。对于文本匹配任务,可以离线计算句子Embedding,然后基于句子Embedding在线匹配,可实现快速高精度的匹配。

2.结构

文章提出三种结构和目标函数,三胞胎结构作者没有画图

1.Classification Objective Function

2.Regression Objective Function

3.Triplet Objective Function

$||.||$计算向量距离,$s_a$为样本本身,$s_p$为正样本,$s_n$为负样本,$\sigma$使得正样本至少比负样本距离样本近$\sigma$。

对于pooling,文章提出三种策略

1.Using the output of the CLS-token
2.computing the mean of all output vectors (MEAN_strategy)
3.computing a max-over-time of the output vectors (MAX_strategy). The default configuration is MEAN.

3.实验结果

3.1 Unsupervised STS

3.2 Supervised STS

3.3 Argument Facet Similarity

3.4 Wikipedia Sections Distinction

We use the Triplet Objective

4.代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from sentence_bert.sentence_transformers import SentenceTransformer, util

###load model
model = SentenceTransformer(model_path)

# Single list of sentences
sentences = ['The cat sits outside',
'A man is playing guitar',
'I love pasta',
'The new movie is awesome',
'The cat plays in the garden',
'A woman watches TV',
'The new movie is so great',
'Do you like pizza?']

#Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
for j in range(i+1, len(cosine_scores)):
pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

for pair in pairs[0:10]:
i, j = pair['index']
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))
1
2
3
4
5
6
7
8
9
10
11
The new movie is awesome 		 The new movie is so great 		 Score: 0.9283
The cat sits outside The cat plays in the garden Score: 0.6855
I love pasta Do you like pizza? Score: 0.5420
I love pasta The new movie is awesome Score: 0.2629
I love pasta The new movie is so great Score: 0.2268
The new movie is awesome Do you like pizza? Score: 0.1885
A man is playing guitar A woman watches TV Score: 0.1759
The new movie is so great Do you like pizza? Score: 0.1615
The cat plays in the garden A woman watches TV Score: 0.1521
The cat sits outside The new movie is awesome Score: 0.1475

文本表示

文本表示的表示形式可以是单一数值(基本没人用),可以是向量(目前主流),好奇有没有高纬tensor表示的?下文是基于向量表示的。

1.词语表示

1.1 one hot

举个例子,有样本如下:

​ Jane wants to go to Shenzhen.

​ Bob wants to go to Shanghai.

基于上述两个文档中出现的单词,构建如下一个词典:

Vocabulary= [Jane, wants, to, go, Shenzhen, Bob, Shanghai]

那么wants 可以表示为

1
[0,1,0,0,0,0,0]

1.2 word embedding

词向量模型是考虑词语位置关系的一种模型。通过大量语料的训练,将每一个词语映射到高维度的向量空间当中,使得语意相似的词在向量空间上也会比较相近,举个例子,如

上表为词向量矩阵,其中行表示不同特征,列表示不同词,Man可以表示为

1
[-1,0.01,0.03,0.09]

性质:$emb_{Man}-emb_{Women}\approx emb_{King}-emb_{Queen}$

常见的词向量矩阵构建方法有,word2vec,GloVe

2.句子表示

2.1 词袋模型

词袋模型不考虑文本中词与词之间的上下文关系,仅仅只考虑所有词的权重。而权重与词在文本中出现的频率有关。

例句:

​ Jane wants to go to Shenzhen.

​ Bob wants to go to Shanghai.

基于上述两个文档中出现的单词,构建如下一个词典:

Vocabulary= [Jane, wants, to, go, Shenzhen, Bob, Shanghai]

那么上面两个例句就可以用以下两个向量表示,其值为该词语出现的次数:

1
2
[1,1,2,1,1,0,0]
[0,1,2,1,0,1,1]

2.2 Sentence Embedding

2.2.1 评价工具

SentEval is a popular toolkit to evaluate the quality of sentence embeddings.

2.2.2 常见方法

sentence BERT

BERT-flow

https://zhuanlan.zhihu.com/p/444346578

参考文献

https://zhuanlan.zhihu.com/p/353187575

https://www.jianshu.com/p/0587bc01e414

https://www.cnblogs.com/chenyusheng0803/p/10978883.html


:D 一言句子获取中...