DSSM双塔模型系列

简单介绍微软出品的DSSM,CNN-DSSM,LSTM-DSSM

原文分别为:

《Learning Deep Structured Semantic Models for Web Search using Clickthrough Data》

《A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval》

《SEMANTIC MODELLING WITH LONG-SHORT-TERM MEMORY FOR INFORMATION RETRIEVAL》

首先为什么叫做双塔,query塔做在线serving,doc塔离线计算embeding建索引,推到线上即可。

注意, DSSM中query和不同的doc是共享参数的, https://flashgene.com/archives/72820.html

一.DSSM

1.1 模型整体结构

模型的整体结构如上图所示,$Q$为query,$D_i$为文档。

文本的初始词袋表示为$x$,因为参数过多,不利于训练,所以降低维度,就提出了word hashing

word hashing其实就是利于char n-gram分词,然后用向量表示(只是这里依然用词袋表示向量,而不是稠密向量),如下所示

这里有个顾虑为是否存在不同的词使用相同的向量表示。关于这个作者做了实验,结果如下。

对于词汇数量500K大小的词表,采用3-gram后,此表压缩到30k,而且重复表示的仅为22个。重复表示率为0.0044%,维度压缩到原来6%,可以说非常有效。

然后为多层的非线性映射,每层都为全连接网络,得到

非线性映射层的最后一层得到语义特征$y$为

利用余弦相似度衡量$Q$和$D$相似度得到

最后的概率输出为

其中$\gamma$为smoothing factor。

1.2 训练

样本集构造,对每个正样本$(Q,D^+)$,搭配4个随机负样本$(Q,D_j^-;j=1,..,4)$

损失函数为:

其中$\wedge$为模型参数。

二.CNN-DSSM

2.1 CLSM结构

模型包括几个部分:(1) a word-n-gram layer obtained by running a contextual sliding window over the input word sequence (2) a letter-trigram layer that transforms each word-trigram into a letter-trigram representation vector (3) a convolutional layer that extracts contextual features for each word with its neighboring words defined by a window (4) a max-pooling layer that discovers and combines salient word-n-gram features to form a fixed-length sentence-level feature vector (5) a semantic layer that extracts a high-level semantic feature vector for the input word sequence.

2.2 Letter-trigram based Word-n-gram Representation

在DSSM的Letter-trigram的基础上加了Word-n-gram,Word-n-gram就是对原始输入文本做滑窗,对于第$t$个word-n-gram可以表示为:

其中$n=2d+1,f_t$为的第$t$个词语的letter-trigram。一个letter-trigram的维度为$30K$,那么一个word-n-gram维度为$n\times30K$

举个例子,如上图,输入文本为$(s) \ online \ auto\ body \ (s)$,滑动窗口大小为n=3,可得$(s)\ online \ auto,\ online \ auto \ body ,auto\ body \ (s) $,那么

$l_1=[f^T((s)),f^T(online ),f^T(auto)]^T,\\l_2=[f^T(online ),f^T(auto),f^T(body)]^T,\\l_3=[f^T(auto),f^T(body),f^T((s))]^T$

2.3 Modeling Word-n-gram-Level Contextual Features at the Convolutional Layer

语境相关特征向量$h_t$可以表示为:

其中$W_c$为特征转换矩阵,也就是卷积矩阵,对于全部的word n-grams,$W_c$共享。有小伙伴肯定好奇,这不就是全连接吗,和卷积什么关系,俺也疑惑?

下图为作者做的一个实验。

2.4 Modeling Sentence-Level Semantic Features Using Max Pooling

获取局部的语境相关的特征向量后,我们需要把它们合在一起组合句子级别的特征向量。由于语句中某些词语不重要,我们可以忽略它,有些词语很重要,要保留。为了达到这个目的,使用了max pooling,用式子描述如下

其中$v(i)$表示池化层输出$v$的第$i$个元素,$K$为$v$的维度和$h_t$的维度一样,$h_t(i)$是第$t$个局部语境特征向量的第$i$个元素。举个例子如下,

2.5 Latent Semantic Vector Representations

语义向量表示$y$,用公式描述如下

2.6 Using the CLSM for IR

和DSSM都一样,

2.7 损失函数

三.LSTM-DSSM

cnn-dssm只能捕获局部的文本信息,lstm对于长序列的信息捕获能力强于lstm,因此使用lstm改进dssm。

3.1 模型结构

整体结构如下图,注意红色的部分为残差传递的方向。

图中的LSTM单元是LSTM的变种,加入了peep hole的 LSTM,具体结构如下。

参考

https://www.cnblogs.com/guoyaohua/p/9229190.html

熵,KL散度,交叉熵,JS散度

GAN需要KL散度和JS散度,所以先预热。

1.熵

信息量为:

熵为信息量的算术平均:

2.交叉熵

交叉熵为

3.KL散度

对于同一个随机变量有两个单独的概率分布,我们可以使用KL散度(Kullback-Leibler divergence)来衡量两个分布的差异。在机器学习的损失函数的计算中,我们可以假设$P$为样本的真实分布,$Q$用来表示模型所预测的分布,使用KL散度来衡量两个分布之间的差异。KL散度等于交叉熵减去熵

$P$和$Q$概率分布越接近,$D_{KL}(P||Q)$越小。

KL散度与交叉熵区别与联系

https://blog.csdn.net/Dby_freedom/article/details/83374650

KL散度主要有两个性质:

(1)不对称性

尽管KL散度从直观上是个距离函数,但它并不是一个真正的度量,因为它不具有对称性,即$D_{KL}(P||Q)\neq D_{KL}(Q||P)$。

(2)非负性

即$D_{KL}(P||Q) \geq 0$。

4.JS散度

JS散度也是用于度量两个概率分布的相似度,其解决了KL散度不对称的缺点

不同于KL主要在两方面:

(1)值域范围

JS散度的值域范围是[0,1],相同则是0,相反为1。

(2)对称性

即$ JS(P||Q)=JS(Q||P)$,从数学表达式中就可以看出。

参考

https://www.cnblogs.com/Mrfanl/p/11938139.html

https://zhuanlan.zhihu.com/p/346518942

https://www.w3cschool.cn/article/83016451.html

Pre-trained Models for Natural Language Processing A Survey

原文内容很丰富,慢慢学习更新。

摘要

这篇综述从language representation learning入手,然后全面的阐述Pre-trained Models的原理,结构以及downstream任务,最后还罗列了PTM的未来发展方向。该综述目的旨在为NLP小白,PTM小白做引路人,感人。

1.Introduction

随着深度学习的发展,许多深度学习技术被应用在NLP,比如CNN,RNN,GNN以及attention。

尽管NLP任务的取得很大成功,但是和CV比较,性能提高可能不是非常明显。这主要是因为NLP任务的数据集都非常小(除了机器翻译),然而深度网络参数非常多,没有足够的数据支撑网络训练会导致过拟合问题。

最近,大量工作表明,预先训练的模型(PTMs),在大型语料库上可以学习通用语言表示,这有利于下游NLP任务可以避免从零开始训练新模型。随着算力的发展,深度模型(例如,transformer)的出现和训练技巧的不断调高,PTM的结构从浅层发展成深层。第一代PTM被用于Non-contextual word Embedding。由于下游任务不需要这些模型本身,只需要训练好的词向量矩阵,因此对于现在的算力,这些模型非常浅层,比如Skip-Gram和GloVe。虽然这些预训练词向量可以捕获词语的语义,但它们不受上下文限制,无法捕获上下文中的高级含义,某些任务会失效,例如多义词,句法结构,语义角色、回指。第二代PTM关注Contextual word embeddings,比如BERT,GPT等。这些编码器任然需要通过下游任务在上下文中表示词语。

2.Background

2.1 Language Representation Learning

The core idea of distributed representation is to describe the meaning of a piece of text by low-dimensional real-valued vectors. And each dimension of the vector has no corresponding sense, while the whole represents a concrete concept.

Non-contextual Embeddings

这一步主要是将分割的字符,比如图中的$x$,变成向量表达$e_x \in \mathbb{R}^{D_e}$,$D_e$是词向量维度。向量化过程就是基于一个离线训练的词向量矩阵$E\in \mathbb{R}^{D_e\times |\mathcal{V}|} $做查找,$\mathcal{V}$是词汇表。

这个过程主要有两个问题。第一个是这个词向量是静态的,没有考虑上下文含义,无法处理多义词。第二个是oov问题,许多算法可以缓解这个问题,比如基于character level,比如基于subword,subword算法有BPE,CharCNN等。

Contextual Embeddings

To address the issue of polysemous and the context-dependent nature of words, we need distinguish the semantics of words in different contexts:

其中$f_{enc}(\cdot)$为深度编码器。$\textbf{h}_t$就是contextual embedding或者dynamical embedding。

2.2 Neural Contextual Encoders

可以分成两类,sequence models and non-sequence models。

2.2.1 sequence models

sequence models 分为两类,Convolutional Models和Recurrent Models,见上图。

Convolutional

Convolutional models take the embeddings of words in the input sentence and capture the meaning of a word by aggregating the local information from its neighbors by convolution operations

Recurrent

Recurrent models capture the contextual representations of words with short memory, such as LSTMs and GRUs . In practice, bi-directional LSTMs or GRUs are used to collect information from both sides of a word, but its performance is often affected by the long-term dependency problem.

2.2.2 non-sequence models

transformer: model the relation of every two words

2.2.3 Analysis

Sequence models:

1.Sequence models learn the contextual representation of the word with locality bias and are hard to capture the long-range interactions between words.

2.Nevertheless, sequence models are usually easy to train and get good results for various NLP tasks.

fully-connected self-attention model:

1.can directly model the dependency between every two words in a sequence, which is more powerful and suitable to model long range dependency of language

2.However, due to its heavy structure and less model bias, the Transformer usually requires a large training corpus and is easy to overfit on small or modestly-sized datasets

结论:the Transformer has become the mainstream architecture of PTMs due to its powerful capacity.

2.3 Why Pre-training?

  1. Pre-training on the huge text corpus can learn universal language representations and help with the downstream tasks.
  2. Pre-training provides a better model initialization,which usually leads to a better generalization performance and speeds up convergence on the target task.
  3. Pre-training can be regarded as a kind of regularization to avoid overfitting on small data

3 Overview of PTMs

3.1 Pre-training Tasks

预训练任务对于学习通用语言表示至关重要。通常,这些预训练任务应具有挑战性,并拥有大量训练数据。在本节中,我们将预训练任务分成三个类别:Supervised learning、Unsupervised learning和Self-Supervised learning。

Self-Supervised learning: is a blend of supervised learning and unsupervised learning. The learning paradigm of SSL is entirely the same as supervised learning, but the labels of training data are generated automatically. The key idea of SSL is to predict any part of the input from other parts in some form. For example, the masked language model (MLM) is a self-supervised task that attempts to predict the masked words in a sentence given the rest words.

接下来基于介绍常用的基于Self-Supervised learning的预训练任务。

3.1.1 Language Modeling (LM)

3.1.2 Masked Language Modeling (MLM)

3.1.3 Permuted Language Modeling (PLM)

3.1.4 Denoising Autoencoder (DAE)

3.1.5 Contrastive Learning (CTL)

nsp也属于CTL

https://zhuanlan.zhihu.com/p/360892229

3.1.6 Others

3.2 Taxonomy of PTMs

作者从以下四个角度,即Representation Type,Architectures,Pre-Training Task Types,Extensions,对现有的PTM分类,分类结果如上。图和这里有一点不统一,是作者没注意?图里有5个类别,多了Tuning Strategies,而且Representation Type在图中为Contextual?。

3.3 Model Analysis

4 Extensions of PTMs

4.1 Knowledge-Enriched PTMs

4.2 Multilingual and Language-Specific PTMs

4.3 Multi-Modal PTMs

4.4 Domain-Specific and Task-Specific PTMs

4.5 Model Compression

5 Adapting PTMs to Downstream Tasks

虽然PTM学习了很多通用知识,但是如何将这些知识有效应用到下游任务是个挑战。

5.1 Transfer Learning

Transfer learning is to adapt the knowledge from a source task (or domain) to a target task (or domain).如下图。

5.2 How to Transfer?

5.2.1 Choosing appropriate pre-training task, model architecture and corpus

5.2.2 Choosing appropriate layers

使用哪些层参与下游任务

选择的层model1+下游任务model2

对于深度模型的不同层,捕获的知识是不同的,比如说词性标注,句法分析,长期依赖,语义角色,协同引用。对于RNN based的模型,研究表明多层的LSTM编码器的不同层对于不同任务的表现不一样。对于transformer based 的模型,基本的句法理解在网络的浅层出现,然而高级的语义理解在深层出现。

用$\textbf{H}^{l}(1<=l<=L)$表示PTM的第$l$层的representation,$g(\cdot)$为特定的任务模型。有以下几种方法选择representation:

a) Embedding Only

choose only the pre-trained static embeddings,即$g(\textbf{H}^{1})$

b) Top Layer

选择顶层的representation,然后接入特定的任务模型,即$g(\textbf{H}^{L})$

c) All Layers

输入全部层的representation,让模型自动选择最合适的层次,然后接入特定的任务模型,比如ELMo,式子如下

其中$\alpha$ is the softmax-normalized weight for layer $l$ and $\gamma$ is a scalar to scale the vectors output by pre-trained model

5.2.3 To tune or not to tune?

总共有两种常用的模型迁移方式:feature extraction (where the pre-trained parameters are frozen), and fine-tuning (where the pre-trained parameters are unfrozen and fine-tuned).

选择的层model1参数是否固定,model2一定要训练

bert 只有top layer finetune????

5.3 Fine-Tuning Strategies

Two-stage fine-tuning

第一阶段为中间任务,第二阶段为目标任务

Multi-task fine-tuning

multi-task learning and pre-training are complementary technologies.

Fine-tuning with extra adaptation modules

The main drawback of fine-tuning is its parameter ineffciency: every downstream task has its own fine-tuned parameters. Therefore, a better solution is to inject some fine-tunable adaptation modules into PTMs while the original parameters are fixed.

Others

self-ensemble ,self-distillation,gradual unfreezing,sequential unfreezing

参考

https://arxiv.org/pdf/2003.08271v4.pdf

 NLP PTM
  
 PTM

leetcode常见套路

一.常见算法

分治策略,动态规划,回溯,分支限界,贪心策略

二.巧用数据结构

普通栈、单调栈

队列

字典树

三.技巧

双指针,滑窗,二分查找,排序,快慢指针,取余,位运算,递归,时空转化(hashtable),dfs/bfs

参考

https://zhuanlan.zhihu.com/p/358653377

https://zhuanlan.zhihu.com/p/341176507

https://zhuanlan.zhihu.com/p/358653377

https://zhuanlan.zhihu.com/p/341176507

A Comprehensive Survey on Graph Neural Networks

there is an increasing number of applications where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on existing machine learning algorithms.

1.介绍

虽然深度学习技术可以捕获欧式空间数据的隐藏模式,但是目前很多应用是基于图的,这是非欧空间的数据。图数据的复杂性给现有的技术带来了很大的挑战。这是因为图数据可以是不规则的,一个图可能有不同数量的无序结点,一个结点可能有不同数量的邻接结点。这会使得一些基本操作,比如卷积,在图领域无法很好的捕获特征。除此之外,目前机器学习算法有一个重要的假设,就是假设各个结点是相互独立的,然而,图中存在很多复杂的连接信息,主要用来表征结点间的互相关性。为了解决以上问题,衍生了很多图神经网络技术。举个例子,比如,图卷积。下图对比了传统的2D卷积和图卷积。二者最大的区别在于邻接结点,一个有序一个无序,一个尺寸固定一个尺寸可变。

2.背景和定义

A. 背景

Graph neural networks vs network embedding

The main distinction between GNNs and network embedding is that GNNs are a group of neural network models which are designed for various tasks while network embedding covers various kinds of methods targeting the same task.

Graph neural networks vs graph kernel methods

The difference is that this mapping function of graph kernel methods is deterministic rather than learnable.

GNNs are much more efficient than graph kernel methods.

B. 定义

上表为本文的notation。

1.图

$ {G}=(V,E) $表示一个图。$N(v)=\{u\in V|(v,u)\in E\}$表示结点$v$的邻接结点。$\textbf{A}$是邻接矩阵,如果$A_{ij}=1$,那么表示$e_{ij}\in E$;如果$A_{ij}=0$,那么表示$e_{ij} \notin E$。$\textbf{X} \in \mathbb{R}^{n \times d} $是结点特征矩阵,$\textbf{X}^{e} \in \mathbb{R}^{m \times c}$是边特征矩阵。

2.有向图

A graph is undirected if and only if the adjacency matrix is symmetric.

3.时空图

A spatial-temporal graph is an attributed graph where the node attributes change dynamically over time.

$G^{(t)}=(V,E,\textbf{X}^{(t)}),\textbf{X}^{(t)} \in \mathbb{R}^{n \times d}$

3.分类和框架

3.1 GNN分类

作者把GNN分成以下4类,分别为RecGNNs,ConvGNNs , GAEs, STGNNs。

RecGNNs(Recurrent graph neural networks)

RecGNNs aim to learn node representations with recurrent neural architectures. They assume a node in a graph constantly exchanges information message with its neighbors until a stable equilibrium is reached.

ConvGNNs(Convolutional graph neural networks )

The main idea is to generate a node $v$’s representation by aggregating its own features $\textbf{x}_v$ and neighbors’ features $\textbf{x}_u,u\in N(v)$。Different from RecGNNs, ConvGNNs stack multiple graph convolutional layers to extract high-level node representations.

GAEs(Graph autoencoders)

are unsupervised learning frameworks which encode nodes/graphs into a latent vector space and reconstruct graph data from the encoded information. GAEs are used to learn network embeddings and
graph generative distributions.

STGNNs(Spatial-temporal graph neural networks)

aim to learn hidden patterns from spatial-temporal graphs. The key idea of STGNNs is to consider spatial dependency and temporal dependency at the same time.

3.2 框架

With the graph structure and node content information as inputs, the outputs of GNNs can focus on different graph analytics tasks with one of the following mechanisms:

Node-level outputs relate to node regression and node classification tasks.

Edge-level outputs relate to the edge classification and link prediction tasks.

Graph-level outputs relate to the graph classification task.

Training Frameworks:

1.Semi-supervised learning for node-level classification

2.Supervised learning for graph-level classification

3.Unsupervised learning for graph embedding

4.RecGNNs

RecGNNs apply the same set of parameters recurrently over nodes in a graph to extract high-level node representations. 接下来介绍几种RecGNNs 结构。

GNN*

Based on an information diffusion mechanism, GNN* updates nodes’ states by exchanging neighborhood information recurrently until a stable equilibrium is reached.

结点的hidden state is recurrently updated by

$\textbf{h}_v^0$随机初始化。$f(\cdot)$是 parametric function,must be a contraction mapping, which shrinks the distance between two points after projecting them into a latent space.

训练过程分为两步,更新结点表示和更新参数,交替进行使得loss收敛。When a convergence criterion is satisfied, the last step node hidden states are forwarded to a readout layer.

GraphESN

GraphESN使用ESN提高GNN*的训练效率。GraphESN包含encoder和output output。encoder随机初始化并且不需要训练。It implements a contractive state transition function to recurrently update node states until the global graph state reaches convergence. Afterward, the output layer is trained by taking the fixed node states as inputs.

Gated Graph Neural Networks (GGNNs)

The adavantage is that it no longer needs to constrain parameters to ensure convergence. However, the downside of training by BPTT is that it sacrifices efficiency both in time and memory.

GGNN

RecGNNs 利用GRU作为循环函数

其中$\textbf{h}_v^{(0)}=\textbf{x}_v$。

GGNN uses the back-propagation through time (BPTT) algorithm to learn the model parameters.

对于大的图不适用。

SSE

proposes a learning algorithm that is more scalable to large graphs

其中$\alpha$为超参数,$\sigma(\cdot)$为sigmoid函数。

5.ConvGNNs

ConvGNNs与RecGNNs 主要区别在于上图。

ConvGNNs fall into two categories, spectral-based and spatial-based. Spectral based approaches define graph convolutions by introducing filters from the perspective of graph signal processing [82] where the graph convolutional operation is interpreted as removing noises from graph signals. Spatial-based approaches inherit ideas from RecGNNs to define graph convolutions by information propagation. spatial-based methods have developed rapidly recently due to its attractive efficiency, flexibility, and generality.

5.1 Spectral-based ConvGNNs

5.2 Spatial-based ConvGNNs

罗列几个基本的结构。

NN4G

其中$f(\cdot)$是激活函数,$\textbf{h}_{v}^{(0)}=0$,可以使用矩阵形式表达为:

DCNN

regards graph convolutions as a diffusion process.

其中$f(\cdot)$是激活函数。probability transition matrix $\textbf{P}\in\mathbb{R}^{n\times n},\textbf{P} = \textbf{D}^{-1}\textbf{A}$。

DCNN concatenates $\textbf{H}^{(1)},\textbf{H}^{(2)},…,\textbf{H}^{(K)}$together as the final model outputs.

PGC-DGCNN

MPNN

5.3 Graph Pooling Modules

After a GNN generates node features, we can use them for the final task. But using all these features directly can be computationally challenging, thus, a down-sampling strategy is needed. Depending on the objective and the role it plays in the network, different names are given to this strategy: (1) the pooling operation aims to reduce the size of parameters by down-sampling the nodes to generate smaller representations and thus avoid overfitting, permutation invariance, and computational complexity issues; (2) the readout operation is mainly used to generate graph-level representation based on node representations. Their mechanism is very similar. In this chapter, we use pooling to refer to all kinds of down-sampling strategies applied to GNNs.

mean/max/sum pooling is the most primitive and effective way

$K$ is the index of the last graph convolutional layer.

some works [17], [27], [46] also use attention mechanisms to enhance the mean/sum pooling.

[101] propose the Set2Set method to generate a memory that increases with the size of the input.

还有SortPooling,DiffPool

6.GAEs

7.STGNNs

8.APPLICATIONS

参考

https://arxiv.org/abs/1901.00596v4

 GNN GNN
  

KG-BERT BERT for Knowledge Graph Completion

原文 https://arxiv.org/pdf/1909.03193.pdf

一.背景补充

知识图谱普遍存在不完备的问题。以上图为例,黑色的箭头表示已经存在的关系,红色的虚线则是缺失的关系。知识图谱补全是基于图谱里已有的关系去推理出缺失的关系。由于BERT在NLP取得的成绩,作者将其迁移到知识图谱补全的应用上。

二.结构

作者设计了两种训练方式的KG - BERT, 可以运用到不同的知识图谱补全任务当中。

2.1 Illustrations of fine-tuning KG-BERT for predicting the plausibility of a triple

输入由三部分组成,$Head$,$Relation$,$Tail$。举个例子,$Head$可以是“Steven Paul Jobs was an American business magnate,entrepreneur and investor.” 或者“Steve Jobs”,$Relation$可以是“founded”,$Tail$可以是“Apple Inc. is an American multinational technology company headquartered in Cupertino, California.”或者“Apple Inc.”。用$[SEP]$分隔实体和关系。输入为3个向量的sum,即token, segment 和position embeddings。对于segment,实体的segment Embedding为$e_A$,而关系的segment Embedding为$e_B$。对于position ,相同position的不同token使用相同的position embedding。

对于输入的三元组$\tau=(h,r,t)$,目标函数为:

损失函数是$S$和$y$的交叉熵:

其中$y_{\tau}\in \{0,1\}$是标签。

关于负样本的构造,作者是将正样本的$Head$或者$Tail$变成随机替换成别的,如下

其中$E$为实体的集合。

2.2 Illustrations of fine-tuning KG-BERT for predicting the relation between two entities

作者发现直接使用两个实体去预测关系,效果优于使用两个实体和一个随机关系(这里本人认为一个随机的关系本来就是错误特征,感觉肯定会影响预测结果)。这里和2.1结构的差异在于:1.输入从实体加关系的三输入变成基于实体的双输入2.输出从二分类变成多分类

目标函数为:

损失函数为$S^{‘}$和$y^{‘}$的交叉熵:

三.实验

setting: We choose pre-trained BERT-Base model with 12 layers, 12 self-attention heads and H = 768 as the initialization of KG-BERT, then fine tune KG-BERT with Adam implemented in BERT.

参考

https://github.com/yao8839836/kg-bert

https://zhuanlan.zhihu.com/p/355391327

word2vec

一.原理

两种训练模型

  • 如果是用一个词语作为输入,来预测它周围的上下文,那这个模型叫做『Skip-gram 模型』
  • 而如果是拿一个词语的上下文作为输入,来预测这个词语本身,则是 『CBOW 模型』

训练技巧

hierarchical softmax 和 negative sampling

二.代码

训练代码

1
2
3
4
5
6
7
8
9
10
11
12
13
from gensim.models.word2vec import Word2Vec
import pandas as pd
from gensim import models
import jieba


###train
data=pd.read_csv(data_path)
sentences=data.tolist()
model= Word2Vec()
model.build_vocab(sentences)
model.train(sentences,total_examples = model.corpus_count,epochs = 5)
model.save(model_path)

词向量矩阵

1
2
3
4
5
6
7
8
from gensim import models
if __name__ == '__main__':
model=models.KeyedVectors.load_word2vec_format(model_path,binary=True)
print(model.vectors) ##(779845, 400))
print("\n")
print(model.index_to_key)
print("\n")
print(model["的"])
1
2
3
4
5
array([[-1.3980628e+00, -4.6281612e-01,  5.8368486e-01, ...,         5.3952241e-01,  4.4697687e-01,  1.3505782e+00],       [ 4.9143720e-01, -1.4818899e-01, -2.8366420e-01, ...,         1.1110669e+00,  2.1992767e-01,  7.0457202e-01],       [-8.5650706e-01,  8.2832746e-02, -8.4218192e-01, ...,         2.1654253e+00,  6.4846051e-01, -5.7714492e-01],       ...,       [ 7.5072781e-03, -1.3543828e-02,  2.3101490e-02, ...,         4.2363801e-03, -5.6749382e-03,  6.3404259e-03],       [-2.6244391e-04, -3.0459568e-02,  5.9752418e-03, ...,         1.7844304e-02, -4.7109672e-04,  7.7916058e-03],       [ 7.2062697e-04, -6.5988898e-03,  1.1346856e-02, ...,        -3.7340564e-03, -1.8825980e-02,  2.7245486e-03]], dtype=float32)

[',', '的', '。', '、', '0', '1', '在', '”', '2', '了', '“', '和', '是', '5', ...]

array([ 4.9143720e-01, -1.4818899e-01, -2.8366420e-01, -3.6405793e-01, 1.0851435e-01, 4.9507666e-02, -7.1219063e-01, -5.4614645e-01, -1.3581418e+00, 3.0274218e-01, 6.1700332e-01, 3.5553512e-01, 1.6602433e+00, 7.5298291e-01, -1.4151905e-01, -2.1077128e-01, -2.6325354e-01, 1.6108564e+00, -4.6750236e-01, -1.6261842e+00, 1.3063166e-01, 8.0702168e-01, 4.0011466e-01, 1.2198541e+00, -6.2879241e-01, ... 2.1928079e-01, 7.1725255e-01, -2.3430648e-01, -1.2066336e+00, 9.7590965e-01, -1.5906478e-01, -3.5802779e-01, -3.8005975e-01, 1.9056025e-01, 1.1110669e+00, 2.1992767e-01, 7.0457202e-01], dtype=float32)

参考

https://zhuanlan.zhihu.com/p/26306795

https://arxiv.org/abs/1301.3781v3

https://arxiv.org/abs/1405.4053

Neural Graph Matching Networks for Chinese Short Text Matching

https://aclanthology.org/2020.acl-main.547.pdf

1.摘要

对于中文短文本匹配,通常基于词粒度而不是字粒度。但是分词结果可能是错误的、模糊的或不一致的,从而损害最终的匹配性能。比如下图:字符序列“南京市长江大桥”经过不同的分词可能表达为不同的意思。

为了解决这个问题,作者提出了一种基于图神经网络的中文短文本匹配方法。不是将句子分割成一个单词序列,而是保留所有可能的分割路径,形成一个Lattice(segment1,segment2,segment3),如上图所示。

2.问题定义

将两个待匹配中文短文本分别定义为$S_a=\left \{ C_1^a,C_2^a,…,C_{t_a}^a \right \}$,$S_b=\left \{ C_1^b,C_2^b,…,C_{t_b}^b \right \}$,其中$C_i^a$表示句子$a$第$i$个字,$C_j^b$表示句子$b$第$j$个字,$t_a$,$t_b$分别表示两个句子的长度。$f(S_a,S_b)$是目标函数,输出为两个文本的匹配度。词格图用$G=(\nu,\xi)$表示,其中$\nu$是节点集,包括所有字符序列。$\xi$表示边集,如果$\nu$中两个顶点$v_i$和$v_j$相邻,那么就存在一个边为$e_{ij}$。$N_{fw}(v_i)$表示节点$v_i$ 正向的所有可达节点的集合,$N_{bw}(v_i)$表示节点$v_i$ 反向的所有可达节点的集合。句子$a$的词格图为$G^a(\nu_a,\xi_a)$,句子$b$的词格图为$G^b(\nu_b,\xi_b)$。

3.模型结构

模型分成3个部分,1.语言节点表示 2.图神经匹配 3.相关性分类器

3.1 语言节点表示

这一部分基于BERT的结构。BERT的token表示基于字粒度,可以得到$\left \{ [CLS],C_1^a,C_2^a,…,C_{ta}^a,[SEP],C_1^b,C_2^b,…,C_{t_b}^b,[SEP] \right \}$,如上图所示。BERT的输出为各个字的Embedding,$ \left \{\textbf{C}^{CLS},\textbf{C}_1^a,\textbf{C}_2^a,…,\textbf{C}_{t_a}^a,\textbf{C}^{SEP},\textbf{C}_1^b,\textbf{C}_2^b,…,\textbf{C}_{t_b},\textbf{C}^{SEP} \right \}$。

3.2 图神经匹配

初始化:假设节点$v_i$包含$n_i$个连续字符,起始字符位置为$s_i$,即$ \left \{C_{s_i},C_{s_{i+1}},…,C_{s_{i}+n_i-1} \right \}$,这里$v_i$表示句子$a$或者$b$的结点。$V_i=\sum_{k=0}^{n_i-1}\textbf{U}_{s_i+k}\odot\textbf{C}_{s_i+k}$,其中$\odot$表示两个向量对应各个元素相乘。特征识别分数向量$\textbf{U}_{s_i+k}=softmax(FFN(\textbf{C}_{s_i+k}))$,$FFN$为两层。$h$为结点的向量表示,将$h_i^0$等于$V_i$

Message Propagation : 对于第$l$次迭代,$G_a$中某个结点$v_i$由如下四个部分组成

其中$\alpha_{ij},\alpha_{ik},\alpha_{im},\alpha_{iq}$是注意力系数,$W^{fw},W^{bw}$是注意力系数参数

然后定义两种信息为$m_i^{self}\triangleq[m_i^{fw},m_i^{bw}],m_i^{cross}\triangleq[m_i^{b1},m_i^{b2}]$

Representation Updating:得到两种信息后,需要更新结点$ v_i$的向量表示

其中$w_k^{cos}$为参数,$d_k$为multi-perspective cosine distance,可以衡量两种信息的距离,$k \in \left \{ 1,2,3,…P\right\}$,$P$是视角的数量。

其中$\textbf{d}_i\triangleq[d_1,d_2,…,d_P]$,$FFN$两层。

句子的图级别表示

总共经历了$L$次迭代(layer),得到$h_i^L$为结点$v_i$最终的向量表示($h_i^L$includes not only the information from its reachable nodes but also information of pairwise comparison with all nodes in another graph)

最终,两个句子的图级别表示分别为

3.3 分类器

得到$g^a,g^b$后,两句子的相似度可以用分类器衡量:

其中$P \in [0,1]$。

4.实验结果

lattice和JIEBA+PKU的区别?

JIEBA+PKU is a small lattice graph generated by merging two word segmentation results

lattice:overall lattice,应该是全部的组合

两者效果差不多是因为Compared with the tiny graph, the overall lattice has more noisy nodes (i.e. invalid words in the corresponding sentence).

参考

https://blog.csdn.net/qq_43390809/article/details/114077216


:D 一言句子获取中...