Transformer-XL Attentive Language Models Beyond a Fixed-Length Context

https://arxiv.org/abs/1901.02860

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling ( memory and computation受限,长度不可能很大 ). propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.

3 Model

3.1 Vanilla Transformer Language Models

问题:In order to apply Transformer or self-attention to language modeling, the central problem is how to train a Transformer to effectively encode an arbitrarily long context into a fixed size representation.通常做法为vanilla model。 vanilla model就是说把长文本分隔成固定长度的seg来处理,如上图。

During training,There are two critical limitations of using a fixed length context. First, the largest possible dependency length is upper bounded by the segment length. Second. simply chunking a sequence into fixed-length segments will lead to the context fragmentation problem

During evaluation, As shown in Fig. 1b, this procedure ensures that each prediction utilizes the longest possible context exposed during training, and also relieves context fragmentation issue encountered in training. However, this evaluation procedure is extremely expensive.

3.2 Segment-Level Recurrence with State Reuse

introduce a recurrence mechanism to the Transformer architecture.

定义变量

转换过程

SG() stands for stop-gradient,$\circ$ 表示矩阵拼接

具体过程如下图

During training, the hidden state sequence computed for the previous segment is fixed and cached to be reused as an extended context when the model processes the next new segment, as shown in Fig. 2a.

during evaluation, the representations from the previous segments can be reused instead of being computed from scratch as in the case of the vanilla model.

3.3 Relative Positional Encodings

how can we keep the positional information coherent when we reuse the states? 如果保留原来的位置编码形式,可以得到如下

这种方式存在问题:

为了解决这个问题提出了relative positional information。

standard Transformer

we propose

3.4 完整算法流程

Bridging the Gap Between Relevance Matching and Semantic Matching for Short Text Similarity Modeling

https://cs.uwaterloo.ca/~jimmylin/publications/Rao_etal_EMNLP2019.pdf

2 HCAN: Hybrid Co-Attention Network

three major components: (1) a hybrid encoder (2) a relevance matching module (3) a semantic matching module

2.1 Hybrid Encoders

hybrid encoder module that explores three types of encoders: deep, wide, and contextual

query and context words :$\{w_1^q,w_2^q,…,w_n^q\},\{w_1^c,w_2^c,…,w_m^c\}$, embedding representations $\textbf{Q}\in \mathbb{R}^{n\times L},\textbf{C}\in \mathbb{R}^{m\times L}$

Deep Encoder

$\textbf{U}$表示$\textbf{Q},\textbf{C}$

Wide Encoder

Unlike the deep encoder that stacks multiple convolutional layers hierarchically, the wide encoder organizes convolutional layers in parallel, with each convolutional layer having a different window size k

Contextual Encoder

2.2 Relevance Matching

2.3 Semantic Matching

2.4 Final Classification

NER

Named Entity Recognition,命名实体识别

旨在从文本中抽取出命名实体,比如人名、地名、机构名等

分类

文本数据标注

为什么标注?说白了就是标签

https://blog.csdn.net/scgaliguodong123_/article/details/121303421

举个例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
BIO-三位序列标注法(B-begin,I-inside,O-outside)
B-X代表实体X的开头 x:PER(person) , ORG(orgnization),LOC(location)
I-X代表实体X的中间或结尾
O代表不属于任何类型的
样例:

我 O
是 O
李 B-PER
果 I-PER
冻 I-PER
, O
我 O
爱 O
中 B-ORG
国 I-ORG
, O
我 O
来 O
自 O
四 B-LOC
川 I-LOC
。 O

参考

https://www.cnblogs.com/huangyc/p/10064853.html

https://www.cnblogs.com/YoungF/p/13488220.htmlhttps://www.cnblogs.com/YoungF/p/13488220.html

https://tech.meituan.com/2020/07/23/ner-in-meituan-nlp.html

https://zhuanlan.zhihu.com/p/156914795

https://blog.csdn.net/scgaliguodong123_/article/details/121303421

answer select

1 问题定义

Given a question and a set of candidate sentences, the task is to identify candidate sentences that contain the correct answer to the question. From the definition, the problem can be formulated as a ranking problem, where the goal is to give better rank to the candidate sentences that are relevant to the question.

2 博客:

https://zhuanlan.zhihu.com/p/39920446

3 A Review on Deep Learning Techniques Applied to Answer Selection

https://aclanthology.org/C18-1181.pdf

4 BAS: An Answer Selection Method Using BERT Language Model

https://arxiv.org/ftp/arxiv/papers/1911/1911.01528.pdf

权重初始化

参数初始权重为什么不全0或者任意相同值

如果我们将神经网络中的权重集初始化为零或者相同,那么同一层的所有神经元将在反向传播期间开始产生相同的输出和相同的梯度。导致同一层每个神经元完全一样,等价于只有一个

常用的三种权值初始化方法

随机初始化、Xavier initialization、He initialization

参考

https://mdnice.com/writing/6fe7dfe1954945d180d6b36562658af8

https://m.ofweek.com/ai/2021-06/ART-201700-11000-30502442.html

https://blog.csdn.net/qq_15505637/article/details/79362970


:D 一言句子获取中...