2022-01-19 dfd456161c6a73a2b627dc04717f15ad 99+ fast 0.1 k

TextGCN Graph Convolutional Networks for Text Classification

1.build a single text graph for a corpus based on word co-occurrence and document word relations,

2.then learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text GCN is initialized with one-hot representation for word and document, it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents.

TextGCN

2021-09-17 497ee909ce15293540c44e2680c35f52 99+ a minute 0.1 k

HIERARCHICAL TRANSFORMERS FOR LONG DOCUMENT CLASSIFICATION

原版BERT的最大输入为512，为了使得BERT能解决超长文本的问题，作者在finetune阶段提出了两种策略来弥补这个问题，即利用BERT+LSTM或者BERT+transformer。

核心步骤：

1.split the input sequence into segments of a fixed size with overlap.

2.For each of these segments, we obtain H or P from BERT model.

3.We then stack these segment-level representations into a sequence, which serves as input to a small (100-dimensional) LSTM layer.//replacing the LSTM recurrent layer in favor of a small Transformer model

4.Finally, we use two fully connected layers with ReLU (30-dimensional) and softmax (the same dimensionality as the number of classes) activations to obtain the final predictions.

NLP 文本分类

超长文本

2021-08-23 4e2b48a121bb0f97a9742bd9e3cf6e2c 99+ 6 m 1.0 k

TextCNN TextRNN TextRCNN

1.TextCNN (Convolutional Neural Networks for Sentence Classification)

原文 https://arxiv.org/abs/1408.5882

调参论文 https://arxiv.org/abs/1510.03820

模型的整体结构如上所示。Feature Map是输入图像经过神经网络卷积产生的结果，filter是卷积核。

输入表示：

假设输入文本的长度为$n$，对于长度不够的需要做padding，任意一个单词可以用一个$k$维的向量表示，即$X_i \in \mathbb{R}^{k}$，那么一个句子可以表示为

$X_{1:n}=X_1 \oplus X_2\oplus...\oplus X_n$

其中$\oplus$是向量拼接操作，$X_{1:n} \in \mathbb{R}^{nk\times 1}$。

卷积：

对于某个滑窗$X_{i,i+h-1}=\{X_i,X_{i+1},…,X_{i+h-1}\}$经过某个卷积核$W_j$可得

$c_{i,j}=f(W_j\cdot X_{i,i+h-1}+b)$

其中$f=tanh(\cdot)$，$W_j\in \mathbb{R}^{ 1\times hk}，c_{i,j} $是标量

假设卷积通道数为$m$，在NLP中，卷积滑动步伐$k=1$，那么经过卷积层后得到的完整的特征矩阵为

$C=[[c_{1,1},c_{2,1},...,c_{n-h+1,1}]^T,[c_{1,2},c_{2,2},...,c_{n-h+1,2}]^T,...,[c_{1,m},c_{2,m},...,c_{n-h+1,m}]^T]$

其中$C \in \mathbb{R}^{(n-h+1)\times m}$

maxpooling：

$\hat{C}=max\{C\} , \hat{C}\in \mathbb{R}^{m}$

全连接：

然后将$\hat{C}$接个全连接，就可以做分类或者回归任务了。

2.TextRNN (Recurrent Neural Network for Text Classification with Multi-Task Learning)

原文 https://www.ijcai.org/Proceedings/16/Papers/408.pdf

该文的场景为Recurrent Neural Network for Text Classification with Multi-Task Learning，就是论文的题目。文中给出了三种结构，如上图所示，图中的RNN单元为LSTM。

Model-I: Uniform-Layer Architecture

对于任务$m$，输入$\hat X_t$包含两个部分

$\hat{X}_t^{(m)}=X_{t}^{(m)}\oplus X_{t}^{(s)}$

其中$X_{t}^{(m)}$表示特定任务的词向量，$X_{t}^{(s)}$表示共享的词向量，$\oplus$表示向量拼接的操作。

Model-II: Coupled-Layer Architecture

$\hat{c}_t=tanh(W_cX_t+U_ch_{t-1}) \ \#原来 \\\downarrow \\\hat{c}_t^{(m)}=tanh(W_c^{(m)}X_t+\sum_{i\in\{m,n\}}g^{(i\longrightarrow m)}U_c^{(i\longrightarrow m)}h_{t-1}^{(i)}) \ \#现在 \\g^{(i\longrightarrow m)}=\sigma(W_{g}^{(m)}x_t+U_g^{(i)}h_{t-1}^{(i)})$

Model-III: Shared-Layer Architecture

$\hat{c}_t=tanh(W_cX_t+U_ch_{t-1}) \ \#原来 \\\downarrow \\\hat{c}_t^{(m)}=tanh(W_c^{(m)}X_t+g^{(m)}U_c^{(m)}h_{t-1}^{(m)}+g^{(s\longrightarrow m)}U_c^{(s)}h_{t}^{(s)} \ \#现在 \\g^{( m)}=\sigma(W_{g}^{(m)}x_t+U_g^{(m)}h_{t-1}^{(m)}), g^{( s\longrightarrow m)}=\sigma(W_{g}^{(m)}x_t+U_g^{(s\longrightarrow m)}h_{t}^{(s)}), h_t^{(s)}=\overrightarrow{h_t^{(s)}}\oplus\overleftarrow{h_t^{(s)}}$

3.TextRCNN(Recurrent Convolutional Neural Networks for Text Classification)

原文 https://www.deeplearningitalia.com/wp-content/uploads/2018/03/Recurrent-Convolutional-Neural-Networks-for-Text-Classification.pdf

整体结构如上图所示，解释一下为啥叫RCNN，一般的 CNN 网络，都是卷积层 + 池化层，这里是将卷积层换成了双向 RNN，所以结果是，双向 RNN + 池化层。作者原话为：From the perspective of convolutional neural networks, the recurrent structure we previously mentioned is the convolutional layer.

词语表示

对于一个词语$w_i$，可以用一个三元组表示为

$x_i=[c_l(w_i);e(w_i);c_r(w_i)]$

其中$e(w_i)$表示$w_i$的词向量，$c_l(w_i)$表示$w_i$句子左边的内容的向量表示，$c_r(w_i)$表示$w_i$句子右边的内容的向量表示，用式子表示如下

$c_l(w_i)=f(W^{l}c_l(w_{i-1})+W^{(sl)}e(w_{i-1})) \\c_r(w_i)=f(W^{r}c_r(w_{i-1})+W^{(sr)}e(w_{i-1}))$

然后将$x_i$经过全连接得到$y_i^{(2)}$，$y_i^{(2)}$is a latent semantic vector

$y_i^{(2)}=tanh(W^{(2)}x_i+b^{(2)})$

语句表示

获取众多的词语表示后，通过max-pooling得到句子表示

$y^{(3)}=\mathop{\max}_{i=1}^{n}y_i^{(2)}$

然后接全连接和softmax

$y^{(4)}=W^{(4)}y^{(3)}+b^{(4)} \\p=softmax(y^{(4)})$

参考

https://www.cnblogs.com/wangduo/p/6773601.html

NLP 文本分类

文本分类

2021-07-19 4aed935a7f1d83aa4dfd2aabd7f6373b 99+ 9 m 1.4 k

fasttext

1、文本分类

1.1 n-gram

由于Bag of words不考虑词语的顺序，因此引入bag of n-gram。针对英文，词内的是char n-gram，用于词向量；词之间的是word n-gram，用于分类；对于中文，存在词粒度和字粒度。

举个例子，句子A为”今天天气真不错”，这里以词粒度举例，先分词为[“今天”，”天气”，”真“，”不错“]

uni-gram：今天天气真不错

2-gram为：今天/天气天气/真真/不错

3-gram为：今天/天气/真天气/真/不错

由于n-gram的量远比word大的多，完全存下所有的n-gram也不现实。FastText采用了hashing trick的方式，如下图所示：

用哈希的方式既能保证查找时O(1)的效率，又可能把内存消耗控制在O(buckets * dim)范围内。不过这种方法潜在的问题是存在哈希冲突，不同的n-gram可能会共享同一个embedding。如果buckets取的足够大，这种影响会很小。

代码如下：

def build_dataset(config, ues_word):
    if ues_word:
        tokenizer = lambda x: x.split(' ')  # word-level
    else:
        tokenizer = lambda x: [y for y in x]  # char-level
    if os.path.exists(config.vocab_path):
        vocab = pkl.load(open(config.vocab_path, 'rb'))
    else:
        vocab = build_vocab(config.train_path, tokenizer=tokenizer, max_size=MAX_VOCAB_SIZE, min_freq=1)
        pkl.dump(vocab, open(config.vocab_path, 'wb'))
    print(f"Vocab size: {len(vocab)}")

    def biGramHash(sequence, t, buckets):
        t1 = sequence[t - 1] if t - 1 >= 0 else 0
        return (t1 * 14918087) % buckets

    def triGramHash(sequence, t, buckets):
        t1 = sequence[t - 1] if t - 1 >= 0 else 0
        t2 = sequence[t - 2] if t - 2 >= 0 else 0
        return (t2 * 14918087 * 18408749 + t1 * 14918087) % buckets

    def load_dataset(path, pad_size=32):
        contents = []
        with open(path, 'r', encoding='UTF-8') as f:
            for line in tqdm(f):
                lin = line.strip()
                if not lin:
                    continue
                content, label = lin.split('\t')
                words_line = []
                token = tokenizer(content)
                seq_len = len(token)
                if pad_size:
                    if len(token) < pad_size:
                        token.extend([PAD] * (pad_size - len(token)))
                    else:
                        token = token[:pad_size]
                        seq_len = pad_size
                # word to id
                for word in token:
                    words_line.append(vocab.get(word, vocab.get(UNK)))

                # fasttext ngram
                buckets = config.n_gram_vocab
                bigram = []
                trigram = []
                # ------ngram------
                for i in range(pad_size):
                    bigram.append(biGramHash(words_line, i, buckets))
                    trigram.append(triGramHash(words_line, i, buckets))
                # -----------------
                contents.append((words_line, int(label), seq_len, bigram, trigram))
        return contents  # [([...], 0), ([...], 1), ...]

    train = load_dataset(config.train_path, config.pad_size)
    dev = load_dataset(config.dev_path, config.pad_size)
    test = load_dataset(config.test_path, config.pad_size)
    return vocab, train, dev, test

1.2 网络结构

fasttext

模型结构上word2vec的cbow模型很像

输入层：举个例子，输入文本”今天天气真不错”，词粒度的2-gram为

$x_2=\begin{bmatrix} emb_{今天/天气}，emb_{天气/真}，emb_{ 真/不错} \end{bmatrix},emb为词向量矩阵 \\x_{1},x_{2},...,x_{N}最后输入到中间层的形式为: mean(\begin{bmatrix}x_1 \\ x_2 \\...\\x_N \end{bmatrix}),其中mean为对每个x的列求平均$

中间层：线形层+relu作为激活函数

输出层：为简单的线形层

代码：

class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.embedding_ngram2 = nn.Embedding(config.n_gram_vocab, config.embed)
        self.embedding_ngram3 = nn.Embedding(config.n_gram_vocab, config.embed)
        self.dropout = nn.Dropout(config.dropout)
        self.fc1 = nn.Linear(config.embed * 3, config.hidden_size)
        # self.dropout2 = nn.Dropout(config.dropout)
        self.fc2 = nn.Linear(config.hidden_size, config.num_classes)

    def forward(self, x):

        out_word = self.embedding(x[0])
        out_bigram = self.embedding_ngram2(x[2])
        out_trigram = self.embedding_ngram3(x[3])
        out = torch.cat((out_word, out_bigram, out_trigram), -1)

        out = out.mean(dim=1)
        out = self.dropout(out)
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        return out

1.3 分层softmax

对于分类问题，神经网络的输出结果需要经过softmax将其转为概率分布后才可以利用交叉熵计算loss

由于普通softmax的计算效率比较低，计算效率为$O(Kd)$使用分层的softmax时间复杂度可以达到$dlogK$，$K$为分类的数量，$d$为向量的维度

1.3.1 普通softmax

假设输出为$Y_{pred}=[y_1,y_2,…,y_K]$,则$P_{y_i}$为

$P_{y_i}=\frac{e_{y_i}}{\sum_{j=0}^Ke^{y_j}}$

其中$y_i$的维度为$d$，从公式可以看出计算效率为$O(Kd)$

1.3.2 分层softmax

霍夫曼树可以参考 https://zhuanlan.zhihu.com/p/154356949

为什么要霍夫曼，普通的不行？

分层softmax核心思想为利用训练样本构建霍夫曼树，如下

fasttext

树的结构是根据不同类在样本中出现的频次构造的，即频次越大的节点距离根节点越近。$K$个不同的类组成所有的叶子节点，$K-1个$内部节点作为参数。从根节点到某个叶子节点$y_i$经过的节点和边形成一条路径，路径长度表示为 $L_{y_i}$,$n_{(y_i,j)}$表示路径上的节点，那么

$P_{y_i}=\prod \limits_{j=1}^{L_{y_i}}P_{(n(y_{i},j),left\ or\ right)} \\=\prod \limits_{j=0}^{L_{y_i}-1}\sigma(f(n(y_i,j+1)==LC(n(y_i,j))){\theta_{n(y_i,j)}^T} Y) \\其中LC(n(y_i,j)表示n(y_i,j)的左孩子，\sigma 为SIGMOD函数，f(m)=\begin{equation}\left\{ \begin{aligned} 1 && if \ m==true \\ -1 & & \ else \\ \end{aligned} \right. \end{equation}$

从公式可以看出时间复杂度降低至$dlogK$。

以图中$y_2$为例：

$P_{y_2}=P_{(n(y_{2},1),left)}\cdot P_{(n(y_{2},2),left)}\cdot P_{(n(y_{2},3),right)} \\=\sigma({\theta_{n(y_2,1)}^T} Y)\cdot \sigma({\theta_{n(y_2,2)}^T} Y) \cdot \sigma({-\theta_{n(y_2,3)}^T} Y)$

从根节点走到叶子节点 $y_2$ ，实际上是在做了3次逻辑回归。