2021-08-02 965c27c0899efd84889dab9fc80a80f0 99+ 5 m 0.7 k

字典树

一.核心思想

Trie tree，即字典树，又称单词查找树或键树，是一种树形结构，是一种哈希树的变种。典型应用是用于统计和排序大量的字符串（但不仅限于字符串），所以经常被搜索引擎系统用于文本词频统计。它的优点是：最大限度地减少无谓的字符串比较。Trie的核心思想是空间换时间。利用字符串的公共前缀来降低查询时间的开销以达到提高效率的目的。字典树的查询时间复杂度是O (L)，L是待查字符串的长度。如果是普通的线性表结构，那么查询效率为O（NL），N为待查数据集的大小。

假设有b，abc，abd，bcd，abcd，efg，hii 这6个单词,那我们创建字典树如下：

二.应用

目的：利用汉语拼音缩写还原中文汉字

准备：数据集（包含中文汉字以及对应汉语缩写）

思想：1.基于汉语拼音缩写检索出数据集中对应的所有中文汉字 2.基于中文汉字出现频次排序，将top1作为汉语拼音的还原结果

代码

import  pandas as pd
import pickle
import os
import joblib

class TrieNode(object):
    def __init__(self):
        """
        Initialize your data structure here.
        """
        self.data = {}###字母字符
        self.data1={}###中文
        self.is_word = False###标识是否汉字


class Trie(object):

    def __init__(self):
        self.root = TrieNode()

    def insert(self, word,word1):
        """
        Inserts a word into the trie.
        :type word: str
        :rtype: void
        """
        node = self.root
        for letter in word:
            child = node.data.get(letter)
            if not child:
                node.data[letter] = TrieNode()
            node = node.data[letter]
        node.is_word = True
        if word1 not in node.data1:
            node.data1[word1]=1
        else:
            node.data1[word1]+=1

    def search(self, word):
        """
        Returns if the word is in the trie.
        :type word: str
        :rtype: bool
        """
        node = self.root
        for letter in word:
            node = node.data.get(letter)
            if not node:
                return False
        return node.is_word

    def starts_with(self, prefix):
        """
        Returns if there is any word in the trie
        that starts with the given prefix.
        :type prefix: str
        :rtype: bool
        """
        node = self.root
        for letter in prefix:
            node = node.data.get(letter)
            if not node:
                return False
        return True

    def get_start(self, prefix):
        """
        Returns words started with prefix
        :param prefix:
        :return: words (list)
        """
        def _get_key(pre, pre_node):
            words_list = []
            if pre_node.is_word:
                words_list.append([pre,pre_node.data1])
            for x in pre_node.data.keys():
                words_list.extend(_get_key(pre + str(x), pre_node.data.get(x)))
            return words_list

        words = []
        if not self.starts_with(prefix):
            return words
        # if self.search(prefix):
        #     words.append(prefix)
        #     return words
        node = self.root
        for letter in prefix:
            node = node.data.get(letter)
        return _get_key(prefix, node)


    def find_result(self,string):
        result =self.get_start(string)
        result = sort_by_value(result[0][1])
        result.reverse()
        return result[0]
    
    
def sort_by_value(d):
    return sorted(d.items(), key=lambda k: k[1])  # k[1] 取到字典的值。



def build_tree(data,save_path):

    trie = Trie()
    for element in data.values:
        trie.insert(element[0], element[1])
    joblib.dump(trie, save_path)
    return

def load_tree(path):
    trie = joblib.load(path)
    return trie

if __name__ == '__main__':

    ###
    build_tree(data,save_path)
    ###
    tree=load_tree(save_path)
    print(tree.find_result("XXXXXXXX"))

参考

https://zhuanlan.zhihu.com/p/28891541

基础算法数据结构

字典树

2021-07-28 27d8132878abc1f5fae21650cfc4df15 99+ 2 m 0.3 k

nlp中使用预训练的词向量和随机初始化的词向量的区别在哪里？

当你训练数据不充足的时候，可以直接使用别人已经预训练好的词向量，也可以根据自己的训练数据微调(fine-tuning)预训练词向量，也可以把词向量和整个模型一块训练，但是通常预训练的词向量我们不会再在训练的过程中进行更新。

当你的训练数据比较充足的时候，并且想让词向量能更好的捕捉自己的训练数据的语义信息时，应该使用随机初始化的词向量。当然，随机初始化的词向量必须要在训练网络的过程中不断进行更新，就和神经网络的权重参数一样进行训练。

例子：

1.直观展示

import torch
from torch import nn
from torch.autograd import Variable
###random
embeds = nn.Embedding(2, 5) 
print(embeds.weight)
embeds = nn.Embedding(2, 5) 
print(embeds.weight)
###from pretrain
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
print(embedding.weight)

Parameter containing:
tensor([[-0.1754,  1.6604, -1.5025, -1.0980, -0.4718],
        [-1.1276,  0.1408, -1.0746, -1.2768, -0.6789]], requires_grad=True)
Parameter containing:
tensor([[-0.7366,  0.0607,  0.6151,  0.2282,  0.3878],
        [-1.1365,  0.1844, -1.1191, -0.8787, -0.5121]], requires_grad=True)
Parameter containing:
tensor([[1.0000, 2.3000, 3.0000],
        [4.0000, 5.1000, 6.3000]])

2.n-gram

1
2
3

self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
self.embedding_ngram2 = nn.Embedding(config.n_gram_vocab, config.embed)
self.embedding_ngram3 = nn.Embedding(config.n_gram_vocab, config.embed)

参考

https://www.zhihu.com/question/337950427

NLP 文本表示

文本表示

2021-07-27 cb752746ba01f81f9394f0e9d77bdc74 99+ 18 m 2.7 k

LASERTAGGER

一. 摘要

对于某一些文本生成任务，输入和输出的文本有很多的重叠部分，如果还是采用encoder-decoder的文本生成模型去从零开始生成，其实是很浪费和没必要的，并且会导致两个问题：1：生成模型的幻觉问题(就是模型胡说八道) ；2：出现叠词(部分片段一致)。

基于上面的考虑，作者提出了lasertagger模型，通过几个常用的操作：keep token、delete token、 add token，给输入序列的每个token打上标签，使得文本生成任务转化为了序列标注任务。

通过这种方式，相较于encoder-decoder模型的优势有如下：1、推理的速度更快 2、在较小的数据集上性能优于seq2seq baseline，在大数据集上和baseline持平（因为输入和输出的文本有很多的重叠部分，对于这种情况，lasertagger的候选词库比较小，因为对于重叠部分的词，词库只需要添加keep，而传统encoder-decoder的候选词库依然很大，因为对于重叠部分的词，词库需要添加对应的词）

二.主要贡献

1、通过输入和输出文本，自动去提取需要add的token

2、通过输入文本，输出文本和tag集，给训练的输入序列打上标签

3、提出了两个版本，$LASERTAGGER_{AR}$( bert+transformer decoder )和$LASERTAGGER_{FF}$( bert+desen+softmax )

三. 整体流程

其实就是两个过程，一.将输入文本变编码成特殊标注，二.将标注解码成文本

四. 文本标注

4.1 Tag集构建（也就是label集构建）

一般情况，tag分为两个大类： base tag $B$和 add tag $P$。对于base tag，就是$KEEP$或者$DELETE$当前token；对于add tag，就是要添加一个词到token前面，添加的词来源于词表$V$。实际在工程中，将$B$和$P$结合来表示，即$^{P}B$，总的tag数量大约等于$B$的数量乘以$P$的数量，即$2|V|$。对于某些任务可以引入特定的tag，比如对于句子融合，可以引入$SWAP$,如下图。

4.1.1 词表V的构建

构建目标：

最小化词汇表规模；
最大化目标词语的比例

限制词汇表的词组数量可以减少相应输出的决策量；最大化目标词语的比例可以防止模型添加无效词。

构建过程：

通过$LCS$算法（longest common sequence，最长公共子序列，注意和最长公共子串不是一回事），找出输入和输出序列的最长公共子序列，输出剩下的序列，就是需要$add$的token，添加到词表$V$，词表中的词基于词频排序,然后选择$l$个常用的。

举个例子：soruce为“12345678”，target为”1264591”

最长公共子序列为[‘1’, ‘2’, ‘4’, ‘5’]

需要$add$的token为 [‘6’, ‘91’]

源码：

def _lcs_table(source, target):
  """Returns the Longest Common Subsequence dynamic programming table."""
  rows = len(source)
  cols = len(target)
  lcs_table = [[0] * (cols + 1) for _ in range(rows + 1)]
  for i in range(1, rows + 1):
    for j in range(1, cols + 1):
      if source[i - 1] == target[j - 1]:
        lcs_table[i][j] = lcs_table[i - 1][j - 1] + 1
      else:
        lcs_table[i][j] = max(lcs_table[i - 1][j], lcs_table[i][j - 1])
  return lcs_table


def _backtrack(table, source, target, i, j):
  """Backtracks the Longest Common Subsequence table to reconstruct the LCS.

  Args:
    table: Precomputed LCS table.
    source: List of source tokens.
    target: List of target tokens.
    i: Current row index.
    j: Current column index.

  Returns:
    List of tokens corresponding to LCS.
  """
  if i == 0 or j == 0:
    return []
  if source[i - 1] == target[j - 1]:
    # Append the aligned token to output.
    return _backtrack(table, source, target, i - 1, j - 1) + [target[j - 1]]
  if table[i][j - 1] > table[i - 1][j]:
    return _backtrack(table, source, target, i, j - 1)
  else:
    return _backtrack(table, source, target, i - 1, j)

def _compute_lcs(source, target):
  # s1={1,3,4,5,6,7,7,8},s2={3,5,7,4,8,6,7,8,2} return 35778
  table = _lcs_table(source, target)
  return _backtrack(table, source, target, len(source), len(target))
  
  
  
def _get_added_phrases(source: Text, target: Text) -> Sequence[Text]:
    """
    Computes the phrases that need to be added to the source to get the target.
    """
    sep = ''
    source_tokens = utils.get_token_list(source.lower())
    target_tokens = utils.get_token_list(target.lower())
    #compute Longest Common Subsequence
    kept_tokens = _compute_lcs(source_tokens, target_tokens)
    added_phrases = []
    kept_idx = 0
    phrase = []
    for token in target_tokens:
        if kept_idx < len(kept_tokens) and token == kept_tokens[kept_idx]:
            kept_idx += 1
            if phrase:
                added_phrases.append(sep.join(phrase))
                phrase = []
        else:
            phrase.append(token)
    if phrase:
        added_phrases.append(sep.join(phrase))
    return added_phrases

词表位于文件label_map.txt.log，本人基于自己的数据集，内容如下所示

Idx Frequency  Coverage (%)   Phrase
1  19 94.22  址
2  15 95.27  单位
3  8  95.76  地
4  6  96.17  执勤

4.1.2 tag集

本人基于自己的数据集，得到的候选tag如下：

KEEP
DELETE
KEEP|址
DELETE|址
KEEP|单位
DELETE|单位
KEEP|地
DELETE|地
KEEP|执勤
DELETE|执勤

4.2 Converting Training Targets into Tags

paper上的伪代码：

采用贪心策略，核心思想就是遍历$t$，先和$s$匹配，匹配上就$keep$，然后$i_t+j$，得到潜在的$add \ phrase \ p=t(i_t:i_t+j-1) $，然后判断$t(i_t+j)==s(i_s)\ and \ p\in V $

源码：

和伪代码有一点不同，差异在于#####之间。

def _compute_single_tag(
        self, source_token, target_token_idx,
        target_tokens):
    """Computes a single tag.

    The tag may match multiple target tokens (via tag.added_phrase) so we return
    the next unmatched target token.

    Args:
      source_token: The token to be tagged.
      target_token_idx: Index of the current target tag.
      target_tokens: List of all target tokens.

    Returns:
      A tuple with (1) the computed tag and (2) the next target_token_idx.
    """
    source_token = source_token.lower()
    target_token = target_tokens[target_token_idx].lower()
    if source_token == target_token:
        return tagging.Tag('KEEP'), target_token_idx + 1
    # source_token!=target_token
    added_phrase = ''
    for num_added_tokens in range(1, self._max_added_phrase_length + 1):
        if target_token not in self._token_vocabulary:
            break
        added_phrase += (' ' if added_phrase else '') + target_token
        next_target_token_idx = target_token_idx + num_added_tokens
        if next_target_token_idx >= len(target_tokens):
            break
        target_token = target_tokens[next_target_token_idx].lower()
        if (source_token == target_token and
                added_phrase in self._phrase_vocabulary):
            return tagging.Tag('KEEP|' + added_phrase), next_target_token_idx + 1
    return tagging.Tag('DELETE'), target_token_idx


def _compute_tags_fixed_order(self, source_tokens, target_tokens):
    """Computes tags when the order of sources is fixed.

    Args:
      source_tokens: List of source tokens.
      target_tokens: List of tokens to be obtained via edit operations.

    Returns:
      List of tagging.Tag objects. If the source couldn't be converted into the
      target via tagging, returns an empty list.
    """
    tags = [tagging.Tag('DELETE') for _ in source_tokens]
    # Indices of the tokens currently being processed.
    source_token_idx = 0
    target_token_idx = 0
    while target_token_idx < len(target_tokens):
        tags[source_token_idx], target_token_idx = self._compute_single_tag(
            source_tokens[source_token_idx], target_token_idx, target_tokens)
        ####################################################################################
        # If we're adding a phrase and the previous source token(s) were deleted,
        # we could add the phrase before a previously deleted token and still get
        # the same realized output. For example:
        #    [DELETE, DELETE, KEEP|"what is"]
        # and
        #    [DELETE|"what is", DELETE, KEEP]
        # Would yield the same realized output. Experimentally, we noticed that
        # the model works better / the learning task becomes easier when phrases
        # are always added before the first deleted token. Also note that in the
        # current implementation, this way of moving the added phrase backward is
        # the only way a DELETE tag can have an added phrase, so sequences like
        # [DELETE|"What", DELETE|"is"] will never be created.
        if tags[source_token_idx].added_phrase:
            # # the learning task becomes easier when phrases are always added before the first deleted token
            first_deletion_idx = self._find_first_deletion_idx(
                source_token_idx, tags)
            if first_deletion_idx != source_token_idx:
                tags[first_deletion_idx].added_phrase = (
                    tags[source_token_idx].added_phrase)
                tags[source_token_idx].added_phrase = ''
        ########################################################################################
        source_token_idx += 1
        if source_token_idx >= len(tags):
            break

    # If all target tokens have been consumed, we have found a conversion and
    # can return the tags. Note that if there are remaining source tokens, they
    # are already marked deleted when initializing the tag list.
    if target_token_idx >= len(target_tokens):  # all target tokens have been consumed
        return tags
    return []  # TODO

缺陷：

对于一些情况，无法还原，举个例子：

source：证件有效期截止日期 target：证件日期格式

得不到tag结果

可以补充策略来修复bug

def _compute_tags_fixed_order(self, source_tokens, target_tokens):
    """Computes tags when the order of sources is fixed.

    Args:
      source_tokens: List of source tokens.
      target_tokens: List of tokens to be obtained via edit operations.

    Returns:
      List of tagging.Tag objects. If the source couldn't be converted into the
      target via tagging, returns an empty list.
    """
  

    tags = [tagging.Tag('DELETE') for _ in source_tokens]
    # Indices of the tokens currently being processed.
    source_token_idx = 0
    target_token_idx = 0
    while target_token_idx < len(target_tokens):
        tags[source_token_idx], target_token_idx = self._compute_single_tag(
            source_tokens[source_token_idx], target_token_idx, target_tokens)
        #########################################################################################
        # If we're adding a phrase and the previous source token(s) were deleted,
        # we could add the phrase before a previously deleted token and still get
        # the same realized output. For example:
        #    [DELETE, DELETE, KEEP|"what is"]
        # and
        #    [DELETE|"what is", DELETE, KEEP]
        # Would yield the same realized output. Experimentally, we noticed that
        # the model works better / the learning task becomes easier when phrases
        # are always added before the first deleted token. Also note that in the
        # current implementation, this way of moving the added phrase backward is
        # the only way a DELETE tag can have an added phrase, so sequences like
        # [DELETE|"What", DELETE|"is"] will never be created.
        if tags[source_token_idx].added_phrase:
            # # the learning task becomes easier when phrases are always added before the first deleted token
            first_deletion_idx = self._find_first_deletion_idx(
                source_token_idx, tags)
            if first_deletion_idx != source_token_idx:
                tags[first_deletion_idx].added_phrase = (
                    tags[source_token_idx].added_phrase)
                tags[source_token_idx].added_phrase = ''
        #######################################################################################

        source_token_idx += 1
        if source_token_idx >= len(tags):
            break

    # If all target tokens have been consumed, we have found a conversion and
    # can return the tags. Note that if there are remaining source tokens, they
    # are already marked deleted when initializing the tag list.
    if target_token_idx >= len(target_tokens):  # all target tokens have been consumed
        return tags
    ####fix bug by lavine

    ###strategy1
    added_phrase = "".join(target_tokens[target_token_idx:])
    if added_phrase in self._phrase_vocabulary:
        tags[-1] = tagging.Tag('DELETE|' + added_phrase)
        print(''.join(source_tokens))
        print(''.join(target_tokens))
        print(str([str(tag) for tag in tags] if tags != None else None))
        return tags
    ###strategy2
    return []  # TODO

4.3 模型结构

模型主要包含两个部分：1.encoder:generates activation vectors for each element in the input sequence 2.decoder：converts encoder activations into tag labels

4.3.1 encoder

由于$BERT$在sentence encoding tasks上做到state-of-the-art，所以使用$BERT$ 作为encoder部分。作者选择了$BERT_{base}$,包含12个self-attention层

4.3.2 decoder

在$BERT$原文中，对于标注任务采取了非常简单的decoder结构，即采用一层feed-forward作为decoder，把这种组合叫做$LASERTAGGER_{FF}$，这种结构的缺点在于预测的标注词相互独立，没有考虑标注词的关联性。

为了考虑标注词的关联性，decode使用了Transformer decoder，单向连接，记作$LASERTAGGER_{AR}$，这种encoder和decoder的组合的有点像BERT结合GPT的感觉decoder 和encoder在以下方面交流：(i) through a full attention over the sequence of encoder activations (ii) by directly consuming the encoder activation at the current step

4.4 loss

假设句子长度为n，tag数量为m, loss为n个m分类任务的和

五.realize

对于基本的tag，比如$KEEP$，$DELETE$，$ADD$，$realize$就是根据输入和tag直接转换就行；对于特殊的tag，需要一些特定操作，看情况维护规则。

六.评价指标

评价指标，不同任务不同评价指标

1 Sentence Fusion

Exact score ：percentage of exactly correctly predicted fusions（类似accuracy）

SARI ：average F1 scores of the added, kept, and deleted n-grams

2 Split and Rephrase

SARI

3 Abstractive Summarization

ROUGE-L

4 Grammatical Error Correction (GEC)

precision and recall, F0:5

七.实验结果

baseline： based on Transformer where both the encoder and decoder replicate the $BERT_{base}$ architecture

速度：1.$LASERTAGGER_{AR} $is already 10x faster than comparable-in-accuracy $SEQ2SEQ_{BERT}$ baseline. This difference is due to the former model using a 1-layer decoder (instead of 12 layers) and no encoder-decoder cross attention. 2.$LASERTAGGER_{FF}$ is more than 100x faster

其余结果参考paper

参考

https://arxiv.org/pdf/1909.01187.pdf

https://github.com/google-research/lasertagger

https://zhuanlan.zhihu.com/p/348109034

NLP 文本生成

文本改写

2021-07-27 6886a6840e722c992035060d31a77303 99+ 4 m 0.6 k

Sentence-BERT Sentence Embeddings using Siamese BERT-Networks

paper: https://arxiv.org/abs/1908.10084

giit: https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications

1.贡献

基于bert利用孪生结构或者三胞胎结构训练，使得产生在低维空间可用的句子Embedding。对于文本匹配任务，可以离线计算句子Embedding，然后基于句子Embedding在线匹配，可实现快速高精度的匹配。

2.结构

文章提出三种结构和目标函数，三胞胎结构作者没有画图

1.Classification Objective Function

$loss=cross-entropy(softmax(W_t(u,v,|u-v|)),y_{true})$

2.Regression Objective Function

$loss=MSE(cosine-sim(u, v),y_{true})$

3.Triplet Objective Function

$loss=max(||s_a-s_p||-||s_a-s_n||+\sigma,0)$

$||.||$计算向量距离，$s_a$为样本本身，$s_p$为正样本，$s_n$为负样本，$\sigma$使得正样本至少比负样本距离样本近$\sigma$。

对于pooling，文章提出三种策略

1.Using the output of the CLS-token
2.computing the mean of all output vectors (MEAN_strategy)
3.computing a max-over-time of the output vectors (MAX_strategy). The default configuration is MEAN.

3.实验结果

3.1 Unsupervised STS

3.2 Supervised STS

3.4 Wikipedia Sections Distinction

We use the Triplet Objective

4.代码

from sentence_bert.sentence_transformers import SentenceTransformer, util

###load model
model = SentenceTransformer(model_path)

# Single list of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'I love pasta',
             'The new movie is awesome',
             'The cat plays in the garden',
             'A woman watches TV',
             'The new movie is so great',
             'Do you like pizza?']

#Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

The new movie is awesome 		 The new movie is so great 		 Score: 0.9283
The cat sits outside 		 The cat plays in the garden 		 Score: 0.6855
I love pasta 		 Do you like pizza? 		 Score: 0.5420
I love pasta 		 The new movie is awesome 		 Score: 0.2629
I love pasta 		 The new movie is so great 		 Score: 0.2268
The new movie is awesome 		 Do you like pizza? 		 Score: 0.1885
A man is playing guitar 		 A woman watches TV 		 Score: 0.1759
The new movie is so great 		 Do you like pizza? 		 Score: 0.1615
The cat plays in the garden 		 A woman watches TV 		 Score: 0.1521
The cat sits outside 		 The new movie is awesome 		 Score: 0.1475

NLP 文本表示

文本表示

2021-07-27 1898b4493fdd44773c33898a3f1bb63e 99+ 2 m 0.3 k

nlpcda-NLP中文数据增强工具，强推

下载：pip install nlpcda

工具支持

NLP 小帮手

中文数据增强

2021-07-27 028324b503b9f2b6efbda1adc7f8d71b 99+ 6 m 0.9 k

文本生成评价指标

1.BLEU

BLEU,全称为bilingual evaluation understudy，一般用于机器翻译和文本生成的评价，比较候选译文和参考译文里的重合程度，重合程度越高就认为译文质量越高，取值范围为[0,1]。

优点

它的易于计算且速度快，特别是与人工翻译模型的输出对比；
它应用范围广泛，这可以让你很轻松将模型与相同任务的基准作对比。

缺点

它不考虑语义，句子结构
不能很好地处理形态丰富的语句（BLEU原文建议大家配备4条翻译参考译文）
BLEU 指标偏向于较短的翻译结果（brevity penalty 没有想象中那么强）

1.1 完整式子

BLEU完整式子为：

$BLEU=BP*e^{\sum_{n=1}^{N}W_nlogP_{n}}$

1.2 $BP$

目的：$n-gram$匹配度可能会随着句子长度的变短而变好，比如，只翻译了一个词且对了，那么匹配度很高,为了避免这种评分的偏向性,引入长度惩罚因子

Brevity Penalty为长度惩罚因子，其中$l_c$表示机器翻译的译文长度，$l_s$表示参考答案的有效长度

$BP=\begin{equation} \left\{ \begin{aligned} 1 & & if \ \ l_c>l_s \\e^{1-\frac{l_s}{l_c}} & & l_c<=l_s \end{aligned} \right. \end{equation}$

1.3 $P_n$

人工译文表示为$s_j$，其中$j \in M$,$M$表示共有$M$个参考答案
翻译译文表示$c_i$，其中$i \in E$,$E$表示共有$E$个翻译
$n-gram$表示$n$个单词长度的词组集合，令$k$ 表示第$k$ 个词组,总共$K$个
$h_k(c_i)$表示第$k$个词组在翻译译文$c_i$出现的次数
$h_k(s_{j})$表示第$k$个词组在参考答案$s_{j}$出现的次数

$P_n=\frac{\sum_{i=1}^E\sum_{k=1}^Kmin(h_k(c_i),max_{j\in M}h_k(s_{j}))}{\sum_{i=1}^E\sum_{k=1}^Kmin(h_k(c_i))}$

举例如下，例如：

　　　　原文：今天天气不错

　　　　机器译文：It is a nice day today

　　　　人工译文：Today is a nice day

$1-gram$:

　可以看到机器译文一共6个词，有5个词语都命中的了参考译文,$P_1=\frac{5}{6}$

$3-gram$:

机器译文一共可以分为4个$3-gram$的词组，其中有2个可以命中参考译文，那么$P_3=\frac{2}{4}$

1.4 $W_n$

$W_n$表示$P_n$的权重，一般为加权平均，即$W_n=\frac{1}{N}$,其中$N$为$gram$的数量，一般不大于4

2 ROUGE

Recall-Oriented Understudy for Gisting Evaluation，可以看做是BLEU 的改进版，专注于召回率而非精度。换句话说，它会查看有多少个参考译句中的 n 元词组出现在了输出之中。

ROUGE大致分为四种（常用的是前两种）： - ROUGE-N （将BLEU的精确率优化为召回率） - ROUGE-L （将BLEU的n-gram优化为公共子序列） - ROUGE-W （将ROUGE-L的连续匹配给予更高的奖励） - ROUGE-S （允许n-gram出现跳词(skip)）

注意：

关于rouge包给出三个结果，而论文只有一个值，比如

1	[{'rouge-1': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}, 'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'rouge-l': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}}]

用“r”，recall就好了

3 NIST

4 METEOR

5 TER

参考文献

https://blog.csdn.net/qq_36533552/article/details/107444391

https://zhuanlan.zhihu.com/p/144182853

https://arxiv.org/pdf/2006.14799.pdf

https://www.cnblogs.com/by-dream/p/7679284.html

https://blog.csdn.net/qq_30232405/article/details/104219396

https://blog.csdn.net/u013521274/article/details/89460322

https://zhuanlan.zhihu.com/p/388720967

NLP 文本生成

文本生成评价指标

2021-07-25 92a3c60ba35a4e83491b434fc1509b48 99+ a minute 0.2 k

pycharm

1 pycharm连接远程，本地无法查看源码，但是可以正常运行

1 原因

由于远程的包没有同步到本地，导致无法在本地查看，但是代码运行在远程，服务器上有包，所以可以运行

2.解决

将包从远程同步到本地，一般情况重启pycharm会自动同步，或者重新安装

2 git 看不到代码变化

删除工程的idea，重新打开

3.连接远程，可以run，不可以debug

1.sudo vim /etc/ssh/sshd_config

AllowTcpForwarding 改为yes

2.systemctl restart sshd

3.重启pycharm

python

pycharm

2021-07-24 de49c5cf79c88c4b99573e4ca92a79bf 99+ 3 m 0.4 k

attention seq2seq

1.结构

左边为encoder，对输入文本编码，右边为decoder，解码并应用。

整个流程的图解可以参考https://blog.csdn.net/weixin_44388679/article/details/102575223 中的“四、图解Attention Seq2Seq”，非常详细。

2.Teacher Forcing

在训练阶段，如果使用Teacher Forcing策略，那么目标句子单词的word embedding使用真值，否则使用预测结果；至于预测阶段不能使用Teacher Forcing。

3.beam search

beam search本质为介于蛮力与贪心之间的策略。对于贪心，每一级的输出只选择top1的结果作为下一级输入，然后top1的结果只是局部最优，不一定是全局最优，精度可能较低。对于蛮力，每级将全部结果输入下级，假设$L$为词表大小，那么最后一级的数据量为$L^{m}$，$m$为decoder 的cell数量，计算效率太低。对于beam search，每级选择top k作为下级输入，综合了效率和精度。

4 常见问题

0 为什么rnn based seq2seq不需要额外添加位置信息？

天然有位置信息（迭代顺序）

1 为什么rnn based seq2seq输入输出长度可变？

因为rnn based seq2seq是迭代进行的，所以长度可变

2 训练的时候要padding吗？

不用padding

参考

https://zhuanlan.zhihu.com/p/47929039

https://www.cnblogs.com/liuxiaochong/p/14399416.html

https://blog.csdn.net/thriving_fcl/article/details/74853556

NLP 文本生成

seq2seq

2021-07-24 27df59e6ff951a0511278478b54672f6 99+ fast 0.0 k

特征提取器

大致分为三类：CNN，LSTM，transformer block

1.CNN

滑动部分为卷积核

2.LSTM

3.transformer block

参考

https://www.cnblogs.com/sandwichnlp/p/11612596.html#transformer

机器学习模型结构

特征提取

2021-07-24 b252747dca4de404ee2b0a8063b40954 99+ 8 m 1.2 k

常见算法总结

类别归类：蛮力，分治，动态规划，贪心，回溯，分支限界。

实现方式有两种，一般为基于迭代和基于递归。

一.蛮力Brute-force

没有什么好说，就是遍历或者枚举。

二.分治

分而治之，将大问题拆解成相互独立且相似的子问题。

步骤：1、先分 2 求解 3 .合并

实现方式：一般递归实现

三.动态规划

和暴力相比，dp利用了子问题间的依赖关系，减少了一些计算

适用条件：1.满足最优子结构性质，即最优解所包含的子问题的解也是最优的

构造递推式：

0.确定维度，一般一维或者二维

1.注意含义

2.选择分隔点

a. 一般和前一步或者两步有关,比如最大子序和，$dp[i]=max(s[i],s[i]+dp[i-1])$

b. 但有时候需要遍历全部分割点，比如单词拆分，$dp[i]=dp[j] \ \& \& \ check(s[j..i−1])$，

c. 有时候动态选择分隔点，比如丑数，$ dp[i]=min(dp[p2]2,dp[p3]3,dp[p5]*5) $,其中$p2,p3,p5$动态变化

3.注意+， -

一定是用已有的递推没有的

举例比如62. 不同路径，$dp[i][j] = dp[i-1][j] + dp[i][j-1]$；

class Solution:
    def uniquePaths(self, m: int, n: int) -> int:
        dp = [[1]*n] + [[1]+[0] * (n-1) for _ in range(m-1)]
        #print(dp)
        for i in range(1, m):
            for j in range(1, n):
                dp[i][j] = dp[i-1][j] + dp[i][j-1]
        return dp[-1][-1]

比如5. 最长回文子串，$dp[i][j]=dp[i+1][j-1] \ and \ s[i]==s[j]$

s = “babad”

n=5

（i，j）

(0,1) (1,2) (2,3) (3,4) (0,2) (1,3) (2,4) (0,3)

class Solution:
    def longestPalindrome(self, s: str) -> str:
        n = len(s)
        if n < 2:
            return s
        
        max_len = 1
        begin = 0
        # dp[i][j] 表示 s[i..j] 是否是回文串
        dp = [[False] * n for _ in range(n)]
        for i in range(n):
            dp[i][i] = True
        
        # 递推开始
        # 先枚举子串长度
        for L in range(2, n + 1):
            # 枚举左边界，左边界的上限设置可以宽松一些
            for i in range(n):
                # 由 L 和 i 可以确定右边界，即 j - i + 1 = L 得
                j = L + i - 1
                # 如果右边界越界，就可以退出当前循环
                if j >= n:
                    break
                    
                if s[i] != s[j]:
                    dp[i][j] = False 
                else:
                    if j - i < 3:
                        dp[i][j] = True
                    else:
                        dp[i][j] = dp[i + 1][j - 1]
                
                # 只要 dp[i][L] == true 成立，就表示子串 s[i..L] 是回文，此时记录回文长度和起始位置
                if dp[i][j] and j - i + 1 > max_len:
                    max_len = j - i + 1
                    begin = i
        return s[begin:begin + max_len]

解的追踪：有时候需要借助备忘录搜索解，时间复杂度不超过计算备忘录

实现方式

递归实现：效率不高的原因在于子问题重复计算了，比起暴力哪个快？应该还是这个

迭代+备忘录：不一定全部都要存储，需要的存着了就可以了，每次子问题计算一次（空间换时间）

一般采用迭代+备忘录

状态压缩动态规划

https://jimmy-shen.medium.com/bitmask-state-compression-dp-39e7ba56978b

https://segmentfault.com/a/1190000038223084

https://blog.csdn.net/wxgaws/article/details/114693162

Usually, the state of DP can use limited variables to represent such a dp[i] , dp[i] [j] ,dp[i] [j] [k].

However, sometimes, the states of a DP problem may contain multiple statuses. In this case, we can think about using the state compression approaches to represent the DP state.

说白了就是在状态很多的时候对状态降维

目的：可以优化空间复杂度，不知道可以不可以优化时间复杂度

四.贪心

短视，局部最优（少考虑很多，计算量就下去了）

需要证明

五.回溯

一般用于可以用树型结构建模的问题

回溯法能系统地搜索一个问题的所有解，和暴力的区别在于回溯可以剪枝，以此减少计算量

实现方式：利用DFS搜索解空间，一般递归实现

https://zhuanlan.zhihu.com/p/93530380

https://zhuanlan.zhihu.com/p/112926891

六.分支限界

类似回溯，区别如下

代价函数

以极大值问题为例，以当前结点为根的子树能够得到的可行解的最大值

界

以极大值问题为例，当前已经得到的可行解的最大值

减枝

代价函数和界可以用于减枝，以极大值问题为例，界大于等于某结点代价函数，该结点就不需要继续搜索了

参考文献

https://blog.csdn.net/zm1_1zm/article/details/69224626

https://blog.csdn.net/wxgaws/article/details/114693162

https://segmentfault.com/a/1190000038223084

https://jimmy-shen.medium.com/bitmask-state-compression-dp-39e7ba56978b

基础算法算法导论

算法导论

一.核心思想

二.应用

参考

参考

一. 摘要

二.主要贡献

三. 整体流程

四. 文本标注

4.1 Tag集构建（也就是label集构建）

4.1.1 词表V的构建

4.1.2 tag集

4.2 Converting Training Targets into Tags

4.3 模型结构

4.3.1 encoder

4.3.2 decoder

4.4 loss

五.realize

六.评价指标

七.实验结果

参考

1.贡献

2.结构

3.实验结果

3.1 Unsupervised STS

3.2 Supervised STS

3.3 Argument Facet Similarity

3.4 Wikipedia Sections Distinction

4.代码

1.BLEU

1.1 完整式子

1.2 $BP$

1.3 $P_n$

1.4 $W_n$

2 ROUGE

3 NIST

4 METEOR

5 TER

参考文献

1 pycharm连接远程，本地无法查看源码，但是可以正常运行

2 git 看不到代码变化

3.连接远程，可以run，不可以debug

1.结构

2.Teacher Forcing

3.beam search

4 常见问题

参考

1.CNN

2.LSTM

3.transformer block

参考

一.蛮力Brute-force

二.分治

三.动态规划

四.贪心

五.回溯

六.分支限界

参考文献

Recents

Categories

Archives

Tags

Subscribe for updates