2021-07-27 cb752746ba01f81f9394f0e9d77bdc74 99+ 18 m 2.7 k

LASERTAGGER

一. 摘要

对于某一些文本生成任务，输入和输出的文本有很多的重叠部分，如果还是采用encoder-decoder的文本生成模型去从零开始生成，其实是很浪费和没必要的，并且会导致两个问题：1：生成模型的幻觉问题(就是模型胡说八道) ；2：出现叠词(部分片段一致)。

基于上面的考虑，作者提出了lasertagger模型，通过几个常用的操作：keep token、delete token、 add token，给输入序列的每个token打上标签，使得文本生成任务转化为了序列标注任务。

通过这种方式，相较于encoder-decoder模型的优势有如下：1、推理的速度更快 2、在较小的数据集上性能优于seq2seq baseline，在大数据集上和baseline持平（因为输入和输出的文本有很多的重叠部分，对于这种情况，lasertagger的候选词库比较小，因为对于重叠部分的词，词库只需要添加keep，而传统encoder-decoder的候选词库依然很大，因为对于重叠部分的词，词库需要添加对应的词）

二.主要贡献

1、通过输入和输出文本，自动去提取需要add的token

2、通过输入文本，输出文本和tag集，给训练的输入序列打上标签

3、提出了两个版本，$LASERTAGGER_{AR}$( bert+transformer decoder )和$LASERTAGGER_{FF}$( bert+desen+softmax )

三. 整体流程

其实就是两个过程，一.将输入文本变编码成特殊标注，二.将标注解码成文本

四. 文本标注

4.1 Tag集构建（也就是label集构建）

一般情况，tag分为两个大类： base tag $B$和 add tag $P$。对于base tag，就是$KEEP$或者$DELETE$当前token；对于add tag，就是要添加一个词到token前面，添加的词来源于词表$V$。实际在工程中，将$B$和$P$结合来表示，即$^{P}B$，总的tag数量大约等于$B$的数量乘以$P$的数量，即$2|V|$。对于某些任务可以引入特定的tag，比如对于句子融合，可以引入$SWAP$,如下图。

4.1.1 词表V的构建

构建目标：

最小化词汇表规模；
最大化目标词语的比例

限制词汇表的词组数量可以减少相应输出的决策量；最大化目标词语的比例可以防止模型添加无效词。

构建过程：

通过$LCS$算法（longest common sequence，最长公共子序列，注意和最长公共子串不是一回事），找出输入和输出序列的最长公共子序列，输出剩下的序列，就是需要$add$的token，添加到词表$V$，词表中的词基于词频排序,然后选择$l$个常用的。

举个例子：soruce为“12345678”，target为”1264591”

最长公共子序列为[‘1’, ‘2’, ‘4’, ‘5’]

需要$add$的token为 [‘6’, ‘91’]

源码：

def _lcs_table(source, target):
  """Returns the Longest Common Subsequence dynamic programming table."""
  rows = len(source)
  cols = len(target)
  lcs_table = [[0] * (cols + 1) for _ in range(rows + 1)]
  for i in range(1, rows + 1):
    for j in range(1, cols + 1):
      if source[i - 1] == target[j - 1]:
        lcs_table[i][j] = lcs_table[i - 1][j - 1] + 1
      else:
        lcs_table[i][j] = max(lcs_table[i - 1][j], lcs_table[i][j - 1])
  return lcs_table


def _backtrack(table, source, target, i, j):
  """Backtracks the Longest Common Subsequence table to reconstruct the LCS.

  Args:
    table: Precomputed LCS table.
    source: List of source tokens.
    target: List of target tokens.
    i: Current row index.
    j: Current column index.

  Returns:
    List of tokens corresponding to LCS.
  """
  if i == 0 or j == 0:
    return []
  if source[i - 1] == target[j - 1]:
    # Append the aligned token to output.
    return _backtrack(table, source, target, i - 1, j - 1) + [target[j - 1]]
  if table[i][j - 1] > table[i - 1][j]:
    return _backtrack(table, source, target, i, j - 1)
  else:
    return _backtrack(table, source, target, i - 1, j)

def _compute_lcs(source, target):
  # s1={1,3,4,5,6,7,7,8},s2={3,5,7,4,8,6,7,8,2} return 35778
  table = _lcs_table(source, target)
  return _backtrack(table, source, target, len(source), len(target))
  
  
  
def _get_added_phrases(source: Text, target: Text) -> Sequence[Text]:
    """
    Computes the phrases that need to be added to the source to get the target.
    """
    sep = ''
    source_tokens = utils.get_token_list(source.lower())
    target_tokens = utils.get_token_list(target.lower())
    #compute Longest Common Subsequence
    kept_tokens = _compute_lcs(source_tokens, target_tokens)
    added_phrases = []
    kept_idx = 0
    phrase = []
    for token in target_tokens:
        if kept_idx < len(kept_tokens) and token == kept_tokens[kept_idx]:
            kept_idx += 1
            if phrase:
                added_phrases.append(sep.join(phrase))
                phrase = []
        else:
            phrase.append(token)
    if phrase:
        added_phrases.append(sep.join(phrase))
    return added_phrases

词表位于文件label_map.txt.log，本人基于自己的数据集，内容如下所示

Idx Frequency  Coverage (%)   Phrase
1  19 94.22  址
2  15 95.27  单位
3  8  95.76  地
4  6  96.17  执勤

4.1.2 tag集

本人基于自己的数据集，得到的候选tag如下：

KEEP
DELETE
KEEP|址
DELETE|址
KEEP|单位
DELETE|单位
KEEP|地
DELETE|地
KEEP|执勤
DELETE|执勤

4.2 Converting Training Targets into Tags

paper上的伪代码：

采用贪心策略，核心思想就是遍历$t$，先和$s$匹配，匹配上就$keep$，然后$i_t+j$，得到潜在的$add \ phrase \ p=t(i_t:i_t+j-1) $，然后判断$t(i_t+j)==s(i_s)\ and \ p\in V $

源码：

和伪代码有一点不同，差异在于#####之间。

def _compute_single_tag(
        self, source_token, target_token_idx,
        target_tokens):
    """Computes a single tag.

    The tag may match multiple target tokens (via tag.added_phrase) so we return
    the next unmatched target token.

    Args:
      source_token: The token to be tagged.
      target_token_idx: Index of the current target tag.
      target_tokens: List of all target tokens.

    Returns:
      A tuple with (1) the computed tag and (2) the next target_token_idx.
    """
    source_token = source_token.lower()
    target_token = target_tokens[target_token_idx].lower()
    if source_token == target_token:
        return tagging.Tag('KEEP'), target_token_idx + 1
    # source_token!=target_token
    added_phrase = ''
    for num_added_tokens in range(1, self._max_added_phrase_length + 1):
        if target_token not in self._token_vocabulary:
            break
        added_phrase += (' ' if added_phrase else '') + target_token
        next_target_token_idx = target_token_idx + num_added_tokens
        if next_target_token_idx >= len(target_tokens):
            break
        target_token = target_tokens[next_target_token_idx].lower()
        if (source_token == target_token and
                added_phrase in self._phrase_vocabulary):
            return tagging.Tag('KEEP|' + added_phrase), next_target_token_idx + 1
    return tagging.Tag('DELETE'), target_token_idx


def _compute_tags_fixed_order(self, source_tokens, target_tokens):
    """Computes tags when the order of sources is fixed.

    Args:
      source_tokens: List of source tokens.
      target_tokens: List of tokens to be obtained via edit operations.

    Returns:
      List of tagging.Tag objects. If the source couldn't be converted into the
      target via tagging, returns an empty list.
    """
    tags = [tagging.Tag('DELETE') for _ in source_tokens]
    # Indices of the tokens currently being processed.
    source_token_idx = 0
    target_token_idx = 0
    while target_token_idx < len(target_tokens):
        tags[source_token_idx], target_token_idx = self._compute_single_tag(
            source_tokens[source_token_idx], target_token_idx, target_tokens)
        ####################################################################################
        # If we're adding a phrase and the previous source token(s) were deleted,
        # we could add the phrase before a previously deleted token and still get
        # the same realized output. For example:
        #    [DELETE, DELETE, KEEP|"what is"]
        # and
        #    [DELETE|"what is", DELETE, KEEP]
        # Would yield the same realized output. Experimentally, we noticed that
        # the model works better / the learning task becomes easier when phrases
        # are always added before the first deleted token. Also note that in the
        # current implementation, this way of moving the added phrase backward is
        # the only way a DELETE tag can have an added phrase, so sequences like
        # [DELETE|"What", DELETE|"is"] will never be created.
        if tags[source_token_idx].added_phrase:
            # # the learning task becomes easier when phrases are always added before the first deleted token
            first_deletion_idx = self._find_first_deletion_idx(
                source_token_idx, tags)
            if first_deletion_idx != source_token_idx:
                tags[first_deletion_idx].added_phrase = (
                    tags[source_token_idx].added_phrase)
                tags[source_token_idx].added_phrase = ''
        ########################################################################################
        source_token_idx += 1
        if source_token_idx >= len(tags):
            break

    # If all target tokens have been consumed, we have found a conversion and
    # can return the tags. Note that if there are remaining source tokens, they
    # are already marked deleted when initializing the tag list.
    if target_token_idx >= len(target_tokens):  # all target tokens have been consumed
        return tags
    return []  # TODO

缺陷：

对于一些情况，无法还原，举个例子：

source：证件有效期截止日期 target：证件日期格式

得不到tag结果

可以补充策略来修复bug

def _compute_tags_fixed_order(self, source_tokens, target_tokens):
    """Computes tags when the order of sources is fixed.

    Args:
      source_tokens: List of source tokens.
      target_tokens: List of tokens to be obtained via edit operations.

    Returns:
      List of tagging.Tag objects. If the source couldn't be converted into the
      target via tagging, returns an empty list.
    """
  

    tags = [tagging.Tag('DELETE') for _ in source_tokens]
    # Indices of the tokens currently being processed.
    source_token_idx = 0
    target_token_idx = 0
    while target_token_idx < len(target_tokens):
        tags[source_token_idx], target_token_idx = self._compute_single_tag(
            source_tokens[source_token_idx], target_token_idx, target_tokens)
        #########################################################################################
        # If we're adding a phrase and the previous source token(s) were deleted,
        # we could add the phrase before a previously deleted token and still get
        # the same realized output. For example:
        #    [DELETE, DELETE, KEEP|"what is"]
        # and
        #    [DELETE|"what is", DELETE, KEEP]
        # Would yield the same realized output. Experimentally, we noticed that
        # the model works better / the learning task becomes easier when phrases
        # are always added before the first deleted token. Also note that in the
        # current implementation, this way of moving the added phrase backward is
        # the only way a DELETE tag can have an added phrase, so sequences like
        # [DELETE|"What", DELETE|"is"] will never be created.
        if tags[source_token_idx].added_phrase:
            # # the learning task becomes easier when phrases are always added before the first deleted token
            first_deletion_idx = self._find_first_deletion_idx(
                source_token_idx, tags)
            if first_deletion_idx != source_token_idx:
                tags[first_deletion_idx].added_phrase = (
                    tags[source_token_idx].added_phrase)
                tags[source_token_idx].added_phrase = ''
        #######################################################################################

        source_token_idx += 1
        if source_token_idx >= len(tags):
            break

    # If all target tokens have been consumed, we have found a conversion and
    # can return the tags. Note that if there are remaining source tokens, they
    # are already marked deleted when initializing the tag list.
    if target_token_idx >= len(target_tokens):  # all target tokens have been consumed
        return tags
    ####fix bug by lavine

    ###strategy1
    added_phrase = "".join(target_tokens[target_token_idx:])
    if added_phrase in self._phrase_vocabulary:
        tags[-1] = tagging.Tag('DELETE|' + added_phrase)
        print(''.join(source_tokens))
        print(''.join(target_tokens))
        print(str([str(tag) for tag in tags] if tags != None else None))
        return tags
    ###strategy2
    return []  # TODO

4.3 模型结构

模型主要包含两个部分：1.encoder:generates activation vectors for each element in the input sequence 2.decoder：converts encoder activations into tag labels

4.3.1 encoder

由于$BERT$在sentence encoding tasks上做到state-of-the-art，所以使用$BERT$ 作为encoder部分。作者选择了$BERT_{base}$,包含12个self-attention层

4.3.2 decoder

在$BERT$原文中，对于标注任务采取了非常简单的decoder结构，即采用一层feed-forward作为decoder，把这种组合叫做$LASERTAGGER_{FF}$，这种结构的缺点在于预测的标注词相互独立，没有考虑标注词的关联性。

为了考虑标注词的关联性，decode使用了Transformer decoder，单向连接，记作$LASERTAGGER_{AR}$，这种encoder和decoder的组合的有点像BERT结合GPT的感觉decoder 和encoder在以下方面交流：(i) through a full attention over the sequence of encoder activations (ii) by directly consuming the encoder activation at the current step

4.4 loss

假设句子长度为n，tag数量为m, loss为n个m分类任务的和

五.realize

对于基本的tag，比如$KEEP$，$DELETE$，$ADD$，$realize$就是根据输入和tag直接转换就行；对于特殊的tag，需要一些特定操作，看情况维护规则。

六.评价指标

评价指标，不同任务不同评价指标

1 Sentence Fusion

Exact score ：percentage of exactly correctly predicted fusions（类似accuracy）

SARI ：average F1 scores of the added, kept, and deleted n-grams

2 Split and Rephrase

SARI

3 Abstractive Summarization

ROUGE-L

4 Grammatical Error Correction (GEC)

precision and recall, F0:5

七.实验结果

baseline： based on Transformer where both the encoder and decoder replicate the $BERT_{base}$ architecture

速度：1.$LASERTAGGER_{AR} $is already 10x faster than comparable-in-accuracy $SEQ2SEQ_{BERT}$ baseline. This difference is due to the former model using a 1-layer decoder (instead of 12 layers) and no encoder-decoder cross attention. 2.$LASERTAGGER_{FF}$ is more than 100x faster

其余结果参考paper

参考

https://arxiv.org/pdf/1909.01187.pdf

https://github.com/google-research/lasertagger

https://zhuanlan.zhihu.com/p/348109034

NLP 文本生成

文本改写

Felix Flexible Text Editing Through Tagging and Insertion

1 Introduction

2 Model description

2.1 Tagging Model

2.2 Insertion Model

参考