2021-11-26 b3e1fa39e5aade0965361798cfb0e53a 99+ a minute 0.2 k

NER

Named Entity Recognition，命名实体识别

旨在从文本中抽取出命名实体，比如人名、地名、机构名等

文本数据标注

为什么标注？说白了就是标签

https://blog.csdn.net/scgaliguodong123_/article/details/121303421

举个例子：

BIO-三位序列标注法(B-begin，I-inside，O-outside)
B-X代表实体X的开头 x:PER(person) , ORG(orgnization),LOC(location)
I-X代表实体X的中间或结尾
O代表不属于任何类型的
样例：

 我 O
 是 O
 李 B-PER
 果 I-PER
 冻 I-PER
 ， O
 我 O
 爱 O
 中 B-ORG
 国 I-ORG
 ， O
 我 O
 来 O
 自 O
 四 B-LOC
 川 I-LOC
 。 O

参考

https://www.cnblogs.com/huangyc/p/10064853.html

https://www.cnblogs.com/YoungF/p/13488220.htmlhttps://www.cnblogs.com/YoungF/p/13488220.html

https://tech.meituan.com/2020/07/23/ner-in-meituan-nlp.html

https://zhuanlan.zhihu.com/p/156914795

https://blog.csdn.net/scgaliguodong123_/article/details/121303421

NLP 信息抽取

NER

2021-11-26 94e3ba015e24b4ab713c4548e35654a3 99+ fast 0.0 k

bm25

改进tf-idf

tf部分改进

idf部分改进

NLP 文本匹配

bm25

2021-11-22 9dc50e73f18d11a51e674bdfb7f99337 99+ fast 0.0 k

对比学习在NLP应用

https://zhuanlan.zhihu.com/p/435367182

1.应用在预训练

Pre-trained Models for Natural Language Processing A Survey

https://arxiv.org/pdf/2003.08271v4.pdf

2.应用在finetune

simcse，consert

NLP 对比学习

对比学习在NLP应用

2021-11-22 6e056f7e582fae7d38d3f1d58fb3f420 99+ fast 0.1 k

token embedding

https://www.cnblogs.com/d0main/p/10447853.html

start=>start: 开始 io=>inputoutput: 输入文本 cond=>condition: 条件 sub=>subroutine: 子流程 end=>end: 结束 op1=>operation: 输入文本 op2=>operation: tokenize op3=>operation: 词向量矩阵（预训练的或者随机初始化） op4=>operation: token embbedding op1->op2->op3->op4{"scale":1,"line-width":2,"line-length":50,"text-margin":10,"font-size":12}

NLP 文本表示

token embedding

2021-11-16 c8858890dcfb0794ab7acd47d2084d1c 99+ fast 0.0 k

Tensorflow中的Seq2Seq全家桶

https://zhuanlan.zhihu.com/p/47929039

NLP 小帮手

Tensorflow中的Seq2Seq全家桶

2021-11-08 ecb0f584b31eecd738ff2832a6326a34 99+ fast 0.0 k

text edit

https://blog.csdn.net/qq_27590277/article/details/118534238

https://thinkwee.top/2021/05/11/text-edit-generation/

https://zhuanlan.zhihu.com/p/144995580#:

NLP 文本生成

text edit

2021-11-04 b17e53cb12eb2fba48a006bf6a67add0 99+ fast 0.1 k

Pre-Training with Whole Word Masking for Chinese BERT

BERT-wwm-ext

wwm：whole word mask

ext： we also use extended training data (mark with ext in the model name)

预训练

1 改变mask策略

Whole Word Masking，wwm

cws: Chinese Word Segmentation

对比四种mask策略

参考

Pre-Training with Whole Word Masking for Chinese BERT

https://arxiv.org/abs/1906.08101v3

Revisiting Pre-trained Models for Chinese Natural Language Processing

https://arxiv.org/abs/2004.13922

github：https://hub.fastgit.org/ymcui/Chinese-BERT-wwm

NLP PTM

PTM

2021-11-04 218eec650c59229902b7dd7e3520a172 99+ a minute 0.2 k

ALBERT A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

There are three main contributions that ALBERT makes over the design choices of BERT：

1 Factorized embedding parameterization

原来embedding层是一个矩阵$M_{emb[V\times H]} $,现在变为两个$M_{emb1[V\times E]}$和$M_{emb2[E\times H]}$,参数量从VH变为VE+EH（This parameter reduction is significant when H >> E.）

2 Cross-layer parameter sharing

The default decision for ALBERT is to share all parameters across layers（attention，FFN)）

3 Inter-sentence coherence loss

原来的NSP改为现在的sop，正例的构建和NSP是一样的，不过负例则是将两句话反过来。

参考

https://zhuanlan.zhihu.com/p/88099919

https://blog.csdn.net/weixin_37947156/article/details/101529943

https://openreview.net/pdf?id=H1eA7AEtvS

NLP PTM

PTM

2021-10-26 6fc64fa940d006e20b55dd3fa555894d 99+ fast 0.1 k

中文词粒度BERT

1 Is Word Segmentation Necessary for Deep Learning of Chinese Representations?

we find that charbased（字粒度） models consistently outperform wordbased （词粒度）models.

We show that it is because word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting.

2 腾讯中文词模型

词模型在公开数据集的表现逊于字模型

参考

https://arxiv.org/pdf/1905.05526.pdf

https://www.jiqizhixin.com/articles/2019-06-27-17

NLP PTM

PTM

2021-10-26 4f0c1edbb24c7f5c8f422d9a846dcbba 99+ 10 m 1.6 k

文本匹配

1.无监督

1.1 编辑距离

定义

编辑距离，英文名字为Levenshtein distance，通过描述一个字符串A需要多少次基本操作可以变成字符串B，来衡量两个字符串的相似度。

基本操作包括：增、删、改

增：字符串A为“AS”，字符串B为“ ASD“，字符串A->字符串B需要增加一个字符“D”

删：字符串A为“ASD”，字符串B为“ AS“，字符串A->字符串B需要删除一个字符“D”

改：字符串A为“ASX”，字符串B为“ ASD“，字符串A->字符串B需要将字符“X”变成字符“D”

代码

实现过程使用动态规划，递推公式为

$lev_{a,b}(i,j)= \begin{equation} f(x)=\left\{ \begin{aligned} max(i,j) & & if\ \min(i,j)=0 \\ min\left\{ \begin{aligned} lev_{a,b}(i-1,j)+1 \\ lev_{a,b}(i,j-1)+1 \\ lev_{a,b}(i-1,j-1)+1_{(a_i\neq b_j)} \end{aligned} \right. \end{aligned} \right. \end{equation}$

$i$和$j$分别表示字符串$a$和字符串$b$的下标，$lev_{a,b}(i,j)$表示子串$a[:i]$到子串$b[:j]$的编辑距离。

def lev(str_a,str_b):
    """
    ED距离，用来衡量单词之间的相似度
    :param str_a:
    :param str_b:
    :return:
    """
    str_a=str_a.lower()
    str_b=str_b.lower()
    matrix_ed=np.zeros((len(str_a)+1,len(str_b)+1),dtype=np.int)
    matrix_ed[0]=np.arange(len(str_b)+1)
    matrix_ed[:,0] = np.arange(len(str_a) + 1)
    for i in range(1,len(str_a)+1):
        for j in range(1,len(str_b)+1):
            # 表示删除a_i
            dist_1 = matrix_ed[i - 1, j] + 1
            # 表示插入b_i
            dist_2 = matrix_ed[i, j - 1] + 1
            # 表示替换b_i
            dist_3 = matrix_ed[i - 1, j - 1] + (1 if str_a[i - 1] != str_b[j - 1] else 0)
            #取最小距离
            matrix_ed[i,j]=np.min([dist_1, dist_2, dist_3])
    print(matrix_ed)
    return matrix_ed[-1,-1]

1.2 TF-IDF

原理

（1）TF

针对某个doc

越大，这个term在doc中越普通

$TF_{term}=\frac{term在doc中出现的次数}{doc所有词的总数}$

（2）IDF

针对所有doc

越大，这个term在doc集合中越稀有

$IDF_{term}=log(\frac{doc总数}{包含该term的doc数+1})$

（3）TF-IDF

综合上面

$TF-IDF_{term}=TF_{term}*IDF_{term}$

（4）TF-IDF VEC

现有句子A：”今天天气真好”，对句子A做分词得到[“今天”,”天气”,”真好”],词库包含[“今天”,”天气”,”真好”,”天气”,”不错呀”]

$VEC_{A}=[TF-IDF_{今天},TF-IDF_{天气}，TF-IDF_{真好},0,0]$

（5）计算两句话的文本相似度

假设词库包含[“今天”,”天气”,”真好”,”天气”,”不错呀”],现有句子A：”今天天气真好”，对句子A做分词得到[“今天”,”天气”,”真好”],句子B:”天气不错呀”，分词后[“天气”,”不错呀”]

利用（3）得到句子A的TF-IDF VEC $VEC_{A}$，句子B的TF-IDF VEC $VEC_B$，利用余弦相似度计算文本相似度

代码

import pandas as pd
import jieba
import numpy  as np
from sklearn.externals import joblib
from scipy.linalg import norm


class TF_IDF_Model(object):
    def __init__(self, corpus_list):

        self.documents_list = corpus_list
        self.documents_number = len(corpus_list)
        self.get_idf()

    def get_idf(self):
        df = {}
        self.idf = {}
        tf = []
        for document in self.documents_list:
            temp = {}
            for word in document:
                temp[word] = temp.get(word, 0) + 1 / len(document)
            tf.append(temp)
            for key in temp.keys():
                df[key] = df.get(key, 0) + 1
        for key, value in df.items():
            self.idf[key] = np.log10(self.documents_number / (value + 1))

    def get_tf(self, document):
        document = list(jieba.cut(document))
        # tf = []
        temp = {}
        for word in document:
            temp[word] = temp.get(word, 0) + 1 / len(document)
        # tf.append(temp)
        return temp

    def tf_idf_vec(self, text):
        tf = self.get_tf(text)
        word = list(self.idf.keys())
        vec = [0] * len(self.idf)
        text = list(jieba.cut(text))
        for ele in text:
            if ele in word:
                vec[word.index(ele)] = tf[ele] * self.idf[ele]
        return vec

    def cal_similarty(self, sentence1, sentence2):
        vec1 = self.tf_idf_vec(sentence1)
        vec2 = self.tf_idf_vec(sentence2)
        similarty = np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))
        return similarty


def train_model():
    #####bulid corpus
    corpus = pd.read_csv(corpus_path)
    corpus_list = corpus["name"].get_values().tolist()
    # corpus_list = corpus1["name"].get_values().tolist()
    corpus_list = [list(jieba.cut(str(doc))) for doc in corpus_list]
    tf_idf_model = TF_IDF_Model(corpus_list)
    joblib.dump(tf_idf_model, model_path)


def load_model(path):
    tf_idf_model = joblib.load(path)
    return tf_idf_model


if __name__ == '__main__':
    from supercat.data_qualifier.tf_idf import TF_IDF_Model
    ####
    train_model()
    ######
    tf_idf_model = load_model(model_path)
    sentence1="XXXX"
    sentence2="XXXX"
    print(tf_idf_model.get_tf(sentence1))
    print(tf_idf_model.idf)
    print(tf_idf_model.tf_idf_vec(sentence1))
    print(tf_idf_model.cal_similarty(sentence1,sentence2))

2.有监督

基于表示的匹配方法：使用深度学习模型分别表征Query和Doc，通过计算向量相似度来作为语义匹配分数。微软的DSSM[26]及其扩展模型属于基于表示的语义匹配方法，美团搜索借鉴DSSM的双塔结构思想，左边塔输入Query信息，右边塔输入POI、品类信息，生成Query和Doc的高阶文本相关性、高阶品类相关性特征，应用于排序模型中取得了很好的效果。此外，比较有代表性的表示匹配模型还有百度提出 SimNet[27]，中科院提出的多视角循环神经网络匹配模型（MV-LSTM）[28]等。

基于交互的匹配方法：这种方法不直接学习Query和Doc的语义表示向量，而是在神经网络底层就让Query和Doc提前交互，从而获得更好的文本向量表示，最后通过一个MLP网络获得语义匹配分数。代表性模型有华为提出的基于卷积神经网络的匹配模型ARC-II[29]，中科院提出的基于矩阵匹配的的层次化匹配模型MatchPyramid[30]。

基于表示的匹配方法优势在于Doc的语义向量可以离线预先计算，在线预测时只需要重新计算Query的语义向量，缺点是模型学习时Query和Doc两者没有任何交互，不能充分利用Query和Doc的细粒度匹配信号。基于交互的匹配方法优势在于Query和Doc在模型训练时能够进行充分的交互匹配，语义匹配效果好，缺点是部署上线成本较高。

匹配不同于排序，匹配是1对1的，排序是1对多

2.1基于表示

https://zhuanlan.zhihu.com/p/138864580

https://blog.csdn.net/qq_27590277/article/details/121391770

2.2.基于交互

https://blog.csdn.net/guofei_fly/article/details/107501276

NLP 文本匹配

文本匹配

NER

分类

文本数据标注

参考

bm25

对比学习在NLP应用

token embedding

Tensorflow中的Seq2Seq全家桶

text edit

Pre-Training with Whole Word Masking for Chinese BERT

预训练

参考

ALBERT A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

参考

中文词粒度BERT

参考

文本匹配

1.无监督

1.1 编辑距离

定义

代码

1.2 TF-IDF

原理

代码

2.有监督

2.1基于表示

2.2.基于交互

Recents

Categories

Archives

Tags

Subscribe for updates