Other than semantic and relevance matching, more complex factors/trade-offs, e.g., user personalization [2, 3, 10] and retrieval efficiency [5], need to be considered when applying deep models to a large-scale online retrieval system.

2.2 Deep Retrieval in Industry Search

Representation-based models with an ANN (approximate near neighbor) algorithm have become the mainstream trend to efficiently deploy neural retrieval models in industry.

3 MODEL

整体结构入下：

3.1 Problem Formulation

$\mathcal{U}=\{u_1,…,u_u,…u_N\}$表示$N$个用户，$\mathcal{Q}=\{q_1,…,q_u,…q_N\}$表示与用户对应的$N$个query，$\mathcal{I}=\{i_1,…,i_u,…i_M\}$表示$M$个商品。将用户$u$的历史行为根据时间分成3个部分：1.real-time，before
the current time step，$\mathcal{R}^u=\{i_1^u,…,i_t^u,…i_T^u\}$ 2.short-term, before $\mathcal{R}$ and within ten days,$\mathcal{S}^u=\{i_1^u,…,i_t^u,…i_T^u\}$ 3.long-term sequences,before $\mathcal{S}$ and within one month,$\mathcal{L}^u=\{i_1^u,…,i_t^u,…i_T^u\}$ ，$T$为时间长度。任务可以定义为：

$z=\mathcal{F}(\phi(q_u,\mathcal{R}^u,\mathcal{S}^u,\mathcal{L}^u),\varphi(i))$

其中$\mathcal{F}(\cdot),\phi(\cdot),\varphi(\cdot)$分别表示scoring function, query and behaviors encoder, and item encoder

3.2 User Tower

3.2.1 Multi-Granular Semantic Unit

挖掘query的语义，原始输入包含当前query和历史query

没有说明为什么这么设计，感觉就是工程试验的结论。有个疑问，直接用BERT等深度语言模型来挖掘query的语义不好吗？

query表示为$q_u=\{w_1^u,…,w_n^u\}$,例如{红色，连衣裙}，$w_u=\{c_1^u,…,c_m^u\}$,例如{红，色}，history query表示为$q_{his}=\{q_1^u,…,q_k^u\} $,例如{绿色，半身裙，黄色，长裙}，其中$w_n \in \mathbb{R}^{1\times d},c_m \in \mathbb{R}^{1\times d},q_k \in \mathbb{R}^{1\times d}$

$q_{1\_gram}=mean\_pooling(c_1,...,c_m) \\q_{2\_gram}=mean\_pooling(c_1c_2,...,c_{m-1}c_m) \\q_{seq}=mean\_pooling(w_1,...,w_n) \\q_{seq\_seq}=mean\_pooling(T_{rm}(w_1,...,w_n)) \\q_{his\_seq}=softmax(q_{seg}\cdot(q_{his})^{T})q_{his} \\q_{mix}=q_{1\_gram}+q_{2\_gram}+q_{seq}+q_{seq\_seq}+q_{his\_seq} \\Q_{mgs}=concat(q_{1\_gram},q_{2\_gram},q_{seq},q_{seq\_seq},q_{his\_seq},q_{mix})$

其中𝑇𝑟𝑚,𝑚𝑒𝑎𝑛_𝑝𝑜𝑜𝑙𝑖𝑛𝑔, and 𝑐𝑜𝑛𝑐𝑎𝑡 denote the Transformer ,average, and vertical concatenation operation, respectively

3.2.2 User Behaviors Attention

$e_i^f=W_f\cdot x_i^{f} \in \mathbb{R}^{1\times d_f} \tag{9} \\i_t^u=concat(\{e_i^f\ | \ f \in \mathcal{F} \})$

其中$W_f$是embedding matrix，$x_i^{f}$是one-hot vector, $\mathcal{F}$是side information (e.g., leaf category, first-level category, brand and,shop)

real-time sequences

User’s click_item

$\mathcal{R}_{lstm}^u=LSTM(\mathcal{R}^u)=\{h_1^{u},...,h_t^{u},...,h_T^{u} \} \\\mathcal{R}_{self\_att}^u=multihead\_selfattention(\mathcal{R}_{lstm}^u)=\{h_1^{u},...,h_t^{u},...,h_T^{u} \} \\\mathcal{R}_{zero\_att}^u=\{0,h_1^{u},...,h_t^{u},...,h_T^{u} \} \ \# add \ a \ zero \ vector \ at \ the \ first \ position \ of \ \mathcal{R}_{self\_att}^u \\H_{real}=softmax(Q_{mgs}\cdot\mathcal{R}_{zero\_att}^T)\cdot\mathcal{R}_{zero\_att}^T$

short-term sequences

User’s click_item

$\mathcal{S}_{self\_att}^u=multihead\_selfattention(\mathcal{S}^u)=\{h_1^{u},...,h_t^{u},...,h_T^{u} \} \\\mathcal{S}_{zero\_att}^u=\{0,h_1^{u},...,h_t^{u},...,h_T^{u} \} \\H_{short}=softmax(Q_{mgs}\cdot\mathcal{S}_{zero\_att}^T)\cdot\mathcal{S}_{zero\_att}^T$

long-term sequence

$\mathcal{L}^u$由四个部分构成，分别为$\mathcal{L}^{u}_{item},\mathcal{L}^{u}_{shop},\mathcal{L}^{u}_{leaf},\mathcal{L}^{u}_{brand}$,每个部分包含3个动作，分别为click，buy，collect。

$\\ \mathcal{L}_{click\_item},\mathcal{L}_{buy\_item},\mathcal{L}_{collect\_item} \rightarrow L^{T}_{item} \\H_{a\_item}=softmax(Q_{mgs}\cdot L^{T}_{item})\cdot L^{T}_{item} \\H_{long}=H_{a\_item}+H_{a\_shop}+H_{a\_leaf}+H_{a\_brand}$

3.2.3 Fusion of Semantics and Personalization

$H_{qu}=Self\_Att^{first}([[cls],Q_{mgs},H_{real},H_{short},H_{long}]) \in \mathbb{R}^{1\times d}$

3.3 Item Tower

For the item tower, we experimentally use item ID and title to obtain the item representation 𝐻𝑖𝑡𝑒𝑚.Given the representation of item 𝑖’s ID, $e_i \in \mathbb{R}^{1\times d}$ , and its title segmentation result $T_i=\{w_1^{i},…,w_N^{i}\}$

$H_{item}=e+tanh(W_t\cdot\frac{\sum_{i=1}^Nw_i}{N})$

where $W_t$ is the transformation matrix. We empirically find that applying LSTM [12] or Transformer [27] to capture the context of the title is not as effective as simple mean-pooling since the title is stacked by keywords and lacks grammatical structure.

3.4 Loss Function

adapt the softmax cross-entropy loss as the training objective

$\hat{y}(i^+|q_u)=\frac{exp(\mathcal{F}(q_u,i^{+}))}{\sum_{i^{'}\in I }exp(\mathcal{F}(q_u,i^{'}))} \\L(\nabla )=-\sum_{i\in I}y_ilog(\hat{y_i})$

where $\mathcal{F},I,i^+,q_u$denote the inner product, the full item pool, the item tower’s representation $H_{item}$, and the user tower’s representation $H_{qu}$, respectively.

3.4.1 Smoothing Noisy Training Data

the softmax function with the temperature parameter $\tau$ is defined as follows

$\hat{y}(i^+|q_u)=\frac{exp(\mathcal{F}(q_u,i^{+})/\tau)}{\sum_{i^{'}\in I }exp(\mathcal{F}(q_u,i^{'})/\tau)}$

If 𝜏->0, the fitted distribution is close to one hot distribution,If 𝜏->∞, the fitted distribution is close to a uniform distribution

3.4.2 Generating Relevance-improving Hard Negative Samples

We first select the negative items of $i^-$ that have the top-𝑁 inner product scores with $q_u $ to form the hard sample set $I_{hard}$

$I_{mix}=\alpha i^++(1-\alpha)I_{hard}$

其中$\alpha\in \mathbb{R}^{N\times1}$is sampled from the uniform distribution 𝑈 (𝑎, 𝑏) (0 ≤ 𝑎 < 𝑏 ≤ 1).

$\hat{y}(i^+|q_u)=\frac{exp(\mathcal{F}(q_u,i^{+})/\tau)}{\sum_{i^{'}\in (I\cup I_{mix}) }exp(\mathcal{F}(q_u,i^{'})/\tau)}$

搜索系统召回向量

Embedding based Product Retrieval in Taobao Search

2021-08-21 0621e1cfe84648350430593241d756d2 99+ a minute 0.2 k

美团排序

https://tech.meituan.com/2022/08/11/coarse-ranking-exploration-practice.html

https://tech.meituan.com/2021/07/08/multi-business-modeling.html

https://tech.meituan.com/2021/11/19/exploration-and-practice-of-multi-business-commodities-ranking-in-meituan-search.html

https://tech.meituan.com/2020/07/09/bert-in-meituan-search.html

BERT在美团搜索核心排序的探索和实践

模型层面

整体结构如下

1 BERT预训练

2 多任务学习

场景层：根据业务场景进行划分，每个业务场景单独设计网络结构

3 联合训练

两个任务分别为：

1 相关性任务：相关性+NER（多任务增强相关性）

2 排序任务

怎么联合没看出来

之前是两阶段finetune： 1. 先相关性任务 2 然后排序任务

搜索系统排序

美团排序

2021-08-18 570287b2ff5e99b7d97bfc663dbb86fb 99+ 9 m 1.3 k

DSSM双塔模型系列

简单介绍微软出品的DSSM,CNN-DSSM,LSTM-DSSM

原文分别为：

《Learning Deep Structured Semantic Models for Web Search using Clickthrough Data》

《A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval》

《SEMANTIC MODELLING WITH LONG-SHORT-TERM MEMORY FOR INFORMATION RETRIEVAL》

首先为什么叫做双塔，query塔做在线serving，doc塔离线计算embeding建索引，推到线上即可。

注意， DSSM中query和不同的doc是共享参数的， https://flashgene.com/archives/72820.html

一.DSSM

1.1 模型整体结构

模型的整体结构如上图所示，$Q$为query，$D_i$为文档。

文本的初始词袋表示为$x$，因为参数过多，不利于训练，所以降低维度，就提出了word hashing

$l_1=W_1x$

word hashing其实就是利于char n-gram分词，然后用向量表示（只是这里依然用词袋表示向量，而不是稠密向量），如下所示

这里有个顾虑为是否存在不同的词使用相同的向量表示。关于这个作者做了实验，结果如下。

对于词汇数量500K大小的词表，采用3-gram后，此表压缩到30k，而且重复表示的仅为22个。重复表示率为0.0044%，维度压缩到原来6%，可以说非常有效。

然后为多层的非线性映射，每层都为全连接网络，得到

$l_i=f(W_il_{i-1}+b_{i}),i=2,...,N-1\\$

非线性映射层的最后一层得到语义特征$y$为

$y=f(W_Nl_{N-1}+b_N)\\ f(x)=tanh(x)=\frac{1-e^{-2x}}{1+e^{-2x}}$

利用余弦相似度衡量$Q$和$D$相似度得到

$R(Q,D)=cosine(y_Q,y_D)=\frac{y_Q^Ty_D}{||y_Q||||y_D||}$

最后的概率输出为

$P(D|Q)=\frac{e^{\gamma R(Q,D)}}{\sum_{D^{'}\in \textbf{D}}e^{\gamma R(Q,D^{'})}}$

其中$\gamma$为smoothing factor。

1.2 训练

样本集构造，对每个正样本$(Q,D^+)$，搭配4个随机负样本$(Q,D_j^-;j=1,..,4)$

损失函数为：

$L(\wedge)=-log \prod \limits_{(Q,D^+)}P(D^+|Q)$

其中$\wedge$为模型参数。

二.CNN-DSSM

2.1 CLSM结构

模型包括几个部分：(1) a word-n-gram layer obtained by running a contextual sliding window over the input word sequence (2) a letter-trigram layer that transforms each word-trigram into a letter-trigram representation vector (3) a convolutional layer that extracts contextual features for each word with its neighboring words defined by a window (4) a max-pooling layer that discovers and combines salient word-n-gram features to form a fixed-length sentence-level feature vector (5) a semantic layer that extracts a high-level semantic feature vector for the input word sequence.

2.2 Letter-trigram based Word-n-gram Representation

在DSSM的Letter-trigram的基础上加了Word-n-gram，Word-n-gram就是对原始输入文本做滑窗，对于第$t$个word-n-gram可以表示为：

$l_t=[f^T_{t-d},...,f^T_{t},...,f^T_{t+d}]^T,\ t=1,2,...,T$

其中$n=2d+1,f_t$为的第$t$个词语的letter-trigram。一个letter-trigram的维度为$30K$，那么一个word-n-gram维度为$n\times30K$

举个例子，如上图，输入文本为$(s) \ online \ auto\ body \ (s)$，滑动窗口大小为n=3，可得$(s)\ online \ auto，\ online \ auto \ body ，auto\ body \ (s) $，那么

$l_1=[f^T((s)),f^T(online ),f^T(auto)]^T,\\l_2=[f^T(online ),f^T(auto),f^T(body)]^T,\\l_3=[f^T(auto),f^T(body),f^T((s))]^T$

2.3 Modeling Word-n-gram-Level Contextual Features at the Convolutional Layer

语境相关特征向量$h_t$可以表示为：

$h_t=tanh(W_c\cdot l_t),\ t=1,...,T$

其中$W_c$为特征转换矩阵，也就是卷积矩阵，对于全部的word n-grams，$W_c$共享。有小伙伴肯定好奇，这不就是全连接吗，和卷积什么关系，俺也疑惑？

下图为作者做的一个实验。

2.4 Modeling Sentence-Level Semantic Features Using Max Pooling

获取局部的语境相关的特征向量后，我们需要把它们合在一起组合句子级别的特征向量。由于语句中某些词语不重要，我们可以忽略它，有些词语很重要，要保留。为了达到这个目的，使用了max pooling，用式子描述如下

$v(i)= \mathop{\max}_{t=1,..,T} \{h_t(i)\},\ i=1,...,K$

其中$v(i)$表示池化层输出$v$的第$i$个元素，$K$为$v$的维度和$h_t$的维度一样，$h_t(i)$是第$t$个局部语境特征向量的第$i$个元素。举个例子如下，

2.5 Latent Semantic Vector Representations

语义向量表示$y$，用公式描述如下

$y=tanh(W_s\cdot v)$

2.6 Using the CLSM for IR

和DSSM都一样，

$R(Q,D)=cosine(y_Q,y_D)=\frac{y_Q^Ty_D}{||y_Q||||y_D||} \\P(D|Q)=\frac{e^{\gamma R(Q,D)}}{\sum_{D^{'}\in \textbf{D}}e^{\gamma R(Q,D^{'})}}$

2.7 损失函数

$L(\wedge)=-log \prod \limits_{(Q,D^+)}P(D^+|Q)$

三.LSTM-DSSM

cnn-dssm只能捕获局部的文本信息，lstm对于长序列的信息捕获能力强于lstm，因此使用lstm改进dssm。

3.1 模型结构

整体结构如下图，注意红色的部分为残差传递的方向。

图中的LSTM单元是LSTM的变种，加入了peep hole的 LSTM，具体结构如下。

参考

https://www.cnblogs.com/guoyaohua/p/9229190.html

搜索系统排序

DSSM

2021-08-06 54c38d42e6b06e14ec6cde05f228369a 99+ fast 0.0 k

搜索系统

https://zhuanlan.zhihu.com/p/112719984

https://zhuanlan.zhihu.com/p/382001982

1.离线

物料获取-> 处理物料 -> 1.构建索引 2.属性库

2.在线

query -> query理解 -> 召回 -> 排序 -> 结果

搜索系统搜索系统

搜索系统

1.文本改写

1.1query纠错

1.2query对齐

1.3 query扩展

2.term分析

参考

参考

1.INTRODUCTION

2.RELATED WORK

2.1 Deep Matching in Search