Lucene是开源全文检索工具包(提供倒排索引能力),说白了就是jar包
https://zhuanlan.zhihu.com/p/71629409?fileGuid=It0Qkg2AiecFMx62
Solr是Apache下的一个顶级开源项目,采用Java开发,它是基于Lucene的全文搜索服务器。Solr提供了比Lucene更为丰富的查询语言,同时实现了可配置、可扩展,并对索引、搜索性能进行了优化。
https://arxiv.org/pdf/2106.09297.pdf
http://xtf615.com/2021/10/07/taobao-ebr/
框架是搜索系统主流的结构,即匹配/检索,粗排,精排,重排。
fall into two categories: representation-based learning and interaction-based learning.
Other than semantic and relevance matching, more complex factors/trade-offs, e.g., user personalization [2, 3, 10] and retrieval efficiency [5], need to be considered when applying deep models to a large-scale online retrieval system.
Representation-based models with an ANN (approximate near neighbor) algorithm have become the mainstream trend to efficiently deploy neural retrieval models in industry.
整体结构入下:
$\mathcal{U}=\{u_1,…,u_u,…u_N\}$表示$N$个用户,$\mathcal{Q}=\{q_1,…,q_u,…q_N\}$表示与用户对应的$N$个query,$\mathcal{I}=\{i_1,…,i_u,…i_M\}$表示$M$个商品。将用户$u$的历史行为根据时间分成3个部分:1.real-time,before
the current time step,$\mathcal{R}^u=\{i_1^u,…,i_t^u,…i_T^u\}$ 2.short-term, before $\mathcal{R}$ and within ten days,$\mathcal{S}^u=\{i_1^u,…,i_t^u,…i_T^u\}$ 3.long-term sequences,before $\mathcal{S}$ and within one month,$\mathcal{L}^u=\{i_1^u,…,i_t^u,…i_T^u\}$ ,$T$为时间长度。任务可以定义为:
其中$\mathcal{F}(\cdot),\phi(\cdot),\varphi(\cdot)$分别表示scoring function, query and behaviors encoder, and item encoder
挖掘query的语义,原始输入包含当前query和历史query
没有说明为什么这么设计,感觉就是工程试验的结论。有个疑问,直接用BERT等深度语言模型来挖掘query的语义不好吗?
query表示为$q_u=\{w_1^u,…,w_n^u\}$,例如{红色,连衣裙},$w_u=\{c_1^u,…,c_m^u\}$,例如{红,色},history query表示为$q_{his}=\{q_1^u,…,q_k^u\} $,例如{绿色,半身裙,黄色,长裙},其中$w_n \in \mathbb{R}^{1\times d},c_m \in \mathbb{R}^{1\times d},q_k \in \mathbb{R}^{1\times d}$
其中𝑇𝑟𝑚,𝑚𝑒𝑎𝑛_𝑝𝑜𝑜𝑙𝑖𝑛𝑔, and 𝑐𝑜𝑛𝑐𝑎𝑡 denote the Transformer ,average, and vertical concatenation operation, respectively
其中$W_f$是embedding matrix,$x_i^{f}$是one-hot vector, $\mathcal{F}$是side information (e.g., leaf category, first-level category, brand and,shop)
real-time sequences
User’s click_item
short-term sequences
User’s click_item
long-term sequence
$\mathcal{L}^u$由四个部分构成,分别为$\mathcal{L}^{u}_{item},\mathcal{L}^{u}_{shop},\mathcal{L}^{u}_{leaf},\mathcal{L}^{u}_{brand}$,每个部分包含3个动作,分别为click,buy,collect。
For the item tower, we experimentally use item ID and title to obtain the item representation 𝐻𝑖𝑡𝑒𝑚.Given the representation of item 𝑖’s ID, $e_i \in \mathbb{R}^{1\times d}$ , and its title segmentation result $T_i=\{w_1^{i},…,w_N^{i}\}$
where $W_t$ is the transformation matrix. We empirically find that applying LSTM [12] or Transformer [27] to capture the context of the title is not as effective as simple mean-pooling since the title is stacked by keywords and lacks grammatical structure.
adapt the softmax cross-entropy loss as the training objective
where $\mathcal{F},I,i^+,q_u$denote the inner product, the full item pool, the item tower’s representation $H_{item}$, and the user tower’s representation $H_{qu}$, respectively.
the softmax function with the temperature parameter $\tau$ is defined as follows
If 𝜏->0, the fitted distribution is close to one hot distribution,If 𝜏->∞, the fitted distribution is close to a uniform distribution
We first select the negative items of $i^-$ that have the top-𝑁 inner product scores with $q_u $ to form the hard sample set $I_{hard}$
其中$\alpha\in \mathbb{R}^{N\times1}$is sampled from the uniform distribution 𝑈 (𝑎, 𝑏) (0 ≤ 𝑎 < 𝑏 ≤ 1).