Lucene是开源全文检索工具包(提供倒排索引能力),说白了就是jar包
https://zhuanlan.zhihu.com/p/71629409?fileGuid=It0Qkg2AiecFMx62
Solr是Apache下的一个顶级开源项目,采用Java开发,它是基于Lucene的全文搜索服务器。Solr提供了比Lucene更为丰富的查询语言,同时实现了可配置、可扩展,并对索引、搜索性能进行了优化。
https://arxiv.org/pdf/2106.09297.pdf
http://xtf615.com/2021/10/07/taobao-ebr/
框架是搜索系统主流的结构,即匹配/检索,粗排,精排,重排。
fall into two categories: representation-based learning and interaction-based learning.
Other than semantic and relevance matching, more complex factors/trade-offs, e.g., user personalization [2, 3, 10] and retrieval efficiency [5], need to be considered when applying deep models to a large-scale online retrieval system.
Representation-based models with an ANN (approximate near neighbor) algorithm have become the mainstream trend to efficiently deploy neural retrieval models in industry.
整体结构入下:
U={u1,…,uu,…uN}表示N个用户,Q={q1,…,qu,…qN}表示与用户对应的N个query,I={i1,…,iu,…iM}表示M个商品。将用户u的历史行为根据时间分成3个部分:1.real-time,before
the current time step,Ru={iu1,…,iut,…iuT} 2.short-term, before R and within ten days,Su={iu1,…,iut,…iuT} 3.long-term sequences,before S and within one month,Lu={iu1,…,iut,…iuT} ,T为时间长度。任务可以定义为:
其中F(⋅),ϕ(⋅),φ(⋅)分别表示scoring function, query and behaviors encoder, and item encoder
挖掘query的语义,原始输入包含当前query和历史query
没有说明为什么这么设计,感觉就是工程试验的结论。有个疑问,直接用BERT等深度语言模型来挖掘query的语义不好吗?
query表示为qu={wu1,…,wun},例如{红色,连衣裙},wu={cu1,…,cum},例如{红,色},history query表示为qhis={qu1,…,quk},例如{绿色,半身裙,黄色,长裙},其中wn∈R1×d,cm∈R1×d,qk∈R1×d
q1_gram=mean_pooling(c1,...,cm)q2_gram=mean_pooling(c1c2,...,cm−1cm)qseq=mean_pooling(w1,...,wn)qseq_seq=mean_pooling(Trm(w1,...,wn))qhis_seq=softmax(qseg⋅(qhis)T)qhisqmix=q1_gram+q2_gram+qseq+qseq_seq+qhis_seqQmgs=concat(q1_gram,q2_gram,qseq,qseq_seq,qhis_seq,qmix)其中𝑇𝑟𝑚,𝑚𝑒𝑎𝑛_𝑝𝑜𝑜𝑙𝑖𝑛𝑔, and 𝑐𝑜𝑛𝑐𝑎𝑡 denote the Transformer ,average, and vertical concatenation operation, respectively
其中Wf是embedding matrix,xfi是one-hot vector, F是side information (e.g., leaf category, first-level category, brand and,shop)
real-time sequences
User’s click_item
Rulstm=LSTM(Ru)={hu1,...,hut,...,huT}Ruself_att=multihead_selfattention(Rulstm)={hu1,...,hut,...,huT}Ruzero_att={0,hu1,...,hut,...,huT} #add a zero vector at the first position of Ruself_attHreal=softmax(Qmgs⋅RTzero_att)⋅RTzero_attshort-term sequences
User’s click_item
Suself_att=multihead_selfattention(Su)={hu1,...,hut,...,huT}Suzero_att={0,hu1,...,hut,...,huT}Hshort=softmax(Qmgs⋅STzero_att)⋅STzero_attlong-term sequence
Lu由四个部分构成,分别为Luitem,Lushop,Luleaf,Lubrand,每个部分包含3个动作,分别为click,buy,collect。
Lclick_item,Lbuy_item,Lcollect_item→LTitemHa_item=softmax(Qmgs⋅LTitem)⋅LTitemHlong=Ha_item+Ha_shop+Ha_leaf+Ha_brandFor the item tower, we experimentally use item ID and title to obtain the item representation 𝐻𝑖𝑡𝑒𝑚.Given the representation of item 𝑖’s ID, ei∈R1×d , and its title segmentation result Ti={wi1,…,wiN}
Hitem=e+tanh(Wt⋅∑Ni=1wiN)where Wt is the transformation matrix. We empirically find that applying LSTM [12] or Transformer [27] to capture the context of the title is not as effective as simple mean-pooling since the title is stacked by keywords and lacks grammatical structure.
adapt the softmax cross-entropy loss as the training objective
ˆy(i+|qu)=exp(F(qu,i+))∑i′∈Iexp(F(qu,i′))L(∇)=−∑i∈Iyilog(^yi)where F,I,i+,qudenote the inner product, the full item pool, the item tower’s representation Hitem, and the user tower’s representation Hqu, respectively.
the softmax function with the temperature parameter τ is defined as follows
ˆy(i+|qu)=exp(F(qu,i+)/τ)∑i′∈Iexp(F(qu,i′)/τ)If 𝜏->0, the fitted distribution is close to one hot distribution,If 𝜏->∞, the fitted distribution is close to a uniform distribution
We first select the negative items of i− that have the top-𝑁 inner product scores with qu to form the hard sample set Ihard
Imix=αi++(1−α)Ihard其中α∈RN×1is sampled from the uniform distribution 𝑈 (𝑎, 𝑏) (0 ≤ 𝑎 < 𝑏 ≤ 1).
ˆy(i+|qu)=exp(F(qu,i+)/τ)∑i′∈(I∪Imix)exp(F(qu,i′)/τ)