A Deep Look into Neural Ranking Models for Information Retrieval

https://par.nsf.gov/servlets/purl/10277191

3 A Unified Model Formulation

So a generalized LTR problem is to find the optimal ranking function f ∗ by minimizing the loss function over some labeled dataset

f 是ranking function,s是query,t是候选集,y is the label set where labels represent grades

Without loss of generality, the ranking function f could be further abstracted by the following unified formulation

ψ, ϕare representation functions which extract features from s and t respectively η is the interaction function which extracts features from (s, t) pair, and g is the evaluation function which computes the relevance score based on the feature representations.

4. Model Architecture

4.1. Symmetric vs. Asymmetric Architectures

Symmetric Architecture: The inputs s and t are assumed to be homogeneous, so that symmetric network structure could be applied over the inputs

Asymmetric Architecture: The inputs s and t are assumed to be heterogeneous, so that asymmetric network structures should be applied over the inputs

4.2. Representation-focused vs. Interaction-focused Architectures

Representation-focused Architecture: The underlying assumption of this type of architecture is that relevance depends on compositional meaning of the input texts. Therefore, models in this category usually define complex representation functions ϕ and ψ (i.e., deep neural networks), but no interaction function η

Interaction-focused Architecture: The underlying assumption of this type of architecture is that relevance is in essence about the relation between the input texts, so it would be more effective to directly learn from interactions rather than from individual representations. Models in this category thus define the interaction function η rather than the representation functions ϕ and ψ

Hybrid Architecture: In order to take advantage of both representation focused and interaction-focused architectures, a natural way is to adopt a hybrid architecture for feature learning. We find that there are two major hybrid strategies to integrate the two architectures, namely combined strategy and coupled strategy.

4.3. Single-granularity vs. Multi-granularity Architecture

Single-granularity Architecture: The underlying assumption of the single granularity architecture is that relevance can be evaluated based on the high level features extracted by ϕ, ψ and η from the single-form text inputs.

Multi-granularity Architecture: The underlying assumption of the multigranularity architecture is that relevance estimation requires multiple granularities of features, either from different-level feature abstraction or based on different types of language units of the inputs

5. Model Learning

5.1. Learning objective

Similar to other LTR algorithms, the learning objective of neural ranking models can be broadly categorized into three groups: pointwise, pairwise, and listwise.

5.1.1. Pointwise Ranking Objective

1 loss

The idea of pointwise ranking objectives is to simplify a ranking problem to a set of classification or regression problems

a. Cross Entropy

For example, one of the most popular pointwise loss functions used in neural ranking models is Cross Entropy:

b. Mean Squared Error

There are other pointwise loss functions such as Mean Squared Error for numerical labels, but they are more commonly used in recommendation tasks.

2 优缺点

a.advantages

First, it simple and easy to scale. Second, the outputs have real meanings and value in practice. For instance, in sponsored search, a model learned with cross entropy loss and clickthrough rates can directly predict the probability of user clicks on search ads, which is more important than creating a good result list in some application scenarios.

b.disadvantages

less effective ,Because pointwise loss functions consider no document preference or order information, they do not guarantee to produce the best ranking list when the model loss reaches the global minimum.

5.1.2. Pairwise Ranking Objective

1 loss

Pairwise ranking objectives focus on optimizing the relative preferences between documents rather than their labels.

a.Hinge loss

b.cross entropy

​ RankNet

2 优缺点

a.advantages

effective in many tasks

b.disadvantages

pairwise methods does not always lead to the improvement of final ranking metrics due to two reasons: (1) it is impossible to develop a ranking model that can correctly predict document preferences in all cases; and (2) in the computation of most existing ranking metrics, not all document pairs are equally important.

5.1.3. Listwise Ranking Objective

1 loss

listwise loss functions compute ranking loss with each query and their candidate document list together

a. ListMLE

https://blog.csdn.net/qq_36478718/article/details/122598406

b.Attention Rank function

https://arxiv.org/abs/1804.05936

c. softmax-based listwise

https://arxiv.org/pdf/1811.04415.pdf

2 优缺点

a.advantages

While listwise ranking objectives are generally more effective than pairwise ranking objectives

b.disadvantages

their high computational cost often limits their applications. They are suitable for the re-ranking phase over a small set of candidate documents

5.1.4. Multi-task Learning Objective

the optimization of neural ranking models may include the learning of multiple ranking or non-ranking objectives at the same time.

5.2. Training Strategies

1 Supervised learning

2 Weakly supervised learning

3 Semi-supervised learning

6. Model Comparison

比较了常见模型在不同应用的效果

1 Ad-hoc Retrieval

https://blog.csdn.net/qq_44092699/article/details/106335971

Ad-hoc information retrieval refers to the task of returning information resources related to a user query formulated in natural language.

2 QA

Hive MetaStore

1 描述

Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in the warehouse. It also includes metadata of column and its type information, the serializers and deserializers which is used to read and write data and the corresponding HDFS files where the data is stored.

2 Hive的元数据存储(Metastore三种配置方式)

Embedded,Local,Remote

https://blog.csdn.net/epitomizelu/article/details/117091656

https://zhuanlan.zhihu.com/p/473378621

https://blog.csdn.net/qq_40990732/article/details/80914873

3 Hive元数据库介绍

https://blog.csdn.net/victorzzzz/article/details/81874674

Table API和SQL

https://blog.csdn.net/weixin_45366499/article/details/115449175

0 原理

1 动态表

flink中的表是动态表

静态表:hive,mysql等

动态表:不断更新

2 持续查询

1 简介

Apache Flink 有两种关系型 API 来做流批统一处理:Table API 和 SQL。

Table API 是用于 Scala 和 Java 语言的查询 API,它可以用一种非常直观的方式来 组合使用选取、过滤、join 等关系型算子。

1
2
3
Table maryClickTable = eventTable
.where($("user").isEqual("alice"))
.select($("url"), $("user"));

SQL 是基于 Apache Calcite 来实现的标准 SQL

1
2
3
Table urlCountTable = tableEnv.sqlQuery(
"SELECT user, COUNT(url) FROM EventTable GROUP BY user"
);

2 框架

表环境和流执行环境不同

3 流表相互转化

stream 《——》table

1
2
3
4
5
6
tableEnv表环境
// 将数据流eventstream转换成表eventTable
Table eventTable = tableEnv.fromDataStream(eventstream);

// 将表visitTable转换成数据流,打印输出
tableEnv.toDataStream(visitTable).print();

4 连接外部系统

可以在创建表的时候用 WITH子句指定连接器connector

5 客户端

./bin/sql client.sh

6 时间属性

事件事件、处理事件

  1. 在创建表的 DDL中定义
  2. 在数据流转换为表时定义

7 窗口

容错机制

在分布式架构中,当某个节点出现故障,其他节点基本不受影响。这时只需要重启应用,恢复之前某个时间点的状态继续处理就可以了。这一切看似简单,可是在实时流处理中,我们不仅需要保证故障后能够重启继续运行,还要保证结果的正确性、故障恢复的速度、对处理性能的影响,这就需要在架构上做出更加精巧的设计。
在Flink中,有一套完整的容错机制( fault tolerance)来保证故障后的恢复,其中最重要的就是检查点( checkpoint)。在第九章中,我们已经介绍过检查点的基本概念和用途,接下来我 们就深入探讨一下检查点的原理和 Flink的容错机制。

状态编程

0 状态管理机制

1 算子任务分类

1 无状态

2 有状态

2 状态分类

Flink 有两种状态:托管状态(Managed State)和原始状态(Raw State)。一般情况使用托管状态,只有在托管状态无法实现特殊需求,才会使用原始转态,一般情况不使用。

托管状态分类:算子状态(Operator State)和按键分区状态(Keyed State)

1 按键分区状态

2 算子状态

3 广播状态 Broadcast State

特殊的算子状态

3 状态持久化

对状态进行持久化( persistence)保存,这样就可以在发生故障后进行重启恢复。

flink状态持久化方式:写入一个“检查点”( checkpoint)或者保存点 savepoint
保存到外部存储系统中。具体的存储介质,一般是分布式文件系统( distributed file system)。

4 状态后端 State Backends

在Flink中,状态的存储、访问以及维护,都是由一个可插拔的组件决定的,这个组件就
叫作状态后端( state backend)。状态后端主要负责两件事:一是本地的状态管理,二是将检查
点( checkpoint)写入远程的 持久化存储。

flink cep

0 简介

类似的多个事件的组合,我们把它叫作“复杂事件”。对于复杂时间的处理,由于涉及到事件的严格顺序,有时还有时间约束,我们很难直接用 SQL或DataStream API来完成。于是只好放大招 派底层的处理函数( process function)上阵了。处理函数确实可以搞定这些需求,不过对于非常复杂的组合事件,我们可能需要设置很多状态、定时器,并在代码中定义各种条件分支( if else)逻辑来处理,复杂度会非常高,很可能会使代码失去可读性。怎 样处理这类复杂事件呢? Flink为我们提供了专门用于处理复杂事件的库 CEP,可以让我们更加轻松地解决这类棘手的问题。这在企业的实时风险控制中有非常重要的作用。

Complex Event Processing,flink 专门用来处理复杂事件的库

1 原理

cep底层是状态机

复杂事件可以通过设计状态机来处理,用户自己写容易出错,cep帮我们封装好,用户写顶层逻辑就可以了

2 核心步骤

总结起来,复杂事件处理(CEP)的流程可以分成三个步骤
(1)定义一个匹配规则
(2)将匹配规则应用到事件流上,检测满足规则的复杂事件
(3)对检测到的 复杂事件进行处理,得到结果进行输出


:D 一言句子获取中...