HIERARCHICAL TRANSFORMERS FOR LONG DOCUMENT CLASSIFICATION

原版BERT的最大输入为512,为了使得BERT能解决超长文本的问题,作者在finetune阶段提出了两种策略来弥补这个问题,即利用BERT+LSTM或者BERT+transformer。

核心步骤:

1.split the input sequence into segments of a fixed size with overlap.

2.For each of these segments, we obtain H or P from BERT model.

3.We then stack these segment-level representations into a sequence, which serves as input to a small (100-dimensional) LSTM layer.//replacing the LSTM recurrent layer in favor of a small Transformer model

4.Finally, we use two fully connected layers with ReLU (30-dimensional) and softmax (the same dimensionality as the number of classes) activations to obtain the final predictions.


:D 一言句子获取中...