HIERARCHICAL TRANSFORMERS FOR LONG DOCUMENT CLASSIFICATION
原版BERT的最大输入为512,为了使得BERT能解决超长文本的问题,作者在finetune阶段提出了两种策略来弥补这个问题,即利用BERT+LSTM或者BERT+transformer。
核心步骤:
1.split the input sequence into segments of a fixed size with overlap.
2.For each of these segments, we obtain H or P from BERT model.
3.We then stack these segment-level representations into a sequence, which serves as input to a small (100-dimensional) LSTM layer.//replacing the LSTM recurrent layer in favor of a small Transformer model
4.Finally, we use two fully connected layers with ReLU (30-dimensional) and softmax (the same dimensionality as the number of classes) activations to obtain the final predictions.
HIERARCHICAL TRANSFORMERS FOR LONG DOCUMENT CLASSIFICATION