ALBERT A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
There are three main contributions that ALBERT makes over the design choices of BERT:
1 Factorized embedding parameterization
原来embedding层是一个矩阵$M_{emb[V\times H]} $,现在变为两个$M_{emb1[V\times E]}$和$M_{emb2[E\times H]}$,参数量从VH变为VE+EH(This parameter reduction is significant when H >> E.)
2 Cross-layer parameter sharing
The default decision for ALBERT is to share all parameters across layers(attention,FFN))
3 Inter-sentence coherence loss
原来的NSP改为现在的sop,正例的构建和NSP是一样的,不过负例则是将两句话反过来。
参考
https://zhuanlan.zhihu.com/p/88099919
https://blog.csdn.net/weixin_37947156/article/details/101529943
ALBERT A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS