There are three main contributions that ALBERT makes over the design choices of BERT:
1 Factorized embedding parameterization
原来embedding层是一个矩阵$M_{emb[V\times H]} $,现在变为两个$M_{emb1[V\times E]}$和$M_{emb2[E\times H]}$,参数量从VH变为VE+EH(This parameter reduction is significant when H >> E.)
2 Cross-layer parameter sharing
The default decision for ALBERT is to share all parameters across layers(attention,FFN))
作者利用transformer做时间序列预测,发现了两个问题,然后提出了改进。一个问题是locality-agnostics,the point-wise dot product self-attention in canonical Transformer architecture is insensitive to local context。另外一个问题是memory bottleneck :space complexity of canonical Transformer grows quadratically with sequence length L, making directly modeling long time series infeasible.为了解决这两个问题,作者提出了convolutional self-attention和LogSparse Transformer。
1 Is Word Segmentation Necessary for Deep Learning of Chinese Representations?
we find that charbased(字粒度) models consistently outperform wordbased (词粒度)models.
We show that it is because word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting.
def get_idf(self): df = {} self.idf = {} tf = [] for document in self.documents_list: temp = {} for word in document: temp[word] = temp.get(word, 0) + 1 / len(document) tf.append(temp) for key in temp.keys(): df[key] = df.get(key, 0) + 1 for key, value in df.items(): self.idf[key] = np.log10(self.documents_number / (value + 1))
def get_tf(self, document): document = list(jieba.cut(document)) # tf = [] temp = {} for word in document: temp[word] = temp.get(word, 0) + 1 / len(document) # tf.append(temp) return temp
def tf_idf_vec(self, text): tf = self.get_tf(text) word = list(self.idf.keys()) vec = [0] * len(self.idf) text = list(jieba.cut(text)) for ele in text: if ele in word: vec[word.index(ele)] = tf[ele] * self.idf[ele] return vec