Optimized BERT Approach

 

Large Language Models: RoBERTa — A Robustly Optimized BERT Approach





The appearance of the BERT model led to significant progress in NLP. Deriving its architecture from Transformer, BERT achieves state-of-the-art results on various downstream tasks: language modeling, next sentence prediction, question answering, NER tagging, etc.

Despite the excellent performance of BERT, researchers still continued experimenting with its configuration in hopes of achieving even better metrics. Fortunately, they succeeded with it and presented a new model called RoBERTa — Robustly Optimised BERT Approach.

Throughout this article, we will be referring to the official RoBERTa paper which contains in-depth information about the model. In simple words, RoBERTa consists of several independent improvements over the original BERT model — all of the other principles including the architecture stay the same. All of the advancements will be covered and explained in this article.

From the BERT’s architecture we remember that during pretraining BERT performs language modeling by trying to predict a certain percentage of masked tokens. The problem with the original implementation is the fact that chosen tokens for masking for a given text sequence across different batches are sometimes the same.

Post a Comment

0 Comments