The authors of the BERT (Bidirectional Encoder Representations from Transformers) model proposed an innovative architecture designed to pretrain deep bidirectional representations from unlabeled text. This was achieved by jointly conditioning on both left and right context in all layers of the model, which is a departure from previous models that typically looked at text in a unidirectional manner, either left-to-right or right-to-left. By using a masked language modeling (MLM) objective during pretraining, BERT was able to predict missing words in a sequence based on their bidirectional context, thereby capturing a more nuanced understanding of language.
Additionally, BERT employs a next sentence prediction (NSP) task during pretraining, which involves predicting whether a given sentence B is the actual next sentence that follows a given sentence A. This helps BERT understand the relationship between sentence pairs, which is beneficial for tasks like question answering and natural language inference.
As a result of these innovative pretraining techniques, BERT achieved significant improvements over previous state-of-the-art models across various natural language processing (NLP) benchmarks. Notably, the authors reported an impressive enhancement in the BLEU score by around 7%, which underscores BERT's superior ability to generate more accurate and coherent translations compared to earlier models. This leap in performance demonstrated the effectiveness of bidirectional training and set a new standard for pretrained language models in the NLP community.