Text Summarization using BERT and T5

Mubasheer Siddiqui
7 min readJan 7, 2022

Introduction and need for text summarization

Many times we find ourselves in a situation where we need the summary of the details and not a full report of the same, then often we go through the whole text and markup important direct or indirect details and then rewrite. This is definitively a time consuming approach and when the no of documents increases, we realize the importance of automatic text summarization.

The difficulty of producing a concise, accurate, and fluent summary of a lengthy text document is known as text summarization.

Automatic text summarization methods are desperately needed to deal with the ever-increasing amount of text data available online, in order to improve both the discovery and consumption of relevant information.

Automatic Text Summarization can be used to summarize research papers, long reports, full books, online pages, and news, among other things. We have seen new highs in this discipline as a result of recent breakthroughs in Machine Learning, particularly Deep Learning.

Deep learning technologies have proven to be particularly promising for this endeavor since they attempt to replicate the way the human brain functions by managing multiple levels of abstraction and nonlinearly translating a given input into a given output (in this process the output of one layer becomes the input of the other layer and so on). Obviously, the deeper the layers, the deeper the depth. Deep neural networks are commonly utilized in NLP difficulties because their architecture fits well with the language’s complicated structure; for example, each layer can handle a particular task before passing the output to the next.

What are the types of Text Summarization ?

In Natural Language Processing (NLP), there are two main ways to summarize text :
1. Extractive Summarization
2. Abstractive Summarization

Fig. Types of Summarization
Fig. Types of Summarization

1. Extractive Text Summarization :

Extracting essential words from a source document and combining them to make a summary is what extractive text summarization is all about. The extraction is done according to the predefined measure without making any changes to the texts.

Methods for extractive summarization are :

· Gensim with TextRank.
· Sumy.
· LexRank.
· Latent Semantic Analysis.
and many more.

2. Abstractive Text Summarization :

Parts of the source document are interpreted and trimmed as part of the abstraction approach. When deep learning is applied for text summarization, abstraction can overcome the grammar mistakes of the extractive method.

The abstractive text summarization algorithms, like humans, produce new phrases and sentences that convey the most relevant information from the original text.

Due to these reasons, abstraction outperforms extraction. The text summarization algorithms required for abstraction, on the other side, are more challenging to build, which is why extraction is still widely used.

Methods for abstractive summarization are :

· Sequence to Sequence Recurrent Neural Networks (RNN).
· Encoder and Decoder Networks.
· Attention Mechanism.
· Text To Text Transfer Transformer (T5).

Example :

Text : Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the city, Mary gave birth to a child named Jesus.

Extractive Summary : Joseph and Mary attend event Jerusalem. Mary birth Jesus.

Abstractive Summary : Joseph and Mary came to Jerusalem where Jesus was born.


It’s not an exaggeration to mention that BERT has considerably altered the Natural Language Processing scene. Consider using a single model trained on a huge unlabeled dataset to obtain best-in-class results on eleven different NLP tasks. and every one of this with very little fine-tuning. That’s BERT! It’s a tectonic shift in how design we models.

a lot of latest NLP architectures, training approaches, and language models, such as OpenAI’s GPT-2, Google’s TransformerXL, RoBERTa, ERNIE2.0, XLNet, etc. have been inspired byBERT.

What is BERT?

You’ve probably heard about BERT and read about how amazing it is and how it may change the NLP landscape. But, first and foremost, what is BERT?

The NLP framework is described as follows by the BERT research team:

“BERT stands for Bidirectional Encoder Representations from Transformers. It is intended to condition both left and right context to pre-train deep bidirectional representations from unlabeled text. As a result, with just one additional output layer, the pre-trained BERT model may be fine-tuned to generate state-of-the-art models for a wide range of NLP tasks.”

To begin, BERT stands for Bidirectional Encoder Representations from Transformers, which is simple to grasp. Each word has a significance, which we will discover one by one throughout this article. For now, the most important takeaway from this section is that BERT is built on the Transformer architecture.

Second, BERT is pre-trained on a vast corpus of unlabeled text, which comprises the whole Wikipedia, which has 2,500 million words, and the Book Corpus, which contains around 800 million words.
This pre-training step is responsible for half of BERT’s success. This is due to the fact that when a model is trained on a large text corpus, it learns to pick up on deeper and more intimate understandings of how language works. This data may be used as a swiss army knife in almost any NLP project.

Third, BERT is a “deeply bidirectional” model. During the training phase, BERT learns information from both the left and right sides of the context of a token.

Extractive summarization using BERT

In this section we will be looking at Extractive Text Summarization using BERT. As we know, in Extractive Summarization we select sentences from the text as summary. Hence it can be considered as a classification problem where we classify if a sentence is part of a summary or not.

The challenge here is that the model will have to interpret the entire text, choose the correct keywords and ensure that there is no loss. Hence to ensure that there is no compromise of speed and accuracy, we use BERTSUM which is able enough to parse meaning from language and do other preprocessing steps like stop word removal, lemmatization, etc. on its own.

Text Summarization using BERT - Deep Learning Analytics

The BERTSUM model consists of 2 parts:

1. BERT encoder.
2. Summarization Classifier.

The Encoder provides us with a vector representation of each sentence which is then used by the Summarization Classifier to assign a label to each sentence indicating whether or not it will be incorporated into the final report.

The input of BERTSUM is a little bit different as compared to the BERT model. Here we add the [CLS] token before each sentence in order to separate each sentence and collect the features of its preceding sentence. Each sentence is also assigned and embedding i.e. it is given Ea if the sentence is of even and Eb if it is of odd length. It also gives a score to each sentence, depending on how important it is, and based on these scores the sentences are decided whether or not to be included in the summary.


T5, built by one of the tech giants Google, is one of the most powerful tools for text summarizing. T5, or text to text transfer transformer, a transformer model allows fine tuning for any simple text to task.

To the current study, T5 adds the following:
1. It builds Colossal Cleaned Common Crawl (C4), a clean version of the enormous common crawl data collection . This data set dwarfs Wikipedia by two orders of magnitude.
2. It proposes that all NLP jobs be reframed as an input text to output text formulation.
3. It exhibits that fine tuning on various tasks — summarization, QnA, reading comprehension with the pretrained T5, and text-text formulation — produces state-of-the-art outcomes.
4. The T5 team also conducted a thorough investigation into the best procedures for pre-training and fine-tuning

Abstractive summarization using T5

T5 is one of the most qualified for Abstractive summarization for the reasons listed above. Abstractive summarization is a Natural Language Processing (NLP) job that seeks to produce a short summary of a source text. As aforementioned, abstractive summarization, unlike extractive summarization, does not merely reproduce essential phrases from the original text but also has the capacity to generate new relevant phrases, which is referred to as paraphrase.

For abstractive summarization, T5 can be used so easily as follows :

1. Install the transformers library
2. Import the library
3. Configure the GPUs and model to be used.
4. Specify the text to be summarized.
5. Use a summarizer to summarize the content.

Md Mubasheeruddin Siddiqui (C-12)
Balaji Padamwar(C-30)
Shriram Pareek(C-37)
Kishan Partani(C-39)
Aniket Ghorpade(B-07)