Summarizing Text with Transformer Models

"Text summarization has come a long way from basic techniques like picking out keyword sentences. Early methods often struggled with capturing the nuance and flow of text. Today, fine-tuned transformer models offer a significant leap forward. These powerful models, trained on massive datasets and customized for summarization tasks, can generate summaries that are not only factually accurate but also coherent and readable. This allows users to quickly grasp the main points of lengthy texts, making information retrieval and comprehension more efficient."- Gemini 2024

Text Summarization

Text summarization is a powerful natural language processing technique with wide-ranging applications across various industries and everyday life. It enables the quick extraction of key information from large volumes of text, saving time and improving efficiency.

Here are just a few ways we see summaries in use today:

  • News Industry: Condense lengthy articles into brief, informative snippets
  • Businesses: Distill important details from customer feedback, market reports, and internal documents, facilitating faster decision-making
  • Legal Contexts: Help lawyers quickly grasp the essence of long case documents
  • Academic Research: Aid in quick review and synthesize of information from numerous papers, accelerating the literature review process
Extractive summarization with POS tagging

POS tagging is a cornerstone of many NLP applications. It acts as the foundation for machines to grasp sentence structure. This clarifies word meaning (such as identifying "bat" as a noun or verb), improves feature extraction for tasks like sentiment analysis, and provides a foundation for building more complex NLP systems that can parse sentences or identify entities.

By providing a basic understanding of language structure, POS tagging empowers machines to process and analyze text more effectively.

Text summarization can be achieved with simple techniques, like extractive summarization, which has long served as a reliable method to condense lengthy texts. Such methods often involve selecting key sentences based on factors like word frequency or position.

In this example, we create a simple text summarizing script that uses POS tagging to identify and extract important sentences from a given text.

Overview of steps in script

  1. Implement POS tagging
  2. Score sentences based on important words (nouns and verbs)
  3. Extract top-scoring sentences as a summary
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

def pos_summarizer(text, num_sentences=3):

    # Tokenize words and remove stopwords
    words = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Get POS tags
    pos_tags = nltk.pos_tag(words)

    # Extract nouns and verbs and calculate frequency
    key_words = [word for word, pos in pos_tags if pos.startswith('NN') or pos.startswith('VB')]
    word_freq = FreqDist(key_words)

    # Score sentences based on frequency of key words
    sentences = sent_tokenize(text)
    scores = {}
    for i, sentence in enumerate(sentences):
        for word in word_tokenize(sentence.lower()):
            if word in word_freq:
                if i in scores:
                    scores[i] += word_freq[word]
                else:
                    scores[i] = word_freq[word]

    # Get the [num_sentences] top scoring sentences and order by appearance in text
    top_sentences = sorted(scores, key=scores.get, reverse=True)[:num_sentences]
    summary = ' '.join([sentences[i] for i in sorted(top_sentences)])

    return summary

# Example usage

from io import BytesIO
from PyPDF2 import PdfReader

# Download an article PDF from the web and extract the text
# https://techxplore.com/news/2024-07-brain-artificial-dendritic-neural-circuit.html

text = ''
with open('2024-07-brain-artificial-dendritic-neural-circuit.pdf', 'rb') as f:
    pdf_content = BytesIO(f.read())
    reader = PdfReader(pdf_content)
    for page in reader.pages:
        text += page.extract_text()

# Preprocess text
text = text.strip().replace('\n', ' ')

# Get summary
summary = pos_summarizer(text)
print(summary)
    

This outputs the following summary, which is not too bad!

Researchers at Tsinghua University recently introduced a new neuromorphic computational architecture designed to replicate the organization of synapses (i.e., connections between neurons) and the tree- like structure of dendrites (i.e., projections extending from the body of neurons). "When I was a master student in AI and brain bioengineering at the Polytechnic of Milano in Italy, I conceived the idea to emulate brain connectivity sparsity and morphology, such as that of the neuron's dendrites, to design efficient AI," Carlo Vittorio Cannistraci, one of the corresponding authors, told Tech Xplore. As part of this recent study, he teamed up with other researchers at Tsinghua University to replicate the morphology of dendrites and the underpinnings of synapses using a neuromorphic 2/8 computing model.
Text Summarization with LLMs

While simpler methods are efficient, they might lack nuance. LLMs, on the other hand, have the potential to generate more engaging and informative summaries, but raise concerns about computational cost and potential biases. This exploration compares LLM outputs from a general model, and the same model, fine-tuned for the task.

This script compares the output of a text-to-text transformer model (google-t5/t5-small) with a related model fine-tuned for the task (koppolusameer/t5-finetuned-summarization-samsum).

from transformers import T5Tokenizer, T5ForConditionalGeneration

def summarize_text(text, model_name, max_length=150):

    # Load pre-trained model and tokenizer
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # Build prompt
    prompt = "summarize: " + text

    # Tokenize the text
    tokens = tokenizer.encode(prompt, return_tensors='pt', max_length=512, truncation=True)

    # Generate and decode summaries
    summaries = model.generate(
        tokens,
        num_beams=4,
        no_repeat_ngram_size=2,
        min_length=30,
        max_length=max_length,
        early_stopping=True
    )
    summary = tokenizer.decode(summaries[0], skip_special_tokens=True)

    return summary

# Example usage

from io import BytesIO
from PyPDF2 import PdfReader

# Download an article PDF from the web and extract the text
# https://techxplore.com/news/2024-07-brain-artificial-dendritic-neural-circuit.html

text = ''
with open('2024-07-brain-artificial-dendritic-neural-circuit.pdf', 'rb') as f:
    pdf_content = BytesIO(f.read())
    reader = PdfReader(pdf_content)
    for page in reader.pages:
        text += page.extract_text()

# Preprocess text
text = text.strip().replace('\n', ' ')

# Get summary for each model to compare
base_model = 't5-small'
fine_tuned = 'koppolusameer/t5-finetuned-summarization-samsum'

summary = summarize_text(text, base_model)
print("\nBase model\n\n", summary)

summary = summarize_text(text, fine_tuned)
print("\nFine tuned model\n\n", summary)

    

In the output from this script, we see the summarization getting progressively better

Base model

engineers have been working on new architectures and 1/8 hardware components that replicate the organization and functions of the human brain. most brain-inspired technologies draw inspiration from the firing of brain cells (i.e., neurons), rather than mirroring the overall structure of neural elements and how they contribute to information processing.

Fine tuned model

A new brain-inspired artificial dendritic neural circuit was introduced by Ingrid Fadelli on July 5 2024. The new architectures and 1/8 hardware components replicate the organization and functions of the human brain.
The Verdict
Extractive General LLM Fine-tuned LLM
Okay Good Better

In our comparison of LLMs to simpler techniques, one can assess the trade-off between efficiency and quality in text summarization. Considering the resource cost, execution time, and potential for bias, is the improvement in summarizing worth the cost.

Ultimately, the decision comes down to necessity and resources, both highly reliant on the particular use case and environment.