Part-of-Speech Tagging

"Part-of-speech tagging, often abbreviated as POS tagging, acts like a linguistic passport for words in a sentence. It assigns grammatical labels like noun, verb, adjective, or adverb to each word. This helps computers understand the structure and meaning of a sentence. By knowing a word's part of speech, a machine translation model can translate it more accurately, or a sentiment analysis tool can determine if a sentence expresses positive or negative emotions."- Gemini 2024

I/PRP like/VBP cats/NNS

The goal of part-of-speech tagging is to label each part of a sequence with its grammar form. Languages are made up of words, symbols and rules for constructing sequences. These define a language's grammar.

Word classes (categories)

Sets of words in a language can be relatively static (unchanging) or dynamic, where new words are added, and unused words are retired/removed from the word set. Some examples in the English language include:

Closed (Static)
Relatively fixed word sets (e.g. articles, pronouns, prepositions, ...)
Open (Dynamic)
Changing word sets (nouns, verbs, adjectives, ...)
For instance, in September 2023, Merriam Webster added 690 new words to it's English Dictionary.

Penn TreeBank

The English Penn Treebank corpus is one of the most known and used corpus for the evaluation of models for sequence labelling.

In paperswithcode.com we see publications referencing the Penn Treebank from 2004 to today.

Penn Treebank POS Tagset

Tag Description Examples
CC Coordinating conjunction for, and, nor
CD Cardinal number one, three, million
DT Determiner a, an, the
EX Existential there There is a cat on the patio.
FW Foreign word Aloha, Hola, Bonjour
IN Preposition or subordinating conjunction in, on, because
JJ Adjective big, small, happy
JJR Comparative adjective bigger, smaller, happier
JJS Superlative adjective biggest, smallest, happiest
LS List item marker *, -, 1.
MD Modal auxiliary can, must, will
NN Noun, singular (common) book, table, dog
NNS Noun, plural (common) books, tables, dogs
NNP Proper noun, singular Bob, Arizonan, American
NNPS Proper noun, plural Bobs, Arizonans, Americans
PDT Predeterminer all, both, some
POS Possessive ending 's (e.g., dog's)
PRP Personal pronoun I, you, he
PRP$ Possessive pronoun mine, yours, his
RB Adverb well, little, much
RBR Comparative adverb better, less, more
RBS Superlative adverb best, least, most
RP Particle up, down, in
SYM Symbol %, &, @
TO to to
UH Interjection oh, wow, hello
VB Verb, base form run, eat, sleep
VBD Verb, past tense ran, ate, slept
VBG Verb, gerund (present participle) running, eating, sleeping
VBN Verb, past participle had run, eaten, slept
VBP Verb, 1st person singular present (non 3rd person) I run, you eat, we sleep
VBZ Verb, 3rd person singular present He runs, she eats, it sleeps
WDT Wh-determiner what, which, whose
WP Wh-pronoun who, what, where
WP$ Possessive Wh-pronoun whose, which's (informal), whomever's
WRB Wh-adverb when, how, why
$ Dollar sign $
. Sentence punctuation ., ?, !

Generating Tags

NLTK

The nltk.tag package provides methods for part-of-speech tagging

Docs
from nltk.tag import pos_tag_sents
from nltk.tokenize import word_tokenize, sent_tokenize

quotes = '''
All programs have a desire to be useful.
If a machine can learn the value of human life, maybe we can too.
We marveled at our own magnificence as we gave birth to AI.
'''

sents = sent_tokenize(quotes)
tokenized = [word_tokenize(x) for x in sents]
tags = pos_tag_sents(tokenized)
for t in tags:
    print(t)

Types of POS Taggers

  • Rule-Based taggers rely on a pre-defined set of rules
  • Statistical taggers use statistical models trained on large datasets of pre-tagged text. For example, the Hidden Markov Model (HMM) with Viterbi algorithm relies on statistical models and probabilities to assign part-of-speech tags.
  • Transformation-Based taggers combine rule-based and statistical methods.

The Noisy channel model is a conceptual framework used in NLP that can illustrate the challenges of part-of-speech tagging and how they are addressed by statistical taggers, given that typos and errors are common in text. (See noisy.html for an example of spelling correction using this framework.)

POS Tagging Tools

Python libraries and some POS methods

  • NLTK
    • pos_tag()
    • pyStatParser - NLTK parse trees
    • Stanford from nltk.tag.stanford import StanfordTagger
    • MaltParser nltk.parse.malt.MaltParser
    • Pattern from pattern.en import tag
  • SpaCy
    • Language model based on transformer architecture en_core_web_trf
  • TextBlob

Comparison of Taggers (2021) - Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Survey Paper (2022) - Part of speech tagging: a systematic review of deep learning and machine learning approaches

Comparing NLTK and SpaCy Tags

SpaCy

The spacy.io package provides methods for NLP tasks, including POS tagging.

Docs
import spacy
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from tabulate import tabulate

quotes = [
'All programs have a desire to be useful.',
'If a machine can learn the value of human life, maybe we can too.',
'We marveled at our own magnificence as we gave birth to AI.'
]

# python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

for quote in quotes:
    print(f'\n{quote}\n')
    data = []
    tokens = word_tokenize(quote)
    nltk_tags = pos_tag(tokens)
    spacy_tags = [(token.text, token.tag_) for token in nlp(quote)]

    for x, y in zip(nltk_tags, spacy_tags):
        if(x[0]==y[0]):
            data.append([x[0],x[1],y[1]])

    print(tabulate(data, ['token','nltk','spacy'], "simple"))

    

Code Output


    All programs have a desire to be useful.

    token     nltk    spacy
    --------  ------  -------
    All       DT      DT
    programs  NNS     NNS
    have      VBP     VBP
    a         DT      DT
    desire    NN      NN
    to        TO      TO
    be        VB      VB
    useful    JJ      JJ
    .         .       .

    If a machine can learn the value of human life, maybe we can too.

    token    nltk    spacy
    -------  ------  -------
    If       IN      IN
    a        DT      DT
    machine  NN      NN
    can      MD      MD
    learn    VB      VB
    the      DT      DT
    value    NN      NN
    of       IN      IN
    human    JJ      JJ
    life     NN      NN
    ,        ,       ,
    maybe    RB      RB
    we       PRP     PRP
    can      MD      MD
    too      RB      RB
    .        .       .

    We marveled at our own magnificence as we gave birth to AI.

    token         nltk    spacy
    ------------  ------  -------
    We            PRP     PRP
    marveled      VBD     VBD
    at            IN      IN
    our           PRP$    PRP$
    own           JJ      JJ
    magnificence  NN      NN
    as            IN      IN
    we            PRP     PRP
    gave          VBD     VBD
    birth         NN      NN
    to            TO      IN
    AI            NNP     NNP
    .             .       .