Part-of-Speech Tagging

"Part-of-speech tagging, often abbreviated as POS tagging, acts like a linguistic passport for words in a sentence. It assigns grammatical labels like noun, verb, adjective, or adverb to each word. This helps computers understand the structure and meaning of a sentence. By knowing a word's part of speech, a machine translation model can translate it more accurately, or a sentiment analysis tool can determine if a sentence expresses positive or negative emotions."- Gemini 2024

The goal of part-of-speech tagging is to label each part of a sequence with its grammar form. Languages are made up of words, symbols and rules for constructing sequences. These define a language's grammar.

Word classes (categories)

Sets of words in a language can be relatively static (unchanging) or dynamic, where new words are added, and unused words are retired/removed from the word set. Some examples in the English language include:

Closed (Static): Relatively fixed word sets (e.g. articles, pronouns, prepositions, ...)
Open (Dynamic): Changing word sets (nouns, verbs, adjectives, ...)
For instance, in September 2023, Merriam Webster added 690 new words to it's English Dictionary.

Penn TreeBank

The English Penn Treebank corpus is one of the most known and used corpus for the evaluation of models for sequence labelling.

In paperswithcode.com we see publications referencing the Penn Treebank from 2004 to today.

Penn Treebank POS Tagset

Tag	Description	Examples
CC	Coordinating conjunction	for, and, nor
CD	Cardinal number	one, three, million
DT	Determiner	a, an, the
EX	Existential there	There is a cat on the patio.
FW	Foreign word	Aloha, Hola, Bonjour
IN	Preposition or subordinating conjunction	in, on, because
JJ	Adjective	big, small, happy
JJR	Comparative adjective	bigger, smaller, happier
JJS	Superlative adjective	biggest, smallest, happiest
LS	List item marker	*, -, 1.
MD	Modal auxiliary	can, must, will
NN	Noun, singular (common)	book, table, dog
NNS	Noun, plural (common)	books, tables, dogs
NNP	Proper noun, singular	Bob, Arizonan, American
NNPS	Proper noun, plural	Bobs, Arizonans, Americans
PDT	Predeterminer	all, both, some
POS	Possessive ending	's (e.g., dog's)
PRP	Personal pronoun	I, you, he
PRP$	Possessive pronoun	mine, yours, his
RB	Adverb	well, little, much
RBR	Comparative adverb	better, less, more
RBS	Superlative adverb	best, least, most
RP	Particle	up, down, in
SYM	Symbol	%, &, @
TO	to	to
UH	Interjection	oh, wow, hello
VB	Verb, base form	run, eat, sleep
VBD	Verb, past tense	ran, ate, slept
VBG	Verb, gerund (present participle)	running, eating, sleeping
VBN	Verb, past participle	had run, eaten, slept
VBP	Verb, 1st person singular present (non 3rd person)	I run, you eat, we sleep
VBZ	Verb, 3rd person singular present	He runs, she eats, it sleeps
WDT	Wh-determiner	what, which, whose
WP	Wh-pronoun	who, what, where
WP$	Possessive Wh-pronoun	whose, which's (informal), whomever's
WRB	Wh-adverb	when, how, why
$	Dollar sign	$
.	Sentence punctuation	., ?, !

Generating Tags

NLTK

The nltk.tag package provides methods for part-of-speech tagging

Docs

from nltk.tag import pos_tag_sents
from nltk.tokenize import word_tokenize, sent_tokenize

quotes = '''
All programs have a desire to be useful.
If a machine can learn the value of human life, maybe we can too.
We marveled at our own magnificence as we gave birth to AI.
'''

sents = sent_tokenize(quotes)
tokenized = [word_tokenize(x) for x in sents]
tags = pos_tag_sents(tokenized)
for t in tags:
    print(t)

Types of POS Taggers

Rule-Based taggers rely on a pre-defined set of rules
Statistical taggers use statistical models trained on large datasets of pre-tagged text. For example, the Hidden Markov Model (HMM) with Viterbi algorithm relies on statistical models and probabilities to assign part-of-speech tags.
Transformation-Based taggers combine rule-based and statistical methods.

The Noisy channel model is a conceptual framework used in NLP that can illustrate the challenges of part-of-speech tagging and how they are addressed by statistical taggers, given that typos and errors are common in text. (See noisy.html for an example of spelling correction using this framework.)

POS Tagging Tools

Python libraries and some POS methods

NLTK

pos_tag()
pyStatParser - NLTK parse trees
Stanford from nltk.tag.stanford import StanfordTagger
MaltParser nltk.parse.malt.MaltParser
Pattern from pattern.en import tag

SpaCy

Language model based on transformer architecture en_core_web_trf

TextBlob

Comparison of Taggers (2021) - Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Survey Paper (2022) - Part of speech tagging: a systematic review of deep learning and machine learning approaches

Comparing NLTK and SpaCy Tags

SpaCy

The spacy.io package provides methods for NLP tasks, including POS tagging.

Docs

import spacy
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from tabulate import tabulate

quotes = [
'All programs have a desire to be useful.',
'If a machine can learn the value of human life, maybe we can too.',
'We marveled at our own magnificence as we gave birth to AI.'
]

# python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

for quote in quotes:
    print(f'\n{quote}\n')
    data = []
    tokens = word_tokenize(quote)
    nltk_tags = pos_tag(tokens)
    spacy_tags = [(token.text, token.tag_) for token in nlp(quote)]

    for x, y in zip(nltk_tags, spacy_tags):
        if(x[0]==y[0]):
            data.append([x[0],x[1],y[1]])

    print(tabulate(data, ['token','nltk','spacy'], "simple"))

Code Output


    All programs have a desire to be useful.

    token     nltk    spacy
    --------  ------  -------
    All       DT      DT
    programs  NNS     NNS
    have      VBP     VBP
    a         DT      DT
    desire    NN      NN
    to        TO      TO
    be        VB      VB
    useful    JJ      JJ
    .         .       .

    If a machine can learn the value of human life, maybe we can too.

    token    nltk    spacy
    -------  ------  -------
    If       IN      IN
    a        DT      DT
    machine  NN      NN
    can      MD      MD
    learn    VB      VB
    the      DT      DT
    value    NN      NN
    of       IN      IN
    human    JJ      JJ
    life     NN      NN
    ,        ,       ,
    maybe    RB      RB
    we       PRP     PRP
    can      MD      MD
    too      RB      RB
    .        .       .

    We marveled at our own magnificence as we gave birth to AI.

    token         nltk    spacy
    ------------  ------  -------
    We            PRP     PRP
    marveled      VBD     VBD
    at            IN      IN
    our           PRP$    PRP$
    own           JJ      JJ
    magnificence  NN      NN
    as            IN      IN
    we            PRP     PRP
    gave          VBD     VBD
    birth         NN      NN
    to            TO      IN
    AI            NNP     NNP
    .             .       .