"Part-of-speech tagging, often abbreviated as POS tagging, acts like a linguistic passport for words in a sentence. It assigns grammatical labels like noun, verb, adjective, or adverb to each word. This helps computers understand the structure and meaning of a sentence. By knowing a word's part of speech, a machine translation model can translate it more accurately, or a sentiment analysis tool can determine if a sentence expresses positive or negative emotions."- Gemini 2024
The goal of part-of-speech tagging is to label each part of a sequence with its grammar form. Languages are made up of words, symbols and rules for constructing sequences. These define a language's grammar.
Word classes (categories)
Sets of words in a language can be relatively static (unchanging) or dynamic, where new words are added, and unused words are retired/removed from the word set. Some examples in the English language include:
Penn TreeBank
The English Penn Treebank corpus is one of the most known and used corpus for the evaluation of models for sequence labelling.
In paperswithcode.com we see publications referencing the Penn Treebank from 2004 to today.
Penn Treebank POS Tagset
Generating Tags
The nltk.tag package provides methods for part-of-speech tagging
nltk.tag
from nltk.tag import pos_tag_sents from nltk.tokenize import word_tokenize, sent_tokenize quotes = ''' All programs have a desire to be useful. If a machine can learn the value of human life, maybe we can too. We marveled at our own magnificence as we gave birth to AI. ''' sents = sent_tokenize(quotes) tokenized = [word_tokenize(x) for x in sents] tags = pos_tag_sents(tokenized) for t in tags: print(t)
Types of POS Taggers
The Noisy channel model is a conceptual framework used in NLP that can illustrate the challenges of part-of-speech tagging and how they are addressed by statistical taggers, given that typos and errors are common in text. (See noisy.html for an example of spelling correction using this framework.)
POS Tagging Tools
Python libraries and some POS methods
pos_tag()
from nltk.tag.stanford import StanfordTagger
nltk.parse.malt.MaltParser
from pattern.en import tag
en_core_web_trf
Comparison of Taggers (2021) - Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models
Survey Paper (2022) - Part of speech tagging: a systematic review of deep learning and machine learning approaches
Comparing NLTK and SpaCy Tags
The spacy.io package provides methods for NLP tasks, including POS tagging.
spacy.io
import spacy from nltk.tag import pos_tag from nltk.tokenize import word_tokenize from tabulate import tabulate quotes = [ 'All programs have a desire to be useful.', 'If a machine can learn the value of human life, maybe we can too.', 'We marveled at our own magnificence as we gave birth to AI.' ] # python -m spacy download en_core_web_sm nlp = spacy.load("en_core_web_sm") for quote in quotes: print(f'\n{quote}\n') data = [] tokens = word_tokenize(quote) nltk_tags = pos_tag(tokens) spacy_tags = [(token.text, token.tag_) for token in nlp(quote)] for x, y in zip(nltk_tags, spacy_tags): if(x[0]==y[0]): data.append([x[0],x[1],y[1]]) print(tabulate(data, ['token','nltk','spacy'], "simple"))
Code Output
All programs have a desire to be useful. token nltk spacy -------- ------ ------- All DT DT programs NNS NNS have VBP VBP a DT DT desire NN NN to TO TO be VB VB useful JJ JJ . . . If a machine can learn the value of human life, maybe we can too. token nltk spacy ------- ------ ------- If IN IN a DT DT machine NN NN can MD MD learn VB VB the DT DT value NN NN of IN IN human JJ JJ life NN NN , , , maybe RB RB we PRP PRP can MD MD too RB RB . . . We marveled at our own magnificence as we gave birth to AI. token nltk spacy ------------ ------ ------- We PRP PRP marveled VBD VBD at IN IN our PRP$ PRP$ own JJ JJ magnificence NN NN as IN IN we PRP PRP gave VBD VBD birth NN NN to TO IN AI NNP NNP . . .