Information Extraction in NLP

"In the vast realm of unstructured text, lies a treasure trove of valuable information waiting to be unearthed. Information extraction, a fundamental task in natural language processing (NLP), empowers us to unlock this hidden knowledge by transforming raw text into structured data. By delving into the semantic content and meaning of text, information extraction enables us to discover and extract key insights, relationships, and entities. Thfimgis series will embark on a comprehensive exploration of information extraction techniques, shedding light on the methods and algorithms that enable us to extract meaningful information from the written word."- Gemini 2024

Information Extraction Tasks

Named Entity Recognition

Entities

Names of people places and companies
Dates and times
Amounts (quantities)
Domain based entities
- 'omics - genes and proteins
- social media - usernames and pages

Reference resolution

Entity clustering - e.g. John, John Smith and Mr. Smith refer to the same entity
Coreference resolution - e.g. instance of 'he'refers to the entity John Smith (In Depth)

Ambiguity challenges

Same name, same type (John)
Same name, different type (Paris)

Technique

Sequence labeling - train classifier to label tokens with tags
Similar to POS tagging

Text ⇨ Human Annotation ⇨ Encoding and Feature Extraction ⇨ Train-Test Split ⇨ Train Labeler (HMM, SVM, etc)

Relation detection & classification

Semantic relations - employment, family, membership, geospatial, parts of a whole
Examples - parent company & subsidiaries, company & employees, etc
Technique - Supervised learning
1. Extract features of named entities
2. Detect presence of relation (train classifier to make binary decision of existance of relation between pairs of entities)
3. Classify relation (train classifier to label relations using multiclass labeling)

Event detection & classification

Identify expressions denoting an event or state that can be assigned to a point or interval in time
Rule based or statistical methods
Reference resolution

Temporal expression recognition & temporal analysis

Identify precise (e.g. 3/12, 1:20) and relative (2 days ago) temporal expressions
Perform normalization techniques (ISO timestamps)
Fix temporal expressions to a date-time "anchor"
Associate temporal expressions with events
Organize events in a complete and coherent timeline

Text Preprocessing

Preprocessing text includes cleaning and preparing data for analysis. This involves a series of steps that vary based on needs. This example outlines some typical steps the normalize and preprocess a sentence. In the example we use a test sentence that may pose a challenge for coreference resolution.

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def simple_clean(text):
    t = text.lower()
    t = re.sub(r'\d+', '', t) # remove digits
    t = re.sub(r'[^\w\s]', '', t) # remove special chars

    # tokenize
    tokens = nltk.word_tokenize(t)

    # remove stopwords
    stops = set(stopwords.words('english'))
    tokens = [x for x in tokens if x not in stops]

    # may need to correct spelling first

    # stemming or lemmatization
    lem = nltk.WordNetLemmatizer()
    lem_tokens = [lem.lemmatize(x) for x in tokens]

    return ' '.join(lem_tokens)

test = "The captain's ferret escaped, if you see him report to him immediately."

print(simple_clean(test))

# >> captain ferret escaped see report immediately