Context-Free Grammars & Pushdown Automata

Moving beyond regular languages to the richer, stack-powered world of context-free languages and parsing.

Learning Objectives

Context-Free Grammars

Regular languages capture a wide range of useful patterns, but they cannot express certain recursive structures. For example, the language {aⁿbⁿ | n ≥ 0} — equal numbers of a's followed by b's — is not regular. Such nested or balanced structures require a more powerful formalism: the context-free grammar (CFG).

Definition — Context-Free Grammar (CFG)

A CFG is a 4-tuple G = (V, Σ, R, S) where V is a finite set of variables (non-terminals); Σ is the terminal alphabet (V ∩ Σ = ∅); R is a finite set of production rules of the form A → α where A ∈ V and α ∈ (V ∪ Σ)*; and S ∈ V is the start variable. The language L(G) is the set of all terminal strings derivable from S.

The term "context-free" means that a production rule A → α can be applied wherever A appears, regardless of the surrounding context. This is what distinguishes CFGs from the more general (and more powerful) context-sensitive grammars.

Derivations and Parse Trees

To generate a string, we begin with the start symbol S and repeatedly replace a variable using one of its production rules. The sequence of replacements is a derivation. A parse tree is a tree that records the structure of the derivation: the root is S, internal nodes are variables, and leaves are terminals.

Example — Grammar for {aⁿbⁿ | n ≥ 0}
  • Variables: {S}   Terminals: {a, b}   Start: S
  • Rules: S → aSb | ε
  • Derivation of "aabb": S ⇒ aSb ⇒ aaSbb ⇒ aabb
  • This grammar generates the balanced language {aⁿbⁿ} which no regular expression can describe.
Example — Arithmetic Expressions
  • E → E + T | T
  • T → T × F | F
  • F → ( E ) | id
  • This grammar captures the precedence of × over + and handles parenthesisation, mirroring the grammar of most programming languages.
A parse tree for the expression "aab".
Image: Source

Pushdown Automata (PDA)

Context-free languages are recognised by pushdown automata — machines like NFAs but augmented with an unbounded stack. The stack allows the machine to remember an unlimited amount of information in a last-in-first-out fashion, making it powerful enough to match nested structures.

Definition — Pushdown Automaton (PDA)

A PDA is a 7-tuple M = (Q, Σ, Γ, δ, q₀, Z₀, F) where Q is the set of states; Σ is the input alphabet; Γ is the stack alphabet; δ: Q × (Σ ∪ {ε}) × Γ → 𝒫(Q × Γ*) is the transition function; q₀ is the start state; Z₀ ∈ Γ is the initial stack symbol; and F ⊆ Q is the set of accepting states.

At each step a PDA reads an input symbol (or ε), pops the top stack symbol, moves to a new state, and pushes a string of stack symbols. The stack provides the "memory" needed to track unmatched symbols — for example, pushing an 'A' for each 'a' read, then popping one 'A' for each 'b', ensures equal counts.

Schematic of a pushdown automaton: it has a read-only input tape (like a finite automaton) plus a stack for unlimited auxiliary storage.
Image: Source

The CFG–PDA Equivalence

Theorem — CFG/PDA Equivalence

A language L is context-free if and only if there exists a pushdown automaton that accepts L. Every CFG can be converted to an equivalent PDA, and every PDA can be converted to an equivalent CFG.

The forward direction (CFG → PDA) is constructive: build a single-state PDA that uses its stack to simulate leftmost derivations of the grammar. At each step, if the top of the stack is a variable A, non-deterministically replace it with the right-hand side of some A-production; if the top is a terminal, match it against the next input symbol.


Parsing Context-Free Languages

Parsing is the process of taking a string and determining whether it belongs to a CFL, and if so, recovering the parse tree that exhibits its grammatical structure. Parsing is a core operation in every compiler and interpreter.

CYK Algorithm

The Cocke-Younger-Kasami (CYK) algorithm is a classic dynamic programming parser that works for any CFG in Chomsky Normal Form (CNF), where every rule has the form A → BC or A → a. It fills a triangular table in O(n³) time, where n is the length of the input string.

Top-Down and Bottom-Up Parsing

Parsing Strategies
  • Top-down (LL parsers): Start from S and try to derive the input left-to-right. Recursive descent parsers are a practical implementation of this approach.
  • Bottom-up (LR parsers): Shift input symbols onto a stack and reduce using grammar rules when a right-hand side is recognised. More powerful than LL and used in most production compilers (yacc, bison).
  • Earley's algorithm: A general O(n³) parser that handles all CFGs, including ambiguous ones.

Context-free grammars and pushdown automata sit at a crucial level of the Chomsky hierarchy: they are powerful enough to describe the syntax of programming languages, yet weak enough to allow efficient parsing — a balance that underpins the design of every language compiler in existence.