Context-Free Grammars
Regular languages capture a wide range of useful patterns, but they cannot express certain recursive structures. For example, the language {aⁿbⁿ | n ≥ 0} — equal numbers of a's followed by b's — is not regular. Such nested or balanced structures require a more powerful formalism: the context-free grammar (CFG).
A CFG is a 4-tuple G = (V, Σ, R, S) where V is a finite set of variables (non-terminals); Σ is the terminal alphabet (V ∩ Σ = ∅); R is a finite set of production rules of the form A → α where A ∈ V and α ∈ (V ∪ Σ)*; and S ∈ V is the start variable. The language L(G) is the set of all terminal strings derivable from S.
The term "context-free" means that a production rule A → α can be applied wherever A appears, regardless of the surrounding context. This is what distinguishes CFGs from the more general (and more powerful) context-sensitive grammars.
Derivations and Parse Trees
To generate a string, we begin with the start symbol S and repeatedly replace a variable using one of its production rules. The sequence of replacements is a derivation. A parse tree is a tree that records the structure of the derivation: the root is S, internal nodes are variables, and leaves are terminals.
- Variables: {S} Terminals: {a, b} Start: S
- Rules: S → aSb | ε
- Derivation of "aabb": S ⇒ aSb ⇒ aaSbb ⇒ aabb
- This grammar generates the balanced language {aⁿbⁿ} which no regular expression can describe.
- E → E + T | T
- T → T × F | F
- F → ( E ) | id
- This grammar captures the precedence of × over + and handles parenthesisation, mirroring the grammar of most programming languages.
Image: Source
Pushdown Automata (PDA)
Context-free languages are recognised by pushdown automata — machines like NFAs but augmented with an unbounded stack. The stack allows the machine to remember an unlimited amount of information in a last-in-first-out fashion, making it powerful enough to match nested structures.
A PDA is a 7-tuple M = (Q, Σ, Γ, δ, q₀, Z₀, F) where Q is the set of states; Σ is the input alphabet; Γ is the stack alphabet; δ: Q × (Σ ∪ {ε}) × Γ → 𝒫(Q × Γ*) is the transition function; q₀ is the start state; Z₀ ∈ Γ is the initial stack symbol; and F ⊆ Q is the set of accepting states.
At each step a PDA reads an input symbol (or ε), pops the top stack symbol, moves to a new state, and pushes a string of stack symbols. The stack provides the "memory" needed to track unmatched symbols — for example, pushing an 'A' for each 'a' read, then popping one 'A' for each 'b', ensures equal counts.
Image: Source
The CFG–PDA Equivalence
A language L is context-free if and only if there exists a pushdown automaton that accepts L. Every CFG can be converted to an equivalent PDA, and every PDA can be converted to an equivalent CFG.
The forward direction (CFG → PDA) is constructive: build a single-state PDA that uses its stack to simulate leftmost derivations of the grammar. At each step, if the top of the stack is a variable A, non-deterministically replace it with the right-hand side of some A-production; if the top is a terminal, match it against the next input symbol.
Parsing Context-Free Languages
Parsing is the process of taking a string and determining whether it belongs to a CFL, and if so, recovering the parse tree that exhibits its grammatical structure. Parsing is a core operation in every compiler and interpreter.
CYK Algorithm
The Cocke-Younger-Kasami (CYK) algorithm is a classic dynamic programming parser that works for any CFG in Chomsky Normal Form (CNF), where every rule has the form A → BC or A → a. It fills a triangular table in O(n³) time, where n is the length of the input string.
Top-Down and Bottom-Up Parsing
- Top-down (LL parsers): Start from S and try to derive the input left-to-right. Recursive descent parsers are a practical implementation of this approach.
- Bottom-up (LR parsers): Shift input symbols onto a stack and reduce using grammar rules when a right-hand side is recognised. More powerful than LL and used in most production compilers (yacc, bison).
- Earley's algorithm: A general O(n³) parser that handles all CFGs, including ambiguous ones.
Context-free grammars and pushdown automata sit at a crucial level of the Chomsky hierarchy: they are powerful enough to describe the syntax of programming languages, yet weak enough to allow efficient parsing — a balance that underpins the design of every language compiler in existence.