Evolution of compilers and parsers
The name "compiler" seems to have been invented in 1952 by Grace Hopper. It is around this time that is implemented for the first time what can be considered a parser. However, Grace Hopper's compiler has nothing to do with what is now called a compiler: a tool that converts one language into another, usually a lower level one. Assemblers existed before compilers, but an assembler is only a text representation of the binary code, translation is done with a simple lookup table.
Meta-circular compiler (1951)
Corrado Böhm describes in a thesis a program that translates a computed language of assignments and expressions to machine code. Meta-circular means that the compiler is written in the same language it compiles.
Autocode (1952)
It is a set of macro instructions that are translated to machine language. There is no particular technique behind this tool.
IT (1956)
The Carnegie Institute of Technology advertises the IBM 650-based IT compiler. The language has arithmetic expressions but no operator precedence.
Chomsky's hierarchy and contextual grammar (1956)
In a paper published in 1956, Chomsky presents the structure of a real compiler:
- The lower layer prefigures later lexers. It is based on Markov chains.
- The middle layer uses a non-contextual grammar, as used in modern parsers.
- The upper layer performs the source language transformations in the target language.
At the same time, he invents non-contextual grammar context: It is composed of rules where x is defined by an group of items, and x is independent of the context in which it appears. Such a grammar is easily written in BNF, a format that will appear later.
Fortran (1957)
The first evolved language, created by John Backus, does not use the Chomsky structure, or a particular technique, but processes the programs line by line. To obtain precedence, it uses a trick, it surrounds parts of the expression around operators betwwen parentheses pairs whose number denotes their importance of the operator. Then the decortication of the expressions results in the precedence of the multiplication on the addition.
Most Fortran constructs only appeared in the following versions.
Lisp (1958)
The firt Lisp compiler is no more elaborate than the Fortran compiler, since the programmer must himself add parentheses to tell it how to transform the code.
BNF (1958)
John Backus and Peter Naur use an original notation that they will use to describe the grammar of the language Algol 58: The Backus-Naur Form, or BNF.
Example of BNF grammar:
<ifelse> ::= <if> [ { else <if> } ] <if> ::= if "(" <condition> ")" ( <instruction> | "{" { <insttuction> } "}" )
Precedence (1959)
A theoretical implementation of the precedence of operators is demonstrated by Samuelson and Bauer by the use of a stack.
Algol (1960)
The first structured language based on a grammar described in BNF. The description of an ALGOL parser is published by Ned Irons.
Algo 60 has blocks of instructions and supports recursion, concepts which where new then. It also includes dynamic arrays, so C and Pascal were regressions on this aspect.
Recursive descending parser (1961)
Peter Lucas and Ned Irons both describe a recursive descent parser. The latter by an initial rule that refers to other rules and interprets the language following the hierarchy of their composition. In addition, a rule can refer to itself and be recursive as in the BNF description above when an instruction, which is part of the "if" rule, is also an "if" rule.
Dijkstra' Shutting-Yard algorithm (1961)
From an arithmetic expression with parentheses or not (and thus by recognition of precedence), this algorithm produces an abstract syntactic tree. It uses a stack and produces an infix notation (called Polish) or an operator follows a pair of operands with those of higher priority arriving at the end of the chain.
LR (1965) and LALR (1971) parsers
Invented by Donald Knuth, a LR parser reads a non-contextual grammar from left to right.
One can have a Simple Left-Right (SLR) parser, invented in 1969 by Frank DeRemer, or LALR (Look-Ahead Left-Right) imagined by the same one in 1971. In the second case one attempts a correspondence between the grammar rule and the instruction of the source code and in case of failure, we return to the first element and try an alternative. An error message is issued when no alternative matches.
We write LR (k) where k is the number of "lookaheads" levels that the parser explores in the source code before deciding that a rule applies to an instruction.
In 1968, Donald Knuth improved Iron's (1968) synthetic attribute concept with inherited attributes. While synthetic attributes define a node from the lower nodes, we have an inherited attribute when a node uses the attributes of the parent node instead.
LL (1968) vs. LR parsers
Conceived by Lewis and Stearns, the LL principle applies to a grammar designed to be analyzed by a simplified parser and that can be written by hand.
An LL parser analyzes an instruction form left to right and from top to bottom, it constructs a representation of the instruction by replacing each node with the group of node that defines it.
An LR parser analyzes an instruction from left to right as well, but from bottom to top, ie it considers terminal nodes and combines them to find the higher level node.
CYK algorithm (1960+)
Designed to parse non-contextual grammars, using bottom-up and dynamic programming, it is effective in specific cases, but the others methods outperform it in the average case.
Earley's parser (1970)
Jay Earley invented an algorithm in 1968 able to analyze most non-contextual grammars, which he simplified in 1970. However, he was slow and other techniques were used instead for mainstream compilers. A faster version was found later.
GLR parser (1974)
A GLR (Generalized LR) parser, is an extension of LR for ambiguous grammars like that of C. An algorithm that implemented it was described by Bernard Lang, and improved since 1974.
The C, C ++, etc. languages are parsed by a version of this algorithm by most compilers.
Yacc (1975)
Based on LALR, this is the first automatic parser generator, made by Stephen Johnson at ATT for Unix.
The C compiler goes then from a simple recursive descent parser to LALR. But Yacc will only be available to the public in 1979.
Yacc was used by GCC, then for Awk, R, Ruby, etc ...
GCC and CLang
GCC has been using Yacc for some time, but in 2004, the C and Objective-C compilers replace Yacc with a manually written recursive descent parser. This provided only a 1.5% speedup. The C++ compiler followed.
CLang uses a similar parser for all languages.
ANTLR (1989)
Based on an LL(k) algorithm, ANTLR, created by Terence Parr, is arguably the most used parser generation tool. Version 4, which is claimed to be ALL(k), and which is a combination of LL(k) and GLR, works either with listeners or visitors, and allows practically any grammar to be analyzed. Writing a compiler becomes much simpler and in the listener version, also facilitates multiple target languages.
See also:
Author: Denis Sureau, creator of the Scriptol and Cryonie programming languages.