Give our new Discussions area a try!
Build a man a fire, and he’ll be warm for a day. Set a man on fire, and he’ll be warm for the rest of his life.
— Terry Pratchett
Posted on 2017-Mar-27 at 13:33:24 (last update on 2017-Apr-12 at 08:20:46) by Phil
One thing that’s always bugged me about language (and thus, compiler) design is that the books seem to spend 80–90% of their time on detailed mathematical analysis of expression parsing. They leave everything else as exercises for the reader: most of the lexical work, input file handling (especially the ins and outs of embedding include files), error detection/reporting and fixup (to continue compilation, not to make a runnable program), source listings (at what point in file inclusion and macro expansion?), macros and other preprocessing, optimization, and code generation. These are practical matters that need to be dealt with in making a useful application. Some guidance, suggestions, and examples for how to do these things would be appreciated.
Yes, there has to be a certain amount of rigor to the basic operations of lexical and syntactical analysis, or the whole thing falls apart, but authors tend to get so deeply into the mathematics of it (they must love that stuff) that the subject matter gets buried under discussion of alphabets, expressions, productions, set theory, and whatnot. Do you even need to rigidly separate lexical analysis (tokenizing) from syntactical analysis (parsing)? In some cases, feedback from parsing would be useful for determining what a token is and where the input should be split up. Can syntactical analysis of an expression be simply handled with operator precedence and associativity and a couple of stacks? Can control structures (basically, the stuff that wraps around expressions) be simply handled with recursive descent? Are there real world (useful) languages that couldn’t be handled this way (perhaps Lisp-like ones)?
Various languages have their own peculiarities. For example, FORTRAN needs some preprocessing to squeeze out whitespace (except within literal constants) while gluing together continuation lines and eliminating comment lines. When doing this, how do you best keep track of the original lines and positions of keywords so that error reporting makes sense? Would it be better to not try to tokenize the input, or even define keywords with \s* between each letter, but to rely on constant feedback from the syntax analysis to determine what’s coming up next? C has an explicit and fairly well-documented preprocessor pass, but for other languages, preprocessing (macro expansion s/replacements, file inclusion, etc.) is often vaguely described, if at all. PHP was originally never formally described, and thus its syntax has had a lot of irregularities. At least FORTRAN and COBOL have the excuse that they were set in stone long before modern language design theory was codified!
At the other extreme, some books are so oriented towards “hands on” practical results that they go over background theory very lightly and plunge right into the needed code. This approach can leave the reader, while appreciative that they aren’t deeply buried in math, wondering if they have a good grasp of the overall picture. I have yet to meet a book that leaves me fully satisfied with the author’s approach.
Posted on 2018-Apr-13 at 17:25:47 (last update on 2019-Nov-13 at 11:49:38) by Phil
This is an interesting discussion from a while back that I came across.
I can sympathize with the thread-starter’s request for a “cookbook” of compiler writing without all the heavy theory and math. The discussion didn’t go on for very long, but did include a lot of literature reviews, and discussion of whether you can write anything useful with a cookbook approach (i.e., not being reasonably well-versed in compiler and language theory).
It got me to thinking: if you take all the chapters devoted to State Machines, Finite Automata, and such in the “Dragon” books, etc., and stand way back and squint, aren’t they just using regular expression parsing to identify keywords and other tokens? It seems those chapters could be replaced by “Read up on RE, and then come back here and we’ll talk about using RE to tokenize input.”
Posted on 2022-Apr-14 at 14:04:00 by Phil
This approach appears to have been taken in the
design system, at least in version 3. It allows some feedback from the syntax
analysis stage to direct tokenizing (lexical analysis). This is enabled by
not separating lexical and syntactical analyses into separate,
cascaded passes, as done with
Note that by “cascaded passes”, I’m not necessarily talking about completely separated distinct passes of the compiler (although it can be done that way) — the usual method is for the syntax parser to ask the lexical analyzer for the next token, but usually not feeding back any information about the state of the parse.
ANTLR supposedly handles syntactical analysis (at least, as of
version 3), with simple left recursive descent, with safeguards to prevent
excessive backtracking and infinite recursion. I haven’t looked at it in
great detail, but this stands in contrast to the approach taken by
All content © copyright 2005 – 2022
by Catskill Technology Services, LLC.
All rights reserved.
Note that Third Party software (whether Open Source or proprietary) on this site remains under the copyright and license of its owners. Catskill Technology Services, LLC does not claim copyright over such software.
This page is https://www.catskilltech.com/philosophy-and-approach-of-compiler-design-books.html
Search Quotations database.
Last updated Mon, 18 Apr 2022 at 10:59 PM