CTS logo
hazy blue Catskill Mountains in distance

A Thought…

I answer the landline phone with “Billy Bob’s Roadkill Cafe! If we can get it off your fender, we can grill it good and tender.”

   — Kim Roberts, Calvin and Hobbes comic letters

Another approach to compiler design

Posted on 2024-Jul-03 at 17:54:00 by Phil

A lot has changed since people first started writing compilers for high-level languages (anythng above Assembly Language). Still, compiler design appears to be bogged down in Ancient Times (1950s and 1960s). Back then, even the biggest computers were severely constrained by the amount of memory available, and processing speed. Language designers and compiler authors worried a great deal about minimizing memory use, and processing speed, things not as critical today. That’s not to say that these things can be completely ignored; it’s just possible that these can now be less of a constraint on the design of a language and the compiler to process it.

lex and yacc are the more-or-less original implementaion of the traditional style of compiler writing. They have been largely superseded by improved implementations, flex and bison, although these two are basically the same architecture as before. A more recent implementation of a compiler-writer is ANTLR, which simplifies control structure definition by using recursive-descent, but still defines expression processing in the traditional lex/yacc manner (even though the definition files for the two are merged into one file).

The new idea is that, freed from the shackles of having to be tiny in size and fast above all else, how could designing a compiler for a given language be made much, much, easier? Current systems really get bogged down in the minutae of lexical and syntactical definition, exhausting the author before they can even begin to think about novel language features, or even how to advance beyond the syntax stage, to optimize and generate code. We may be able to explore new methodologies (and even language features) that before, have been dismissed as taking too many resources (too big, or too slow).

Traditional compiler design theory covers three areas: lexical scan (tokenization), syntactical parsing, and semantic analysis. Optimization and code generation (usually to assembly language), if not left as an exercise for the reader, are at the back end. These are often implemented as completely separate passes or phases of a compiler, with little or no feedback to an earlier phase (e.g., semantic analysis to direct and refine syntactical analysis). That is, each phase is usually completely stand-alone, with static blocks of data passed in-core or via files. There are languages, such as C++, where several different constructs can be lexically the same requiring sophisticated patching to arrive at different syntactical meaning. It is highly desirable to be able to modify processing in earlier phases based on what is determined in later phases, something which the traditional design makes quite difficult.

Given a well-designed language, it should be possible to easily port it to another platform (hardware and/or operating system). At the front end, if everything involved with the first three phases is written in a high-level, platform-independent manner, moving those phases over to a new platform should be very trivial. Optimization may need minor tweaking (e.g., there is, or isn’t, a hardware increment or decrement instruction), and some Intermediate Language constructs may be more efficient for one architecture than another, but a lot of stuff should be independent of what the target platform is. Output code generation of course will be quite different, unless the output is platform-independent, such as some sort of “byte code” or the like. A well-designed back end (optimizer and code generator) might be almost language-independent, leaving the front end (lexical, syntactical, and semantic analysis) language-specific.

LL, LR, SLR, LALR, etc. all break down the process into minute steps with complicated mathematics that make it difficult to design a grammar. You literally cannot see the meadow for the individual leaves of grass. Let’s consider the easy-to-use recursive descent for control structures, plus operator-precedence parsing of expressions. A legitimate and major concern is avoiding more than trivial backtracking, which can make a compiler unusably slow. Another, related, bugaboo is left recursion, which goes away if you avoid yacc-style grammar productions and use recursive descent and operator-precedence parsing. ANTLR does this in part, by using recursive descent parsing for control structures, although it still does use yacc-style expression parsing.

Consider phase 1 to be determining structure, with syntax- and semantic- feedback to lexical analysis in certain cases (supporting not just context-free grammars, but any reasonable grammar) — any limited backtracking needed at this point is low cost. If phase 1 passes without fatal error, go through a second time, now knowing the structure, and actually generate intermediate code — no backtracking at all should be necessary in phase 2. From there, do the standard optimization and code generation.


Posted on 2024-Oct-07 at 15:19:00 by Phil

Let’s say we’re at a “synchronization point”, where we could have either a keyword or an expression. If using recursive descent for keywords, would it be possible to first try the list of all possible keywords at this point in the program (assuming reserved keywords), and if no matches, assume it’s an expression (including assignments)? If no keywords match, or it has to be an expression at this point (e.g., an opening index operator has been seen), call the expression parser. Knowing what kind of statement(s) we’re working within, deeper statements (their keywords) and even expressions can be understood better, without having to backtrack at all.

While this should work well for “modern” languages, old, poorly thought-out languages such as FORTRAN could still be difficult to parse using a context-free grammar, even with some feedback from deeper processing. To be fair, this language was developed before Computer Science was really a thing, and tools such as lex and yacc had been created. Whitespace is ignored (except in strings), so DO 10 I=1,5, D O1 0I = 1 , 5, and DO10I=1,5 are all the same thing. How to deal with that? One trick would be to automatically add \s* between each character of the keyword (DO in this case) regular expression pattern, permitting embedded spaces. However, the problem still remains that FORTRAN does not reserve keywords, and in any case, a variable name could begin with “DO” (or even, that can be the variable name). One possibility would be to treat each statement as an expression, and if that fails to parse, try the keyword. Or, try the keyword first, and if it fails to parse correctly as a statement, try it as an expression before giving up and declaring an error.

Let’s say you want to send a space probe to, oh, Venus. The rocket’s computer is running FORTRAN, and someone made a mistake typing in a DO-loop. Instead of DO 10 I=1,5 (a loop construct), it was typed in as DO 10 I=1.5! Supposedly this really happened back around 1962. When the rocket lifted off from Cape Canaveral, it suddenly changed direction and headed for downtown Orlando, and had to be blown up right over the assembled spectators and dignitaries. You see, the erroneous code was interpreted as DO10I = 1.5, using the default declaration as a REAL variable (starts with ‘D’) and the data constant matched the type. If your FORTRAN compiler offers something like IMPLICIT NONE, get into the habit of using it, and explicitly declaring all variables. You’ll find a lot of typos in your variable names!

So does this mean that irregular grammar languages such as FORTRAN can not use modern tools with a context-free grammar? Not necessarily. I suspect that if a keyword matches, treat it as a potential keyword, understanding that it may in fact be a complete or partial variable name. This would be some sort of flag in the compiler definition (in addition to auto-inserting spaces between keyword and possibly variable characters). If it fails to properly parse as a statement (that it may not have been a keyword after all), give an expression (e.g, an assignment) a try. Hopefully, one or the other will parse cleanly.


Posted on 2024-Oct-07 at 15:51:00 by Phil

Assuming that irregular grammars can be dealt with, how to handle odd formatting requirements, as well as macro expansions, in a language such as FORTRAN? If we can get past this, the new language compiler should be able to handle any modern language. However, except for (at least) C and C++, languages have macro capabiity but no explicit preprocessor pass to expand macros, etc. FORTRAN, at the least, also has column-usage requirements that need to be obeyed: column 1 ‘C’ and ‘*’ (at a minimum) denote a comment line, columns 1-5 are for numeric statement labels, column 6 is reserved for (many) non-blank characters denoting a continuation of the previous statement, and columns 73 and up are reserved for card sequencing numbers. There are “free-form” FORTRAN compilers around, but that seems fairly rare in the wild.

Should a general purpose compiler generator include a separate preprocessor pass, regardless of whether the language definition calls for one? It could certainly make life easier for both the compiler writer and the generator writer, but could it always be done without altering the definition of the language? For something like FORTRAN, a preprocessor pass could get rid of comment lines and sequence fields, glue together continuation lines, mark statement labels as such, and otherwise turn the format into something more conventional. It probably should not attempt to pre-tokenize lines by adding or removing spaces — leave that to the lexical scan’s pattern matching.

If one is preprocessing a language file to produce something more palatable to a standard-model lexical and syntactical parser, don’t forget that the original formatting needs to be preserved (to some extent) for program listings and error location reporting. It doesn’t do much good to present a radically-altered program to the user, and marking it for error locations. They will spend too much time trying to correlate the revised listing with what they typed in, and become frustrated. Expanding macros should be OK, provided that the original line(s) are kept too, and the expansions are clearly marked.

 

All content © copyright 2005 – 2025 by Catskill Technology Services, LLC.
All rights reserved.
Note that Third Party software (whether Open Source or proprietary) on this site remains under the copyright and license of its owners. Catskill Technology Services, LLC does not claim copyright over such software.

 

This page is https://www.catskilltech.com/utils/show.php?link=another-approach-to-compiler-design

Search Quotations database.

Last updated Mon, 07 Oct 2024 at 3:51 PM

Valid HTML 5

Mon, 13 Jan 2025 at 4:50 AM EST