CTS logo
hazy blue Catskill Mountains in distance


A Thought…

Wisdom comes from experience. Experience is often a result of lack of wisdom.

   — Terry Pratchett

General guidelines and resources

Posted on 2017-Mar-27 at 12:53:54 by Phil
Last update on 2019-Jul-28 at 17:15:09 by Phil

Not many of us will have call to write an entire language’s compiler from scratch, but most of us have been called upon to process configuration files, etc., written in a somewhat “high level” language. Such languages should be clean and understandable to humans, so something as cluttered and rigidly machine-oriented as XML is undesirable. This board is oriented towards discussion of methodologies for constructing such translators, interpreters, and compilers. Ad hoc freehand design of such a system is often not a good thing — you really should have some method behind your madness, or your code is likely to be very brittle, difficult to extend or reuse, and generally buggy.

The intent is to discuss a range of language translators, whether you’re building utilities to translate old COBOL code to C# (good luck!), read (and possibly write) configuration files, an interpreter for some purpose, or an out-and-out compiler. It can be a standalone program or a module to drop into other code (and thus, the language choice is constrained for you). It can be a full translator to low level machine language (assembly code), a Just-In-Time bytecode compiler, or something else. Whatever you need, whether needed for your work, or just for expanding your skill set.

Some resources to get us started (please suggest more!):


  • Aho & Ullman (first edition) or Aho, Sethi, & Ullman (second edition) Principles of Compiler Design. The famed “Dragon” books referred to by all authors.
  • Lewis, Rosenkrantz, & Stearns Compiler Design Theory.
  • Holub Compiler Design in C. This is not well structured, more of an ad hoc attack on the problem, but useful for bits and pieces showing how to actually do things in building a C compiler. Be sure to attend to the rather lengthy online errata list at some point. There is an online copy (including errata list).
  • Pyster Compiler Design and Construction. This is implementation of a Pascal-like language in itself, in a very informal approach. Fun fact: the output is IBM 360 Assembly Language.
  • Niklaus Wirth Compiler Construction. A brief (131 page) introductory CS course featuring a subset of the Oberon language.
  • Pratt Programming Languages: Design and Implementation. While not going deeply into compilation, it is a good survey and overview of different approaches to languages.
  • Waite & Goos Compiler Construction. This book is lighter on parsing theory, but goes more heavily into the details of error reporting and fixup, optimization, and code generation than do most other books.


  • lex and yacc — these are the Ur translation system, originally from Bell Labs for the Unix operating system, and oriented towards C-like languages. lex is the lexical analysis portion, which tokenizes the input stream and feeds tokens to the syntactical analyzer (Yet Another Compiler Compiler). Despite the flippant name, it has been a workhorse in the field (even if it wasn’t necessarily the first).
  • flex and bison — the free (open source) versions of lex and yacc, somewhat more modern and updated in design. Supposedly “flex” is “fast lex”, and “bison” is just wordplay on “yacc” (equals the Asian domestic animal “yak”). It has been widely used in all sorts of free software compiler projects.
  • ANTLR — Terence Parr’s ANother Tool for Language Recognition, using input much like lex/yacc or flex/bison, but combined in one input file and allowing some context-directed parsing (and thus more flexibility in the language design). It also permits input specifications (e.g., a comma-separated list of expressions) in a more natural manner, closer to the concept of the “railroad track” syntax diagrams (a.k.a. syntax charts or bead diagrams) some of you may have worked with. It has many associated tools, and is Java-based, although there are many back-ends for other support languages.

Please try to keep any algorithms you give in a more or less language-independent style, unless the subject matter is for a specific implementation language. You may love writing in Python, but you want your work to be accessible to someone who doesn’t know Python.

Finally, discussion is open for implementing not only “traditional” languages that you find compilers and interpreters for, like C or FORTRAN (with expressions inside control structures), but also “inside-out” languages such as HTML, where keywords and other control structures are often embedded inside running text content. There are still rules that apply (especially for tag nesting), but the structure is often more one of a lot of disconnected little pieces. See Javascript Document Object Model.

Posted on 2022-Dec-14 at 13:09:00 by Phil

Another interesting attempt at a compiler toolkit is LLVM. I haven’t had a deep dive into it yet, but it appears to be worth keeping in mind.

Posted on 2023-Jan-20 at 10:18:00 by Phil

An interesting article on natural language processing re-inventing a re-invention of computer science in Physics.org. It concerns an ancient Sanskrit technique for word formation that was just rediscovered by scholar. I commented that it sure sounds like a compiler shift-reduce conflict, with a similar rule (take the shift) to resolve it.

Posted on 2024-Feb-08 at 12:35:00 by Phil

Someone on Stack Overflow posted a question about why programming languages are so English-centric (more-or-less English for all keywords and commands). Answers point to a Wikipedia entry discussing this, and showing a few new languages written with non-English keywords, and many implementations of (otherwise, English keyword) languages translated to other (than English) languages.

It seems that the TL;DR is that computer programming primarily started in the U.S. (and other English-speaking countries), establishing a convention of using English keywords. Some compilers have been hard-translated to other (non-English) languages, while others allow run-time selection of which language (of many) to use.

Is it worth it, or is it just an exercise in nationalism and narcissism? Understandably, some people will not like American cultural, business, and technical hegemony; but the downside is that your software will not be usable by much of the world (at least, if they need to look inside the code). English is the lingua franca of technology, including software development; it might have been others (e.g., French or German), but English was pretty much first (and thus prevailed and set a precedent), for better or worse. If I have to learn Slovenian in order to use your great new language — guess what? I won’t be using it!

Of course, the user interface should be (as much as possible) usable in the local language. An French-speaking end-user using software in France has the right to expect that the UI will be in French; what we’re talking about here is the internal code written in some computer language, which likely uses English keywords.


All content © copyright 2005 – 2024 by Catskill Technology Services, LLC.
All rights reserved.
Note that Third Party software (whether Open Source or proprietary) on this site remains under the copyright and license of its owners. Catskill Technology Services, LLC does not claim copyright over such software.


This page is https://www.catskilltech.com/utils/show.php?link=general-guidelines-and-resources

Search Quotations database.

Last updated Wed, 05 Jun 2024 at 8:50 PM

Valid HTML 5

Thu, 13 Jun 2024 at 12:17 PM EDT