🐌
Yours truly is not an expert in compilers but does enjoy diving into this topic and learning something new from time to time.

Are you looking to implement your own compiler?

An awesome idea, do it! You’ll have lots of fun along the way and learn a great deal.

Compiling theory and practice is a vast landscape. It can feel overwhelming to navigate and easy to get lost in. I’m not a compiler expert by any stretch of the imagination — well, exactly like that disclaimer snail claims. However, I’d still like to share a few things that either worked well for me or could’ve, had I known about them going in. Hopefully, these can be helpful to other people who are enthusiastic about writing a compiler but might be struggling to figure out where to start.

Theory

There’s is no way around studying a bit of theory before getting down to coding. The key, in my opinion, is to be careful to not get stuck on it, pick up just enough and get your hands dirty as soon as possible. Keep the references close by and read up on relevant subjects as you go.

The learning resources you’re going to use are a big deal — with the right ones you’ll be ready to apply the newly gained knowledge to practice in a reasonably short time. Keeping that in mind, don’t start with classics textbooks like the Dragon Book, or “Modern Compiler Implementation”. These books are great, and you can always come back to them later, when you’re better prepared to take the most out of them. It’s way more productive to first read ones geared toward beginners.

Here’re two books that I’d opt for.

  • Crafting Interpreters. Yes, you read that right, interpreters. Bear with me for a second. Compilers and interpreters have a lot in common and the book is exceptionally beginner-friendly. Basically, this is the perfect one to get started with.

  • Introduction to Compilers and Language Design again is really accessible, doesn’t assume any preexisting compilers knowledge, and teaches all the basics necessary to build one using a hands-down approach.

Both are freely available online. (If you can, I encourage you to consider buying the paid versions though, to support the authors).

The section Learning resources discusses these books in a little more detail and lists a couple more equally great sources.

Don’t try to come up with your own language just yet

Go with an existing educational language instead, and focus on learning about compilers.

Unless, of course, inventing a language is what you’re really after. In that case, you’ll still want to get a handle on lexing and parsing (we’ll get to these two shortly) while producing C code as the compilation result. Doing so helps focus on the syntax and semantics of your language instead of the mechanics of encoding high-level language concepts in assembly. Nim is an example of a well-known language following this approach. An alternative to emitting C is targeting LLVM IR. The post isn’t going to go into any more details regarding these approaches though.

I’m aware of two programming languages that, in my opinion, are best fit for a hobby compiler project. These are specifically designed for classroom use in compilers courses and by extension are great for pet compiler projects.

ChocoPy

ChocoPy is a concise Python subset, which can be implemented in a time-frame reasonable for a hobby compiler. Although concise, the language is quite expressive — the features include lists, classes, dynamic dispatch, nested functions. These represent exciting compiler design and implementation challenges. ChocoPy is statically typed, which again leads to interesting implications for design and implementation. The language being a Python subset means programs in it can be executed directly in a Python (3.6+) interpreter or an online playground. The code syntax highlighting, editing, and formatting is supported out of the box by any Python tool.

Cool 2020

Cool 2020 is a small Scala subset including classes, inheritance, virtual dispatch, and even a very simple form of pattern matching. All of the advantages listed for ChocoPy apply to Cool 2020 too. The language is narrow in scope so that implementing a compiler takes a limited, realistic amount of time. While it’s still expressive and presents interesting design and implementation challenges. Tools for Scala code syntax highlighting, formatting, and editing work out of the box. A Cool 2020 program can be (with just a little bit of help) directly compiled by a Scala compiler or executed in an online playground.

Thoughts on implementing a C compiler

C is considered to be a small language. On the other hand, it’s one of the most widely-used programming languages, innumerable quantities of software written in C work in production and power mission-critical systems. The language’s standard library has been ported to a huge number of platforms. So, given that C is small but also very “real” and popular, a thought of writing a C compiler is exceptionally tempting.

If you decide to go down this route, it’s good to keep the following in mind. When people discuss the size of C what’s usually implied is it’s small for a production-grade, standardized language. As a natural consequence of that C has a lot of nuanced and complicated details. I’ll quote from the log Rui Ueyama kept on writing the 8cc C compiler to give a few examples.

I cannot pass a struct to a function. In x86 calling convention, structs are copied to the stack and their pointers are passed to functions. But in x86-64, you have to destructure a struct into multiple pieces of data and pass them via registers. It’s complicated…

— How I wrote a self-hosting C compiler in 40 days. Day 17

Implementing C preprocessor is no easy task.

It’s a part of the C standard, so it’s defined in the spec. But the spec says too little about it that it’s not useful to implement by itself. The spec includes a few examples of macros with their expanded forms, but it say only little about the algorithm. I think it doesn’t even say the details of its expected behavior. In short, it’s underspecified.

[…]

C preprocessor is itself an independent language. It’s rich in features, and only seasoned C programmers understand that well.

— How I wrote a self-hosting C compiler in 40 days. Day 21

I think I’ve been writing C for almost 15 years, but today I feel like I finally understand the C type syntax for the first time. It’s no wonder I couldn’t write working code. It’s because I simply wasn’t understanding it correctly.

The code that I just wrote is too complicated and fragile that even I can barely understand. I don’t believe Dennis Ritchie, the creator of the C language, understood the implications of what he was doing. I suspect that he invented a syntax, wrote code for it, which turned out to be more complicated than he had expected. And that was eventually standardized by the ANSI committee. It’s hard to implement a standardized language because you have to get everything right. It’s rather easy to write your own toy language.

— How I wrote a self-hosting C compiler in 40 days. Day 28

All that said, don’t get discouraged! Just consider whether you have enough time and energy to dedicate to such a project. I believe, having a finished compiler of an educational language is better than having a C compiler abandoned half-way through. But it’s definitely doable and a truly thrilling project to be working on. After all, as a careful reader might’ve already inferred, Rui Ueyama did 8cc in 40 days and fewer than 30 files of code. Although it seems they’re simply on a different level as a programmer… :)

You might also want to check out the source code of chibicc, which is another project of Rui Ueyama with “[the] core value [of] simplicity and the rea[da]bility of its source code.” And the code written not for the author but “for first-time readers.”

Despite all the complexities, writing a C compiler comes with a very nice bonus. You don’t need to come up with a runtime/standard library of your own. The C standard library is at your disposal, if you target one of the numerous supported platforms.

Parsing

Don’t get hung up on parsing theory. It’s a fascinating area of computer science, but it gets real complex really fast. The thing is, coding up a recursive descent parser takes very little knowledge of the theory. Of course, you can always come back later and study parsing in great details. Actually, first getting a feel for it by doing is going to be a great help in grasping the theory later.

Using a parser generator tool or a parser combinator library might sounds like a good way to save time and effort vs writing a parser by hand. But writing a hand-crafted parser might turn out easier than it seems from the outset. Once you get into the groove, you may realize that while it does take more typing, than using a tool wood, in return you get straightforward handwritten code that is easy to reason about and debug. And if you hit a special case, you just add a branch to handle it instead of fighting with a tool which might or might not be well documented.

Again, I cannot recommend “Crafting Interpreters” enough for a guide on implementing a parser.

It’s hard to underestimate the importance of having a working example to look at and fully comprehend. Here goes a couple:

Please, keep in mind, the last two are much bigger than 8cc and chibicc and don’t have easy of exploring their sources as an explicit goal.

An impressive thing to note is production-grade, major-league language implementations like GCC, Clang, Roslyn (C# and VB.NET compiler), D, Rust, Swift, Go, Java, V8, and TypeScript use hand-crafted recursive descent parsers too. So, if you decide to go down this route, your experience will actually be closer to that of professional compiler developers :)

Assembly generation

As I already mentioned, it’s possible to build a compiler that translates the source programming language into another high-level language like C. In fact, depending on the goals, this might actually be the best route to go down. That said, high-level languages are a bit too… well, high-level for someone who wants to grasp how things work behind the scenes.

On the other hand, the fact you’re enthusiastic about compilers means it isn’t unlikely you have interest in going even deeper down to the level of binary machine code and linking. I’d encourage you to put off working on these undeniably enticing subjects until after you have a first working version of the compiler built. For now, let GNU as and ld build and link native executables for you.

Many educational compilers emit MIPS assembly. Assuming you’re using Windows, Linux, or Mac as your daily driver, although it’s possible to run MIPS code using an emulator, to me, running a native executable produced by my own compiler feels much more rewarding.

Taking all the discussed points into account, assembly code native to your platform is a sweet spot for translation. It gets you close-enough to the metal to learn how high-level languages do their magic. But is abstract enough to avoid getting distracted by somewhat orthogonal subjects.

Refer to Chapter 11 – Code Generation of Introduction to Compilers and Language Design for a basic approach to assembly generation. Examining the sources of chibicc: A Small C Compiler or 8cc C Compiler is another great option.

Register allocation

In short, programs use processor registers to temporarily store variable and expression values. This post won’t dig any deeper than that, as by the time you get to implementing register allocation, you’ll in all likelihood have read longer descriptions by people who are actually good at explaining things.

Anyway, something to keep in mind, coding up register allocation can be a rather involved exercise. Register allocation in production compilers is a matter of scientific research. But there’s an extremely simple approach that is great to get a compiler up and running and can get it surprisingly far. It’s laid out in that same “Chapter 11 – Code Generation”:

[Y]ou can see that we set aside each register for a purpose: some are for function arguments, some for stack frame management, and some are available for scratch values. Take those scratch registers and put them into a table […] Then, write scratch_alloc to find an unused register in the table, mark it as in use, and return the register number r. scratch_free should mark the indicated register as available. scratch_name should return the name of a register, given its number r. Running out of scratch registers is possible, but unlikely, as we will see below. For now, if scratch_alloc cannot find a free register, just emit an error message and halt.

— 11.2 Supporting Functions

Runtime library

An implementation of educational language relies on a number of built-in functions, classes, and objects. Some of these are available to the user directly, such as ChocoPy’s print function or Cool 2020’s IO class. References to other built-ins can only be injected into a program by the compiler — for example, to allocate the memory for a new object or abort the execution in case things have gone off the rails.

The only readily available runtime implementation for an educational language I’m aware of is written in MIPS assembly for the language COOL1 designed by Alex Aiken used in his famous online course.

Examining COOL runtime’s MIPS code provides enough insights to implement a similar one for a different “client” language and dialect of assembly. This is how I came up with a runtime for Cool 2020 in x86-64 assembly. It’s written in an extremely naive and unoptimized way, but gets the job done. One serious omission though, there’s no garbage collection code (One day…)

Learning resources

Crafting Interpreters

Yes, yes, you read that right, interpreters. Crafting Interpreters by Robert Nystrom is simply a brilliant book, which despite its name can serve as a great introduction to compilers. Compilers and interpreters have a lot in common and by the time you get to the interpreters-specific topics, you’ll gain a ton of useful knowledge. In my opinion, the explanations on general programming language implementation concepts, lexing, and parsing are downright the best in class.

Introduction to Compilers and Language Design

Introduction to Compilers and Language Design by Douglas Thain is really accessible and doesn’t assume any preexisting compilers knowledge. It teaches all the basics necessary to build a compiler: from lexing and parsing to a program’s memory organization and stack management to assembly generation. It’s self-contained and takes a hands-on approach: you’ll end up with a working compiler if you choose to following the book through. If later you feel like digging into more advanced compiling techniques, you’ll have a solid foundation to build upon.

Both are freely available online. (I encourage you to consider buying the paid versions of the books though, if you can, to support the authors).

Building a Compiler video series by Immo Landwerth

Who’s gonna watch a Fortnite stream on Twitch, when Immo Landwerth is coding up a compiler live on YouTube?! Apart from all the usual compiling topics, the series touch upon adding a basic debugger support, so that it’s possible to step through the source of a compiled program. Immo is an awesome educator and their code is accessible and simply beautiful. You can explore it in the project’s github repo.

CS 6120: Advanced Compilers: The Self-Guided Online Course

A free self-guided online course from Cornell University. Goes in depth on IR and optimizations but is very accessible at the same time. Again, cannot recommend highly enough. CS 6120: Advanced Compilers: The Self-Guided Online Course.

LLVM IR Tutorial - Phis, GEPs …

A talk on intermediate representations (IR). Previous knowledge is not required. What’s so cool about this talk is it’s built around one of the most widely used IR — LLVM IR. 2019 EuroLLVM Developers’ Meeting: V. Bridgers & F. Piovezan “LLVM IR Tutorial - Phis, GEPs …”.

Conclusion

Compiling is a huge field. You’ll encounter seemingly endless choices, techniques, and opinions each step of the way. At times it’s hard not to give in to the temptation of going down the rabbit hole of researching all kind of ways of implementing a compilation stage or algorithm. I believe keeping the project’s scope as narrow as possible is a recipe for actually getting through to a working compiler.

Stick to straightforward simple ways of doing things at first. Consider writing down all the interesting links and nuggets of information you come across and setting them aside for a time. Once you finally make it to the other side, you can start going through your notes and replacing portions of the code with more advanced algorithms or extending it with additional functionality.

Good luck!


  1. Notice, while the names Cool 2020 and COOL sound confusingly similar, the languages themselves are distinct: Cool 2020 is a Scala subset and COOL is an invention of Alex Aiken. The name Cool 2020 is definitely not a coincidence. In the reference manual they say the language is named after Alex Aiken’s COOL. ↩︎