Starforth v1: How to Write a Compiler without a Compiler!

Jul 8, 2023

This article is part of an ongoing series on writing a self-hosting compiler. You can find the first installment here.

This post goes over how I shaped up the first version of the compiler. The main criteria for version 1 were, the compiler should be able to compile its own source code, and also be bootstrappable by the compiler written in assembly. Each version of the compiler is kept in its own branch, so you can find the latest update for version 1 in the v1 branch, though I am going to mainly reference how the source code looked like at tag v1.0.

I try to explain Forth concepts as we encounter them, but I’m not going to go into details. You can find many good references. I personally started by reading the online version of the book “Starting Forth”. You can find the online version of the book here.

Getting Started

There was a problem when I started working on the Starforth compiler: I didn’t know the source language. Not only I wasn’t much of an expert on Forth itself, but I didn’t know the particulars of my own variant of Forth, since there are a lot of Forths out there. I immediately dove in writing assembly code for the bootstrap compiler, but I soon realized that not knowing what primitives I’m going to need is a big issue. It was so frustrating that I soon abandoned that route. That effort is not even in the git history.

I then came up with another idea. What if I start writing the Forth code first? I didn’t have a compiler to compile and run it, but it was still going to give me a more concrete idea about what I needed to implement.

Many tries later, things started getting more concrete in my head. The main problem that needed to be solved were immediates. An immediate is a word (remember that’s a functions in Forth) that are run at compile-time. Lisp programmers would call it a macro. With an interpreter that should be relatively easy to implement, but with an ahead-of-time compiler, I needed to compile the code to memory and run it from there. That sounded like a lot of work. Trouble was, control flow structures like IF…THEN and DO…LOOP are traditionally implemented using immediates, so it looked like that I really needed to have them.

I found the solution some time later when I was reading about preForth/seedForth. This is a cool project that aims to bootstrap itself from a small kernel and then grow from there. Looking at what they had done I realized that they don’t actually have immediates in their bootstrapper. The only control flow they have is ?exit. That’s a word that reads a boolean from the stack and returns if it’s true. That sounded like what I needed. It was going to be hard not having any loops or even “if”, but that could only add to the fun of the project!

Another issue was variables. Variables are traditionally implemented using Forth’s dictionary which maps words to their code. This dictionary most definitely does not exist in the code produced by my compiler, so what should I do? I decided the program will have a block of memory it can use however it wants. There would be a primitive named “mem” that returns the start address to this block of memory. So declaring variables would become something like this:

: foo mem ;
: bar mem 4 + ;

So “foo” returns the address of the start of user memory. And “bar” returns the address to 4 bytes after that. As I mentioned in the previous article, variables in Forth return their addresses, so these work exactly like normal Forth variables. We just need to manually assign them addresses in our block of memory.

We also need some string operations. I decided a single string variable is going to suffice for everything. We’re going to compile the input one word at a time. We’ll implement strings as a buffer that starts with the string length (in the first 4 bytes), followed by the contents of the string. Our single string buffer is only 36 bytes, so our compiler can handle names no longer than 32 bytes. I implemented a couple of helper words to make things slightly easier to read. str-length returns the length of the current string, str-start returns the start address of the string contents, and str-ch receives an index n on the stack, and returns the nth character of the string. clear-string sets the string buffer to an empty string, and finally append-char appends one character to the end of the string. That should be enough for our compiler.

Diving into the Code

Let’s start with some code. A Starforth program starts with the “main” word. This is what I imagined the compiler’s own “main” word would look like:

: main init compile-all epilogue ;

The final version of main had a little more to it, but those are the main steps. compile-all is the more interesting word here, so let’s try implementing that:

: compile-all read eof? ?exit compile tail compile-all ;

This starts by reading a single token. This is read into the compiler’s string buffer. We then check if we’ve hit the end of file. If not we compile the token we just read and finally compile-all calls itself again to repeat everything. In order to avoid exhausting the stack, we use the tail word to eliminate the tail call. I’ll explain how that works later. For now, we know this code just keeps reading a word, compiling it, and repeating that until we reach the end of the input file.

Let’s see how compile is supposed to work:

: compile
  \ try user-defined words
  compile-call? ?exit

  \ try built-ins
  compile-colon? ?exit
  compile-semicolon? ?exit
  compile-mem? ?exit
  \ ... more built-ins

  \ try numbers
  compile-char-literal? ?exit
  compile-number? ?exit

  \ unknown word error
  1 halt
  ;

In the absence of any control flow besides ?exit, we repeatedly try calling different “compile-*” words. Each of these returns true if it manages to compile the word in the string buffer. So at first compile-call? tries looking up the token as a user-defined word and if it is, emits a call instruction for it. If that didn’t succeed, we try compiling any of the built-ins we have. If that didn’t work either, we trying compiling the token as a number or character literal. If that doesn’t work, the compiler exits with a “1” exit code, meaning it encountered an unknown word.

Compiling Built-in Words

Now let’s see how we compile one of the simpler built-ins:

: compile-mem?
  false
  str-length 3 <> ?exit
  0 str-ch 'm' <> ?exit
  1 str-ch 'e' <> ?exit
  2 str-ch 'm' <> ?exit
  drop
  57h emit1 \ push edi ; start of user memory area is in edi
  true
  ;

Since we don’t have any sophisticated string operators, we have to check the string one character at a time. We first check the string length of course. If the token is three characters long and consists of the characters “m”, “e”, and “m”, we go ahead and compile it into a single instruction: PUSH EDI

We expect an initialization code allocate a block of memory and store it’s start address in the EDI register, so executing the “mem” primitive simply puts the value of EDI on the stack.

57h (which is a hex literal), is the x86 op code for push edi. emit1 emits a one byte value to the output code. We also have emit2, emit3 and emit4.

Compiling all other built-ins follows the same pattern: we put false on the stack and check if the string matches what we want. If not, we return, leaving false on the stack. If everything does match, we drop false from the stack, emit machine code, push true on the stack and return.

Defining New Words

Now let’s see how user-defined words are implemented. The colon (:) built-in word does that:

: compile-colon?
  false
  str-length 1 <> ?exit
  0 str-ch ':' <> ?exit
  drop
  read
  dict-add
  cur-offset @ , \ write the word start offset in the dictionary entry
  emit-prologue-if-main
  emit-word-preamble
  true
  ;

Let’s unpack this one step at a time.

First, we check if current token is indeed a colon. If not, we return false. Otherwise we read another token. This will be the name of the user-defined word.

dict-add is the important part here. It adds the token in the string buffer (the name after colon) to the dictionary, which is a simple key-value map. After that we store the current code offset in the dictionary entry, that is, we associate the name with its code address.

The dictionary is a linked list. We keep a pointer to its tail (the last word added), and each entry has a link to the previous one. A zero in the “previous” field marks the beginning of the dictionary. So dict-add allocates a dictionary entry, links it to the last entry, marks it as the new last entry, and copies the name.

How does memory allocation work? This is pretty much how things are done in normal Forths too. We have a block of memory, and we keep a pointer to where we haven’t written anything yet. We call this pointer here. The comma word (,) reads a value from the stack and writes it to where here points to, and then advances here to the next free location.

Back to the code above, dict-add creates a dictionary entry. After it’s done, here points to where we can write the “value” of the dictionary entry. A dictionary entry has a “key”, which is the word name, and a “value” which is the start code offset of the word. So after cur-offset @ , the name after colon becomes associated with current code offset.

emit-prologue-if-main and emit-word-preamble are not terribly important here. The first emits some startup code if the main word is being defined. The other, emits some “preamble” code at the beginning of each definition. I’ll explain this preamble later.

Executing User-Defined Words

This is the code that compiles a call to a user-defined word:

: compile-call?
  dict-find dup 0= ?exit

  \ we now have a pointer to the value of the dictionary entry (after the
  \ header). the cell contains an offset to call (which should be converted to a
  \ relative offset first).

  @ \ read the actual offset from the dictionary entry

  0ec87h emit2 \ xchg ebp, esp
  cur-offset @ 5 + - \ convert offset to relative value (5 is the length of the call instruction)
  0e8h emit1   \ call ...
  emit4        \ relative address to call
  0ec87h emit2 \ xchg ebp, esp

  true
  ;

This looks up the name in the dictionary. dict-find performs the lookup. If it returns zero, the name does not exist in the dictionary. If it is found, it returns a pointer to the “value” of the relevant dictionary entry, that is, where we stored the code offset.

Let’s ignore the xchg ebp, esp instruction for now. I’m going to explain that later. We now need to emit a call instruction. The operand of the call instruction is an address relative to the instruction immediately after call. So we add 5 to the current offset, because a call instruction is five bytes long, and then subtract that from the offset dict-find returned.

Tail Calls

A tail call is a call that happens as the last action in a subroutine. There’s no code after the tail call, we simply return whatever the call returned. These calls can usually be eliminated depending on the programming language. In Forth, there’s no “stack frame” as other programming languages, so a tail call can always be converted to a jump instruction without any adjustments to the stack.

This is useful for us, since we don’t have any loops in our language yet, and we rely heavily on recursion. The tail word is going to help with that. If we eliminate tail calls, we can use recursion instead of loops without having to worry about exhausting the stack. At the earliest versions of Starforth though, I relied on a trick I learned from an IOCCC entry that implemented a simple Forth. This is the definition of the tail word:

: tail r> r> drop >r ;

Forth has two stacks, the parameter stack (which is the normal stack used for most operations), and the return stack, which keeps the return address when a word is called.

The r> word moves an item from the return stack to the parameter stack. The >r word does the reverse and moves an item from the parameter stack to the return stack.

When tail is called, the top of the return stack contains the address of the instruction right after tail, which should be a tail recursive call. We move this to the parameter stack. We then move the item right behind it, which is the address of the caller of the word that called tail. We drop this second address and move the tail return address back. So at the end of tail we return to where we should have, but one address before that is gone. This way, tail prevents the stack from growing without bounds. There’s a small problem though. The return address of the word that called the caller of tail is removed. For recursive calls that’s fine. But we’re also eliminating the address of the initial caller. So if main calls compile-all and compile-all recursively calls itself using tail, we actually never return to main, we try to return to one level above that when compile-all is done. This won’t work obviously. In order to fix that, we add an extra caller to all words that recursively tail-call themselves. So this will be the actual definition of compile-all:

: _compile-all read eof? ?exit compile tail _compile-all ;
: compile-all _compile-all ;

When _compile-all is done, we never return to compile-all, but return to its caller, which is just fine.

I/O and Interfacing with the OS

We need to interface with the operating system for certain tasks, like reading and writing files, or exiting the program. Since we only target Linux, we need to call Linux “system calls”. Linux system calls on x86 are called using the int 0x80 instruction. Each system call has a number that is stored in the EAX register. If the system call has any parameters, these are stored in EBX, ECX, EDX, ESI, EDI, and EBP (so a maximum of six arguments) in that order. As an example, one of the simplest system calls is EXIT, which has the number 1 (in 32-bit x86 at least). The following code compiles the halt built-in that exits the program using this system call:

: compile-halt?
  false
  str-length 4 <> ?exit
  0 str-ch 'h' <> ?exit
  1 str-ch 'a' <> ?exit
  2 str-ch 'l' <> ?exit
  3 str-ch 't' <> ?exit
  drop
  0xb8 emit1       \ mov eax, ...
  0x00000001 emit4 \ ...1 (EXIT system call)
  0x5b emit1       \ pop ebx
  0x80cd emit2     \ int 0x80
  true
  ;

The generated code will read a number from the stack and uses that as the first and only argument to the system call (which is exit code). It also sets 1 in EAX and performs int 0x80.

I initially had built-ins like halt, read, write, etc to allow a program to interface with the operating system. But in the end, I decided to remove those and add a number of built-ins that allow a program to directly call any system call. These built-ins are called syscall0, syscall1, syscall2, all the way through syscall6. The number at the end shows the number of arguments the system call receives.

Exiting the program with exit code 42 then becomes:

42 1 syscall1

Handling Two Stacks

As we said before, Forth has two stacks: a parameter stack and a return stack. An x86 CPU however only supports one stack. This is setup for us by Linux and its address is stored in the ESP register. The code generated by Starforth then allocates a block of memory for the return stack and stores its address in EBP.

Whenever we want to perform an operation that needs to work on the return stack, we perform an xchg ebp, esp, which swaps the values of EBP and ESP. Afterwards, we emit another xchg ebp, esp to switch back to the normal stack. This is a little inefficient of course. We’ll make it work slightly better in future versions.

Here are the places where we issue an xchg instruction:

Before and after each call instruction
Before a ret instruction
Right at the beginning of a word definition.

So when we want to execute a user-defined word, an xchg instruction is run first to switch to the return stack, we then perform the call which stores the return address on the return stack as we want. Right at the beginning of the callee code, there will be another xchg that switches back to the parameter stack. Before returning, we do another xchg to switch to the return stack, allowing ret to read the return address from the correct stack, and then we get to the point after the call instruction that does yet another xchg to switch back to the normal stack.

Compiler Primitives at Version 1

In order to keep the bootstrap compiler as compact as possible, these are the only built-in words that are supported by the compiler at version 1:

: start a word definition
; end a word definition
ps@ get the current value of the parameter stack pointer
ps! write a new address to the parameter stack pointer
rs@ get the current value of the return stack pointer
rs! write a new address to the parameter stack pointer
@ read the value of a cell (4 bytes) from memory
! write a value to a cell in memory
and bitwise AND two integers
- subtract two integers
* multiply two integers
lsr logical right shift an integer
mem give the pointer to the start of user memory
?exit exit current word if the top of stack is true (non zero)
0< put true (-1) on the stack if current top of the stack is less than zero, otherwise put false.
syscall0-syscall6: perform a Linux system call with 0 to 6 arguments.

This is very minimal in order to keep the bootstrap compiler small. So initially even + is implemented in terms of -.

Tests

Tests are awfully important for a compiler. You keep changing the internals of the compiler, including how language constructs are translated into machine code. We need to make sure things behave the way they should, and they keep behaving like that when we make changes. That’s why I had to write some sort of test suite early.

In addition to the language constructs, the test suite also tests some of the compiler internals, so it needs to be able to call those. In addition, since the core language is very minimal, we need access to the utility words written in the compiler. Trouble is, we don’t have any library support, or even programs consisting of multiple files. This shell script is what I came up with:

#!/bin/sh

# remove the main word in startforth.f, add the test file (which includes its
# own main word) and compile the result.
{ sed 's/: main .* ;//' starforth.f && cat test.f; } | ./sf
exit_code="$?"
if [ "$exit_code" -ne 0 ]; then
    echo "Compilation failed with code $exit_code."
    exit 1
fi

# Run the tests
./a.out
exit_code="$?"
if [ "$exit_code" -ne 0 ]; then
    echo
    echo "Some tests failed. Exit code=$exit_code"
    exit 2
fi

This combines the the compiler source code (starforth.f) with the test suite (test.f), but removes the compiler’s main definition so we won’t have two main words. It them compiles the combined source code and runs it.

The test suite itself is just a number of test words, with a main that simply calls them, and makes sure each has returned true.

A Little Cheating: refc and refi

It’s time to confess that I did a bit of cheating to make things a little easier for myself. I wrote a couple of other tools to help me with development and debugging. These are not used for bootstrapping, and were actually removed in future versions, but they were still really useful to get me to version 1.

I first wrote a “reference compiler” in Python. This is a small script named “refc.py” and it can compile the Starforth source code into an ELF executable. It basically does the same thing as the compiler itself (starforth.f) and also as the bootstrap compiler (bootstrap.s), but since it’s in Python, I could write it very quickly and use it for debugging.

I also wrote a “reference interpreter” in C (refi.c). This allowed me to step through the source code, examine memory and registers and track bugs much more easily.

Both of these are actually capable of bootstrapping the version 1 compiler, but of course they are not used in actual bootstrapping (even though the verify.sh script calls refc too, and compares the result with what the bootstrap compiler generates).

One interesting use of the refc script was to help with debugging in gdb. Let’s say there’s a crash in the compiler. I go to gdb, but gdb can only show me assembly code and raw addresses. How can we get a stack trace? Enter refc! One interesting thing that refc does when compiling the source code is logging each definition that it compiles along with its address. We can write these into a text file as a sort of poor man’s debug info. Then in GDB we examine the contents of where ESP and EBP registers point to. It’s usually pretty easy to spot which one contains the return stack and which the parameter stack, since the return stack almost always contains only code addresses.

So by examining memory, we can see the call stack in raw form. By cross referencing these values with what refc has logged, we can map each address to a word name. I also wrote a script (findaddr.py) to help me with this cross referencing. Debugging is still rather difficult, but not impossible.

Even More Minimal?

In this first version, I tried to make things as minimal as possible, even going so far as to not have a + word and implementing that in terms of -. I could have gone even more minimal than this by removing comments, character literals and hex literals. Technically none of those are really necessary, but without them the code would become much harder to read. So I drew a line at that. I can work without built-in addition, but not without comments!

Some References

Here are links to some of the materials that helped me.

Jones Forth is a thoroughly commented implementation of a small Forth written in assembly.
Here are the design notes for the IOCCC entry I mentioned in the article:
And finally here’s the code repository for preForth.

Coming Up

We now have a compiler that can compile its own source code. I plan to write more blog entries describing how the language and the compiler evolved after that. Stay tuned!