Starforth v1: How to Write a Compiler without a Compiler!
This article is part of an ongoing series on writing a self-hosting compiler. You can find the first installment here.
This post goes over how I shaped up the first version of the compiler. The main criteria for version 1 were, the compiler should be able to compile its own source code, and also be bootstrappable by the compiler written in assembly. Each version of the compiler is kept in its own branch, so you can find the latest update for version 1 in the v1 branch, though I am going to mainly reference how the source code looked like at tag v1.0.
I try to explain Forth concepts as we encounter them, but I’m not going to go into details. You can find many good references. I personally started by reading the online version of the book “Starting Forth”. You can find the online version of the book here.
Getting Started
There was a problem when I started working on the Starforth compiler: I didn’t know the source language. Not only I wasn’t much of an expert on Forth itself, but I didn’t know the particulars of my own variant of Forth, since there are a lot of Forths out there. I immediately dove in writing assembly code for the bootstrap compiler, but I soon realized that not knowing what primitives I’m going to need is a big issue. It was so frustrating that I soon abandoned that route. That effort is not even in the git history.
I then came up with another idea. What if I start writing the Forth code first? I didn’t have a compiler to compile and run it, but it was still going to give me a more concrete idea about what I needed to implement.
Many tries later, things started getting more concrete in my head. The main problem that needed to be solved were immediates. An immediate is a word (remember that’s a functions in Forth) that are run at compile-time. Lisp programmers would call it a macro. With an interpreter that should be relatively easy to implement, but with an ahead-of-time compiler, I needed to compile the code to memory and run it from there. That sounded like a lot of work. Trouble was, control flow structures like IF…THEN and DO…LOOP are traditionally implemented using immediates, so it looked like that I really needed to have them.
I found the solution some time later when I was reading about
preForth/seedForth. This is a cool project that aims to bootstrap itself
from a small kernel and then grow from there. Looking at what they had done I
realized that they don’t actually have immediates in their bootstrapper. The
only control flow they have is ?exit
. That’s a word that reads a boolean from
the stack and returns if it’s true. That sounded like what I needed. It was
going to be hard not having any loops or even “if”, but that could only add to
the fun of the project!
Another issue was variables. Variables are traditionally implemented using Forth’s dictionary which maps words to their code. This dictionary most definitely does not exist in the code produced by my compiler, so what should I do? I decided the program will have a block of memory it can use however it wants. There would be a primitive named “mem” that returns the start address to this block of memory. So declaring variables would become something like this:
: foo mem ;
: bar mem 4 + ;
So “foo” returns the address of the start of user memory. And “bar” returns the address to 4 bytes after that. As I mentioned in the previous article, variables in Forth return their addresses, so these work exactly like normal Forth variables. We just need to manually assign them addresses in our block of memory.
We also need some string operations. I decided a single string variable is going
to suffice for everything. We’re going to compile the input one word at a time.
We’ll implement strings as a buffer that starts with the string length (in the
first 4 bytes), followed by the contents of the string. Our single string buffer
is only 36 bytes, so our compiler can handle names no longer than 32 bytes. I
implemented a couple of helper words to make things slightly easier to read.
str-length
returns the length of the current string, str-start
returns the
start address of the string contents, and str-ch
receives an index n on the
stack, and returns the nth character of the string. clear-string
sets the
string buffer to an empty string, and finally append-char
appends one
character to the end of the string. That should be enough for our compiler.
Diving into the Code
Let’s start with some code. A Starforth program starts with the “main” word. This is what I imagined the compiler’s own “main” word would look like:
: main init compile-all epilogue ;
The final version of main
had a little more to it, but those are the main
steps. compile-all
is the more interesting word here, so let’s try
implementing that:
: compile-all read eof? ?exit compile tail compile-all ;
This starts by reading a single token. This is read into the compiler’s string
buffer. We then check if we’ve hit the end of file. If not we compile
the
token we just read and finally compile-all
calls itself again to repeat
everything. In order to avoid exhausting the stack, we use the tail
word to
eliminate the tail call. I’ll explain how that works later. For now, we know
this code just keeps reading a word, compiling it, and repeating that until we
reach the end of the input file.
Let’s see how compile
is supposed to work:
: compile
\ try user-defined words
compile-call? ?exit
\ try built-ins
compile-colon? ?exit
compile-semicolon? ?exit
compile-mem? ?exit
\ ... more built-ins
\ try numbers
compile-char-literal? ?exit
compile-number? ?exit
\ unknown word error
1 halt
;
In the absence of any control flow besides ?exit
, we repeatedly try calling
different “compile-*” words. Each of these returns true if it manages to compile
the word in the string buffer. So at first compile-call?
tries looking up the
token as a user-defined word and if it is, emits a call instruction for it. If
that didn’t succeed, we try compiling any of the built-ins we have. If that
didn’t work either, we trying compiling the token as a number or character
literal. If that doesn’t work, the compiler exits with a “1” exit code, meaning
it encountered an unknown word.
Compiling Built-in Words
Now let’s see how we compile one of the simpler built-ins:
: compile-mem?
false
str-length 3 <> ?exit
0 str-ch 'm' <> ?exit
1 str-ch 'e' <> ?exit
2 str-ch 'm' <> ?exit
drop
57h emit1 \ push edi ; start of user memory area is in edi
true
;
Since we don’t have any sophisticated string operators, we have to check the
string one character at a time. We first check the string length of course. If
the token is three characters long and consists of the characters “m”, “e”, and
“m”, we go ahead and compile it into a single instruction: PUSH EDI
We expect an initialization code allocate a block of memory and store it’s start address in the EDI register, so executing the “mem” primitive simply puts the value of EDI on the stack.
57h (which is a hex literal), is the x86 op code for push edi
. emit1
emits a
one byte value to the output code. We also have emit2
, emit3
and emit4
.
Compiling all other built-ins follows the same pattern: we put false on the
stack and check if the string matches what we want. If not, we return, leaving
false
on the stack. If everything does match, we drop false
from the stack,
emit machine code, push true
on the stack and return.
Defining New Words
Now let’s see how user-defined words are implemented. The colon (:) built-in word does that:
: compile-colon?
false
str-length 1 <> ?exit
0 str-ch ':' <> ?exit
drop
read
dict-add
cur-offset @ , \ write the word start offset in the dictionary entry
emit-prologue-if-main
emit-word-preamble
true
;
Let’s unpack this one step at a time.
First, we check if current token is indeed a colon. If not, we return false. Otherwise we read another token. This will be the name of the user-defined word.
dict-add is the important part here. It adds the token in the string buffer (the name after colon) to the dictionary, which is a simple key-value map. After that we store the current code offset in the dictionary entry, that is, we associate the name with its code address.
The dictionary is a linked list. We keep a pointer to its tail (the last word added), and each entry has a link to the previous one. A zero in the “previous” field marks the beginning of the dictionary. So dict-add allocates a dictionary entry, links it to the last entry, marks it as the new last entry, and copies the name.
How does memory allocation work? This is pretty much how things are done in
normal Forths too. We have a block of memory, and we keep a pointer to where we
haven’t written anything yet. We call this pointer here
. The comma word (,)
reads a value from the stack and writes it to where here
points to, and then
advances here
to the next free location.
Back to the code above, dict-add
creates a dictionary entry. After it’s done,
here
points to where we can write the “value” of the dictionary entry. A
dictionary entry has a “key”, which is the word name, and a “value” which is the
start code offset of the word. So after cur-offset @ ,
the name after colon
becomes associated with current code offset.
emit-prologue-if-main and emit-word-preamble are not terribly important here.
The first emits some startup code if the main
word is being defined. The
other, emits some “preamble” code at the beginning of each definition. I’ll
explain this preamble later.
Executing User-Defined Words
This is the code that compiles a call to a user-defined word:
: compile-call?
dict-find dup 0= ?exit
\ we now have a pointer to the value of the dictionary entry (after the
\ header). the cell contains an offset to call (which should be converted to a
\ relative offset first).
@ \ read the actual offset from the dictionary entry
0ec87h emit2 \ xchg ebp, esp
cur-offset @ 5 + - \ convert offset to relative value (5 is the length of the call instruction)
0e8h emit1 \ call ...
emit4 \ relative address to call
0ec87h emit2 \ xchg ebp, esp
true
;
This looks up the name in the dictionary. dict-find performs the lookup. If it returns zero, the name does not exist in the dictionary. If it is found, it returns a pointer to the “value” of the relevant dictionary entry, that is, where we stored the code offset.
Let’s ignore the xchg ebp, esp
instruction for now. I’m going to explain that
later. We now need to emit a call
instruction. The operand of the call
instruction is an address relative to the instruction immediately after call. So
we add 5 to the current offset, because a call instruction is five bytes long,
and then subtract that from the offset dict-find returned.
Tail Calls
A tail call is a call that happens as the last action in a subroutine. There’s no code after the tail call, we simply return whatever the call returned. These calls can usually be eliminated depending on the programming language. In Forth, there’s no “stack frame” as other programming languages, so a tail call can always be converted to a jump instruction without any adjustments to the stack.
This is useful for us, since we don’t have any loops in our language yet, and we
rely heavily on recursion. The tail
word is going to help with that. If we
eliminate tail calls, we can use recursion instead of loops without having to
worry about exhausting the stack. At the earliest versions of Starforth though,
I relied on a trick I learned from an IOCCC entry that implemented a simple
Forth. This is the definition of the tail
word:
: tail r> r> drop >r ;
Forth has two stacks, the parameter stack (which is the normal stack used for most operations), and the return stack, which keeps the return address when a word is called.
The r> word moves an item from the return stack to the parameter stack. The >r word does the reverse and moves an item from the parameter stack to the return stack.
When tail
is called, the top of the return stack contains the address of the
instruction right after tail
, which should be a tail recursive call. We move
this to the parameter stack. We then move the item right behind it, which is the
address of the caller of the word that called tail
. We drop this second
address and move the tail
return address back. So at the end of tail
we
return to where we should have, but one address before that is gone. This way,
tail
prevents the stack from growing without bounds. There’s a small problem
though. The return address of the word that called the caller of tail
is
removed. For recursive calls that’s fine. But we’re also eliminating the address
of the initial caller. So if main
calls compile-all
and compile-all
recursively calls itself using tail
, we actually never return to main, we try
to return to one level above that when compile-all
is done. This won’t work
obviously. In order to fix that, we add an extra caller to all words that
recursively tail-call themselves. So this will be the actual definition of
compile-all
:
: _compile-all read eof? ?exit compile tail _compile-all ;
: compile-all _compile-all ;
When _compile-all
is done, we never return to compile-all
, but return to its
caller, which is just fine.
I/O and Interfacing with the OS
We need to interface with the operating system for certain tasks, like reading
and writing files, or exiting the program. Since we only target Linux, we need
to call Linux “system calls”. Linux system calls on x86 are called using the
int 0x80
instruction. Each system call has a number that is stored in the EAX
register. If the system call has any parameters, these are stored in EBX, ECX,
EDX, ESI, EDI, and EBP (so a maximum of six arguments) in that order. As an
example, one of the simplest system calls is EXIT, which has the number 1 (in
32-bit x86 at least). The following code compiles the halt
built-in that exits
the program using this system call:
: compile-halt?
false
str-length 4 <> ?exit
0 str-ch 'h' <> ?exit
1 str-ch 'a' <> ?exit
2 str-ch 'l' <> ?exit
3 str-ch 't' <> ?exit
drop
0xb8 emit1 \ mov eax, ...
0x00000001 emit4 \ ...1 (EXIT system call)
0x5b emit1 \ pop ebx
0x80cd emit2 \ int 0x80
true
;
The generated code will read a number from the stack and uses that as the first
and only argument to the system call (which is exit code). It also sets 1 in EAX
and performs int 0x80
.
I initially had built-ins like halt
, read
, write
, etc to allow a program
to interface with the operating system. But in the end, I decided to remove
those and add a number of built-ins that allow a program to directly call any
system call. These built-ins are called syscall0
, syscall1
, syscall2
, all
the way through syscall6
. The number at the end shows the number of arguments
the system call receives.
Exiting the program with exit code 42 then becomes:
42 1 syscall1
Handling Two Stacks
As we said before, Forth has two stacks: a parameter stack and a return stack. An x86 CPU however only supports one stack. This is setup for us by Linux and its address is stored in the ESP register. The code generated by Starforth then allocates a block of memory for the return stack and stores its address in EBP.
Whenever we want to perform an operation that needs to work on the return stack,
we perform an xchg ebp, esp
, which swaps the values of EBP and ESP.
Afterwards, we emit another xchg ebp, esp
to switch back to the normal stack.
This is a little inefficient of course. We’ll make it work slightly better in
future versions.
Here are the places where we issue an xchg instruction:
- Before and after each call instruction
- Before a
ret
instruction - Right at the beginning of a word definition.
So when we want to execute a user-defined word, an xchg instruction is run first
to switch to the return stack, we then perform the call which stores the return
address on the return stack as we want. Right at the beginning of the callee
code, there will be another xchg that switches back to the parameter stack.
Before returning, we do another xchg to switch to the return stack, allowing
ret
to read the return address from the correct stack, and then we get to the
point after the call instruction that does yet another xchg to switch back to
the normal stack.
Compiler Primitives at Version 1
In order to keep the bootstrap compiler as compact as possible, these are the only built-in words that are supported by the compiler at version 1:
:
start a word definition;
end a word definitionps@
get the current value of the parameter stack pointerps!
write a new address to the parameter stack pointerrs@
get the current value of the return stack pointerrs!
write a new address to the parameter stack pointer@
read the value of a cell (4 bytes) from memory!
write a value to a cell in memoryand
bitwise AND two integers-
subtract two integers*
multiply two integerslsr
logical right shift an integermem
give the pointer to the start of user memory?exit
exit current word if the top of stack is true (non zero)0<
put true (-1) on the stack if current top of the stack is less than zero, otherwise put false.syscall0
-syscall6
: perform a Linux system call with 0 to 6 arguments.
This is very minimal in order to keep the bootstrap compiler small. So initially
even +
is implemented in terms of -
.
Tests
Tests are awfully important for a compiler. You keep changing the internals of the compiler, including how language constructs are translated into machine code. We need to make sure things behave the way they should, and they keep behaving like that when we make changes. That’s why I had to write some sort of test suite early.
In addition to the language constructs, the test suite also tests some of the compiler internals, so it needs to be able to call those. In addition, since the core language is very minimal, we need access to the utility words written in the compiler. Trouble is, we don’t have any library support, or even programs consisting of multiple files. This shell script is what I came up with:
#!/bin/sh
# remove the main word in startforth.f, add the test file (which includes its
# own main word) and compile the result.
{ sed 's/: main .* ;//' starforth.f && cat test.f; } | ./sf
exit_code="$?"
if [ "$exit_code" -ne 0 ]; then
echo "Compilation failed with code $exit_code."
exit 1
fi
# Run the tests
./a.out
exit_code="$?"
if [ "$exit_code" -ne 0 ]; then
echo
echo "Some tests failed. Exit code=$exit_code"
exit 2
fi
This combines the the compiler source code (starforth.f) with the test suite
(test.f), but removes the compiler’s main
definition so we won’t have two
main
words. It them compiles the combined source code and runs it.
The test suite itself is just a number of test words, with a main
that simply
calls them, and makes sure each has returned true.
A Little Cheating: refc and refi
It’s time to confess that I did a bit of cheating to make things a little easier for myself. I wrote a couple of other tools to help me with development and debugging. These are not used for bootstrapping, and were actually removed in future versions, but they were still really useful to get me to version 1.
I first wrote a “reference compiler” in Python. This is a small script named “refc.py” and it can compile the Starforth source code into an ELF executable. It basically does the same thing as the compiler itself (starforth.f) and also as the bootstrap compiler (bootstrap.s), but since it’s in Python, I could write it very quickly and use it for debugging.
I also wrote a “reference interpreter” in C (refi.c). This allowed me to step through the source code, examine memory and registers and track bugs much more easily.
Both of these are actually capable of bootstrapping the version 1 compiler, but of course they are not used in actual bootstrapping (even though the verify.sh script calls refc too, and compares the result with what the bootstrap compiler generates).
One interesting use of the refc script was to help with debugging in gdb. Let’s say there’s a crash in the compiler. I go to gdb, but gdb can only show me assembly code and raw addresses. How can we get a stack trace? Enter refc! One interesting thing that refc does when compiling the source code is logging each definition that it compiles along with its address. We can write these into a text file as a sort of poor man’s debug info. Then in GDB we examine the contents of where ESP and EBP registers point to. It’s usually pretty easy to spot which one contains the return stack and which the parameter stack, since the return stack almost always contains only code addresses.
So by examining memory, we can see the call stack in raw form. By cross referencing these values with what refc has logged, we can map each address to a word name. I also wrote a script (findaddr.py) to help me with this cross referencing. Debugging is still rather difficult, but not impossible.
Even More Minimal?
In this first version, I tried to make things as minimal as possible, even going
so far as to not have a +
word and implementing that in terms of -
. I could
have gone even more minimal than this by removing comments, character literals
and hex literals. Technically none of those are really necessary, but without
them the code would become much harder to read. So I drew a line at that. I can
work without built-in addition, but not without comments!
Some References
Here are links to some of the materials that helped me.
- Jones Forth is a thoroughly commented implementation of a small Forth written in assembly.
- Here are the design notes for the IOCCC entry I mentioned in the article:
- And finally here’s the code repository for preForth.
Coming Up
We now have a compiler that can compile its own source code. I plan to write more blog entries describing how the language and the compiler evolved after that. Stay tuned!