8 October 2015 / Consolite

The Consolite Compiler

In my previous post, I created Breakout in my custom assembly language, Consolite Assembly. Writing a game in assembly was a fun challenge, but I am used to writing code at a higher level of abstraction. To make it easier on myself, I created a compiler that takes source code in a C-like language and converts it to Consolite Assembly. The source of the Consolite Compiler, written in C++, can be found here. A rough specification for the language, which I will refer to as Consolite C, can be found here.

Link to Tetris demo

What Is a Compiler?

A compiler is a program that takes in source code in one language and outputs target code in another language. In this case, the source code is written in Consolite C, and the target code is Consolite Assembly. To give a contrived example, the compiler takes in code like this:

uint16 apples = 3;
uint16 oranges = 5;
uint16 fruit = apples + oranges;

And turns it into code like this:

MOVI E 0x0003  ; apples = 3
MOVI F 0x0005  ; oranges = 5
MOV G E        ; fruit = apples
ADD G F        ; fruit = fruit + oranges

One of the primary advantages of the former is that everything is more human-readable. You can have descriptive variable names, complex mathematical expressions, if-else statements, loops, and many other features that would otherwise need to be tediously hand-coded in assembly. For a full list of language features, take a look at the spec.

How Does It Work?

The Consolite Compiler works in two stages. In the first stage, it parses the source code into a syntax tree, checking for errors along the way. For example, suppose the compiler is parsing a function like the following, which sums the values in an array:

uint16 sum(uint16 array, uint16 length) {
  uint16 i;
  uint16 s = 0;
  for (i = 0; i < length; i = i + 1) {
    s = s + array[i];
  }
  return s;
}

The parser knows that this is a function because the first three tokens it finds are: "uint16", a type; "sum", a name; and "(", an open parenthesis that distinguishes a function from a global variable declaration. It saves the return type and function name, then parses the comma-separated parameters until it reaches a closing parenthesis. Next it consumes the opening brace "{" and begins parsing statements in the function body, until it reaches a closing brace "}".

Statements within a function body can take many forms. In the example above, there are two local variable declarations, one of which has an initial value. Next, there is a for-loop, which is of the form:

for (initial_expression; condition; loop_expression)
  body_statement

The parser goes through the following steps to parse the for-loop:

Recognize that what follows is a for-loop by the "for" keyword.
Consume the "(" token, or give an error if not found.
Go into expression parsing mode to get the initial expression.
Consume the ";" token, or give an error if not found.
Go into expression parsing mode to get the conditional expression.
Consume the ";" token, or give an error if not found.
Go into expression parsing mode to get the loop expression.
Consume the ")" token, or give an error if not found.
Go into statement parsing mode and get the loop body. The loop body could be a compound statement between "{" and "}", another for-loop, an assignment, etc. Statement parsing mode works the same as parsing the next statement in a function body.

The resulting syntax tree for the for-loop in the example above would look like the following:

for-loop syntax tree

After the syntax tree has been constructed, if there were no errors the compiler proceeds to the second stage. In the second stage, it traverses the syntax tree and outputs the linear assembly code equivalent for each function, statement, expression, etc. Since there are no explicit loop constructs in assembly, linearizing the above for-loop would result in something like the following in Consolite C:

// Initial expression
i = 0;
loop_start:
// Check condition
if (i < length) {
  goto loop_end;
}
// Body statement
s = s + array[i];
// Loop expression
i = i + 1;
goto loop_start;
loop_end:

When we actually run the original for-loop through the compiler, we get the following assembly (annotated for clarity):

; i is in register E, s is in register F,
; array is in register A, length is in register B
        MOVI N 0x0000
        MOV E N     ; i = 0
        MOV L N     ; Compiler artifact, does nothing.
sum_for_start:      ; Labels are prefixed with the
                    ; function name.
        MOV N B
        MOV M E
        CMP M N     ; Compare i and length.
        JB label    ; If i is below length, jumps to
                    ; code for setting the result of
                    ; the expression "i < length" to 1.
        MOVI M 0x0  ; Reached if i >= length, sets
                    ; the result of the expression
                    ; "i < length" to 0.
        JMPI label1 ; Skip the next line since
                    ; i >= length.
label:
        MOVI M 0x1  ; Sets the result of the expression
                    ; "i < length" to 1.
label1:
        MOV L M
        TST L L     ; Test if the result of the
                    ; expression "i < length" was 0.
                    ; If so, break out of the loop.
        JEQ sum_for_break
                    ; Start of the loop body.
        MOV N E     ; N = i
        MOV M A     ; M = address of first element
                    ; of array
        MOVI L 0x0002
        MUL N L     ; Each element is 2 bytes, so
                    ; multiply i by 2.
        ADD M N     ; Add i * 2 to the address of
                    ; the first element of the array
                    ; to get the current element.
        MOV N M
        LOAD N N    ; Load the value at that address
                    ; into N, now N = array[i].
        MOV M F     ; M = s
        ADD M N     ; M = s + array[i]
        MOV N M
        MOV F N     ; Put the result of s + array[i]
                    ; back into the register for s.
        MOV L N
sum_for_continue:   ; Label to jump to if we had used
                    ; the "continue" keyword anywhere.
        MOVI N 0x0001
        MOV M E     ; M = i
        ADD M N     ; M = i + 1
        MOV N M
        MOV E N     ; i = i + 1
        MOV L N
        JMPI sum_for_start  ; Jump to right before we
                            ; check the condition.
sum_for_break:      ; Label to jump if the condition is
                    ; false, or if we had used the
                    ; "break" keyword anywhere.

Register Conventions

Consolite has 16 registers, and groups of these registers are reserved for specific purposes by the compiler. The register SP is special because it is the stack pointer, so it is modified by PUSH, POP, CALL, and RET instructions. The register FP is the frame pointer, and this stores what the value of the stack pointer was at the beginning of the current function. This is used for getting parameters and local variables that are not stored in registers, because they are at a certain offset from the frame pointer.

Registers A through D are used to pass values of the first four parameters to a function call. If there are more than four parameters, they are stored on the stack prior the return address.

Registers E through K are used to store the values of local variables. If there are more variables than can fit in these registers, they are stored on the stack after the return address. Array contents are always stored on the stack.

Registers L, M, and N are scratch registers, used for storing intermediate calculations in mathematical expressions. Additionally, register L is used to store the return value of a function.

How Optimized Is the Output?

Good question! The answer is: not very. Hand-written assembly code is pretty much guaranteed to run faster than the output of this compiler, with far fewer memory accesses.

One optimization I have made is that whenever the compiler would have output a PUSH instruction directly followed by a POP instruction, it instead outputs a single MOV instruction or nothing at all. For example:

PUSH M  ; These lines can be omitted
POP M

PUSH M  ; These lines can be replaced
POP N   ; by the single MOV below

MOV N M ; Moves values without using main memory

Language Shortcomings

In order to simplify the coding of the compiler, I cut a few corners in the design of the language. As of this writing, the following is an (incomplete) list of the shortcomings of the language:

The compiler will not warn you if you have a non-void function that doesn't return a value along some code paths. For example, you could compile the function uint16 getValue() { }, but its return value will be garbage.
The boolean operators || and && don't short-circuit. Meaning, if you have code like func1() && func2(), both func1() and func2() will be called, regardless of the output of func1(). This is different from C, in which func2() will only be called if the output from func1() evaluated to true.
There is only one data type, uint16. This is an unsigned 16-bit integer that can hold values from 0 to 65,535, inclusive. I had hoped to at least include a signed 16-bit integer to make working with negative numbers easier, but then every expression would need to have either a signed or unsigned attribute and this would have complicated the code. I hope to add more data types in the future.
Array sizes are evaluated at compile-time, and if they include global variables then the initial value of the global variable is used. It would be nice to have a const keyword so that constant local variables could be used in array sizes, and also so that non-constant globals would not be permitted in array sizes.
There are no shorthand assignment operators like +=. In order to do something like a += b;, you would instead write a = a + b;.
There are no increment and decrement operators, so increasing a variable by 1 looks like i = i + 1; instead of i++;.
There is no preprocessor. This means no #includes, so all code must be written in one file. Additionally, this means no #defines, so in order to avoid magic numbers global variables must be used in their place, which incurs a performance overhead for the memory access.
There is no option to have a forward declaration of a function. This forces you to write your functions in a certain order. For example, you cannot have code like the following:

void func1();
void func2() {
  func1();
}
void func1() { }

Tetris Demo

Click the emulator screen below to run the emulator. Click off the screen to pause the emulator. Source code for this example can be found here.

Controls

Spacebar – start the game / drop piece to bottom.
Up arrow – rotate piece.
Left arrow – move piece left.
Right arrow – move piece right.
Down arrow – move piece down.

Mobile users: touch controls are not supported. Sorry!

Future Plans

Now that I have written the emulator, assembler, and compiler, I would like to begin work on the hardware side of the console. As I have mentioned in previous posts, I purchased a Mimas V2 FPGA board over the summer. That $50 board should have enough resources to support the microprocessor and display controller for a hardware version of Consolite. However, I may take a hiatus from this project to work on a few of my other ideas, so my next post may be about something entirely different. Stay tuned!