In my spare time this summer I set out to write a programming language from scratch. I’ve always been interested in learning about how programming languages worked, and sometimes the best way to learn something is to get your hands dirty. So after quite a few nights of work, I created Dash, a very simple procedural language, virtual machine, and bytecode.
So without further ado, let’s write some Dash!
So what’re you looking at?
Dash is similar to C, with a few style differences. Here we define a procedure main that accepts no variables and outputs an integer. Dash has only two data types it recognizes, signed integers and floating point numbers. There are no arrays, data structures, or even strings. This certainly makes Hello World difficult.
So what in the world is going on here?
Well Dash can’t handle strings, but it can handle characters. So we use an external procedure to print each character of “Hello, world”. We make do with what we have, right?
Let’s take a look at some examples more suited to Dash. Here’s a calculation of Pi using a series:
As you can see, the first syntax difference from C is in variable declarations. Multiple variables can be defined at a time, and their types are inferred from the right hand side of the declaration. This also means we can’t declare a variable without defining it. The only time that we explicitly specify type is in procedure declarations, where type can’t be inferred.
Dash also has some standard statements, such as If, While statements. Some notable exceptions though are For loops, and Break statements. There’s also no implicit casts, not even for constants. The type of a constant is an integer unless there is a decimal part. So you need to add a trailing ‘.’ on integer constants to make them real constants. Otherwise you’ll probably get an error about mismatching real and integer in an expression. It’s the most common error right now. Dash also supports recursion (which kind of comes with supporting procedure calling and a stack).
Here’s two different versions of Fibonnaci:
Mentioned earlier, Dash has a limited library of external procedures written in C. Currently there is just printing procedures, Sin, Cos, Tan, and Pow.
Here’s a quick example of using an external procedure:
The feature closest to my heart in Dash is multiple return types. Procedures can return multiple values, and can be used to initialize multiple variables, specify multiple parameters to a procedure, or return multiple values from the current procedure. It’s pretty neat!
Here’s an example of some of its uses:
Behind the Scenes - The Library
The Dash core library is written in C and is divided into two parts. A VM that executes procedures of bytecode, and a compiler that generates those procedures.
The VM is a register based machine that executes procedures of contiguous bytecode. A register is just a machine word sized slot for any kind of data. A procedure takes in a certain amount of registers as input, uses a certain amount of registers for calculations, then returns a certain sized range of registers as output. Instructions operate on registers using an index encoded in the instruction, and currently only operations on signed integers and floating point numbers are supported. A full list of instructions can be found here.
Some important instructions are listed below:
|dvm_opcode_addf||adds two floating point numbers|
|dvm_opcode_jmp_c||jmp by an offset if the value in a register is nonzero|
|dvm_opcode_cmpi_e||compares two integers for equality and stores the result in a register|
|dvm_opcode_call||calls a procedure, specifying an input range, and an output range|
|dvm_opcode_ret||finishes the current procedure, and returns a specified register range|
|dvm_opcode_stor||stores the machine word after this instruction in a specified register. This is the only double length instruction in the set|
The VM was relatively straightforward to write, and once it was written I could actually hand assemble instructions and write simple procedures like Fibonacci or exponentiation. The compiler on the other hand was not so simple.
The compiler is divided into two parts. The front-end that parses Dash source into an AST, and the back-end that generates bytecode from an AST.
The front-end was mostly written with off the shelf tools. The Dash language was designed to be context-free, so I could use parser generators instead of writing it all by hand. Lexical analysis is done with flex, and parsing is done with yacc. Input comes in as characters and is tokenized by flex into some simple tokens. Yacc then reduces these tokens, and builds an AST. Once the source is accepted and we have an AST, we move to the back-end to generate bytecode.
The back-end was the most difficult part of the project. There is an incredible amount of work you can put into a good code generator. In fact, there’s a lot of work needed to just make a passable code generator. Which is where I aimed to hit. The codegen process for Dash is essentially just a big tree traversal. I’m not going into the details today, but if there’s interest I’ll write a post about it.
The bytecode generated by the back-end is of decent quality. The codegen process is single pass, so there is no dead code elimination or expression simplification. The code you write is very close to the code that will be executed by the VM. Most of the optimization is done around register allocation.
Once the codegen process is completed, the bytecode is loaded into the VM and is available to be executed at any time in the future.
Try and write some expressions and check out the disassembly:
Overall I’m very happy with the project. It’s not exactly groundbreaking, but it’s just cool to be able to understand the whole process. I don’t have any plans to add new features in the immediate future, but that doesn’t mean you can’t! Feel free to experiment with my code, and create something new.
Questions, comments, found a bug? Leave a comment below!