I've been working a bit on a compiler that targets the 6502 (yes, I know there are plenty of things that are already good enough, but its mostly just a fun project to work on for me). It supports operations on 1, 2, 3, and 4 byte values, including the usual 6502 operations, plus variable shift operations (including arithmetic shift right), multiply, divide, modulo, logical operations, and others.
David Wheeler's ideas were a big help in figuring out the runtime for a high level language. I decided to build a split stack machine.
There are 4 data stacks -- STACK0, STACK1, STACK2, and STACK3. STACK0 could be on the zero page to save code space and allow some ldy ZP, x instructions, and STACK1, STACK2, STACK3 can be elsewhere. There is only one stack pointer, stored in the X register. This allows stack access quite quickly using indexed absolute addressing. Multi-byte values are stored across multiple stacks. For example, the bytes of a 24-bit integer would be stored at STACK0+x, STACK1+x, and STACK2+x. Manipulating the stack pointer is also very cheap, it only takes a dex or inx.
As an optimization, the current top-of-stack value is stored contiguously in the zero page at TOS. This means that most operations, which conceptually pop 2 values from the stack, then push the result back on the stack, only have to modify TOS in-place and then increment x in order to calculate their values.
Here's an example, the 2-byte addition operation:
A generated program would look something like this:
There are a few things that I am considering to improve the performance. Currently, the stack usage is very wasteful. If you use 1 byte operands, then STACK1, STACK2, and STACK3 are unused. Because the 1 and 2 byte operations take less code space than the 3 and 4 byte operations, I could create, for example, 1 byte operations the operate on STACK1 instead of STACK0, and 2 byte operations that operate on STACK2 and 3 instead of 0 and 1. Packing multiple values into one stack slot could be powerful, but it would also be very difficult to write a compiler that can take full advantage of it. A good compiler would generate normal 6502 code for 1 byte instructions anyway, rather than using the stack.
David Wheeler's ideas were a big help in figuring out the runtime for a high level language. I decided to build a split stack machine.
There are 4 data stacks -- STACK0, STACK1, STACK2, and STACK3. STACK0 could be on the zero page to save code space and allow some ldy ZP, x instructions, and STACK1, STACK2, STACK3 can be elsewhere. There is only one stack pointer, stored in the X register. This allows stack access quite quickly using indexed absolute addressing. Multi-byte values are stored across multiple stacks. For example, the bytes of a 24-bit integer would be stored at STACK0+x, STACK1+x, and STACK2+x. Manipulating the stack pointer is also very cheap, it only takes a dex or inx.
As an optimization, the current top-of-stack value is stored contiguously in the zero page at TOS. This means that most operations, which conceptually pop 2 values from the stack, then push the result back on the stack, only have to modify TOS in-place and then increment x in order to calculate their values.
Here's an example, the 2-byte addition operation:
Code:
add2: ; 30 cycles total
clc ; 2 cycles
lda STACK0, x ; 4 cycles
adc TOS ; 3 cycles
sta TOS ; 3 cycles
lda STACK1, x ; 4 cycles
adc TOS+1 ; 3 cycles
sta TOS+1 ; 3 cycles
inx ; 2 cycles
rts ; 6 cycles
clc ; 2 cycles
lda STACK0, x ; 4 cycles
adc TOS ; 3 cycles
sta TOS ; 3 cycles
lda STACK1, x ; 4 cycles
adc TOS+1 ; 3 cycles
sta TOS+1 ; 3 cycles
inx ; 2 cycles
rts ; 6 cycles
A generated program would look something like this:
Code:
main: ; calculates $1234 + $1337
; put $1234 on the stack
lda #<$1234
sta STACK0, x
lda #>$1234
sta STACK1, x
; put $1337 on the top of the stack
lda #<$1337
sta TOS
lda #>$1337
sta TOS+1
jmp add2
; put $1234 on the stack
lda #<$1234
sta STACK0, x
lda #>$1234
sta STACK1, x
; put $1337 on the top of the stack
lda #<$1337
sta TOS
lda #>$1337
sta TOS+1
jmp add2
There are a few things that I am considering to improve the performance. Currently, the stack usage is very wasteful. If you use 1 byte operands, then STACK1, STACK2, and STACK3 are unused. Because the 1 and 2 byte operations take less code space than the 3 and 4 byte operations, I could create, for example, 1 byte operations the operate on STACK1 instead of STACK0, and 2 byte operations that operate on STACK2 and 3 instead of 0 and 1. Packing multiple values into one stack slot could be powerful, but it would also be very difficult to write a compiler that can take full advantage of it. A good compiler would generate normal 6502 code for 1 byte instructions anyway, rather than using the stack.