I've started working on an ambitious project that some folks here might find interesting. (I have a habit of starting ambitious projects and not finishing them, but I've been interested in SNES emulation for at least a year and a half now, so hopefully I will stick with it!)
Though it is ambitious, I've put lots of thought into it and I'm heading in a direction that seems good so far. Over the last few months I've made several attempts to capture the semantics of a 65816 (for example) in a "simple specification" which I could then write tools to generate code from (i.e. generate an emulation core for that chip). I wanted a representation where each "timing template" fit on one line. After struggling with various attempts to capture this information, I've finally hit on a representation with the characteristics I want.
A "timing template" is meant to be a loose description of the externally visible effects of an instruction during its execution. In other words, it describes what the chip does in each cycle -- a memory read or write, an operand fetch, or some internal operation. After struggling with various attempts to capture this information, I've hit on a format that I think will work. Here are some examples from my G65816 templates (which are derived directly from table 6-7 in the W65C816S datasheet):
1a is used for ALU instructions with Absolute addressing mode. 1d is used for R-M-W instructions with Absolute addressing mode. 10a would be used for ALU instructions with Direct addressing mode. Each {term} represents one cycle. Extra effects which appear outside the curly braces can be thought of as part of the previous cycle. The ?1 at the beginning of certain cycles is a "predicate", and means that cycle is ONLY part of the instruction for situations where "condition 1" is true. In this case, "condition 1" means the instruction has 16-bit data (i.e. the X flag=0 for instructions which use X/Y, or the M flag=0 for instructions which use the accumulator).
So, what can I do with these things, anyway?
Well, I am hoping to produce a code generator (written in Java) which can output various emulation cores for these chips. I hope to eventually use the same code generator infrastructure to produce slow-but-readable C cores, optimized assembly cores and an exhaustive set of automated tests for several chips (NES CPU, SNES CPU and APU, Gameboy CPU, whatever). I would then use these cores as part of the implementation of a multi-system emulator. The slow-but-readable C cores will be derived more or less directly from the timing templates of each chip, so they will be easy to read and debug (but still cycle-accurate). The optimized assembly cores will have a significantly different internal structure, but using automated tests I will exhaustively compare their behaviour with that of the slow-but-readable C cores and ensure that every instruction has exactly the same visible side effects in the assembly core as in its slower cousin.
How will this code generator work?
Essentially, it will start with a table of all instructions (in a format easily derived from available documents). It will parse the table into an internal format, and compute the proper base "timing template" for each instruction.
For each instruction, the code generator will scan the template, counting the minimum and maximum number of cycles, and comparing those to what is known about the instruction (to make sure it is consistent). It will use the predicates mentioned in the template to decide if we need multiple instantiations of this instruction (for example, if it has the ?1 predicate in it, we will need either M0 and M1 versions, or X0 and X1 versions).
It will the proceed to refine the template into a more useful form, for a particular instantiation. It will essentially search-and-replace certain parts of it with other parts. For example, when generating the assembly core, it might replace "{F_AAL} {F_AAH}" with something like "{}{F16_AA}", because the assembly core is going to use one function to read both bytes simultaneously. (This is kind of like a peephole optimization).
The replacement is context-sensitive, so if necessary, I can interpret the same term in different ways depending on the instruction it is used in. For example, in 1a for a STA instruction, the {AX_L} might become an {AW_L} to write the low byte. For a different instruction like ADC, it might become {AR_L} instead. (Or for the assembly core and M0, both cycles would be replaced by {}{AR16_T}).
The predicate ?1 is a good example. In the slow-but-readable C core, "{AX_L} ?1{AX_H}" might generate code for two cycles, with an if-statement around the second cycle's code so it only executes if M=0. In the assembly core, the entire instruction will be instantiated twice and different handlers will be generated for each mode (i.e. M1 and M0). So the predicate can be evaluated at code-generation time, as it were... the M1 handler will have those two terms replaced by something like {AR_L}", and the M0 handler would have them replaced by "{}{AR16_T}".
Notice that refinement can add semantic information which was not present in the original templates. For example, in the 10a template, somehow between the {F_DO} term and the first direct read or write (the {DX_L} term) we need to take the direct address "DO" and add the D register to it. The template intentionally does NOT capture this requirement (leaving it out makes it easier to manipulate the template in various ways). Either through refinement, or later when we generate the code for those terms, we need to make sure the effect of calculating the direct address happens. Maybe in the C core, we will use refinement to insert an explicit action "DO += D" after the {F_DO} term. But in the assembly core, I might instead expect the function called to implement the read/write term to handle the calculation of the address.
After refinement, there are some steps which manipulate the structure instantiated instruction templates. (Long story short--in some of the assembly cores, I plan to split most of the instruction handlers into two routines chained together, to reduce the amount of duplicated code. At this point, it would find points in the instantiated templates where they could be split, and work out what the two halves of each instruction's handler are, etc).
Finally, we generate code in a greedy fashion by matching parts of the final template string and spitting out prefab blocks of code. The advantage here is that each prefab block of code is represented in one place, and can be re-used in different but similar templates. So if I have a bug in the code for some addressing mode (for example), I only have to fix it in one place. This is almost like using a macro in the assembler, but its more convenient for me this way.
As mentioned above, I also want to generate automated tests for each instruction of each CPU. I haven't thought about how that will work in as much detail, but hopefully the same code generator thingy will support that task as well. =)
How far have I got?
So far, I've got instruction lists for N6502 (documented only) and G65816 and SPC700 which are being parsed into usable data structures. I've got timing templates for G65816 and SPC700, and I'm computing the correct timing template for each instruction for the G65816. I'm about ready to try and write the first code for matching and refining parts of G65816 templates.
Though it is ambitious, I've put lots of thought into it and I'm heading in a direction that seems good so far. Over the last few months I've made several attempts to capture the semantics of a 65816 (for example) in a "simple specification" which I could then write tools to generate code from (i.e. generate an emulation core for that chip). I wanted a representation where each "timing template" fit on one line. After struggling with various attempts to capture this information, I've finally hit on a representation with the characteristics I want.
A "timing template" is meant to be a loose description of the externally visible effects of an instruction during its execution. In other words, it describes what the chip does in each cycle -- a memory read or write, an operand fetch, or some internal operation. After struggling with various attempts to capture this information, I've hit on a format that I think will work. Here are some examples from my G65816 templates (which are derived directly from table 6-7 in the W65C816S datasheet):
Code:
"1a : {F_AAL} {F_AAH} {AX_L} ?1{AX_H} alu_op",
"1b : {F_NewPCL} {F_NewPCH} PC=NewPC",
"1c : {F_NewPCL} {F_NewPCH} {IO} {PUSH_PCH} {PUSH_PCL} PC=NewPC",
"1d : {F_AAL} {F_AAH} {AR_L} ?1{AR_H} {rmw_op} ?1{AW_H} {AW_L}",
"10a: {F_DO} ?2{IO} {DX_L} ?1{DX_H} alu_op",
"1b : {F_NewPCL} {F_NewPCH} PC=NewPC",
"1c : {F_NewPCL} {F_NewPCH} {IO} {PUSH_PCH} {PUSH_PCL} PC=NewPC",
"1d : {F_AAL} {F_AAH} {AR_L} ?1{AR_H} {rmw_op} ?1{AW_H} {AW_L}",
"10a: {F_DO} ?2{IO} {DX_L} ?1{DX_H} alu_op",
1a is used for ALU instructions with Absolute addressing mode. 1d is used for R-M-W instructions with Absolute addressing mode. 10a would be used for ALU instructions with Direct addressing mode. Each {term} represents one cycle. Extra effects which appear outside the curly braces can be thought of as part of the previous cycle. The ?1 at the beginning of certain cycles is a "predicate", and means that cycle is ONLY part of the instruction for situations where "condition 1" is true. In this case, "condition 1" means the instruction has 16-bit data (i.e. the X flag=0 for instructions which use X/Y, or the M flag=0 for instructions which use the accumulator).
So, what can I do with these things, anyway?
Well, I am hoping to produce a code generator (written in Java) which can output various emulation cores for these chips. I hope to eventually use the same code generator infrastructure to produce slow-but-readable C cores, optimized assembly cores and an exhaustive set of automated tests for several chips (NES CPU, SNES CPU and APU, Gameboy CPU, whatever). I would then use these cores as part of the implementation of a multi-system emulator. The slow-but-readable C cores will be derived more or less directly from the timing templates of each chip, so they will be easy to read and debug (but still cycle-accurate). The optimized assembly cores will have a significantly different internal structure, but using automated tests I will exhaustively compare their behaviour with that of the slow-but-readable C cores and ensure that every instruction has exactly the same visible side effects in the assembly core as in its slower cousin.
How will this code generator work?
Essentially, it will start with a table of all instructions (in a format easily derived from available documents). It will parse the table into an internal format, and compute the proper base "timing template" for each instruction.
For each instruction, the code generator will scan the template, counting the minimum and maximum number of cycles, and comparing those to what is known about the instruction (to make sure it is consistent). It will use the predicates mentioned in the template to decide if we need multiple instantiations of this instruction (for example, if it has the ?1 predicate in it, we will need either M0 and M1 versions, or X0 and X1 versions).
It will the proceed to refine the template into a more useful form, for a particular instantiation. It will essentially search-and-replace certain parts of it with other parts. For example, when generating the assembly core, it might replace "{F_AAL} {F_AAH}" with something like "{}{F16_AA}", because the assembly core is going to use one function to read both bytes simultaneously. (This is kind of like a peephole optimization).
The replacement is context-sensitive, so if necessary, I can interpret the same term in different ways depending on the instruction it is used in. For example, in 1a for a STA instruction, the {AX_L} might become an {AW_L} to write the low byte. For a different instruction like ADC, it might become {AR_L} instead. (Or for the assembly core and M0, both cycles would be replaced by {}{AR16_T}).
The predicate ?1 is a good example. In the slow-but-readable C core, "{AX_L} ?1{AX_H}" might generate code for two cycles, with an if-statement around the second cycle's code so it only executes if M=0. In the assembly core, the entire instruction will be instantiated twice and different handlers will be generated for each mode (i.e. M1 and M0). So the predicate can be evaluated at code-generation time, as it were... the M1 handler will have those two terms replaced by something like {AR_L}", and the M0 handler would have them replaced by "{}{AR16_T}".
Notice that refinement can add semantic information which was not present in the original templates. For example, in the 10a template, somehow between the {F_DO} term and the first direct read or write (the {DX_L} term) we need to take the direct address "DO" and add the D register to it. The template intentionally does NOT capture this requirement (leaving it out makes it easier to manipulate the template in various ways). Either through refinement, or later when we generate the code for those terms, we need to make sure the effect of calculating the direct address happens. Maybe in the C core, we will use refinement to insert an explicit action "DO += D" after the {F_DO} term. But in the assembly core, I might instead expect the function called to implement the read/write term to handle the calculation of the address.
After refinement, there are some steps which manipulate the structure instantiated instruction templates. (Long story short--in some of the assembly cores, I plan to split most of the instruction handlers into two routines chained together, to reduce the amount of duplicated code. At this point, it would find points in the instantiated templates where they could be split, and work out what the two halves of each instruction's handler are, etc).
Finally, we generate code in a greedy fashion by matching parts of the final template string and spitting out prefab blocks of code. The advantage here is that each prefab block of code is represented in one place, and can be re-used in different but similar templates. So if I have a bug in the code for some addressing mode (for example), I only have to fix it in one place. This is almost like using a macro in the assembler, but its more convenient for me this way.
As mentioned above, I also want to generate automated tests for each instruction of each CPU. I haven't thought about how that will work in as much detail, but hopefully the same code generator thingy will support that task as well. =)
How far have I got?
So far, I've got instruction lists for N6502 (documented only) and G65816 and SPC700 which are being parsed into usable data structures. I've got timing templates for G65816 and SPC700, and I'm computing the correct timing template for each instruction for the G65816. I'm about ready to try and write the first code for matching and refining parts of G65816 templates.