My crazy project - NESdev BBS

My crazy project
by mozz on 2006-04-23 (#12160)

I've started working on an ambitious project that some folks here might find interesting. (I have a habit of starting ambitious projects and not finishing them, but I've been interested in SNES emulation for at least a year and a half now, so hopefully I will stick with it!)

Though it is ambitious, I've put lots of thought into it and I'm heading in a direction that seems good so far. Over the last few months I've made several attempts to capture the semantics of a 65816 (for example) in a "simple specification" which I could then write tools to generate code from (i.e. generate an emulation core for that chip). I wanted a representation where each "timing template" fit on one line. After struggling with various attempts to capture this information, I've finally hit on a representation with the characteristics I want.

A "timing template" is meant to be a loose description of the externally visible effects of an instruction during its execution. In other words, it describes what the chip does in each cycle -- a memory read or write, an operand fetch, or some internal operation. After struggling with various attempts to capture this information, I've hit on a format that I think will work. Here are some examples from my G65816 templates (which are derived directly from table 6-7 in the W65C816S datasheet):

Code:

       "1a : {F_AAL} {F_AAH} {AX_L} ?1{AX_H} alu_op",
       "1b : {F_NewPCL} {F_NewPCH} PC=NewPC",
       "1c : {F_NewPCL} {F_NewPCH} {IO} {PUSH_PCH} {PUSH_PCL} PC=NewPC",
       "1d : {F_AAL} {F_AAH} {AR_L} ?1{AR_H} {rmw_op} ?1{AW_H} {AW_L}",

       "10a: {F_DO} ?2{IO} {DX_L} ?1{DX_H} alu_op",

1a is used for ALU instructions with Absolute addressing mode. 1d is used for R-M-W instructions with Absolute addressing mode. 10a would be used for ALU instructions with Direct addressing mode. Each {term} represents one cycle. Extra effects which appear outside the curly braces can be thought of as part of the previous cycle. The ?1 at the beginning of certain cycles is a "predicate", and means that cycle is ONLY part of the instruction for situations where "condition 1" is true. In this case, "condition 1" means the instruction has 16-bit data (i.e. the X flag=0 for instructions which use X/Y, or the M flag=0 for instructions which use the accumulator).

So, what can I do with these things, anyway?

Well, I am hoping to produce a code generator (written in Java) which can output various emulation cores for these chips. I hope to eventually use the same code generator infrastructure to produce slow-but-readable C cores, optimized assembly cores and an exhaustive set of automated tests for several chips (NES CPU, SNES CPU and APU, Gameboy CPU, whatever). I would then use these cores as part of the implementation of a multi-system emulator. The slow-but-readable C cores will be derived more or less directly from the timing templates of each chip, so they will be easy to read and debug (but still cycle-accurate). The optimized assembly cores will have a significantly different internal structure, but using automated tests I will exhaustively compare their behaviour with that of the slow-but-readable C cores and ensure that every instruction has exactly the same visible side effects in the assembly core as in its slower cousin.

How will this code generator work?

Essentially, it will start with a table of all instructions (in a format easily derived from available documents). It will parse the table into an internal format, and compute the proper base "timing template" for each instruction.

For each instruction, the code generator will scan the template, counting the minimum and maximum number of cycles, and comparing those to what is known about the instruction (to make sure it is consistent). It will use the predicates mentioned in the template to decide if we need multiple instantiations of this instruction (for example, if it has the ?1 predicate in it, we will need either M0 and M1 versions, or X0 and X1 versions).

It will the proceed to refine the template into a more useful form, for a particular instantiation. It will essentially search-and-replace certain parts of it with other parts. For example, when generating the assembly core, it might replace "{F_AAL} {F_AAH}" with something like "{}{F16_AA}", because the assembly core is going to use one function to read both bytes simultaneously. (This is kind of like a peephole optimization).

The replacement is context-sensitive, so if necessary, I can interpret the same term in different ways depending on the instruction it is used in. For example, in 1a for a STA instruction, the {AX_L} might become an {AW_L} to write the low byte. For a different instruction like ADC, it might become {AR_L} instead. (Or for the assembly core and M0, both cycles would be replaced by {}{AR16_T}).

The predicate ?1 is a good example. In the slow-but-readable C core, "{AX_L} ?1{AX_H}" might generate code for two cycles, with an if-statement around the second cycle's code so it only executes if M=0. In the assembly core, the entire instruction will be instantiated twice and different handlers will be generated for each mode (i.e. M1 and M0). So the predicate can be evaluated at code-generation time, as it were... the M1 handler will have those two terms replaced by something like {AR_L}", and the M0 handler would have them replaced by "{}{AR16_T}".

Notice that refinement can add semantic information which was not present in the original templates. For example, in the 10a template, somehow between the {F_DO} term and the first direct read or write (the {DX_L} term) we need to take the direct address "DO" and add the D register to it. The template intentionally does NOT capture this requirement (leaving it out makes it easier to manipulate the template in various ways). Either through refinement, or later when we generate the code for those terms, we need to make sure the effect of calculating the direct address happens. Maybe in the C core, we will use refinement to insert an explicit action "DO += D" after the {F_DO} term. But in the assembly core, I might instead expect the function called to implement the read/write term to handle the calculation of the address.

After refinement, there are some steps which manipulate the structure instantiated instruction templates. (Long story short--in some of the assembly cores, I plan to split most of the instruction handlers into two routines chained together, to reduce the amount of duplicated code. At this point, it would find points in the instantiated templates where they could be split, and work out what the two halves of each instruction's handler are, etc).

Finally, we generate code in a greedy fashion by matching parts of the final template string and spitting out prefab blocks of code. The advantage here is that each prefab block of code is represented in one place, and can be re-used in different but similar templates. So if I have a bug in the code for some addressing mode (for example), I only have to fix it in one place. This is almost like using a macro in the assembler, but its more convenient for me this way.

As mentioned above, I also want to generate automated tests for each instruction of each CPU. I haven't thought about how that will work in as much detail, but hopefully the same code generator thingy will support that task as well. =)

How far have I got?

So far, I've got instruction lists for N6502 (documented only) and G65816 and SPC700 which are being parsed into usable data structures. I've got timing templates for G65816 and SPC700, and I'm computing the correct timing template for each instruction for the G65816. I'm about ready to try and write the first code for matching and refining parts of G65816 templates.

by mozz on 2006-04-23 (#12161)

And by "we" I mean "I". (The "royal We"). Years of working on projects with collective code ownership have apparently messed me up in the head.

by blargg on 2006-04-23 (#12164)

I have been thinking of writing a code generator for a few CPU cores, since it is one area where extensive abstraction tools could significantly improve ease-of-coding without sacrificing speed. The main demotivator is that I don't think it'd help much, due to a small number of instructions being much more common than others.

What practical benefits do you hope to achieve? Some that come to mind are many opportunities for re-use and collaboration at various levels, and the ability to generate cores with various levels of emulation accuracy.

Quote:
The advantage here is that each prefab block of code is represented in one place, and can be re-used in different but similar templates. So if I have a bug in the code for some addressing mode (for example), I only have to fix it in one place.

A while back I wrote a fresh 6502 core that had absolutely no duplication of any code, and it was a breeze to get debugged since any bug would affect many opcodes, either all using a certain addressing mode or all using a certain operation.

by mozz on 2006-04-24 (#12199)

blargg wrote:
What practical benefits do you hope to achieve? Some that come to mind are many opportunities for re-use and collaboration at various levels, and the ability to generate cores with various levels of emulation accuracy.

Mostly I want to experiment with different strategies for making the assembly cores smaller and faster. I want to be able to (for example) change what register a particular temporary value is stored in, without having to track down 20 places in the code and change them all, and without having to riddle my assembly code with macros.

You could think of my tool as a very specialized compiler which has the input (the specs for each chip) built in. Its Intermediate Representation (IR) is just the template string. Refining the template is sort of like optimizing the IR (though peephole optimizations are usually done near the end). Strings are easy to manipulate, and it can generate straight-line code from each string template with ease, because it doesn't have to worry about control flow and register allocation and so on---the code it is generating is specialized for one particular task (cycle-accurate emulation of an instruction) so not much smarts will be needed for that (I hope).

In the last few months, I have written by hand most of a slow-but-readable C core for a 65816. I have written most of an optimized-for-size assembly core for SPC700, and some parts of an optimized-for-size assembly core for a 65816. Enough so that I have a pretty good idea of what I want to generate. I more or less wrote those by manually following the steps I hope to automate with my tool. Which is good, because I sometimes overlook little details (and introduce hard-to-find bugs) when I do a bunch of manual steps like that. With the tool, once I get the tool doing the right thing, I can change it and generate a new core with the press of a button. Unless I introduce a bug in the tool, I don't have to worry about one of the derivation steps being performed incorrectly.

Also, by having the slow-but-readable-C core and the very different optimized assembly core, I can test them against each other. Any discrepancy points to some sort of bug or oversight in either in the code generator, or in the blocks of code it is generating.

Ultimately I want to have fully automated tests for each part of my emulator. For the cores, this means exhaustive tests for each instruction, testing all the corner cases of each addressing mode (wrapping behaviour, exact timing of read and write cycles, interrupting for DMA, multi-byte accesses that overlap different memory regions, etc). For example, the assembly core for a 65816 calls one function to fetch two bytes, and the C core does two 8-bit reads and combines the result afterwards. The assembly core has 5 separate dispatch tables and uses different handlers for M0 and M1 versions of an instruction; the C core uses one switch statement and has if-tests where the M flag affects execution. The assembly core relies on X86 flag values wherever possible to compute the flags of the emulated processor; the C core will have to evaluate them separately. The C core will be an executable embodiment of "what the semantics should be". The assembly core's behaviour will then be compared against the C core's behaviour in great detail to make sure the optimizations did not break the assembly core in some subtle fashion. Whenever I change things or fix a bug or disable or enable new optimizations, I generate both cores and run the automated tests to prove they are still equivalent.

Another kind of easy automated test I thought of is ROM-loading tests. Write a tool which processes the output of a ROM-cleaning utility (e.g. for SNES roms, I'd use NSRT; for NES roms, perhaps GoodNES?) and caches the information about how each ROM should be loaded: LoROM/HiROM or mapper type, what special chips are activated, etc. Then test the emulator by having it load each ROM and compare the decisions it made while loading the ROM, to the cached info from ROM tool. So if the cached info says the ROM uses a certain mapper, you check that the emulator decided to set up the same mapper. There's one test for each ROM, which fails if there are any discrepancies (i.e. it fails if the emulator interprets the ROM differently than the tool's database said it should be interpreted). Collect a large set of ROMs for each system, and whenever I make changes to the rom loading code, run these tests.

Edit: also, the tools will format the generated code to make it nice and easy to read. The C code generator will automatically count open curly braces and indent each line the correct number of spaces. The assembly code generator would accept something like "foo: add eax,edx ; 2 ...comment" and would spread out the fields the proper amount:
Code:
foo: add eax,edx ; 2 ...comment
If it needs to make more room for the label or comment, it can wrap it onto the next line, etc. That number ("2") is the size of the instruction; my text editor has a column mode and I can sum the columns to find out how many bytes of code are generated. With a little smarts built into it, the tool could compute that number for me. It could also compute 4-1-1 decoder templates and insert the number of P3 micro-ops required for each instruction, etc (something I occasionally do by hand just for curiosity's sake).

by blargg on 2006-04-24 (#12204)

Wouldn't most of these issues go away if you were targeting a modern RISC architecture? If you're targeting x86, that basically means a fast desktop machine, so optimization doesn't matter near as much. The main value of optimization seems to be for portables, which virtually all use RISC processors. In that case, you can let the C compiler optimize and schedule the instructions better.

In my opinion, any investment in the x86 architecture is doomed to become obsolete in the future, whereas targeting RISC basically means writing your code in a straightforward way in C and using plenty of local variables. Sorry, I just hate the x86 architecture with a strong passion. :)

by mozz on 2006-04-24 (#12206)

blargg wrote:
Wouldn't most of these issues go away if you were targeting a modern RISC architecture? If you're targeting x86, that basically means a fast desktop machine, so optimization doesn't matter near as much. The main value of optimization seems to be for portables, which virtually all use RISC processors. In that case, you can let the C compiler optimize and schedule the instructions better.

People have been saying x86 would die now for at least 10 years. They've been wrong and they're still wrong. But that's irrelevant. I like writing assembly for x86. =) There is no other good reason to do it.

Its true that decent C compilers do much better on RISC than they do on x86 (mostly because there are more architecturally-visible registers; modern x86 are all-RISC inside). But writing C code is not nearly as fun or challenging as writing assembly. Emulators can take advantage of architectural features that aren't visible to C, such as machine flags, continuation-style code, your own funky register passing convention that requires less spillage for your specific use case, etc. And other things that are hard to make use of in portable C code: rotate instructions, SIMD/MMX instructions, etc.

And of course, there are zillions of x86 boxes out there already. It would be nice to have emulators which were both highly accurate, and fast/lightweight enough to run on that older hardware.

by tepples on 2006-04-24 (#12207)

blargg wrote:
Wouldn't most of these issues go away if you were targeting a modern RISC architecture? If you're targeting x86, that basically means a fast desktop machine

Not necessarily. My main computer is a five-year-old machine with a 0.9 GHz Pentium III. In addition, things such as rewind, fast forward, and multiple systems in multiple windows need a fast CPU.

Quote:
In my opinion, any investment in the x86 architecture is doomed to become obsolete in the future

Then why did Apple switch from PowerPC to x86?

mozz wrote:
Emulators can take advantage of architectural features that aren't visible to C, such as machine flags

Tried C--? It's like C but it exposes more low-level machine features, making it suitable as an output language for compiler front ends such as yours.

by mozz on 2006-04-24 (#12213)

tepples wrote:
Tried C--? It's like C but it exposes more low-level machine features, making it suitable as an output language for compiler front ends such as yours.

I looked at it many years ago, but not lately. (I didn't like it much then, but maybe it has improved...)

Besides, that would take all the fun out of it! Fiddling with x86 assembly is sort of a hobby of mine.

One thing I've noticed from looking at various people's emulation cpu cores... they are kind of big. I have a feeling I could produce one that is 1/4th the code size of a typical assembly core, and comparably fast. I want to see if I can make it work. (Small is beautiful.)

Re: My crazy project
by abonetochew on 2006-04-25 (#12244)

I'm not sure how well this will work in terms of performance, but generating a CPU emulator from a (relatively) readable set of instruction definitions just sounds cool.

by blargg on 2006-04-25 (#12246)

Quote:
'm not sure how well this will work in terms of performance

I'd expect performance to possibly exceed that of any other CPU emulator, since you would automate everything and then override this at various layers with optimizations. Optimizations wouldn't just apply to a specific opcode (though they could); they might apply to a whole addressing mode or logical operation. This is the beauty of code generators (or any kind of abstraction): changes can be made at high levels without having to manually update code in several places to avoid inconsistency.

by mozz on 2006-04-26 (#12260)

Yeah, the beauty of code generators is that you don't have "copy and paste" code; you can generate similar-looking code in several places in the output, from a single source.

Example: Suppose I wanted to add a comment to each assembly handler for an undocumented instruction that said "; (undocumented)". That would be a one- or two-line change somewhere in the code generator. I don't even have to figure out manually which instructions are the undocumented ones (my tool already knows that). If I change my mind and decide to put the comment on entries in the dispatch table instead, thats another simple change in one place. The same principle applies to less trivial changes (such as changing what register a temporary value lives in, or changing the calling convention for the memory access functions). Hopefully changes like that can be made in fewer places.

Some of the same benefits can be had from using macros in the assembler, but I don't really like to use macros because they can obscure what instructions are actually being generated.

by mozz on 2006-06-25 (#14525)

Just an update for anyone who's interested.

I'm still working on this in my copious free time. I'm up to about 50 classes and 5,000 lines of Java code now. It took several redesigns, but I now have a pretty flexible code generation framework (in Java) to support my code-generation efforts. It has pretty-printers for C++ and assembly, and supports heirarchially composable output through the use of "output contexts" (language-specific write-only text buffers) that can be nested (i.e. you can embed a context inside another context) and can have other things (text generators, or hints for the pretty printer) embedded in them wherever you want. It allows the code-generation code to execute in whatever order is useful, and have the output go where you want it to go--you don't have to produce output in the order it will eventually appear. I'm actually pretty happy with it.

Here is a snippet of generated C++ code, for example (the indenting is automatic):
Code:
/**************************************************************************/

#define setNZ_8 do { NF=(res>>7)&1; ZF=!(res&0xFF); }while(0)

void alu_op(int opcode) {
switch (opcode) {
case ADC:
VF=(A^T); A=res=(A+T+CF); VF=((VF^res)>>7)&1;
CF=(res>>8)&1; setNZ_8; break;
case SBC:
VF=(A^T); A=res=(A-T-CF); VF=((VF^res)>>7)&1;
CF=!(res>>8); setNZ_8; break;

case CMP: res=(A-T); CF=!(res>>8); setNZ_8; break;
case CPX: res=(X-T); CF=!(res>>8); setNZ_8; break;

and generated assembly will look something like this (horizontal spacing is automatic):
Code:
; n6502small.asm
; THIS FILE IS AUTOMATICALLY GENERATED. DO NOT EDIT.

;---------------------------------------------------------------------------
; Dispatch table
;---------------------------------------------------------------------------

%macro F 1
dw (op_%1 - APUBASE),0
%endmacro
%macro FF 2
dw (op1_%1 - APUBASE),(op2_%2 - APUBASE)
%endmacro
%macro FC 2
dw (op_%1 - APUBASE),(%2)
%endmacro

ALIGN 32,nop
dispatchTab: F op_brk_impl ; 4 ;00: BRK impl
FF op1_dxi, op2_ora ; 4 ;01: ORA dxi
F op_hlt_impl ; 4 ;02: HLT impl
FF op1_dxi, op2_slo ; 4 ;03: SLO dxi
FF op1_d, op2_skb ; 4 ;04: SKB d
FF op1_d, op2_ora ; 4 ;05: ORA d
FF op1_d, op2_asl ; 4 ;06: ASL d

Regarding actual code generation, I wrote 95% of a "simple C core generator" for the 6502, and about two thirds of the one for the 65816. The output they were producing was very large and verbose though. I also made an attempt at an assembly core generator for the 6502 and ran into difficulties because the way I organized my template-refining code was not very flexible. So I backed off of that and focused on the C cores for a bit.

In an effort to solve the verbosity problem of the C code, I wrote 95% of a 6502 core by hand yesterday---one that uses roughly one case per addressing mode in the giant switch (instead of one case per instruction). I then refactored my template-refining code and wrote a new 6502 simple C core generator, that generates exactly what I wrote by hand except using the internal data structures of my code generation stuff. I am very happy with the result (so far)--its about 462 lines long where the old verbose core was well over 4000 lines, making the new one much easier to read.

Here's an example of the code being generated for the new "simple-but-readable" 6502 C core:
Code:
case IG_ALU_A:
/* 2 */ SETL(AA,Read(PC++));
/* 3 */ SETH(AA,Read(PC++));
/* 4 */ T = Read(AA); do_alu(opcode);
break;

case IG_ALU_AY: ireg=Y;
case IG_ALU_AX:
/* 2 */ SETL(AA,Read(PC++));
/* 3 */ SETH(AA,Read(PC++));
carry = ((GETL(AA)+ireg)>>8); SETL(AA,GETL(AA) + ireg);
/* 4 */ T = Read(AA); SETH(AA,GETH(AA) + carry);
/* 5?*/ if (carry) { T = Read(AA); }
do_alu(opcode);
break;

While refactoring I made the 6502 and 65816 stuff more similar internally, and I am pretty sure I can produce a 65816 generator in this new style fairly easily. The next thing I plan to do though, is make another stab at an assembly core generator for the 6502.

...I haven't compiled any of the generated code yet. It looks pretty though!