Implementation of a Binary Translator (JIT) for the NES

Implementation of a Binary Translator (JIT) for the NES
by laughy on 2006-06-05 (#13832)

I did one, and I wrote a paper on it for y'all:

http://www.laughy.net/Laughy-DBT.doc

No, an NES emulator does not need a JIT - this was for educational purposes only, and I was hoping to share the information I learned with anyone who was thinking about doing the same thing, either for the NES or other emulators. No doubt though, it was fun, and the performance increase in NES CPU emulation is astounding.

Opinions, comments and/or questions are more than welcome. Sorry its in DOC format - my converter thingie didn't work right.

(times are for 2.2ghz machine)

by blargg on 2006-06-05 (#13834)

What interpreter did you use? On this ancient 400 MHz Mac my interpreter runs as fast as the one you used on your 2.2 GHz machine (20x real-time, that is), and that's with PPU and sound emulation enabled.

Could be good for emulation on GBA
by tepples on 2006-06-05 (#13837)

laughy wrote:
No, an NES emulator does not need a JIT

Unless you're on a platform with a 16.8 MHz CPU and a video chip whose rendering model resembles that of the NES. In that case, you'll need to implement a 6502 to ARM JIT, even if only as practice before implementing 65c816, SPC700, MC68000, and Z80 JITs.

by laughy on 2006-06-05 (#13838)

blargg wrote:
What interpreter did you use? On this ancient 400 MHz Mac my interpreter runs as fast as the one you used on your 2.2 GHz machine (20x real-time, that is), and that's with PPU and sound emulation enabled.

I wrote a very simple and object-oriented interpreter to use for writing the translator: I was wary in posting benchmarks since the purpose of the paper is not to say "hey look DBT is so much faster LOL", but to outline the DBTs implementation. The benchmarks were mainly to show how a translator compares in its optimizations. I guess you could say the above picture compares a naive interpreter solution to an optimized dynamic translator for gags.

Increasing the speed of my interpreter should also speed up the DBT as well, since it often makes use of the interpreter.

by blargg on 2006-06-05 (#13848)

I just read the paper and liked it (it showed up fine in TextEdit on Mac OS X, except for the images at the end). It was neat how the algorithm came together, with the block convertor meshing smoothly with translated blocks and the interpreter. The most interesting part for me is optimization of flags. On older architectures with implicit flag setting, that's one of the biggest opportunities for elimination of unneeded operations.

One area of optimization I didn't see mentioned is translation-time differentiation of absolute memory accesses based on what address range they access, i.e. $0-$2000 (low memory), $2000-$4017 (PPU/APU registers), etc. For the registers, you could even dispatch specific ones separately, since $2002 for example is read heavily by some games. So an lda $2000 would turn into a direct call to read_ppu_2002(). How extensive is clock counting? I'm wondering how much that bogs it down.

by Dwedit on 2006-06-05 (#13849)

I'm most familiar with the GBA based nes emulator PocketNES, with an optimized ARM-ASM 6502 interpreter. The impression I get is that if the JIT did nothing other than inline each instruction, performance would only go up by about 5 instructions per 6502 instruction. But if it did lots of optimization and replaced a layered and emulated memory model with a native memory model, performance could increase dramatically. Mainly from removing layers from the memory system, but changing the 6502 sequences to optimized ARM sequences wouldn't hurt either.

by laughy on 2006-06-05 (#13852)

blargg wrote:
One area of optimization I didn't see mentioned is translation-time differentiation of absolute memory accesses based on what address range they access, i.e. $0-$2000 (low memory), $2000-$4017 (PPU/APU registers), etc. For the registers, you could even dispatch specific ones separately, since $2002 for example is read heavily by some games. So an lda $2000 would turn into a direct call to read_ppu_2002(). How extensive is clock counting? I'm wondering how much that bogs it down.

Yes you are right - I didn't mention this, and I probably should. In my JIT I do in fact differentiate absolute memory accesses and use specific translations in this case (I don't call a method). Calling a method would still be better than abandoning the translation and letting the interpreter handle it however.

One of the difficult things I'm struggling with is how to translate a block with this:

LDA 0x7
STA $05
LDA 0x20
STA $06
LDA 0x2
STA ($05)

See how we write to address 0x2007 indirectly? However address $05 doesn't contain 0x2007 until the block executes - so I have NO WAY of determing this is going to write to 0x2007 at translation time (without complicating things like a mother). Therefore when it comes time to translate the STA ($05) I incorrectly have it modify memory - when I should have it translate to an address-specific translation or method call like you said. I'm still working on this problem - possibly have the interpreter run blocks at least once before translation to determine if this occurred.

Clock counting really doesn't bog things down much. In the actual translated code, its only one extra instruction per branch, and two or three extra instructions per block. In the actual translation, we have to add cycles as we're translating, but remember the speed of the translator is not critcal(take with grain of salt). In 10,000 vblanks, we have a whopping 350 translations in Mario Bros. That's not much, and so the extra code to count cycles as one is going along in the translation becomes moot.

Thanks for the questions Blargg.

Dwedit wrote:
I'm most familiar with the GBA based nes emulator PocketNES, with an optimized ARM-ASM 6502 interpreter. The impression I get is that if the JIT did nothing other than inline each instruction, performance would only go up by about 5 instructions per 6502 instruction. But if it did lots of optimization and replaced a layered and emulated memory model with a native memory model, performance could increase dramatically. Mainly from removing layers from the memory system, but changing the 6502 sequences to optimized ARM sequences wouldn't hurt either.

I think after block-linking (one of the optimizations) any JIT will be significantly faster than any interpreter, and becomes worth doing. Note how the performance increased five times when this was done.

However you're right - if you really want speed extra optimizations are the way to go. How you handle memory is a pretty implementation specific issue, but you're right in the fact no translation should go through any type of memory layer: a translation must access memory directly for it to be fast. My translator does this by having the mapper be responsible for returning pointers to memory locations that will later be accessed directly. That way I can still keep the memory access one-level deep and be flexible (i.e. crazy mapper stuff) at the same time.

I couldn't imagine how much slower the translator would be if the translations went through levels of memory access. yikes!

Thanks for the input.

by blargg on 2006-06-05 (#13859)

Quote:
See how we write to address 0x2007 indirectly? However address $05 doesn't contain 0x2007 until the block executes - so I have NO WAY of determing this is going to write to 0x2007 at translation time (without complicating things like a mother).

In my profiles zero-page indirect instructions are fairly rare (the most-used in my profile is lda (zero-page),y, occurring only 1.35% of the time). I doubt that this optimization would give any noticeable payoff.

Quote:
Clock counting really doesn't bog things down much. In the actual translated code, its only one extra instruction per branch, and two or three extra instructions per block.

What about the adjustment necessary at every memory-mapped I/O access so the device knows when the access occurred? And how are wait-states (introduced by the DMC) handled?

by laughy on 2006-06-05 (#13860)

blargg wrote:
In my profiles zero-page indirect instructions are fairly rare (the most-used in my profile is lda (zero-page),y, occurring only 1.35% of the time). I doubt that this optimization would give any noticeable payoff.

Except its not an optimization - games won't even work because the translator incorrectly determines its a "normal" instruction when it MUST detect that it is an access to $2007 so that correct code can be inserted into the translation. A lot of games actually seem to use this, especially for things like $40XX registers.

blargg wrote:
What about the adjustment necessary at every memory-mapped I/O access so the device knows when the access occurred? And how are wait-states (introduced by the DMC) handled

Forgive me if I misunderstand what you mean by the first question: I'm a little rusty on my NES architecture details. If things like video and sound require knowledge of the cycle certain accesses occurred, then they must be logged somehow, which no doubt will add more overhead I didn't account for. If I was going to turn this in to a full emulator, writes and reads to special I/O registers would likely log to a linked list chain of "access blocks" or the like, and be processed later. Waitstates are currently not handled at all.

The great thing about a translator is that at any time you can go "you know what, this particular thing is to hard to handle" and claim the block is un-translatable: leaving your interpreter (which I guess yours is pretty fast) to handle it.

The questions you posed could probably be answered if I attempted to implement them. I'd love to hear any ideas on how to do so in the context of the translator: no doubt it IS possible and would probably be faster than a corresponding interpreter solution.

by Nessie on 2006-06-05 (#13861)

Great work, laughy. I've been thinking of a NES JIT myself but haven't gotten around to start experimenting with one yet. Nice to see you got it working.

laughy wrote:
blargg wrote:
In my profiles zero-page indirect instructions are fairly rare (the most-used in my profile is lda (zero-page),y, occurring only 1.35% of the time). I doubt that this optimization would give any noticeable payoff.

Except its not an optimization - games won't even work because the translator incorrectly determines its a "normal" instruction when it MUST detect that it is an access to $2007 so that correct code can be inserted into the translation. A lot of games actually seem to use this, especially for things like $40XX registers.

Why do you need to translate that particular memory access at all? Can't you just dispatch all indirect accesses to some mmu code that knows what to do?
Code:
; STA ($05)

mov ax [esi+$05] ; read target address from zero page somehow
add al cl ; add Y, intel handles page wrapping for you ;)
push eax ; put address
movzx eax bl ; and A
push eax ; on the stack
call WriteHandler
mov bl al ; put result in A
;
; (I'm not familiar with VC asm syntax, but you get the idea)

And then in WriteHandler, you've got a 'switch (address)' statement or something that knows what to do with 0x2007 writes.

Btw, here's another interesting doc regarding JIT console emulation:
http://www.squish.net/generator/gendoc.pdf

by laughy on 2006-06-06 (#13862)

Nessie wrote:
Great work, laughy. I've been thinking of a NES JIT myself but haven't gotten around to start experimenting with one yet. Nice to see you got it working.

laughy wrote:
blargg wrote:
In my profiles zero-page indirect instructions are fairly rare (the most-used in my profile is lda (zero-page),y, occurring only 1.35% of the time). I doubt that this optimization would give any noticeable payoff.

Except its not an optimization - games won't even work because the translator incorrectly determines its a "normal" instruction when it MUST detect that it is an access to $2007 so that correct code can be inserted into the translation. A lot of games actually seem to use this, especially for things like $40XX registers.

Why do you need to translate that particular memory access at all? Can't you just dispatch all indirect accesses to some mmu code that knows what to do?
Code:
; STA ($05)

mov ax [esi+$05] ; read target address from zero page somehow
add al cl ; add Y, intel handles page wrapping for you ;)
push eax ; put address
movzx eax bl ; and A
push eax ; on the stack
call WriteHandler
mov bl al ; put result in A
;
; (I'm not familiar with VC asm syntax, but you get the idea)

And then in WriteHandler, you've got a 'switch (address)' statement or something that knows what to do with 0x2007 writes.

Btw, here's another interesting doc regarding JIT console emulation:
http://www.squish.net/generator/gendoc.pdf

Adding calls to MMU code on every indirect access would greatly affect the speed of the translator, although that definately is a viable option. Wouldn't it still be more efficient to have the interpreter always execute a block first, and then it can log if an indirect address goes to a I/O register? If it does, we abandon translation of the block.

Thanks for the doc link! He doesn't talk about JIT too much, but he does talk about the same flag optimization I did (which is neat).

He also says one thing that REALLY BOTHERS ME:

Quote:
The start of the block is just the start of the unknown code we encounter, and it ends with the first instruction that might change the PC

He ends blocks on BRANCHES!!! This leaves most blocks EXTREMELY small (3-4 instructions?) and you spend most of your time jumping into your translator to translate small blocks instead of actually running code. I would greatly suggest to this guy to pick a JMP, JSR, etc. instruction as ends to blocks. Picking a branch as ends of blocks simplifies things though.

Thanks for the input Nessie!

by tepples on 2006-06-06 (#13869)

laughy wrote:
Adding calls to MMU code on every indirect access would greatly affect the speed of the translator, although that definately is a viable option.

Correctness before speed.

Quote:
Wouldn't it still be more efficient to have the interpreter always execute a block first, and then it can log if an indirect address goes to a I/O register?

And what happens if the same instruction writes to RAM or to I/O depending on what comes before the block? It's been done.

Quote:
He ends blocks on BRANCHES!!! This leaves most blocks EXTREMELY small (3-4 instructions?) and you spend most of your time jumping into your translator to translate small blocks instead of actually running code. I would greatly suggest to this guy to pick a JMP, JSR, etc. instruction as ends to blocks. Picking a branch as ends of blocks simplifies things though.

Again, correctness before speed. Simplicity breeds correctness.

by ershu on 2006-06-06 (#13870)

So great~~
I am dothing something similar. Your paper is great helpful for me.

by Nessie on 2006-06-06 (#13885)

laughy wrote:
Adding calls to MMU code on every indirect access would greatly affect the speed of the translator, although that definately is a viable option. Wouldn't it still be more efficient to have the interpreter always execute a block first, and then it can log if an indirect address goes to a I/O register? If it does, we abandon translation of the block.

But it's impossible to determine whether the indirect access will ever affect an I/O register or not. If it's difficult to insert a call to a WriteHandler, perhaps it's simpler to just abandon translation of all blocks with indirect access. As blargg pointed out, they're not very common anyway.
Quote:
Thanks for the doc link! He doesn't talk about JIT too much, but he does talk about the same flag optimization I did (which is neat).

He also says one thing that REALLY BOTHERS ME:

Quote:
The start of the block is just the start of the unknown code we encounter, and it ends with the first instruction that might change the PC

He ends blocks on BRANCHES!!! This leaves most blocks EXTREMELY small (3-4 instructions?) and you spend most of your time jumping into your translator to translate small blocks instead of actually running code. I would greatly suggest to this guy to pick a JMP, JSR, etc. instruction as ends to blocks. Picking a branch as ends of blocks simplifies things though.

Yeah, I think loops is where we would see the biggest performance gain, but that's lost when ending blocks on branches.

On a related note, someone made a PSX emulator in Java, and it uses a JIT:
http://www.javagaming.org/forums/index. ... ic=13865.0
What's cool about making a JIT in Java is that you don't need to worry about optimizing the translated code (dead code removal, register allocation). If you just translate into Java bytecode, Hotspot will then take care of the final optimization for the target platform - and Hotspot is very good at that. And furthermore, you get three platforms in one - Windows, Linux, Mac.

Martin

by laughy on 2006-06-06 (#13887)

tepples wrote:
laughy wrote:
Adding calls to MMU code on every indirect access would greatly affect the speed of the translator, although that definately is a viable option.

Correctness before speed.

Quote:
Wouldn't it still be more efficient to have the interpreter always execute a block first, and then it can log if an indirect address goes to a I/O register?

Quote:
He ends blocks on BRANCHES!!! This leaves most blocks EXTREMELY small (3-4 instructions?) and you spend most of your time jumping into your translator to translate small blocks instead of actually running code. I would greatly suggest to this guy to pick a JMP, JSR, etc. instruction as ends to blocks. Picking a branch as ends of blocks simplifies things though.

Again, correctness before speed. Simplicity breeds correctness.

If correctness was the priority, one wouldn't be doing a translator in the first place! Hehe. However I agree with you - getting it working should be a priority, but at the end of the paper he claims it wasn't really worth the work: if he had worked a bit harder in the speed area he may have seen the fruits of his labor.

Tepples wrote:
And what happens if the same instruction writes to RAM or to I/O depending on what comes before the block? It's been done.

Nessie wrote:
But it's impossible to determine whether the indirect access will ever affect an I/O register or not. If it's difficult to insert a call to a WriteHandler, perhaps it's simpler to just abandon translation of all blocks with indirect access. As blargg pointed out, they're not very common anyway.

You guys bring up a good issue. Normally this would work if the interpreter could assume an indirect instruction would always write to a I/O register OR always write to RAM. I can see three possibilities:

1) Execute a block multiple times in the interpreter looking for this special case (maybe in 10 times, a morphing indirect instruction will show its ugly head). - will possibly fail if we're unlucky
2) Keep a list of games which do this - abandon translations which use indirect addressing in these games, or use method call approach
3) Always use method call approach

Abanonding blocks can be fatal - I've seen the same blocks can be executed hundreds of thousands of times - if the indirect access is in a tight loop its a big loss. If I was gong to do this I'd probably go in the above in order, seeing which one worked out.

Nessie wrote:
Yeah, I think loops is where we would see the biggest performance gain, but that's lost when ending blocks on branches.

The size of the block has no affect on the speed of looping, but you bring up a good point in that the small block problem won't be as visible one is tightly looping. That said, if he isn't using the block linking optimization, tight loops are almost as slow as using an interpreter anyway. Notice how he said what made his emulator playable in many cases was when he detected tight loops and "skipped" them by just executing the interrupt the loop was trying to detect. I can see why he would need to do this.

Thanks for your comments/input guys. Let me know if I can better clear anything up.

by mozz on 2006-06-06 (#13888)

Ending blocks on branches does make sense, because blocks are then linked together.

You guys should look into an old HP project called Dynamo. It used JIT recompilation of native PA-RISC instructions into more efficient ("dynamically optimized") sequences of native PA-RISC instructions. Executing PA-RISC binaries under this system was often 5% or more faster than just running them on the underlying hardware!

I think there was an X86 version of this project started later under a different name (can't remember what at the moment). The technique was not dependent on a nice orthogonal RISC instruction set.

Dynamo used interpretation for the first few runs through some code (up to 100 or so?) in order to profile backwards branches and identify the "hot path" through a set of basic blocks. It then collected these instructions together into "traces" which consisted of several basic blocks stuck back-to-back (i.e. single-entry, multi-exit control flow) and ran a set of linear-time optimizations on the traces to improve their efficiency. This optimized trace was entered into a trace cache and thereafter, whenever the interpreter wanted to execute the code starting with the basic block at the beginning of the trace, it would just execute the native code in the cache.

Just like laughy, they found that switching from compiled code to the interpreter and back was a huge performance hit, so they implemented block linking (in Dynamo, a branch that exits from a trace would either go back to the interpreter, or to the beginning of a different trace. If it went to a different trace, they would replace the branch instruction that exits the trace with a direct branch to the new trace.) That way, you'd spend all your inner loops running directly-linked native code in the trace cache, and only have to go out to the interpreter for rarely-executed code like error handling.

by laughy on 2006-06-06 (#13899)

mozz wrote:
Ending blocks on branches does make sense, because blocks are then linked together.

Linking blocks together will be done no matter what the blocks are ended with: one doesn't have to be at the end of a block to link it to another one. The only differences I know of between a block that ends in a branch and block that ends with a jump is that the blocks which end in branches are easier to implement, and slower since more time will be spent going back and forth between the translator. Also smaller blocks means less flag optimization.

Quote:
Dynamo used interpretation for the first few runs through some code (up to 100 or so?) in order to profile backwards branches and identify the "hot path" through a set of basic blocks. It then collected these instructions together into "traces" which consisted of several basic blocks stuck back-to-back (i.e. single-entry, multi-exit control flow) and ran a set of linear-time optimizations on the traces to improve their efficiency. This optimized trace was entered into a trace cache and thereafter, whenever the interpreter wanted to execute the code starting with the basic block at the beginning of the trace, it would just execute the native code in the cache.

Yeah I'm starting to think this is needed to catch things like indirect writes to I/O registers and such. Its also a great way to determine which optimizations are good to run a block.

I can also see how running optimizations on traces would be a good idea.

Thanks for the info.

by mozz on 2006-06-07 (#13919)

Sorry. I was thinking of "jump" and "branch" as being the same thing (control transfer instruction, a.k.a. the last instruction of a basic block in the conventional sense). I guess I should actually read your paper (not just the summary).

The reason a "basic block" is considered "basic" is because, control flows linearly from the first instruction in the block to the last. That makes it easy to reason about (for example) what happens if you move instructions around, or remove one of them, etc. Thats why basic blocks often show up as an abstraction in algorithms that handle code (data flow analysis, or quick optimizations, or whatever else).