Is it possible to parallelize (i.e. prefetch instructions and execute them on separate threads before encountering one that has some kind of stateful dependency) a 6502 core? If so, is it done in any popular emulators and does it achieve noticeable performance gains?
No, I don't believe so. Everything about the CPU is a serial operation. What do you propose to split into threads?
You could maybe run 2A03 and 2C02 in parallel simulations, but the frequent synchronization between the two (think of ~60 $2007 syncs per frame) might make it difficult to improve performance this way.
It's sometimes suggested to alternate between CPU/PPU simulation only at synch points (e.g. just run CPU by itself until you hit a BIT $2002 or NMI or something, then run PPU to catch up to that point) instead of simulating them together in lock-step. There is potential advantage already here from not jumping back and forth on each instruction, but if you wanted to try parallelization you'd have to implement this already as a stepping stone.
I suppose I was thinking: are there enough situations where you have several instructions not dependent on each other like:
;just random set of instructions that don't depend on each other
lda $00
inx
inc $0a
where you could execute all three in parallel and once you find a dependency wait for the results, or something?
I'm grasping at straws lately to speed up my 6502 core on android devices (it runs startlingly well in pure Java, though, but I need to push it over the top to make it truly viable for distribution). The battery saving features of such devices seems to make threads run at different speeds seemingly arbitrarily, no matter what I do to request higher priority. I think for android I may have to write the cpu portion in C.
GradualGames wrote:
;just random set of instructions that don't depend on each other
lda $00
inx
inc $0a
where you could execute all three in parallel and once you find a dependency wait for the results, or something?
That's just way too granular for multithreading. Think about all the extra work you've got to do just to identify separable instructions, and then merge the common results, etc... the time it takes to synchronize data between threads greatly outweighs the time taken by any single instruction. You need to be able to separate into batches where the threads don't have to talk to each other for longer periods.
That kind of granularity actually
is done by the hardware of modern CPUs internally, though. Not as a threading thing, but on a lower level, as the CPU's pipelining hardware. You could potentially take advantage of that hardware by doing JIT compilation, or static recompilation, where you rebuild the whole program as native instructions instead of the usual byte-by-byte dispatch. Recompilation has some caveats that might make it difficult too (self modifying code, code copied to RAM, overlapping code, etc.).
I've not heard of recompilation being done for an NES emulator, but it's definitely used in several emulators for more powerful systems (not sure if it would really pay off on NES). There's a small list here:
https://en.wikipedia.org/wiki/Dynamic_recompilation#Gaming
Makes sense. What about other tricks...lazy evaluation of cpu flags for example? A lot of instructions affect many flags but, the following code may or may not use the flags calculated from the previous result. Just trying to learn if there's anything clever one can do in a 6502 core to reduce the amount of work that needs to be done.
There's only four ALU flags, and computing them should be basically free on a modern CPU (ironically, since the flags don't depend on each other, the CPU *will* probably be able to compute them in parallel). The overhead involved in trying to compute them lazily will almost certainly be higher than the cost of just computing them.
Why is that? I'm thinking, for example, of evaluating status_zero for example only when beq/bne is used, for example. There are probably many hundreds of instructions executed in any given game that are not immediately followed by beq/bne (but which do affect the zero flag), and so it seems like when you sum that all up it could save at least a small portion of time overall. Note my core is written in Java, so it is unclear to me how much parallelization (in the real physical cpu) is actually taken advantage of down at the lowest level after the JIT is done and the bytecode is finally interpreted etc.
How would the emulator remember from which value the flags shall be calculated, particularly when PLP can load inconsistent flags (e.g. N and Z both set)?
My hope was I'd just store off the result of the last instruction and evaluate flags from that result variable when I need them. What's this about inconsistent flags? Is that a hardware glitch or is that something else?
One thing that makes Java super awkward for writing a 6502 core is I constantly have to do & 0xff to get an unsigned byte value (always expands to an int, since java only allows bytes to be signed). Perhaps one optimization I could make would be when I load the cartridge to expand it to twice the size and each byte is just stored in an int (short) (doing the & 0xff calculation statically when loading), eliminating the need to use 0xff everywhere. Should have off some time anyway...
GradualGames wrote:
One thing that makes Java super awkward for writing a 6502 core is I constantly have to do & 0xff to get an unsigned byte value (always expands to an int, since java only allows bytes to be signed). Perhaps one optimization I could make would be when I load the cartridge to expand it to twice the size and each byte is just stored in an int (short) (doing the & 0xff calculation statically when loading), eliminating the need to use 0xff everywhere. Should have off some time anyway...
I would actually expect Java to handle simple byte masking like "& 0xFF" in a practical/efficient way. Most compilers handle this kind of thing very well, and attempts to "optimize" by manually packing bits in less intuitive ways could actually lower performance. (I could be wrong about this, but as always: measure your optimizations after implementing them to make sure.)
GradualGames wrote:
My hope was I'd just store off the result of the last instruction and evaluate flags from that result variable when I need them. What's this about inconsistent flags? Is that a hardware glitch or is that something else?
He means that the flags aren't really set by a single result. The Zero flag may have been set by a different instruction than the Carry flag, etc. so you'd actually need to store a result for each flag, I suppose... but I'm not sure how that would be significantly different than just storing the flags. (Dunno if you're doing this, but you don't need to store flags as a packed 8-bit value- that's only needed for things that use flags on the stack, PHP/RTI/etc. Perfectly fine to just use separate booleans for the internal representation.)
Going back to my earlier thing about recompilation, lazy evaluation is something that compilers are really good at doing, so it actually is a technique that would work very well for JIT / dynamic recompilation. However, trying to do it at runtime, now you're probably spending a lot of CPU time computing the laziness, which could easily negate the work saved.
The "store the last value that the ALU calculated and calculate N and Z therefrom" model won't always work. Normally it's not possible for N and Z to get set at once:
- ALU result $00: N false, Z true
- ALU result $01-$7F: N false, Z false
- ALU result $80-$FF: N true, Z false
But there are two ways for N and Z to get set at once. One is an instruction that pops flags from the stack, namely PLP or RTI:
Code:
FLAGS_N = $80
FLAGS_V = $40
FLAGS_D = $08
FLAGS_I = $04
FLAGS_Z = $02
FLAGS_C = $01
lda #FLAGS_N|FLAGS_Z
pha
plp
The other is BIT:
Code:
lda #$80
sta $00
lda #$01
bit $00
This sets Z because there are no 1 bits in common between the value in A and the value read from $00. But it sets N based only on the value read from $00, disregarding the value in A.
From stackoverflow.... leaving things as int's might actually(slightly) improve performance...or if not at least make it easier to code.
http://stackoverflow.com/questions/1453 ... float-inst
I do think doing it in C will speed it up vs Java. Then there are the old tricks like using computed gotos instead of switch.
Well the odd thing is, it's actually performing great most of the time, more than enough for what I'm using it for (see GGVm thread), even on a 3 year old phone. Problem is, Android seems to put the thread at different priorities or on different cores outside of my control, and sometimes runs at 1/4 the speed, at which point, depending on the phone I might see some dropped frames (the actual speed winds up being slightly slower than the actual NES would execute a rom). If I can just get it to not throttle down to 1/4 speed I'd be golden. However, I suspect this may be something I can't control and I'll just have to bite the bullet and write it in C for this particular platform.
calima wrote:
Then there are the old tricks like using computed gotos instead of switch.
Switches often compile to jump tables anyway, especially for in-order contiguous values.
Doing a computed goto is (slightly) faster than using a switch, because you do it at the end of an opcode instead of looping back around. One jump instead of two.
Of course, Java doesn't have any faculties for doing a computed goto (AFAIK). And unconditional branches are all but free on a modern CPU anyway. On CPUs that execute at two billion cycles a second, a million extra unconditional jumps per second really, really doesn't matter.
I'd like fceux to be more efficient, if only from a power saving standpoint. Currently it takes about 35% of one core, with X taking an additional 20% (apparently it draws inefficiently?).
calima wrote:
I'd like fceux to be more efficient, if only from a power saving standpoint. Currently it takes about 35% of one core, with X taking an additional 20% (apparently it draws inefficiently?).
Er. Even on my Athlon/1333 it was tremendously lighter weight than that...
Overall, Atom and ARM tend slower than Athlon. An Atom's work per clock is close to that of a Pentium 4.
This is on a Phenom II, but the core is likely not at full speed.
Honestly, if 6502 emulation performance is a problem then you might be just not doing an optimal job at a serial implementation.
PPU is usually way more bottlenecky on slow CPUs.
It's actually working really well like I said. I'm kinda casting a wide net for improving the performance further. The actual issue I'm experiencing I think has to do with power save features on android devices, because it just mysteriously throttles down to like 1/4 speed even though my thread is doing the exact same work every time.
I don't even have a PPU
not really, anyway. It just renders the tiles, straight, rather than scanline per scanline. I.e. not a real NES emulator, see
GGVm thread if curious.
In other words, you have a HLE PPU (high level emulated Picture Processing Unit). The best comparison I guess would be PocketNES for Game Boy Advance, which also has a HLE PPU because it maps the NES's tiled backgrounds and sprites onto those of the GBA. Yet it somehow runs at full speed on a 16.8 MHz ARM7TDMI.
I have another question relevant to what I'm working on: I know that the CPU clock speed from the wiki is: 1.789773 mhz, which amounts to about 29829.55 cycles per 1/60th of a second, correct? How many actual instructions per frame does that amount to, on average? I am going to guess the average amount of cycles taken by any given instruction is about 3 (seeing some as low as 2 and some as high as 7)?
Right now, one of the metrics my cpu spits out is "instructions per second." I have no concept of ticks or cycles, it is a purely high level cpu simulator. Thus when I'm seeing a metric such as: 2,570,423 instructions per second, this is ridiculously faster than the NES needs to be, if I'm correct in the above paragraph at all.
One thing I learned today is Android throttles down CPUs when they get hot. I do seem to observe a degradation of performance over time, and if I let it sit and start over the speed is restored.
Perhaps all I need to do is manually throttle the cpu with Thread.nanosleep, at least on mobile devices in an attempt to stress the cpu less? After all, there's absolutely no reason to be "overclocking" to the extreme degree that it is right now.
...hmm...from what I'm reading, sleep will still use the cpu. Sounds like I want wait. Problem is I can't know when I should do that. Since GGVm already has game-specific knowledge perhaps I can tell it where all nmi wait spin loops are and do wait/notify to give the thread some rest between frames.
Dots per frame = 341 * 262 - 0.5 = 89341.5
Cycles per frame = 89341.5 / 3 = 29780.5
where each "cycle" is one time the CPU reads or writes memory, including dummy reads for certain instructions
So how hard would it be to make your CPU spit out the metric "memory reads and writes per second"?
Yes, once you run out of memory accesses, blocking until the next host vblank would be a good idea.
Another good idea is automatic speed hacking, as implemented in PocketNES. if the system reads a location in a tight loop, stop the CPU until the next interrupt, like this:
Code:
lda nmis
:
cmp nmis
beq :-
"At least one new post has been made to this topic. You may wish to review your post in light of this."
GradualGames wrote:
Since GGVm already has game-specific knowledge perhaps I can tell it where all nmi wait spin loops are
Can you tell it to automatically recognize patterns like that?
I think I have a proof of concept working for a hard-coded example (pattern recognition can wait, especially if this doesn't wind up solving the hot cpu problem on android devices). Now I just need to figure out how to measure instructions per second correctly taking into account when the thread is sleeping, so I can observe whether this really does get me an improvement or not.
Here's the top oprofile results from fceux if anyone's interested. Indeed not the 6502 core, but sound and drawing are taking most of the time.
Code:
samples % image name symbol name
12098 31.7391 fceuxg NeoFilterSound(int*, int*, unsigned int, int*)
10644 27.9245 fceuxg Blit8ToHigh(unsigned char*, unsigned char*, int, int, int, int, int)
2273 5.9632 fceuxg RefreshLine(int)
1948 5.1106 libasound_module_rate_speexrate.so resampler_basic_interpolate_single
1869 4.9033 fceuxg X6502_RunDebug(int)
979 2.5684 fceuxg RDoSQ1()
843 2.2116 fceuxg RDoSQ2()
809 2.1224 fceuxg FCEUPPU_Loop(int)
749 1.9650 fceuxg FlushEmulateSound()
741 1.9440 libc-2.7.so memset
You get the biggest CPU emulation speedup from *idle loop skipping*. But unless you're on a 16MHz ARM or something, you probably won't notice.
Welp, as a quick update, it turns out my 6502 core wasn't the bottleneck after all, it was how I was using opengl. One game was running around 23fps on a 3 year old phone of mine---changed some things in how I'm using opengl and now its 60fps. Crazy stuff...
That said...I actually found a simpler way to use wait/notify on my cpu thread. I just count the instructions and when it reaches 10,000 (a rough estimate based on roughly 30,000 cycles per frame) it blocks, and then I notify on every nmi.
Is this a case of glTexImage2D vs glTexSubImage2D?
There's a huge difference between those two. One is a memory allocation and garbage collection call, and the other one just updates a texture.
Dwedit wrote:
Is this a case of glTexImage2D vs glTexSubImage2D?
There's a huge difference between those two. One is a memory allocation and garbage collection call, and the other one just updates a texture.
I'm actually not sure what is actually going on OpenGL wise, as I'm using LibGDX, which is a thin layer above OpenGL. At a high level, however, I am aware that the bottleneck was because I had been storing 8 textures in the gpu (4 copies of each pattern table, one for each attribute, for background and sprites) and, when drawing the nametable and sprites, it was switching between these textures quite frequently, which is an expensive operation. So I rewrote the code which generates images from the ppu to use just one texture rather than 8...the speed boost was phenomenal, went from 23fps on an old phone of mine to 60fps.
Would it help to use a pixel shader for palette lookup so that you don't have to store eight copies of the pattern table in memory at once?
tepples wrote:
Would it help to use a pixel shader for palette lookup so that you don't have to store eight copies of the pattern table in memory at once?
I actually do use a fragment shader to interpret the textures I'm generating, which does perform a palette lookup on a texture that is 32 pixels wide by 1 pixel tall. I actually encode the pixel value, attribute, and bg or spr in the rgb values of the actual textures. Without the shader turned on, everything looks sickly greenish/red/blue hues, haha.
I've been trying to think of a way to delegate all of the ppu decoding to the shader since that is performed in parallel on the gpu for each pixel (afaik), however the bandwidth needed for uploading 8k to the gpu every frame, though small, might be prohibitive? I'm not sure. #longtermgoals. It'd be neat if I could get that to work though, because then suddenly live CHR-RAM updates would be possible. Right now, I can only support it during transitions.
Just peeking around... What are you actually doing to draw stuff?
https://github.com/libgdx/libgdx/blob/m ... xture.javaIn that file, function GLTexture::uploadImageData, I see only calls to texImage2D, nothing to subimage.
https://github.com/libgdx/libgdx/blob/m ... xture.javaBut in Texture::draw, it does call texSubImage2D.
The PPU graphics data is generated pixel by pixel into a LibGDX Pixmap (only on startup for CHR-ROM based games, or during transitions when graphics are off for CHR-RAM games...live updates not supported), which is converted into a Texture, which then is converted into TextureRegions (one for each chr tile of each pattern table/attribute combination for bg and spr). Finally, Sprite objects are created from TextureRegions. I don't know what's going on at a lower level than this (never did OpenGL before, so I'd probably not easily understand what the library's doing)---just that if I were to use multiple Textures from which to derive TextureRegions/Sprites, if I'm not careful with how many times I switch between using those Textures it slows down drawing a lot. I actually was able to obtain a temporary speed up by sorting drawing by attribute so the 8 textures I had previously been generating would only get switched to once, each. Having rewritten this so only 1 master texture is used, I don't have to do this sorting anymore so it both gets me the speedup and keeps the code simple.
On a related topic to the OP...
Premature optimization truly is the root of all evil. For most of the duration of this project, I had been executing the cpu on a background thread. Apparently on Android devices this is a bad idea. Just for the hell of it, I experimented with synchronously advancing the cpu in the render thread. Now games which were performing decently but with a lot of stuttering are now very close to being buttery smooth even on my oldest phone.