This all seems hacky and unnecessary.
Doing things "before" the CPU or "after" shouldn't matter as long as you're consistent.
It's confusing to visualize what happens when the CPU and PPU try to access something at the same time -- so just assume they have one of the bootup syncs where they never do:
Code:
P = ppu cycle
C = cpu cycle
PPPCPPPCPPPCPPPC
You might look at this and say "he's running the PPU first" .. but this sequence could just as easily be rewritten as:
Code:
CPPPCPPPCPPPCPPP
And now the CPU is first. There's no difference -- they're the exact same thing.
The only thing that matters is when something happens on one P cycle, how it impacts the surrounding C cycles (and vice versa). Logically, something that happens on a P cycle can't impact a C cycle that is already complete, so you only have to worry about the next one:
Code:
PPU generates NMI here
|
V
CPPPCPPPCPPP
^
|
|
CPU 'sees' NMI here (but won't actually perform
it until this instruction is complete)
Notice how I "ran the CPU first" here... but this logic doesn't change even if you run the PPU first:
Code:
PPU generates NMI here
|
V
PPPCPPPCPPPC
^
|
|
Exactly the same .. even though PPU is "run first"
On this abstract level, it really isn't any more complicated than that. You're making this problem out to be way harder than it really is.
In all liklihood, you're seeing shakey status bars because you have an off-by-1 error somewhere in your code. Maybe you're firing an NMI one ppu cycle too early/late or something like that. That could be easily exposed if you
dump a tracelog and examine it.
If you make a trace log that tells you exactly what cycle NMI happens on, and what cycles the CPU is writing to scroll regs, or reading from $2002 -- and compare that to a disassembly of what the code is actually doing, you should be able to spot where your emu is going wrong.
----------------------
Or... if you want to run the CPU as "half-cycles" you might have a pattern like this:
Code:
P = PPU cycle
r = rising edge of CPU clock
f = falling edge of CPU clock
rPfPPrPfPPrPfPP
or
rPPfPrPPfPrPPfP
This might help if you find that writes take effect on the rising edge of a CPU clock, while reads take effect on the falling edge. But I *think* they both take place on the rising edge, so this is probably unnecessary. You could verify that with Visual2a03 + visual2C02 if you want.