Right now I have VERY basic timing code implemented. And I am rendering on a per frame basis. I was wondering if I am going to run into unforeseen trouble down the road using this method. I have gotten to the point where a few games are working, but I don't want to do a lot of heavy ppu coding and regret it later. I'm still at a point where I can switch the way that rendering is done, so I'd appreciate your feedback.
To sum up:
Should I switch to a per pixel or per scanline technique instead to avoid problems down the road?
Thanks!
EDIT: Removed last comment because it sounded snotty. SORRY!
Per-pixel and per-scanline aren't much different. Per-pixel just means you divide up the scanlines into smaller sections if the PPU changes state while rendering a scanline.
If you ever want to play Punch Out or Fire Emblem, you need per-pixel since the mapper switches CHR banks when it encounters a particular TILE number.
Per frame will only work with the simpler games. Any that have a scrolling area and a fixed status bar won't work.
It would be recommended to atleast do per scanline. The state of things at the end of rendering which is likely when you draw the frame now is not enough to fully emulate the PPU. Graphics banks can change, scrolling and nametable registers change, etc.
The easiest for coding point of view is to emulate per pixel. Meaning run one CPU opcode. Then run 3 PPU cycles for every 1 CPU cycle you executed. This will keep the PPU and CPU roughly in sync. To be even more accurate you would need to actually break down CPU opcodes into their cycles and execute 1 CPU cycle and then 3 PPU cycles.
But as far as running most games, you can do a per scanline rendering. You can do hacky things. It can work if you want it to.
And alternative to the per pixel and per scanline is to do the "catch up" method. You execute CPU cycles until an event happens that needs the PPU to catch up, at which point you execute your PPU emulation to run PPU cycles until it catches up to the CPU. In the end it's up to you to decide what works best for you.
MottZilla wrote:
It would be recommended to atleast do per scanline. The state of things at the end of rendering which is likely when you draw the frame now is not enough to fully emulate the PPU. Graphics banks can change, scrolling and nametable registers change, etc.
The easiest for coding point of view is to emulate per pixel. Meaning run one CPU opcode. Then run 3 PPU cycles for every 1 CPU cycle you executed. This will keep the PPU and CPU roughly in sync. To be even more accurate you would need to actually break down CPU opcodes into their cycles and execute 1 CPU cycle and then 3 PPU cycles.
But as far as running most games, you can do a per scanline rendering. You can do hacky things. It can work if you want it to.
And alternative to the per pixel and per scanline is to do the "catch up" method. You execute CPU cycles until an event happens that needs the PPU to catch up, at which point you execute your PPU emulation to run PPU cycles until it catches up to the CPU. In the end it's up to you to decide what works best for you.
I am going to switch over to a per pixel method this weekend. I'd rather not take shortcuts and regret it in the end. Plus, the code I have in place for per-frame rendering shouldn't be too difficult to change into per pixel. Let's hope it's as easy as I'm thinking it will be.
Hardest part about more precision is breaking in and out of the loops. A lot harder to do scanline->pixel than frame->scanline.
Say you load in tiledata for the next 8 pixels, but if you return from your loop, you lose that data. So you have to store it globally, or reload it for each pixel, the former being much more correct.
But overall, not very hard. Weekend project or so.
MottZilla wrote:
[An] alternative to the per pixel and per scanline is to do the "catch up" method. You execute CPU cycles until an event happens that needs the PPU to catch up, at which point you execute your PPU emulation to run PPU cycles until it catches up to the CPU. In the end it's up to you to decide what works best for you.
Catch-up is a threading method, thus it's orthogonal to PPU emulation granularity. You can do per-frame, scanline, pixel with catch-up, green threads, etc.
You can also do per-pixel with full CPU timing accuracy even with a per-opcode CPU emulator by having it keep track of the times of memory accesses. For example, if a STA $2000 started at cycle count 10, then the $2007 write would be passed a cycle count of 13.
My favorite is to do catch-up in both directions :)
In Mottzilla's example, allow the PPU to then run ahead of the CPU until it needs to access something that the CPU can modify. Not as effective for CPU<>PPU, as the PPU undoubtedly constantly accesses registers, but for certain chip setups (like for an audio processor), it can be really impressive.
A further optimization I have been thinking about lately is write-ahead. Instead of simply syncing the PPU up to the CPU on any access, only do it on reads. Store a relative array of writes and timestamps for CPU writes, say in a binary min-heap array for O(1) read and O(log n) insert+delete, and then replay them when the PPU eventually takes over; and vice versa. In this way, you can probably cut the number of synchronizations needed in half.
And of course, the holy grail would be a way to "unwind" reads that end up being different from what the state was in the past. This is pivotal to high performance on processors that directly share eg a large bank of RAM. It's very, very unlikely that they will both access the same byte of RAM right away, but since you don't know, it forces a lot of unnecessary syncs. Even if the unwind action were relatively painful, the number of synchronizations eliminated could lead to a real speedup overall.
The more you run-ahead with this model, the more painful the rewind operations would become. You'd essentially be getting into adaptive territory now in trying to best weigh the tradeoffs. Could be a very exciting area of research at the very top-end of computer science.
I won't be so strict as to say that Per-frame rendering doesn't work with games that use scrolling in part of the screen.
I mean, for example if your emu does per-frame rendering BUT records in which scanline(s) the scroll registers were modified, then, you can still make up your frame only once at the beggining of vblank and still fully support the partial screen scrolling.
What I mean is there's not a single way of doing per-frame rendering, as there isn't a single way of doing per-scanline rendering.
Petruza wrote:
if your emu does per-frame rendering BUT records in which scanline(s) the scroll registers were modified, then, you can still make up your frame only once at the beggining of vblank and still fully support the partial screen scrolling.
That's called "timestamping", and in theory, it can handle even mid-scanline changes. You just have to be extra careful with code that checks for a sprite 0 hit or mappers that assert an IRQ on PPU memory reads, because PPU register writes above them can affect the result.
I have already switched my code over to per-pixel rendering. It was a little more of a pain than I expected, but it seems to be working well now. After working on this for a bit, the benefits are quite clear.