Based on my own experience and the experiences that I read about on this forum, most NES emulator developers end up writing the PPU in iterative stages of complexity. In other words, the PPU is virtually rewritten several times such that each version approximates the actual hardware better. There are several reasons to do this: It takes time to comprehend each aspect of the PPU and part of the learning process involves coding. Working with multiple versions maybe the only practical way to learn the material. Another reason is the amount of time that you have to spend on the project, which depending on your current programming knowledge and experience may be quite immense. Each version of the PPU will be able to play some subset of games and you can call it quits at any of the iterative stages. But, if you jump right into the most complex PPU design, you might never get anything playable completed. Related to that is motivation. Once you see some games running with a simple PPU implementation, you'll probably find it a lot easier to work on the next version as opposed to waiting and waiting for the complex one to get done.
Understanding timing is key to making the emulator work. Each frame needs to be displayed approximately every 17 milliseconds. You'll need some sort of sleep function that delays until it is time to generate the successive frame:
Code:
while(true) {
renderFrame();
waitForNextFrameTime();
}
The CPU and PPU execute in parallel and they are synchronized by a common clock. But, the first approximation of this might look like:
Code:
void renderFrame() {
renderBackground();
renderSprites();
generateNMI();
runCpuForNumberOfCyclesInFrame();
}
That is sufficient for the simplest games like Donkey Kong and Popeye.
The next approximation of PPU is scanline based:
Code:
void renderFrame() {
for(int i = -1; i < 240; i++) {
renderScanline(i);
runCpuForNumberOfCyclesInScanline();
}
generateNMI();
for(int i = 240; i < 262; i++) {
runCpuForNumberOfCyclesInScanline();
}
}
Ultimately, you should create a PPU function that renders a single pixel:
Code:
void renderFrame() {
for(int i = -1; i < 262; i++) {
for(int j = 0; j < 341; j++) {
renderDot(i, j);
}
}
}
In this model, the PPU drives the CPU. For NTSC, the ratio is 3:1 (3 dots per CPU cycle). For PAL, the ratio is 16:5 and there are additional vblank scanlines. The sleep delay between frames will also be slightly different. These ratios can be maintained by using floats or by integer overflows.
The PPU does several things in parallel. Such a renderDot() function will contain a lot of switching logic that decides what to do based on the current scanline and the current dot index. The wikis that describe the PPU are not written in procedural pseudo code. Instead, they are written as a bunch of possible cases. You'll need the switching logic to direct execution to each case.
Finally, do not optimize early. Modern CPUs are insanely fast. Write your code clean and readable and your emulator will likely run perfectly with plenty of time to spare for each frame.