Simplest way I can explain:
Keep vars for a CPU timestamp and an APU timestamp. To keep things simple... let's say your CPU timestamp counts in CPU cycles (if the timestamp is 0, after you run LDA #$xx -- a 2 cycle instruction, your timestamp would be 2)
On writes to APU registers... Call a 'RunAPU()' function or something of the sort... which catches the APU up to the current CPU timestamp. Example: your APU timestamp is 10, your CPU timestamp is 100, and the game just wrote to $4000... so you'd run the APU for 90 cycles. Once the APU is caught up... apply the changes that the write performed.
In RunAPU, you'd emulate all your sound channels and actually produce your sound. Assuming you're using that simplest, low quality method I described earlier... RunAPU might look something like this:
Code:
void RunAPU()
{
int minticks;
int doticks;
int soundout;
doticks = nCPUTimestamp - nAPUTimestamp;
nAPUTimestamp = nCPUTimestamp;
while(doticks > 0)
{
minticks = ceil( fTicksUntilNextSample );
if(minticks > doticks)
minticks = doticks;
doticks -= minticks;
ClockSquare1( minticks );
fTicksUntilNextSample -= minticks;
if( fTicksUntilNextSample <= 0 )
{
fTicksUntilNextSample += fTicksPerSample;
soundout = 0;
soundout += GetSquare1Output();
//clip at 8-bits
if(soundout < -128) soundout = -128;
if(soundout > 127) soundout = 127;
//convert to 8-bit unsigned (instead of signed)
soundout ^= 0x80;
OutputSample( soundout );
}
}
}
fTicksPerSample would be that CPU_CLOCK / SAMPLERATE value (~40.58 @ 44100 Hz). fTicksUntilNextSample is a counter which, as the name implies, tracks how many cycles need to pass before you output another sample. For this example to work decently these would probably have to be floating point to avoid roundoff (you don't want to round off to 40 or 41... since that might bend the pitch of the sound). There are ways to do things without using any floating point vars (which would provide a better performance)... but the concept is easiest to show this way.
ClockSquare1() would be the function that does your emulation for Square 1. Like... clocking the Programmable Timer and updating the Duty Cycle and all that jazz. The actual output of the channel is represented here by GetSquare1Output()... which would return the output. OutputSample() would be where you'd buffer the generated sample (and send to waveOut or whatever).
Note this method isn't really optimized but it should provide you with a concept to help you get things working. Also note that this example leaves out the APU frame thingamajig (which clocks the Sweep Units, Length Counters, Decay Units, etc)... but I wouldn't worry about that stuff until after you get the main sound working.
Also... the ~40.58 cycle thing is only if you're outputting at 44100 Hz. If you try this and the sound is still way offkey... doublecheck your samplerate and make sure it's 44100 Hz.