A thought on Frame IRQ latency

A thought on Frame IRQ latency
by Disch on 2006-06-25 (#14552)

Some confusing APU Frame IRQ behavior has been bugging me for a while. I've mentioned it several times in the past, and speculated on it a few times.

In blargg's "tests.txt" doc which comes with the APU test roms:

Quote:

Frame interrupt flag is set three times in a row 29831 clocks after
writing $4017 with $00.

[snip]

IRQ handler is invoked at minimum 29833 clocks after writing $00 to
$4017.

Note the 2 clock discrepency between when the IRQ flag raises and when the IRQ actually trips. I speculated that this might be due to a 2 cycle latency in IRQ firing. However games which use mapper IRQs (specifically ones which use CPU cycle based counters, such as VRC mappers) borked up when applying latency to all IRQs. Then I thought that the latency only existed for APU frame IRQs -- but that didn't really sit right with me (doesn't really make sense....)

Now I'm thinking something else. Note in above quote: "Frame interrupt flag is set three times in a row". While this in itself is nothing new -- it hit me that the LAST time the flag is set is interesting:

29831 - Flag first set
29832 - Flag set again
29833 - Flag set for the last time, IRQ

The significance of this? Well of course this is all just more speculation on my part, but now I'm thinking that the IRQ flag is raised before the /IRQ line is asserted*. Now I'm not saying there's latency -- latency would imply that the IRQ is being requested -- but it just doesn't happen for 2 cycles. I'm thinking that the IRQ isn't even being requested until that last cycle.

Maybe the APU is too busy in this time frame to request an IRQ until it's done with everything else? When you think about it -- on this step in the frame sequencer it's supposedly clocking Decay/Linear + Length/Sweep + IRQ (3 seperate areas taking 3 seperate cycles? coincidence?).

It could be that instead of all these things occuring on one cycle, they occur across a group of cycles. The IRQ flag being set 3 times is a possible indication of this. This might also explain a quirk blargg briefly mentions in the readme:

Quote:

Not documented here is a delay when changing modes by writing to $4017.
This is quite complex and I haven't fully worked out its exact
operation. Once determined, documented, and tested, the information here
should still be valid. This delay when changing modes involves the
current mode running a few clocks before switching to the new mode, so
it only affects the rare case where $4017 is written within a few clocks
of a frame counter step. This delay does not cause the steps to occur
any later than shown below; it only causes the first few clocks of the
new mode to be transparent, allowing the previous mode to "show
through".

Operations might "show through" if the APU is still in the process of performing them -- if they can't be done as quickly as on one cycle. But again this is just more speculation.

That's about all. Just thought I'd post this here because it was something on my mind -- maybe it'll be helpful/interesting to someone out there.

* (note my understanding of hardware is zero -- so my reference to /IRQ line is likely off and maybe even nonsensical)

by hap on 2006-06-26 (#14588)

That's an interesting thought. I've always assumed that the APU frame IRQ flag, and the _IRQ line from the APU frame sequencer are basically connected from the same 'wire'. So I set that _IRQ line and flag at the same time 3 times in a row.

Implementing your idea here makes blargg's 08.irq_timing.nes test fail with #3.

*edit* I've said before, and it's still the case, I haven't implemented a delay.

by Disch on 2006-06-26 (#14591)

hap wrote:
*edit* I've said before, and it's still the case, I haven't implemented a delay.

I still don't get how you're pulling that off. If I don't put in a delay either the 07.irq_flag_timing test or 08.irq_timing test fails. For me anyway.

*shrugs*

by hap on 2006-06-27 (#14654)

My CPU loop (skeleton underneath) is pretty standard:
- get opcode, decrement cpu cycles

- update dmc, framesequencer, and mapper irq

- run opcode
PPU, APU, mapper hardware get updated here if they are written to.

- run interrupt
If any irq was first set at the last cycle of the current opcode, the interrupt handler will instead be called after the next opcode.

by Disch on 2006-06-27 (#14659)

The way you layed that out reminds me of something I was doing before...

When I was messing with frame IRQs, before I implimented a prediction routine, I was updating the APU between every CPU instruction. Therefore I ended doing the following (in this order):

1) Check for pending IRQs
2) Update APU
3) Run CPU for 1 instruction (adjust cpu cyc)

This worked out to where I passed blargg's test without implimenting any latency -- but then it hit me that I actually was unintentionally implimenting latency of 1 instruction. Since, if the APU was run and it wanted an IRQ, it would not happen until after the CPU runs for one more instruction. Once I switched steps 1 and 2, blargg's test complained again and I had to put in the 2 cycle delay.

It doesn't look like that's what you're doing though -- since you say you adjust your timestamp before updating the frame sequencer.

Maybe this is my problem:
Quote:
If any irq was first set at the last cycle of the current opcode, the interrupt handler will instead be called after the next opcode.

That sort of puts in some latency -- but only 1 cycle.

Gah who knows.

by byuu on 2006-06-27 (#14694)

I'll try and share some more SNES experience, since once again -- things seem very similar between the two.

First, tests for IRQ occur at the "start" of the last opcode cycle. This is actually a hackaround to simulate the two-stage pipeline of the CPU. The CPU *has* to test one cycle early because the bus is always one cycle ahead of the work cycle. So if you don't test one cycle early, you will have already started the opcode fetch cycle before you can trigger your IRQ. The bus opcode fetch actually still occurs! This is why the first cycle of an IRQ is a memory read from PC instead of an I/O cycle. However, PC isn't incremented. The CPU kind of hijacks the processor into the interrupt routine instead. When you return, you perform that opcode fetch cycle again for obvious reasons. Or actually, that may be why rti doesn't increment PC when it reads it from the stack like rts does. But don't quote me on that.

Next, with the SNES the smallest clock step is two. Pretend each reference to two clocks is one on the NES and you have the same behavior. So the IRQ acknowledge bit ($4211.d7) is set at say, V=225,H=12 as an example. Then, the /IRQ line does not go high until V=225,H=16. The irq read bit actually stays high for the first two possible read locations (V=225,H=12 or 14). If you read during this time, then you can read $4211 and see the IRQ acknowledge bit set twice.

Anyway, the reason is not that this bit is being set three times in a row, at H=12,14 and 16. It is because the bit is being asserted via bus hold delay. That's why your read has no effect until the bus hold delay has elapsed. On the SNES, an access to $4211 takes exactly six clock cycles.

Now, why does IRQ read get set four cycles before /IRQ goes low? Easy. You're reading $4211, whereas /IRQ is being written to. Reads for fast-memory access areas are acknowledged 2 clocks into the six-clock cycle, and writes are not completed until 6 clocks into the cycle. 6-2=4, there you go. The SNES is holding a write to $4211 internally for clock 2 and 4, so your read isn't doing you any good. Once you read at 6 or beyond, you can actually clear the bit since the write to that register has stopped.

The last anomaly is pairs like cli : rti , plp : cli, and all of those combos. I forget exactly what they are, but the basic idea is that you can affect whether the actual IRQ will trigger by updating i in the last cycle of the opcode. The test still occurs at the start of last cycle, but the new state of I is considered when on the first cycle of the next opcode before actually triggering an IRQ. Again, makes sense if you sit down and think about how the pipelining has a one cycle delay between bus cycles and work cycles.

Anyway, you basically need to understand pipelining, and bus hold delays to understand the quirks with IRQ. Think of pipelining as splitting the CPU cycles in half to actual reads/writes and work against the values read/written. The work cycles are always one cycle behind the bus cycles for obvious reasons of the data not being ready at the start of the work cycle. Think of bus hold delays as breaking down your CPU from cycles into raw clocks. Lucky for you, you can "hack" your way around this and not have to implement either one. You just get fun little cases like you're describing where you don't understand it because you're thinking at too high a level.

All of these notes also apply to NMI (if the NES even has that), but the exact behavior is a little different. Thankfully, not by much.

by Disch on 2006-06-27 (#14700)

Awesome... that actually makes a lot of sense, though I can't say I understand it 100% (a lot of this is new info for me -- I think that's the first time I've ever heard of a bus hold delay ;P)

But conceptually I think I get what you're saying.

As you hinted... a similar effect does happen with NMIs on the NES, as well. There's a way to sort of manually trigger an NMI (by changing $2000.7 0->1 when $2002.7 is high), but the NMI doesn't occur until an instruction after it's triggered. Which is probably because the write to $2000 isn't completed until after the next opcode fetch?

Anyway, thanks a million byuu =)

by byuu on 2006-06-27 (#14704)

Well, essentially think of bus hold delays like this... your one CPU cycle that performs a read or write takes what, 6 clocks or so to complete? Obviously, the read isn't going to happen instantaneously at the very start of that 6 clock cycle period. They happen somewhere in the middle for reads, and at the end for writes. Luckily, the SNES has a latching mechanism to get the H/V screen counter positions within 4 clocks of accuracy. Using a clever trick (use a 14-cycle opcode to toggle bit 2 of the counter) I was able to double that to 2 clocks of accuracy, basically the smallest you can step by anyway. So then I was able to write a routine that gets to V=0,H=0 every time you call it, making it trivial to carefully plan out exactly what clock you read from the IRQ status register, etc to find out all of these little tricks.
Anyway, back on topic: $2137 reads latch this counter, as well as $4201 writes. But a fun anomaly was that the reads would latch the counter at 4 clock cycles lower than $4201 writes. They both took 6 clock cycles to complete, so obviously the reads and writes weren't happening instantly. The WDC 65c816 manual basically confirms my timings (it lists the bus hold delays in microseconds that TRAC and I "estimated" against to match my findings). We then extended them to the reads/writes that take longer than 6 clocks. My results were that reads held for all but the last four clocks of the cycle, and writes held the entire cycle.
But to be honest, I can't say with 100% certainty whether the read and write hold delays and acknowledgements aren't longer for different components (eg do accesses take longer to travel between CPU<>APU than CPU<>PPU?) without an oscilliscope and hardware knowledge I do not have. But at least I'm mostly certain I'm correct about the timings for reads and writes to fast memory registers. Hope that helps.

Yes, you can manually trigger NMIs since they are edge-sensitive. If I recall correctly, an NMI fires whenever /NMI goes from high to low. As long as you don't read the NMI status register, then you can keep writing to the NMI enable bit and toggling enable between 1 and 0 and force it to trigger whenever you want. You have to let the NMI status bit get set first though for this to work.

IRQs are even easier. Level-sensitive. So just don't read the IRQ status register and IRQs will fire whenever the I flag is cleared. Set the first opcode in your IRQ handler to cli and crash the system, heh.

And then lastly, the NMI status flag gets cleared at the start of a new frame, unlike the IRQ status flag. Hence you can only keep triggering NMIs after V=225 (or V=240 with overscan enabled). I imagine you can probably leave overscan off, go to V>=225&&V<240, then enable overscan, then trigger NMIs. Why you'd do this, I don't know.

I really need to document NMI and IRQ this time around. I figured them all out and wrote about 40 hardware tests for each that require perfect timing to pass. So I have them almost completely figured out (excluding conflicts such as DMA overlapping NMI), but since I never documented them, my memory on them is rusty and I have to relearn the stuff to rewrite my CPU core with this cooperative threading thing I'm doing. Oh well.

by mattmatteh on 2006-06-28 (#14730)

byuu,

you have test for the nes or snes or both ? any test that pass on hardware would be great. blarggs ( and others ) tests have been a great help.

matt

by byuu on 2006-06-28 (#14732)

I have SNES tests. They just run and show a blue screen on passing, red screen on failure. Test # that failed is stored in SRAM[0]. Not as nice as most tests, but it wasn't meant to be public. I'll post them between now and three weeks from now, if I remember.

by Disch on 2006-07-05 (#15073)

An update:

I have since implimented a simulated version of the behavior byuu describes for all IRQs, and set the frame IRQ on the first cycle only in that set of 3 (rather than doing it on all 3 cycles) -- and Frame IRQs, MMC3 IRQs, VRC IRQs, and all other mapper IRQs I've tried so far all work great without any timing adjustments or hacks! There are still some issues with a handful of extraordinarily nitpicky MMC3 games (*cough*Days of Thunder*cough*), but overall I am very satisfied. This is the smoothest I've had things running so far.

Thanks again for the input, byuu. It's much appreciated.

by hap on 2006-07-08 (#15210)

You mean the fact that tests for IRQ occur at the start of the last opcode cycle? (hence my 1 cycle 'latency' if an IRQ happened there). Or have you simulated bus hold delays?

Setting the frame IRQ on the 1st cycle instead of all 3 here makes blargg's test pass too, though I still need to set the readable IRQ flag 3 times in a row.

MMC3: any success with the games mentioned in this thread? (most notably Juuouki, Mickey's Safari in Letterland)

by Disch on 2006-07-08 (#15221)

hap wrote:
You mean the fact that tests for IRQ occur at the start of the last opcode cycle? (hence my 1 cycle 'latency' if an IRQ happened there). Or have you simulated bus hold delays?

Both.

Quote:
MMC3: any success with the games mentioned in this thread? (most notably Juuouki, Mickey's Safari in Letterland)

Unfortunately, no. Those were the "extraordinarily nitpicky" games I was referring to ;D

I have a hard time seeing how Mickey's Safari in Letterland could be an MMC3 issue though -- I mean screen shaking I can understand, but why would it split the status bar in two? Does it put each half in a seperate nametable or something to get more vertical scrolling space?

by hap on 2006-07-08 (#15227)

(kind of trailing off your original topic, but your problem was solved anyway)

Yes, I don't know the reason, but it does put each half in a separate nametable. The IRQ over there seems to happen at either scanline 206 (looking OK), or 205 (looking squashed), always at 5 PPU cycles after hblank.

by byuu on 2006-07-11 (#15333)

Disch wrote:
Thanks again for the input, byuu. It's much appreciated.

Glad to help. There's definitely a lot of parallels between the NES and SNES, so it makes sense to share our knowledge. And yet obviously we need to test the differences very closely on both hardware units and not assume anything directly translates from one system to another. I'm actually kind of hesitant at giving advice to NES devs who lack copiers to test my ideas. Even if they work, that doesn't mean they're right.

Anyway, if you're still having trouble in some games, there may be NES differences, or you may not have what I was describing down pat just yet. My explanations are honestly rather lousy, and it took me several weeks to RE and implement what I described properly.