Don't really understand why all documentations says that DMA takes 160 microseconds,
and that the default wait loop waits about that long.
But if i am not mistaken, it's doing 40*16 cycles, 4 for decrementing A and 12 for JR NZ.
And that's 640 cycles, which is not 160 microseconds, as that would be about 671 right?
Am i missing something?
In fact it's 39 * 16 + 36 = 660 cycles in HRAM, isn't it? Because the final JR isn't taken (- 4), but there's the LD A (+ 8) and a RET (+ 16). That still comes in under 160ms though.
So what you're saying makes sense - the pandocs seem to miscount the cycles for the JR as justification, but the official docs use the same code and give the correct cycle count, eliding over the fact that the routine wouldn't take 160 micros.
But at any rate, that code does seem to work, and it would fail hard if the delay wasn't long enough - the CPU would read garbage once you RET'd to ROM and the program would almost certainly crash. But I've used that code myself on real hardware and it's worked fine. Since it's the official Nintendo code, and since Pan & co. probably got it by disassembling commercial ROMs, I assume real games use it too and they obviously work.
So if I had to guess, the 160 micros number that Nintendo gave (and that Pan et al copied?) is imprecise - it probably takes 160 * 4 cycles rather than 160 micros.
Maybe looking through gambatte's source code would reveal more? I tried to myself but it got a little thorny.
Or if you want to be sure, you could write a test that lets you adjust the wait time and reduce it til the CPU crashes to find the correct cut-off.
You are confusing clock cycles and instruction cycles. The master clock runs at ~4 MHz (Actually 4 MiHz, or 4*1024*1024 Hz to be precise, but that's not really important for the argument.) However, instructions always take a multiple of 4 clock cycles, so instruction timing is often counted in machine cycles, where one nop is said to take 1 machine cycle and so on. 1 machine cycle takes ~1 us, whereas 1 clock cycle obviously takes ~0.25 us.
In short, you have to divide your 640 figure by 4, which indeed gives 160 us.
nitro2k01 wrote:
zerowalker wrote:
And that's 640 cycles, which is not 160 microseconds, as that would be about 671 right?
The master clock runs at ~4 MHz (Actually 4 MiHz, or 4*1024*1024 Hz to be precise, but that's not really important for the argument.)
I saw that as exactly the argument, as 640 * 1.024 * 1.024 = 671. So I'm inclined to believe adam_smasher's explanation of imprecision between the 4 MHz nominal and 4.194 MHz actual clock rates, where a "micro" is 10^9/2^20 = 954 ns.
Yes, I misread the post. And yes, Martin seems to round 1 instruction cycle to 1 us.
Wait this got a bit confusing hehe.
I always count in cycles (not instruction cycles), so 4 being the minimum.
So is 640 cycles is correct then?
Cause that's how much mine takes until it's out of the wait loop.
It's really hard when the documentation is so fluffy at times, i usually read all over the place and try to figure out which one is correct xd.
Not a fan of checking other ppls source code as i would like to learn to write the code myself and how the system works, though i do at times.
It's even more confusing when you consider the fact that the total duration is actually 161 machine cycles if you count from the DMA register write
(at least on DMG/SGB/MGB/SGB2).
When OAM DMA is started, there is one machine cycle delay before the actual transfer starts. So, let's say you start OAM DMA with the LDH ($46), a instruction. These are the machine cycles that happen:
Code:
------------------------------------------------------------------
-3: opcode read and decoding of LDH (n), a (= $E0 is read)
-2: memory read of the DMA register address (= $46 is read)
-1: memory write to the DMA register (= value of A is written)
------------------------------------------------------------------
0: CPU continues to the next instruction. OAM DMA has not really started yet so the OAM area is still accessible during this one cycle.
1: first cycle of OAM DMA. OAM area is inaccessible
2: second cycle of OAM DMA. OAM area is inaccessible
...
...
160: 160th cycle of OAM DMA. OAM area is inaccessible
------------------------------------------------------------------
161: OAM DMA is no longer running so the OAM area is now accessible
is -3 to 0 one instruction?
isn't cycles bundled together in the entire instruction?
Not sure if this is what you're asking, but -3 to -1 inclusive are LDH [$46], A.
They're not (usually) externally visible, but internal state changes happen during each constituent cycle of the instruction - the CPU isn't just sitting around waiting N cycles and then instantly executing the instruction.
Yeah that's what i am asking.
And that part i get, but was wondering about the Timer and Interruption.
Those are updated on a per instruction basis right?
At least the effect they do?
I just chose the numbers to reflect the relative machine cycle position compared to the "first OAM DMA cycle".
Here's a real hardware trace of a case almost identical to the one posted earlier:
The CPU is running these instructions:
Code:
$0150: LD A, $40
$0151: LDH ($46), A
$0154: NOP
You can see the OAM DMA accessing $4000 and $4001 in the end...note the one cycle delay during which the CPU executes a NOP in this example.
(if you're curious about what the CPU executes after the NOP, the answer is that the last two machine cycles actually involve an OAM DMA conflict in this case, but that's a story for another day...)
So the first instruction, the CPU actually reads (FF) in 1 cycle, then it reads the remaning in the other 3 cycles.
After that it does the FF thing again, then do the load for 3 cycles?
How does the TIMA and Interruption work in these cases, aren't those essentially pseudo-async to the CPU?
I mean the CPU must somehow check the data there, but if it does things internally like this, when does that check occur, every cycle, or after every complete instruction?
Quote:
And that part i get, but was wondering about the Timer and Interruption.
Those are updated on a per instruction basis right?
Timer (TIMA) works at M-cycle granularity (not at instruction granularity!).
The real interrupt sources in the system work at various granularities all the way down to half T-cycles. However, the CPU checks interrupts only at the start of instructions so the CPU never dispatches to an interrupt handler in the middle of an instruction.
zerowalker wrote:
So the first instruction, the CPU actually reads (FF) in 1 cycle, then it reads the remaning in the other 3 cycles.
After that it does the FF thing again, then do the load for 3 cycles?
I think you're mixing T-cycles and M-cycles. In this screenshot CLK = T-cycles, PHI = M-cycles. None of the instructions here contain the byte $FF.
During M-cycle labeled -5 the CPU reads $3E. During M-cycle -4 the CPU reads $40.
Quote:
Timer (TIMA) works at M-cycle granularity (not at instruction granularity!).
With that you mean it has it's own clock right?
So, if the say in the next 8 cycles the TIMA will overflow and trigger an interruption.
The CPU's next Instruction happens to be 16 cycles.
So it would first check the state before doing the instruction.
Then do it (+16 cycles).
Then repeat, and this time an interruption occurs, it then call a jump to the RST (if enabled), which takes 20 cycles.
then does the stuff, then it get's back to to the instruction that was after those 16 cycles before?
//EDIT:
My bad i read the "FF" on the image, below 80/81.
but wait, T-Cycles is the "real cycles" right?
and M-Cycles is 4 T-Cycles, cause everything is dividable by 4 right?
Is there any place to get those Hardware Traces?
As it seems to be quite rare to find what the CPU actual does every clock (even Machine Cycles).
How come some instructions need decoding that takes 4 cycles?
For CB i get it, i assume it takes 4 cycles to read the next 8 bits, and then 4 cycles to execute it (if it's on a register only).