In
this post, Fx3 wondered why a CPU tester was displaying +1 cycle or -1 cycle.
Fx3 wrote:
Too bad here, unfortunately. I call "CPU clock" one PPU access, rendering 3 pixels. My CPU core is simple regarding the instruction set, of how each opcode is emulated. For some obscure reason, your test ROM is giving +1 cycle error for opcodes $01 and $04 (right now). Opcode $04 is odd... if I take out 1 PPU access, it displays -1 cycle; else, +1 cycle. Go figure...
Any help?
Every 6502 cycle has a top half (clock = HI) and a bottom half (clock = LO). A read or write may take effect only at one half.
What do you mean? Could you give me an example of this?
Something can happen at the rising edge of the clock and/or at the falling edge of a clock. For a D flip-flop triggered by a rising edge, you can invert the clock to trigger it on the falling edge.
No.
I mean using
emulation terms. If an instruction takes 1 cycle to read the opcode, 1 cycle to read the argument (next byte, address), 1 cycle to read from address XXh and 1 cycle to do the operation... what am I missing here? How could I think about rising/falling edge of CPU cycles? Looks like that weird MMC3 IRQ clocking!
the read or write doesnt happen for the whole cycle. only half will have the read or write asserted. and changes only happen on the rising or falling edge.
matt
No++.
I need an example. I don't understand the meaning of rising/falling edge of CPU cycles.
The CPU clock signal is roughly a square wave:
Code:
high clock state rising edge
vvvvv v
_____ _____ _____ _____
| | | | | | | |
_____| |_____| |_____| |_____| |_ ...
^^^^^ ^
low clock state falling edge
Each cycle consists of a rising edge, a high state, a falling edge, and a low state. Different things can be defined to happen on the rising or falling edge of the clock signal. For example, when performing a memory read, a CPU may put an address on the address bus on a rising edge and then read the data bus on the next falling edge.
Now if you have two processors running at different speeds:
Code:
CPU
_____ _____ _____ _____
| | | | | | | |
_____| |_____| |_____| |_____| |_ ...
PPU
_ _ _ _ _ _ _ _ _ _ _ _
| | | | | | | | | | | | | | | | | | | | | | | |
_| |_| |_| |_| |_| |_| |_| |_| |_| |_| |_| |_| |_ ...
Then the rising and falling edges of a single CPU cycle may occur a PPU cycle or two apart, which may cause a read or write to appear to be delayed by one or two PPU cycles (a fraction of a CPU cycle).
Awesome. ^_^;;
By the way, in terms of emulation of a certain instruction (like LDA $00), how a rising/falling edge would be detected?
LDA $00 has six edges, more or less like the following:
- a rise and fall for reading opcode LDA,
- a rise and fall for reading address $00,
- a rise while address $0000 is put on the address bus, and
- a fall while the value is read from the data bus.
Each rise or fall may affect other hardware connected to the CPU bus, such as the PPU registers.
tepples wrote:
LDA $00 has six edges, more or less like the following:
- a rise and fall for reading opcode LDA,
- a rise and fall for reading address $00,
- a rise while address $0000 is put on the address bus, and
- a fall while the value is read from the data bus.
Each rise or fall may affect other hardware connected to the CPU bus, such as the PPU registers.
Hmm... interesting. By the way, as far as I know, there's no public docs that cover this rising/falling thing, but only "numeric" CPU cycles as
units. Plus, it's picky to emulate something like 1 cycle being broken into 2 steps (rise/fall). Anyway, it makes sense as I could detect unexpected errors in most of other blargg's tests.
The LDA $xx takes 1 cycle to fetch the opcode, 1 cycle to fetch the immediate byte and 1 cycle to read from RAM[$xx]. By the way, each cycle takes 2 'steps'. This way, I must create a SINGLE PPU access function to correct the problem. By default, mine takes 3 PPU cycles per CPU cycle, and it seems incorrect... -_-;; That's it. Thanks for the help.
Fx3 wrote:
By the way, as far as I know, there's no public docs that cover this rising/falling thing
The timing diagrams of the 6502 data sheet should work nicely.
I thought the 6502 also used a two-phase clock like this, where actions occur on the rising edge of each of the phases:
Code:
___ ___ ___
_| |___| |___| |___
___ ___ ___
___| |___| |___| |___
But perhaps this is just another way of describing a single clock at double the rate with actions occurring on the rising and falling edges.
I still don't see how this would help explain Fx3's problem with instruction timing.
blargg wrote:
I still don't see how this would help explain Fx3's problem with instruction timing.
Just forgive me. ^_^;; Ah yes, thanks for the test ROMs, it's awesome, no joke.
blargg wrote:
I still don't see how this would help explain Fx3's problem with instruction timing.
Any test ROM that measures the timing of an instruction by comparing it with the length of a PPU frame will have different behavior if the NMI comes a half-cycle early or late, especially if that triggers the bug where a PPUSTATUS read cancels NMI.
My CPU timing test has large margins for timing, since it uses NMI to time thousands of executions of the instruction, not just one. It further allows an error of up to +/- 6 iterations of the loop as compared to the reference values. For instructions which differ in execution by one clock, the iteration count in differs by at least by 200.
In other timing tests which do timing down to 1 PPU clock accuracy, they first synchronize the CPU clock with the PPU clock such that the error is at most 3/4 PPU clock. The PPU clock is master / 4, the CPU master / 12, so there are four different possible fixed synchronizations at power-up, depending on the random state of the dividers (P = PPU, C = CPU, one character = one ~21.5 MHz master clock):
Code:
P---P---P---P---P---P---
C-----------C-----------
-C-----------C----------
--C-----------C---------
---C-----------C--------
In other words, I cannot admit 1 CPU cycle = 3 PPU cycles, but 1 CPU cycle has 2 phases, of rising/falling, and 3 PPU accesses, like:
cpuclock -> ppu-> cpuclock_H -> ppu-> cpuclock_L -> ppu
Well, instead of a function that "renders" 3 ppu pixels, I need to rebuilt it in order to allow a single PPU cycle/access.
I don't think you need any finger granularity than one CPU cycle for 6502 emulation. Each 6502 cycle always accesses memory, either a read or write. I'd be very surprised if this access were done at a different time each cycle; I expect that they always occur at the same relative time. One difference I would not be surprised to find is that writes effectively occur earlier in a cycle than a read, since the written data appears immediately while the data read isn't latched until later, to give time for the device to respond. A device which responded quickly to writes and which did not latch data being read (i.e. during a read the data lines could change if the data being read changed) could cause such a situation of differing write/read times.
OK, I was upset about this last test ROM. -_-;; Plus, I'm sorry for possible English faults below...
Basically, all previous test ROMs have passed. I had made huge fixes into my pAPU/CPU core, plus a huge debug session in order to understand why a specific test ROM was failing. Each cycle or CPU event was matching, so I was happy.
What my CPU instruction core does is something very "simple": the PPU runs 3 cycles for each memory access. What I meant was to "break" these 3 PPU cycles into single cycles, after the "new" info about low/high edges of a clock... -_-;;
I couldn't debug this new test ROM yet due to the lack of motivation because of complexity level reached, so I'm not making any updates until I get proper help some day.
i dont think you need to do half ppu cycles. i was thinking about this a while ago when getting those test to pass. the only thing i changed was the catch up. reads would catch up to the cycle, while writes would catch up to the cycle + 1 ppu cycle. this would be because the read would set the read line and hold it for the cpu read. the write would be held for the cpu write, and the ppu would get it after that. but this is really a guess and i have nothing to show that it is correct.
matt
I'm not meaning "half" of anything, but something like...
Code:
LDA #00 [3 cycles]
1. opcode fetch (1 CPU cycle -> 3 PPU clocks)
2. immediate byte (+1 CPU cycle -> +3 PPU clocks)
3. A = read from RAM[00] (+1 CPU cycle -> +3 PPU clocks)
What I see as "correct" follows:
3. A = read from RAM[00] (+1 CPU cycle -> +3 PPU clocks)
- put 00 on databus (ppu clock) -- rising edge
- temp = read from RAM[00] (ppu clock)
- A = temp (ppu clock) -- falling edge
Example: opcode $01 fails (ORA). If I take out the CPU cycle (or 3 PPU cycles), it passes ok. This is where my question relies.
Another interesting issue (or "hack") is for fetching the bytes after the opcode (immediate=1 or absolute=2): when fetching the opcode, the PPU is clocked
before checking the NMI condition (to trigger after completing the instruction); when fetching the immediate byte, the NMI is checked
after checking the NMI condition. Bump.