Currently, my CPU core is littered with this nasty hack:
The idea of L (lastCycle()) is to test interrupts.
This becomes important for eg CLI:
Eg: https://wiki.nesdev.com/w/index.php/CPU ... 2C_and_PLP
The understanding I've heard in the past is this is a side-effect of the 65xx's two-stage pipeline. While the CPU is doing work cycle N, it's performing bus cycle N+1. So the last work cycle is really the first bus cycle of the next instruction, and that's where it's testing for interrupts (IRQ/NMI.) The test ends up happening before P.i can be set to zero.
So breaking down cycle timings ... presume V=scanline#, H=scanline clock counter, and each opcode cycle takes 6 clocks.
So the bus fetches 0x58 (CLI) during cycles 12-17. Then there's an idle bus cycle during cycles 18-23. Then the CPU fetches the next opcode during cycles 24-29 ... but we are testing for interrupts at cycle 18, not 24!!
If we re-imagine the bus/work breakdown, we could say that the opcode fetch was already done from the previous instruction.
In code:
But we aren't eliminating any complexity, now every single instruction has to end with the opcode byte fetch, so that our final work cycle can come after it.
Then there's this weird thing I found while chasing cycle-perfect interrupt timings:
I spent weeks testing this extensively to confirm this is exactly what was happening ... it was pretty hellish to exhaustively test every possibility. So then, a read from a slow ROM region can take 8 clocks instead of 6 clocks, so it can affect timing.
If we fetch opcode bytes and then execute their instructions, the interrupt() function looks like this:
But if we fetch opcodes at the end of each instruction (thus the next opcode has already been fetched before executing instructions), it looks like this:
If an interrupt fires after say, NOP, and each instruction ends with an opcode prefetch, then the last cycle of NOP was an opcode fetch and that takes +8 cycles instead of +6 cycles.
But that only explains the exception case, so all we did was invert idleIRQ(). An instruction like XBA (fetch + idle + idle) doesn't have the effect. Which would mean that with opcode fetching coming at the end of the instruction, XBA is (idle + idle + opfetch). But my testing showed there was no opcode fetch if an interrupt fires after XBA, so that would mean we'd have to make the last cycle of XBA call idleInvertedIRQ() that turns an opcode fetch into an idle cycle (along with a PC.w increment) if an interrupt is pending.
So I don't really know which method is better. The only nice thing about the opcode fetch appearing at the end is it becomes a conistent final cycle. Branch conditions are currently pretty annoying otherwise:
But with the opcode fetch at the end:
And that means we can combine all those annoying 8/16-bit CPU instructions:
To eg:
...
Ultimately what I want to do here is get rid of the need for L. My thought for this was, what if we do the lastCycle() test on every single cycle, and use the second-to-last result to determine if we should fire an interrupt after an instruction?
But with my current lastCycle() test, it's not possible to call it on every cycle because it has side effects.
The (irq,nmi)Transition flags are tested for every clock cycle, and set the instant an interrupt triggers. If we call lastCycle() on every cycle, then we could end up clearing the transition flags early. The same goes for the WAI instruction flag, though that's probably less important since WAI just sits in a spinloop testing the interrupt flag constantly.
...
The reason I bring this up is because Sour appears to be doing exactly what I've always wanted to do in testing interrupts every cycle and then using the second-to-last result for triggering interrupts. If he can actually pass my test_nmi and test_irq test ROMs then that is really, really impressive. Because right now, I don't see how that's possible >_<
Code:
#define L lastCycle();
void WDC65816::instructionJumpIndirect() {
V.l = fetch();
V.h = fetch();
W.l = read(uint16(V.w + 0));
L W.h = read(uint16(V.w + 1));
PC.w = W.w;
}
void WDC65816::instructionJumpIndirect() {
V.l = fetch();
V.h = fetch();
W.l = read(uint16(V.w + 0));
L W.h = read(uint16(V.w + 1));
PC.w = W.w;
}
The idea of L (lastCycle()) is to test interrupts.
This becomes important for eg CLI:
Code:
void WDC65816::CLI() {
L idleIRQ();
P.i = 0;
}
L idleIRQ();
P.i = 0;
}
Eg: https://wiki.nesdev.com/w/index.php/CPU ... 2C_and_PLP
The understanding I've heard in the past is this is a side-effect of the 65xx's two-stage pipeline. While the CPU is doing work cycle N, it's performing bus cycle N+1. So the last work cycle is really the first bus cycle of the next instruction, and that's where it's testing for interrupts (IRQ/NMI.) The test ends up happening before P.i can be set to zero.
So breaking down cycle timings ... presume V=scanline#, H=scanline clock counter, and each opcode cycle takes 6 clocks.
Code:
V H bus work
0,12 fetch 0x58 cli
L 0,18 idle I = 1
0,24 fetch 0xea nop
0,12 fetch 0x58 cli
L 0,18 idle I = 1
0,24 fetch 0xea nop
So the bus fetches 0x58 (CLI) during cycles 12-17. Then there's an idle bus cycle during cycles 18-23. Then the CPU fetches the next opcode during cycles 24-29 ... but we are testing for interrupts at cycle 18, not 24!!
If we re-imagine the bus/work breakdown, we could say that the opcode fetch was already done from the previous instruction.
Code:
V H bus work
0,12 idle
L 0,18 fetch 0xea I = 1
0,24 idle
0,12 idle
L 0,18 fetch 0xea I = 1
0,24 idle
In code:
Code:
void WDC65816::CLI() {
L prefetch(); //load the next instruction byte
P.i = 0;
}
L prefetch(); //load the next instruction byte
P.i = 0;
}
But we aren't eliminating any complexity, now every single instruction has to end with the opcode byte fetch, so that our final work cycle can come after it.
Then there's this weird thing I found while chasing cycle-perfect interrupt timings:
Code:
//immediate, 2-cycle opcodes with idle cycle will become bus read
//when an IRQ is to be triggered immediately after opcode completion.
//this affects the following opcodes:
// clc, cld, cli, clv, sec, sed, sei,
// tax, tay, txa, txy, tya, tyx,
// tcd, tcs, tdc, tsc, tsx, txs,
// inc, inx, iny, dec, dex, dey,
// asl, lsr, rol, ror, nop, xce.
auto WDC65816::idleIRQ() -> void {
if(interruptPending()) {
//modify I/O cycle to bus read cycle, do not increment PC
read(PC.d);
} else {
idle();
}
}
//when an IRQ is to be triggered immediately after opcode completion.
//this affects the following opcodes:
// clc, cld, cli, clv, sec, sed, sei,
// tax, tay, txa, txy, tya, tyx,
// tcd, tcs, tdc, tsc, tsx, txs,
// inc, inx, iny, dec, dex, dey,
// asl, lsr, rol, ror, nop, xce.
auto WDC65816::idleIRQ() -> void {
if(interruptPending()) {
//modify I/O cycle to bus read cycle, do not increment PC
read(PC.d);
} else {
idle();
}
}
I spent weeks testing this extensively to confirm this is exactly what was happening ... it was pretty hellish to exhaustively test every possibility. So then, a read from a slow ROM region can take 8 clocks instead of 6 clocks, so it can affect timing.
If we fetch opcode bytes and then execute their instructions, the interrupt() function looks like this:
Code:
void WDC65816::interrupt() {
read(PC.d);
idle();
N push(PC.b);
push(PC.h);
push(PC.l);
push(EF ? P & ~0x10 : P);
IF = 1;
DF = 0;
PC.l = read(r.vector + 0);
PC.h = read(r.vector + 1);
PC.b = 0x00;
}
read(PC.d);
idle();
N push(PC.b);
push(PC.h);
push(PC.l);
push(EF ? P & ~0x10 : P);
IF = 1;
DF = 0;
PC.l = read(r.vector + 0);
PC.h = read(r.vector + 1);
PC.b = 0x00;
}
But if we fetch opcodes at the end of each instruction (thus the next opcode has already been fetched before executing instructions), it looks like this:
Code:
void WDC65816::interrupt() {
idle();
PC.w--; //undo the last instruction prefetch increment
N push(PC.b);
push(PC.h);
push(PC.l);
push(EF ? P & ~0x10 : P);
IF = 1;
DF = 0;
PC.l = read(r.vector + 0);
PC.h = read(r.vector + 1);
PC.b = 0x00;
prefetch(); //since PC changed, the old opcode prefetch was invalidated
}
idle();
PC.w--; //undo the last instruction prefetch increment
N push(PC.b);
push(PC.h);
push(PC.l);
push(EF ? P & ~0x10 : P);
IF = 1;
DF = 0;
PC.l = read(r.vector + 0);
PC.h = read(r.vector + 1);
PC.b = 0x00;
prefetch(); //since PC changed, the old opcode prefetch was invalidated
}
If an interrupt fires after say, NOP, and each instruction ends with an opcode prefetch, then the last cycle of NOP was an opcode fetch and that takes +8 cycles instead of +6 cycles.
But that only explains the exception case, so all we did was invert idleIRQ(). An instruction like XBA (fetch + idle + idle) doesn't have the effect. Which would mean that with opcode fetching coming at the end of the instruction, XBA is (idle + idle + opfetch). But my testing showed there was no opcode fetch if an interrupt fires after XBA, so that would mean we'd have to make the last cycle of XBA call idleInvertedIRQ() that turns an opcode fetch into an idle cycle (along with a PC.w increment) if an interrupt is pending.
So I don't really know which method is better. The only nice thing about the opcode fetch appearing at the end is it becomes a conistent final cycle. Branch conditions are currently pretty annoying otherwise:
Code:
void WDC65816::instructionBranch(bool take) {
//prefetch() implied right before executing this by WDC65816::interrupt() dispatch call
if(!take) {
L fetch();
} else {
U.l = fetch();
V.w = PC.d + (int8)U.l;
idle6(V.w);
L idle();
PC.w = V.w;
idleBranch();
}
}
//prefetch() implied right before executing this by WDC65816::interrupt() dispatch call
if(!take) {
L fetch();
} else {
U.l = fetch();
V.w = PC.d + (int8)U.l;
idle6(V.w);
L idle();
PC.w = V.w;
idleBranch();
}
}
But with the opcode fetch at the end:
Code:
void WDC65816::instructionBranch(bool take) {
if(!take) {
fetch();
} else {
U.l = fetch();
V.w = PC.d + (int8)U.l;
idle6(V.w);
idle();
PC.w = V.w;
idleBranch();
}
L prefetch();
}
if(!take) {
fetch();
} else {
U.l = fetch();
V.w = PC.d + (int8)U.l;
idle6(V.w);
idle();
PC.w = V.w;
idleBranch();
}
L prefetch();
}
And that means we can combine all those annoying 8/16-bit CPU instructions:
Code:
void WDC65816::instructionImmediateRead8(alu8 op) {
L W.l = fetch();
alu(W.l);
}
void WDC65816::instructionImmediateRead16(alu16 op) {
W.l = fetch();
L W.h = fetch();
alu(W.w);
}
L W.l = fetch();
alu(W.l);
}
void WDC65816::instructionImmediateRead16(alu16 op) {
W.l = fetch();
L W.h = fetch();
alu(W.w);
}
To eg:
Code:
void WDC65816::instructionImmediateRead(alu16 op, bool word) {
W.w = fetch();
if(word) W.h = fetch();
alu(W.w, word);
L prefetch();
}
W.w = fetch();
if(word) W.h = fetch();
alu(W.w, word);
L prefetch();
}
...
Ultimately what I want to do here is get rid of the need for L. My thought for this was, what if we do the lastCycle() test on every single cycle, and use the second-to-last result to determine if we should fire an interrupt after an instruction?
But with my current lastCycle() test, it's not possible to call it on every cycle because it has side effects.
Code:
bool CPU::nmiTest() {
if(!status.nmiTransition) return false;
status.nmiTransition = false;
r.wai = false;
return true;
}
bool CPU::irqTest() {
if(!status.irqTransition && !r.irq) return false;
status.irqTransition = false;
r.wai = false;
return !r.p.i;
}
void CPU::lastCycle() {
if(!status.irqLock) {
if(nmiTest()) status.nmiPending = true;
if(irqTest()) status.irqPending = true;
status.interruptPending = status.nmiPending || status.irqPending;
}
}
if(!status.nmiTransition) return false;
status.nmiTransition = false;
r.wai = false;
return true;
}
bool CPU::irqTest() {
if(!status.irqTransition && !r.irq) return false;
status.irqTransition = false;
r.wai = false;
return !r.p.i;
}
void CPU::lastCycle() {
if(!status.irqLock) {
if(nmiTest()) status.nmiPending = true;
if(irqTest()) status.irqPending = true;
status.interruptPending = status.nmiPending || status.irqPending;
}
}
The (irq,nmi)Transition flags are tested for every clock cycle, and set the instant an interrupt triggers. If we call lastCycle() on every cycle, then we could end up clearing the transition flags early. The same goes for the WAI instruction flag, though that's probably less important since WAI just sits in a spinloop testing the interrupt flag constantly.
...
The reason I bring this up is because Sour appears to be doing exactly what I've always wanted to do in testing interrupts every cycle and then using the second-to-last result for triggering interrupts. If he can actually pass my test_nmi and test_irq test ROMs then that is really, really impressive. Because right now, I don't see how that's possible >_<