Has anyone ever written a tool that takes 6502 assembler code, and changes it to run in fixed number of clock cycles?
It is a fairly mechanical process, and should be possible to automate. The need for such process occurs fairly often when you are trying to do scanline effects in an NMI: You have some code that needs to run, but you also need to do specific things at a specific time. You might use delay_n to delay a certain number of clock cycles before doing that time-sensitive thing, but if your delays are large, you might not have any enough time remaining in the frame to execute your actual payload. So you want to execute the payload while the delay runs. In essence, the payload is factored into the delay. That's when you must know exactly how many clock cycles the payload takes.
Now I realize that it is not possible to do for all code samples. Specifically, if the code includes loops (branches back into already-visited code), the loop boundary must be possible to calculate at compile-time. You also must know whether branch instructions do a page-wrap or not. And you must know whether absolute-indexed reads, or indirect-indexed reads do a page-wrap or not. Unconditional branches (such as BNE after nonzero LDA) must be possible to identify. And if you are playing DPCM samples at the same time, you must take that into account. If peripherals are generating interrupts, things become even more complicated.
Let me show here an example of what I recently did.
The process involves finding all forward branches, and analyzing the cycle counts taken by the different paths up until the point where they merge.
For example, let's study the part from bcs @nohide2 until reaching @next.
If the branch is taken, the following instructions are executed: bcs @nohide2 (3), and lda $300,x (4), for a total of 7.
If the branch is not taken, the following instructions are executed: bcs @nohide2 (2), lda #$F8 (2), sta $200,x (5), and bne @next (3), for a total of 12. The difference between these cycle counts is 5, and therefore a delay_n 5 must be inserted in the shorter path, as was done here.
Next, we have the branch from bcs @nohide1 until @next, where all execution paths merge again. If the branch is taken, the list of instructions executed is bcs @nohide1 (3), lda $300,x (4), and delay_n 5 (5), for a total of 12 cycles. If the branch is not taken, the list of instructions executed is bcs nohide1 (2), lda $203,x (4), sbc #10 (2), cmp #216 (2), blob that takes 12 cycles (12), for a total of 22 cycles. The difference is 10 cycles, which means that delay_n 10 must be inserted somewhere that is only executed along the way of the shorter path, as was done here.
Finally, the branch bne :- goes backwards. The branch depends on the value of X, because the last instruction before the branch that affects the flags used by the bne is an inx. Before the loop, X is initialized with a constant value of 0. Inside the loop, X is incremented by 4, and the final branch is bne, which means that 64 iterations of the loop will be run. All code inside the loop, now that branches have been perfectly balanced, from lda $201,x until inx, measures exactly 64 cycles. The branch instruction itself takes 3 cycles. This means 64 loops of 67 cycles. Outside the loop is one ldx #, which adds 2. The whole cycle count is therefore 2+64×67−1 cycles (−1 because the final branch is not taken), or 4289 cycles, plus JSR+RTS (which are 12).
Difficult special cases:
If there is a case where the difference different branches is 1 cycle, then two cycles must be added to one and three to the other, because it is not possible to do 1 cycle of delay. It may be possible to compensate this delay with a single bcc *+2 instruction (or equivalent) at the merging point, if it is known that one branch always comes out with certain flags and the other branch comes out with the opposite flags.
If one of the branches merges into the other without executing any instructions (example: bcs skip ― adc #1 ― tax ― skip:), instructions must still be added into the shorter branch somehow. If the difference is not 1 cycle and/or it is not possible to use the dummy branch trick as explained above, it will require modifying the branch into an artificial new location, executing the delay there, and then from that location, jumping to the original target. In this case, the replacement would be: bcs outside ― adc #1 ― tax ― skip: (2+2+2=6) and outside: jmp skip (3+3=6). If the outside location is too far, it would be: bcc inside ― jmp outside ― inside: adc #1 ― tax ― delay_n 3 ― skip: (3+2+2+3=10) and outside: delay_n 2 ― jmp skip (2+3+2+3=10). But this kind of changes may lead into a cascading problem with branch distances.
It is a fairly mechanical process, and should be possible to automate. The need for such process occurs fairly often when you are trying to do scanline effects in an NMI: You have some code that needs to run, but you also need to do specific things at a specific time. You might use delay_n to delay a certain number of clock cycles before doing that time-sensitive thing, but if your delays are large, you might not have any enough time remaining in the frame to execute your actual payload. So you want to execute the payload while the delay runs. In essence, the payload is factored into the delay. That's when you must know exactly how many clock cycles the payload takes.
Now I realize that it is not possible to do for all code samples. Specifically, if the code includes loops (branches back into already-visited code), the loop boundary must be possible to calculate at compile-time. You also must know whether branch instructions do a page-wrap or not. And you must know whether absolute-indexed reads, or indirect-indexed reads do a page-wrap or not. Unconditional branches (such as BNE after nonzero LDA) must be possible to identify. And if you are playing DPCM samples at the same time, you must take that into account. If peripherals are generating interrupts, things become even more complicated.
Let me show here an example of what I recently did.
Code:
StarField:
ldx #0
: lda $201,x ;attr ;4
and #3 ;2
sec ;2
sbc $203,x ;X ;4
eor #$FF ;2
sta $203,x ;5
; Hide if Y >= $48 && Y < $D0
; && X >= $0B && X < $E3
lda $300,x ;4
sec ;2
sbc #$48 ;2
cmp #($D0-$48) ;2
bcs @nohide1 ;2
lda $203,x ;4 ;4
; carry is clear
sbc #$0B-1 ;2 ;2
cmp #($E3-$0B) ;2 ;2
bcs @nohide2 ;2 ;2
;
@hide: lda #$F8 ;2 ;2
sta $200,x ;5 ;5
bne @next ;3 ;3
@nohide1:
;1 cycle done, 10 to go
delay_n 10
@nohide2: ;1
lda $300,x ;4
delay_n 5 ;5
@next: sta $200,x ;5
inx ;2
inx ;2
inx ;2
inx ;2
bne :- ;3
rts ;-1
ldx #0
: lda $201,x ;attr ;4
and #3 ;2
sec ;2
sbc $203,x ;X ;4
eor #$FF ;2
sta $203,x ;5
; Hide if Y >= $48 && Y < $D0
; && X >= $0B && X < $E3
lda $300,x ;4
sec ;2
sbc #$48 ;2
cmp #($D0-$48) ;2
bcs @nohide1 ;2
lda $203,x ;4 ;4
; carry is clear
sbc #$0B-1 ;2 ;2
cmp #($E3-$0B) ;2 ;2
bcs @nohide2 ;2 ;2
;
@hide: lda #$F8 ;2 ;2
sta $200,x ;5 ;5
bne @next ;3 ;3
@nohide1:
;1 cycle done, 10 to go
delay_n 10
@nohide2: ;1
lda $300,x ;4
delay_n 5 ;5
@next: sta $200,x ;5
inx ;2
inx ;2
inx ;2
inx ;2
bne :- ;3
rts ;-1
The process involves finding all forward branches, and analyzing the cycle counts taken by the different paths up until the point where they merge.
For example, let's study the part from bcs @nohide2 until reaching @next.
If the branch is taken, the following instructions are executed: bcs @nohide2 (3), and lda $300,x (4), for a total of 7.
If the branch is not taken, the following instructions are executed: bcs @nohide2 (2), lda #$F8 (2), sta $200,x (5), and bne @next (3), for a total of 12. The difference between these cycle counts is 5, and therefore a delay_n 5 must be inserted in the shorter path, as was done here.
Next, we have the branch from bcs @nohide1 until @next, where all execution paths merge again. If the branch is taken, the list of instructions executed is bcs @nohide1 (3), lda $300,x (4), and delay_n 5 (5), for a total of 12 cycles. If the branch is not taken, the list of instructions executed is bcs nohide1 (2), lda $203,x (4), sbc #10 (2), cmp #216 (2), blob that takes 12 cycles (12), for a total of 22 cycles. The difference is 10 cycles, which means that delay_n 10 must be inserted somewhere that is only executed along the way of the shorter path, as was done here.
Finally, the branch bne :- goes backwards. The branch depends on the value of X, because the last instruction before the branch that affects the flags used by the bne is an inx. Before the loop, X is initialized with a constant value of 0. Inside the loop, X is incremented by 4, and the final branch is bne, which means that 64 iterations of the loop will be run. All code inside the loop, now that branches have been perfectly balanced, from lda $201,x until inx, measures exactly 64 cycles. The branch instruction itself takes 3 cycles. This means 64 loops of 67 cycles. Outside the loop is one ldx #, which adds 2. The whole cycle count is therefore 2+64×67−1 cycles (−1 because the final branch is not taken), or 4289 cycles, plus JSR+RTS (which are 12).
Difficult special cases:
If there is a case where the difference different branches is 1 cycle, then two cycles must be added to one and three to the other, because it is not possible to do 1 cycle of delay. It may be possible to compensate this delay with a single bcc *+2 instruction (or equivalent) at the merging point, if it is known that one branch always comes out with certain flags and the other branch comes out with the opposite flags.
If one of the branches merges into the other without executing any instructions (example: bcs skip ― adc #1 ― tax ― skip:), instructions must still be added into the shorter branch somehow. If the difference is not 1 cycle and/or it is not possible to use the dummy branch trick as explained above, it will require modifying the branch into an artificial new location, executing the delay there, and then from that location, jumping to the original target. In this case, the replacement would be: bcs outside ― adc #1 ― tax ― skip: (2+2+2=6) and outside: jmp skip (3+3=6). If the outside location is too far, it would be: bcc inside ― jmp outside ― inside: adc #1 ― tax ― delay_n 3 ― skip: (3+2+2+3=10) and outside: delay_n 2 ― jmp skip (2+3+2+3=10). But this kind of changes may lead into a cascading problem with branch distances.