Making sure that code runs in fixed number of cycles

Making sure that code runs in fixed number of cycles
by Bisqwit on 2016-05-04 (#170220)

Has anyone ever written a tool that takes 6502 assembler code, and changes it to run in fixed number of clock cycles?

It is a fairly mechanical process, and should be possible to automate. The need for such process occurs fairly often when you are trying to do scanline effects in an NMI: You have some code that needs to run, but you also need to do specific things at a specific time. You might use delay_n to delay a certain number of clock cycles before doing that time-sensitive thing, but if your delays are large, you might not have any enough time remaining in the frame to execute your actual payload. So you want to execute the payload while the delay runs. In essence, the payload is factored into the delay. That's when you must know exactly how many clock cycles the payload takes.

Now I realize that it is not possible to do for all code samples. Specifically, if the code includes loops (branches back into already-visited code), the loop boundary must be possible to calculate at compile-time. You also must know whether branch instructions do a page-wrap or not. And you must know whether absolute-indexed reads, or indirect-indexed reads do a page-wrap or not. Unconditional branches (such as BNE after nonzero LDA) must be possible to identify. And if you are playing DPCM samples at the same time, you must take that into account. If peripherals are generating interrupts, things become even more complicated.

Let me show here an example of what I recently did.

Code:

StarField:
        ldx #0
:       lda $201,x ;attr                        ;4
        and #3                                  ;2
        sec                                     ;2
        sbc $203,x ;X                           ;4
        eor #$FF                                ;2
        sta $203,x                              ;5
        ; Hide if Y >= $48 && Y < $D0
        ;      && X >= $0B && X < $E3
        lda $300,x                              ;4
        sec                                     ;2
        sbc #$48                                ;2
        cmp #($D0-$48)                          ;2
        bcs @nohide1                            ;2
        lda $203,x      ;4                      ;4
        ; carry is clear
        sbc #$0B-1      ;2                      ;2
        cmp #($E3-$0B)  ;2                      ;2
        bcs @nohide2    ;2                      ;2
        ;
@hide:  lda #$F8        ;2                      ;2
        sta $200,x      ;5                      ;5
        bne @next       ;3                      ;3
@nohide1:
        ;1 cycle done, 10 to go
        delay_n 10
@nohide2:               ;1
        lda $300,x      ;4
        delay_n 5       ;5
@next:  sta $200,x                              ;5
        inx                                     ;2
        inx                                     ;2
        inx                                     ;2
        inx                                     ;2
        bne :-                                  ;3
        rts                                     ;-1

The process involves finding all forward branches, and analyzing the cycle counts taken by the different paths up until the point where they merge.

For example, let's study the part from bcs @nohide2 until reaching @next.
If the branch is taken, the following instructions are executed: bcs @nohide2 (3), and lda $300,x (4), for a total of 7.
If the branch is not taken, the following instructions are executed: bcs @nohide2 (2), lda #$F8 (2), sta $200,x (5), and bne @next (3), for a total of 12. The difference between these cycle counts is 5, and therefore a delay_n 5 must be inserted in the shorter path, as was done here.

Next, we have the branch from bcs @nohide1 until @next, where all execution paths merge again. If the branch is taken, the list of instructions executed is bcs @nohide1 (3), lda $300,x (4), and delay_n 5 (5), for a total of 12 cycles. If the branch is not taken, the list of instructions executed is bcs nohide1 (2), lda $203,x (4), sbc #10 (2), cmp #216 (2), blob that takes 12 cycles (12), for a total of 22 cycles. The difference is 10 cycles, which means that delay_n 10 must be inserted somewhere that is only executed along the way of the shorter path, as was done here.

Finally, the branch bne :- goes backwards. The branch depends on the value of X, because the last instruction before the branch that affects the flags used by the bne is an inx. Before the loop, X is initialized with a constant value of 0. Inside the loop, X is incremented by 4, and the final branch is bne, which means that 64 iterations of the loop will be run. All code inside the loop, now that branches have been perfectly balanced, from lda $201,x until inx, measures exactly 64 cycles. The branch instruction itself takes 3 cycles. This means 64 loops of 67 cycles. Outside the loop is one ldx #, which adds 2. The whole cycle count is therefore 2+64×67−1 cycles (−1 because the final branch is not taken), or 4289 cycles, plus JSR+RTS (which are 12).

Difficult special cases:

If there is a case where the difference different branches is 1 cycle, then two cycles must be added to one and three to the other, because it is not possible to do 1 cycle of delay. It may be possible to compensate this delay with a single bcc *+2 instruction (or equivalent) at the merging point, if it is known that one branch always comes out with certain flags and the other branch comes out with the opposite flags.

If one of the branches merges into the other without executing any instructions (example: bcs skip ― adc #1 ― tax ― skip:), instructions must still be added into the shorter branch somehow. If the difference is not 1 cycle and/or it is not possible to use the dummy branch trick as explained above, it will require modifying the branch into an artificial new location, executing the delay there, and then from that location, jumping to the original target. In this case, the replacement would be: bcs outside ― adc #1 ― tax ― skip: (2+2+2=6) and outside: jmp skip (3+3=6). If the outside location is too far, it would be: bcc inside ― jmp outside ― inside: adc #1 ― tax ― delay_n 3 ― skip: (3+2+2+3=10) and outside: delay_n 2 ― jmp skip (2+3+2+3=10). But this kind of changes may lead into a cascading problem with branch distances.

Re: Making sure that code runs in fixed number of cycles
by darryl.revok on 2016-05-04 (#170251)

This in intriguing. I made a tool to calculate values for a nested delay loop to match a known cycle count of another routine, but nothing close to what you're talking about.

I have a love/hate relationship with raster effects. I did a lot of them on my game project and came up with some cool stuff. On the other hand, I spent months working on overcoming the challenges associated with what I was doing, primarily in performing the nametable updates in accordance with the varying scroll speeds, and unlimited X/Y scrolling, within acceptable cycle limits.

Something to make part of the raster effect process more streamlined would be nice to take some of the load off. At the same time, I hope you don't mind me to make a brief public service announcement that the people who advised me to work on building a solid gameplay engine first were right. Parallax scrolling can really add a lot of depth to the playfield, but if you intend to do something complex, you might want to carve a year out of your schedule just to refine that.

That all being said, I like the idea of the project, and I could see it finding good use to simplify timing status bars, intro screens, or for the demo scene. I'd use it if it was available, although I wouldn't want to do anything too crazy with it, from my previous experiences. But it would help in timing a status bar, so that I could still keep Sprite 0 available.

If the one cycle delay is your biggest issue, the most obvious solution is to add a JMP in the longer branch, and then wait 4 cycles on the shorter branch. I don't know if it's the BEST solution, but sometimes I like to do what I know will work until I come up with something better.