Split-screen vertical scrolling and precise delays

Split-screen vertical scrolling and precise delays
by Bisqwit on 2013-01-19 (#106501)

Split-screen horizontal scrolling is easy: Just poke to $2005 two values, with one being the X scrolling offset, any time during the scanline.
But turns out vertical scrolling is not that easy, especially if you are going to do that without any mapper support.

In a recent project of mine I did that anyway. It is tested on MMC1, VRC6 and MMC3. (Emulation only.)
You can see a screenshot of that here:

Here is the source code. It is in form of two macros.

The first one prepares a PPU write which sets the scrolling offset to a particular value.
The second one commits the write. It must be performed at a specific time window during the h-blank, or you will get artifacts.

Code:

.macro PPU_WriteOffset
        ; A = NTA
        ; X = X-Offset
        ; Y = Y-offset
        asl
        asl       ; For first write to $2006: NTA*4
        sty $01
        tay
        ;sta $2006 ; t: yyy NN YYYYY XXXXX xxx
        ;          ;    *54 32 10--- ----- --- <- These are affected. * = set zero.
        ;          ;        NN                 <- UNIQUE DATA
        txa
        lsr
        lsr
        lsr
        sta $00
        lda $01
        asl
        asl
        and #$E0
        ora $00   ; Second write to $2006:   X/8 + 32*Y/8
        ;
        ; Cycle cost: 2+2+3+2+ 2*4 + 3+3+2+2+2+3 = 32
.endmacro
.macro PPU_WriteOffset_Finish
        sty $2006
        ldy $01 
        sty $2005 ; t: yyy NN YYYYY XXXXX xxx
        ;              210 -- 76543 ----- ---  <- These are affected.
        ;         ;    yyy    YY               <- UNIQUE DATA
        ;
        ; At THIS point, we must be in h-blank
        ;
        stx $2005 ; t: yyy NN YYYYY XXXXX xxx
        ;              --- -- ----- 76543 210  <- These are affected.
        ;         ;                       xxx  <- UNIQUE DATA
        
        ;
        sta $2006 ; t: yyy NN YYYYY XXXXX xxx
        ;              --- -- --765 43210 ---  <- These are affected.
        ;
        ;          This last write is the only one that updates v = t!
        ;
        ; Cycle cost: 4*4+3 = 19
.endmacro

Related, here is a macro which sleeps (delays) the execution for the given number of scanlines, plus/minus a configurable number of cycles. All numbers must be compile time constants. The macro depends on a compile-time constant called "PAL" which must be 0 if you are compiling for a NTSC system, 1 if you are compiling for a PAL system.

Code:

.macro ScanlineDelay already_done, do_scanlines, delta
    .if PAL=0
        compensate_dmc_delay (already_done), ((do_scanlines)*341   /3 + (delta))
    .else
        compensate_dmc_delay (already_done), ((do_scanlines)*341*5/16 + (delta))
    .endif 
.endmacro

Here is a version that does not need the number of scanlines to be a compile-time constant. (It still requires the PAL/NTSC variable.)

Code:

.pushseg
.segment "FUNC_SCANLINEDELAY"
; A = number of scanlines to delay. Must be >= 1.
; Uses $00,$01,$F8 as temp.
VariableScanlineDelay:
        @nearby_rts_14cyc = $C450 ; lda #1 + rts
        @nearby_rts       = rts12
        ; Some zeropage variable.
        @remain = $F8
        sta @remain             ;3
.if PAL=0
        @n_cases = 3
.else
        @n_cases = 16
.endif
@loop:
        lda #@n_cases           ;2
        cmp @remain             ;3
        Jcc @thatmuch           ;3
        ; n_cases >= remain
        ; remain <= n_cases
        lda @remain             ;-1+3
        tax                     ;2   
        lda #0                  ;2   
        sta @remain             ;3   
        ; X = remain, remain = 0     
        jmp @jump               ;3 -- 20 so far
@thatmuch:
        ; n_cases < remain
        ; remain > n_cases
        tax                     ;2
        lda @remain             ;3
        sec                     ;2
        sbc #@n_cases           ;2
        sta @remain             ;3 -- 20 so far
        ; X = n_cases, remain -= ncases
@jump:
        TableWrapCheck (@lo_ptr_table-1), @n_cases, "@lo_ptr_table causes page wrap"
        TableWrapCheck (@hi_ptr_table-1), @n_cases, "@hi_ptr_table causes page wrap"

        lda @lo_ptr_table-1,x   ;4 assuming no wrap
        sta $00                 ;3
        lda @hi_ptr_table-1,x   ;4 assuming no wrap
        sta $01                 ;3
        jmp ($00)               ;5 -- 39 so far
@continue:
        ;                       ;  -- 42 so far (JMP)
        lda @remain             ;3
        Jne @loop               ;3
        ;                       ;-1
        ; Overhead so far: 6+3-1 = 8 cycles (including JSR).
        ;          Add 6 for RTS = 14 cycles.
        rts

.segment "DATA_SCANLINEDELAY_POINTERS"
        .byte 0 ; We don't need a delay0 pointer.
@lo_ptr_table:
    .repeat @n_cases, I
        .byte <.ident (.sprintf("@delay%d", I+1))
    .endrepeat
@hi_ptr_table:
    .repeat @n_cases, I
        .byte >.ident (.sprintf("@delay%d", I+1))
    .endrepeat

    .repeat @n_cases, I
      .segment .sprintf("DELAY%d", I+1)
      .ident (.sprintf("@delay%d", I+1)):
        ScanlineDelay 48, (I+1), 0
        jmp @continue
    .endrepeat
.popseg

It will produce a number of segments, which you will have to include in your linker script.

These both depend on this macro, which produces code that sleeps the given number of cycles regardless of whether a DMC sample is playing or not. The detection of the DMC sample must be customized for your given program.
http://bisqwit.iki.fi/src/6502-inline_dmc_delay_compensation.inc

Which, in turn, depends on a delay_n macro, which sleeps for an exact number of cycles, which you can find implemented in... Well. There are a number of files.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepy.inc This preserves Y, but clobbers A, X, C, Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepaxyc.inc This preserves A, X, Y and C, but clobbers Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepaxyczndv.inc This preserves A, X, Y, C, Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepy-a25.inc This preserves just Y, and requires the presence of a function called delay_a_25_clocks. You can find the implementation below.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keep-ax33-xa30.inc This clobbers all registers besides S and I, and requires the presence of functions called delay_256a_x_33_clocks and delay_256x_a_30_clocks.
The delay code is optimized for size, and is surprisingly good. Most values produce 5-7 bytes of code. The versions which preserve more registers produce larger code.
The delay is customized and optimized for size for up to 5000 cycles, and up from it is done recursively by halving the delay.
None of the macros change contents of memory locations. Many of them do read RAM locations. Some of them do sequences of two successive RAM writes, where the first one changes a value and the second one changes it back.
You can find a version for your own set of preserved registers by editing the filename URL by following the established pattern. In the future, the macro might receive the set of registers as a parameter.

You can find the implementation of the aforementioned three runtime delay functions below. Portions were written by Blargg and dclxvi.

Code:

;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A:X clocks+overhead
; Time: 256*A+X+34 clocks (including JSR)
; Written by Joel Yliluoma. Clobbers A. Preserves X,Y. Has relocations.
;;;;;;;;;;;;;;;;;;;;;;;;
:       ; do 256 cycles.        ; 5 cycles done so far. Loop is 2+1+ 2+3+ 1 = 9 bytes.
        sbc #1                  ; 2 cycles - Carry was set from cmp
        pha                     ; 3 cycles
         lda #(256-25-10-2-4)   ; +2
         jsr delay_a_25_clocks
        pla                     ; 4 cycles
delay_256a_x_33_clocks:
        cmp #1                  ; +2; 2 cycles overhead
        bcs :-                  ; +2; 4 cycles overhead
        ; 0-255 cycles remain, overhead = 4
        txa                     ; +2; 6; +27 = 33
        ; 15 + JSR + RTS overhead for the code below. JSR=6, RTS=6. 15+12=27
        ;          ;    Cycles        Accumulator     Carry flag
        ;          ; 0  1  2  3  4       (hex)        0 1 2 3 4
        sec        ; 0  0  0  0  0   00 01 02 03 04   1 1 1 1 1
:       sbc #5     ; 2  2  2  2  2   FB FC FD FE FF   0 0 0 0 0
        Jcs :-     ; 4  4  4  4  4   FB FC FD FE FF   0 0 0 0 0
        lsr a      ; 6  6  6  6  6   7D 7E 7E 7F 7F   1 0 1 0 1
        Jcc :+     ; 8  8  8  8  8   7D 7E 7E 7F 7F   1 0 1 0 1
:       sbc #$7E   ;10 11 10 11 10   FF FF 00 00 01   0 0 1 1 1
        Jcc :+     ;12 13 12 13 12   FF FF 00 00 01   0 0 1 1 1
        Jeq :+     ;      14 15 14         00 00 01       1 1 1
        Jne :+     ;            16               01           1
:       rts        ;15 16 17 18 19   (thanks to dclxvi for the algorithm) 
;;;;;;;;;;;;;;;;;;;;;;;;  
; Delays X:A clocks+overhead
; Time: 256*X+A+30 clocks (including JSR)
; Written by Joel Yliluoma. Clobbers A,X. Preserves Y. Has relocations.
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256x_a_30_clocks:
        cpx #0                  ; +2
        Jeq delay_a_25_clocks   ; +3  (25+5 = 30 cycles overhead)
        ; do 256 cycles.        ;  4 cycles so far. Loop is 1+1+ 2+3+ 1+3 = 11 bytes.
        dex                     ;  2 cycles
        pha                     ;  3 cycles
         lda #(256-25-9-2-7)    ; +2
         jsr delay_a_25_clocks
        pla                        ; 4
        jmp delay_256x_a_30_clocks ; 3.
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A clocks + overhead
; Preserved: X, Y
; Time: A+25 clocks (including JSR)  (13+6+6)
;;;;;;;;;;;;;;;;;;;;;;;;
:       sbc #7          ; carry set by CMP
delay_a_25_clocks:
        cmp #7          ;2
        Jcs :-          ;2    do multiples of 7
        ;               ; Cycles          Accumulator            Carry           Zero
        lsr a           ; 0 0 0 0 0 0 0   00 01 02 03 04 05 06   0 0 0 0 0 0 0   ? ? ? ? ? ? ?
        Jcs :+          ; 2 2 2 2 2 2 2   00 00 01 01 02 02 03   0 1 0 1 0 1 0   1 1 0 0 0 0 0
:       Jeq @zero       ; 4 5 4 5 4 5 4   00 00 01 01 02 02 03   0 1 0 1 0 1 0   1 1 0 0 0 0 0
        lsr a           ; : : 6 7 6 7 6   :: :: 01 01 02 02 03   : : 0 1 0 1 0   : : 0 0 0 0 0
        Jeq :+          ; : : 8 9 8 9 8   :: :: 00 00 01 01 01   : : 1 1 0 0 1   : : 1 1 0 0 0
        Jcc :+          ; : : : : A B A   :: :: :: :: 01 01 01   : : : : 0 0 1   : : : : 0 0 0
@zero:  Jne :+          ; 7 8 : : : : C   00 01 :: :: :: :: 01   0 1 : : : : 1   1 1 : : : : 0
:       rts             ; 9 A B C D E F   00 01 00 00 01 01 01   0 1 1 1 0 0 1   1 1 1 1 0 0 0
; ^ (thanks to dclxvi for the algorithm)

These functions require macros called Jeq,Jcc,Jne etc. (with a capital J). These are branch macros that do page-wrapping checking. You can find the implementation of those macros below. They were written by Blargg.

Code:

.macro branch_check opc, dest
    opc dest
    .assert >* = >dest, warning, "branch_check: failed, crosses page"
.endmacro

.macro Jcc dest
        branch_check bcc, dest
.endmacro
.macro Jcs dest
        branch_check bcs, dest
.endmacro  
.macro Jeq dest
        branch_check beq, dest
.endmacro
.macro Jne dest
        branch_check bne, dest
.endmacro
.macro Jmi dest
        branch_check bmi, dest
.endmacro
.macro Jpl dest
        branch_check bpl, dest
.endmacro
.macro Jvc dest
        branch_check bvc, dest
.endmacro  
.macro Jvs dest  
        branch_check bvs, dest
.endmacro

Related, there is also this macro which checks whether a table spans across two pages, and produces a link-time warning if access to the table will produce page crossing.

Code:

.macro TableWrapCheck table, last_index, message
        .assert >(table) = >(table+(last_index)), warning, message
.endmacro