Split-screen horizontal scrolling is easy: Just poke to $2005 two values, with one being the X scrolling offset, any time during the scanline.
But turns out vertical scrolling is not that easy, especially if you are going to do that without any mapper support.
In a recent project of mine I did that anyway. It is tested on MMC1, VRC6 and MMC3. (Emulation only.)
You can see a screenshot of that here:
Here is the source code. It is in form of two macros.
The first one prepares a PPU write which sets the scrolling offset to a particular value.
The second one commits the write. It must be performed at a specific time window during the h-blank, or you will get artifacts.
Related, here is a macro which sleeps (delays) the execution for the given number of scanlines, plus/minus a configurable number of cycles. All numbers must be compile time constants. The macro depends on a compile-time constant called "PAL" which must be 0 if you are compiling for a NTSC system, 1 if you are compiling for a PAL system.
Here is a version that does not need the number of scanlines to be a compile-time constant. (It still requires the PAL/NTSC variable.)
It will produce a number of segments, which you will have to include in your linker script.
These both depend on this macro, which produces code that sleeps the given number of cycles regardless of whether a DMC sample is playing or not. The detection of the DMC sample must be customized for your given program.
http://bisqwit.iki.fi/src/6502-inline_dmc_delay_compensation.inc
Which, in turn, depends on a delay_n macro, which sleeps for an exact number of cycles, which you can find implemented in... Well. There are a number of files.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepy.inc This preserves Y, but clobbers A, X, C, Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepaxyc.inc This preserves A, X, Y and C, but clobbers Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepaxyczndv.inc This preserves A, X, Y, C, Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepy-a25.inc This preserves just Y, and requires the presence of a function called delay_a_25_clocks. You can find the implementation below.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keep-ax33-xa30.inc This clobbers all registers besides S and I, and requires the presence of functions called delay_256a_x_33_clocks and delay_256x_a_30_clocks.
The delay code is optimized for size, and is surprisingly good. Most values produce 5-7 bytes of code. The versions which preserve more registers produce larger code.
The delay is customized and optimized for size for up to 5000 cycles, and up from it is done recursively by halving the delay.
None of the macros change contents of memory locations. Many of them do read RAM locations. Some of them do sequences of two successive RAM writes, where the first one changes a value and the second one changes it back.
You can find a version for your own set of preserved registers by editing the filename URL by following the established pattern. In the future, the macro might receive the set of registers as a parameter.
You can find the implementation of the aforementioned three runtime delay functions below. Portions were written by Blargg and dclxvi.
These functions require macros called Jeq,Jcc,Jne etc. (with a capital J). These are branch macros that do page-wrapping checking. You can find the implementation of those macros below. They were written by Blargg.
Related, there is also this macro which checks whether a table spans across two pages, and produces a link-time warning if access to the table will produce page crossing.
But turns out vertical scrolling is not that easy, especially if you are going to do that without any mapper support.
In a recent project of mine I did that anyway. It is tested on MMC1, VRC6 and MMC3. (Emulation only.)
You can see a screenshot of that here:
Here is the source code. It is in form of two macros.
The first one prepares a PPU write which sets the scrolling offset to a particular value.
The second one commits the write. It must be performed at a specific time window during the h-blank, or you will get artifacts.
Code:
.macro PPU_WriteOffset
; A = NTA
; X = X-Offset
; Y = Y-offset
asl
asl ; For first write to $2006: NTA*4
sty $01
tay
;sta $2006 ; t: yyy NN YYYYY XXXXX xxx
; ; *54 32 10--- ----- --- <- These are affected. * = set zero.
; ; NN <- UNIQUE DATA
txa
lsr
lsr
lsr
sta $00
lda $01
asl
asl
and #$E0
ora $00 ; Second write to $2006: X/8 + 32*Y/8
;
; Cycle cost: 2+2+3+2+ 2*4 + 3+3+2+2+2+3 = 32
.endmacro
.macro PPU_WriteOffset_Finish
sty $2006
ldy $01
sty $2005 ; t: yyy NN YYYYY XXXXX xxx
; 210 -- 76543 ----- --- <- These are affected.
; ; yyy YY <- UNIQUE DATA
;
; At THIS point, we must be in h-blank
;
stx $2005 ; t: yyy NN YYYYY XXXXX xxx
; --- -- ----- 76543 210 <- These are affected.
; ; xxx <- UNIQUE DATA
;
sta $2006 ; t: yyy NN YYYYY XXXXX xxx
; --- -- --765 43210 --- <- These are affected.
;
; This last write is the only one that updates v = t!
;
; Cycle cost: 4*4+3 = 19
.endmacro
; A = NTA
; X = X-Offset
; Y = Y-offset
asl
asl ; For first write to $2006: NTA*4
sty $01
tay
;sta $2006 ; t: yyy NN YYYYY XXXXX xxx
; ; *54 32 10--- ----- --- <- These are affected. * = set zero.
; ; NN <- UNIQUE DATA
txa
lsr
lsr
lsr
sta $00
lda $01
asl
asl
and #$E0
ora $00 ; Second write to $2006: X/8 + 32*Y/8
;
; Cycle cost: 2+2+3+2+ 2*4 + 3+3+2+2+2+3 = 32
.endmacro
.macro PPU_WriteOffset_Finish
sty $2006
ldy $01
sty $2005 ; t: yyy NN YYYYY XXXXX xxx
; 210 -- 76543 ----- --- <- These are affected.
; ; yyy YY <- UNIQUE DATA
;
; At THIS point, we must be in h-blank
;
stx $2005 ; t: yyy NN YYYYY XXXXX xxx
; --- -- ----- 76543 210 <- These are affected.
; ; xxx <- UNIQUE DATA
;
sta $2006 ; t: yyy NN YYYYY XXXXX xxx
; --- -- --765 43210 --- <- These are affected.
;
; This last write is the only one that updates v = t!
;
; Cycle cost: 4*4+3 = 19
.endmacro
Related, here is a macro which sleeps (delays) the execution for the given number of scanlines, plus/minus a configurable number of cycles. All numbers must be compile time constants. The macro depends on a compile-time constant called "PAL" which must be 0 if you are compiling for a NTSC system, 1 if you are compiling for a PAL system.
Code:
.macro ScanlineDelay already_done, do_scanlines, delta
.if PAL=0
compensate_dmc_delay (already_done), ((do_scanlines)*341 /3 + (delta))
.else
compensate_dmc_delay (already_done), ((do_scanlines)*341*5/16 + (delta))
.endif
.endmacro
.if PAL=0
compensate_dmc_delay (already_done), ((do_scanlines)*341 /3 + (delta))
.else
compensate_dmc_delay (already_done), ((do_scanlines)*341*5/16 + (delta))
.endif
.endmacro
Here is a version that does not need the number of scanlines to be a compile-time constant. (It still requires the PAL/NTSC variable.)
Code:
.pushseg
.segment "FUNC_SCANLINEDELAY"
; A = number of scanlines to delay. Must be >= 1.
; Uses $00,$01,$F8 as temp.
VariableScanlineDelay:
@nearby_rts_14cyc = $C450 ; lda #1 + rts
@nearby_rts = rts12
; Some zeropage variable.
@remain = $F8
sta @remain ;3
.if PAL=0
@n_cases = 3
.else
@n_cases = 16
.endif
@loop:
lda #@n_cases ;2
cmp @remain ;3
Jcc @thatmuch ;3
; n_cases >= remain
; remain <= n_cases
lda @remain ;-1+3
tax ;2
lda #0 ;2
sta @remain ;3
; X = remain, remain = 0
jmp @jump ;3 -- 20 so far
@thatmuch:
; n_cases < remain
; remain > n_cases
tax ;2
lda @remain ;3
sec ;2
sbc #@n_cases ;2
sta @remain ;3 -- 20 so far
; X = n_cases, remain -= ncases
@jump:
TableWrapCheck (@lo_ptr_table-1), @n_cases, "@lo_ptr_table causes page wrap"
TableWrapCheck (@hi_ptr_table-1), @n_cases, "@hi_ptr_table causes page wrap"
lda @lo_ptr_table-1,x ;4 assuming no wrap
sta $00 ;3
lda @hi_ptr_table-1,x ;4 assuming no wrap
sta $01 ;3
jmp ($00) ;5 -- 39 so far
@continue:
; ; -- 42 so far (JMP)
lda @remain ;3
Jne @loop ;3
; ;-1
; Overhead so far: 6+3-1 = 8 cycles (including JSR).
; Add 6 for RTS = 14 cycles.
rts
.segment "DATA_SCANLINEDELAY_POINTERS"
.byte 0 ; We don't need a delay0 pointer.
@lo_ptr_table:
.repeat @n_cases, I
.byte <.ident (.sprintf("@delay%d", I+1))
.endrepeat
@hi_ptr_table:
.repeat @n_cases, I
.byte >.ident (.sprintf("@delay%d", I+1))
.endrepeat
.repeat @n_cases, I
.segment .sprintf("DELAY%d", I+1)
.ident (.sprintf("@delay%d", I+1)):
ScanlineDelay 48, (I+1), 0
jmp @continue
.endrepeat
.popseg
.segment "FUNC_SCANLINEDELAY"
; A = number of scanlines to delay. Must be >= 1.
; Uses $00,$01,$F8 as temp.
VariableScanlineDelay:
@nearby_rts_14cyc = $C450 ; lda #1 + rts
@nearby_rts = rts12
; Some zeropage variable.
@remain = $F8
sta @remain ;3
.if PAL=0
@n_cases = 3
.else
@n_cases = 16
.endif
@loop:
lda #@n_cases ;2
cmp @remain ;3
Jcc @thatmuch ;3
; n_cases >= remain
; remain <= n_cases
lda @remain ;-1+3
tax ;2
lda #0 ;2
sta @remain ;3
; X = remain, remain = 0
jmp @jump ;3 -- 20 so far
@thatmuch:
; n_cases < remain
; remain > n_cases
tax ;2
lda @remain ;3
sec ;2
sbc #@n_cases ;2
sta @remain ;3 -- 20 so far
; X = n_cases, remain -= ncases
@jump:
TableWrapCheck (@lo_ptr_table-1), @n_cases, "@lo_ptr_table causes page wrap"
TableWrapCheck (@hi_ptr_table-1), @n_cases, "@hi_ptr_table causes page wrap"
lda @lo_ptr_table-1,x ;4 assuming no wrap
sta $00 ;3
lda @hi_ptr_table-1,x ;4 assuming no wrap
sta $01 ;3
jmp ($00) ;5 -- 39 so far
@continue:
; ; -- 42 so far (JMP)
lda @remain ;3
Jne @loop ;3
; ;-1
; Overhead so far: 6+3-1 = 8 cycles (including JSR).
; Add 6 for RTS = 14 cycles.
rts
.segment "DATA_SCANLINEDELAY_POINTERS"
.byte 0 ; We don't need a delay0 pointer.
@lo_ptr_table:
.repeat @n_cases, I
.byte <.ident (.sprintf("@delay%d", I+1))
.endrepeat
@hi_ptr_table:
.repeat @n_cases, I
.byte >.ident (.sprintf("@delay%d", I+1))
.endrepeat
.repeat @n_cases, I
.segment .sprintf("DELAY%d", I+1)
.ident (.sprintf("@delay%d", I+1)):
ScanlineDelay 48, (I+1), 0
jmp @continue
.endrepeat
.popseg
It will produce a number of segments, which you will have to include in your linker script.
These both depend on this macro, which produces code that sleeps the given number of cycles regardless of whether a DMC sample is playing or not. The detection of the DMC sample must be customized for your given program.
http://bisqwit.iki.fi/src/6502-inline_dmc_delay_compensation.inc
Which, in turn, depends on a delay_n macro, which sleeps for an exact number of cycles, which you can find implemented in... Well. There are a number of files.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepy.inc This preserves Y, but clobbers A, X, C, Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepaxyc.inc This preserves A, X, Y and C, but clobbers Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepaxyczndv.inc This preserves A, X, Y, C, Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepy-a25.inc This preserves just Y, and requires the presence of a function called delay_a_25_clocks. You can find the implementation below.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keep-ax33-xa30.inc This clobbers all registers besides S and I, and requires the presence of functions called delay_256a_x_33_clocks and delay_256x_a_30_clocks.
The delay code is optimized for size, and is surprisingly good. Most values produce 5-7 bytes of code. The versions which preserve more registers produce larger code.
The delay is customized and optimized for size for up to 5000 cycles, and up from it is done recursively by halving the delay.
None of the macros change contents of memory locations. Many of them do read RAM locations. Some of them do sequences of two successive RAM writes, where the first one changes a value and the second one changes it back.
You can find a version for your own set of preserved registers by editing the filename URL by following the established pattern. In the future, the macro might receive the set of registers as a parameter.
You can find the implementation of the aforementioned three runtime delay functions below. Portions were written by Blargg and dclxvi.
Code:
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A:X clocks+overhead
; Time: 256*A+X+34 clocks (including JSR)
; Written by Joel Yliluoma. Clobbers A. Preserves X,Y. Has relocations.
;;;;;;;;;;;;;;;;;;;;;;;;
: ; do 256 cycles. ; 5 cycles done so far. Loop is 2+1+ 2+3+ 1 = 9 bytes.
sbc #1 ; 2 cycles - Carry was set from cmp
pha ; 3 cycles
lda #(256-25-10-2-4) ; +2
jsr delay_a_25_clocks
pla ; 4 cycles
delay_256a_x_33_clocks:
cmp #1 ; +2; 2 cycles overhead
bcs :- ; +2; 4 cycles overhead
; 0-255 cycles remain, overhead = 4
txa ; +2; 6; +27 = 33
; 15 + JSR + RTS overhead for the code below. JSR=6, RTS=6. 15+12=27
; ; Cycles Accumulator Carry flag
; ; 0 1 2 3 4 (hex) 0 1 2 3 4
sec ; 0 0 0 0 0 00 01 02 03 04 1 1 1 1 1
: sbc #5 ; 2 2 2 2 2 FB FC FD FE FF 0 0 0 0 0
Jcs :- ; 4 4 4 4 4 FB FC FD FE FF 0 0 0 0 0
lsr a ; 6 6 6 6 6 7D 7E 7E 7F 7F 1 0 1 0 1
Jcc :+ ; 8 8 8 8 8 7D 7E 7E 7F 7F 1 0 1 0 1
: sbc #$7E ;10 11 10 11 10 FF FF 00 00 01 0 0 1 1 1
Jcc :+ ;12 13 12 13 12 FF FF 00 00 01 0 0 1 1 1
Jeq :+ ; 14 15 14 00 00 01 1 1 1
Jne :+ ; 16 01 1
: rts ;15 16 17 18 19 (thanks to dclxvi for the algorithm)
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays X:A clocks+overhead
; Time: 256*X+A+30 clocks (including JSR)
; Written by Joel Yliluoma. Clobbers A,X. Preserves Y. Has relocations.
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256x_a_30_clocks:
cpx #0 ; +2
Jeq delay_a_25_clocks ; +3 (25+5 = 30 cycles overhead)
; do 256 cycles. ; 4 cycles so far. Loop is 1+1+ 2+3+ 1+3 = 11 bytes.
dex ; 2 cycles
pha ; 3 cycles
lda #(256-25-9-2-7) ; +2
jsr delay_a_25_clocks
pla ; 4
jmp delay_256x_a_30_clocks ; 3.
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A clocks + overhead
; Preserved: X, Y
; Time: A+25 clocks (including JSR) (13+6+6)
;;;;;;;;;;;;;;;;;;;;;;;;
: sbc #7 ; carry set by CMP
delay_a_25_clocks:
cmp #7 ;2
Jcs :- ;2 do multiples of 7
; ; Cycles Accumulator Carry Zero
lsr a ; 0 0 0 0 0 0 0 00 01 02 03 04 05 06 0 0 0 0 0 0 0 ? ? ? ? ? ? ?
Jcs :+ ; 2 2 2 2 2 2 2 00 00 01 01 02 02 03 0 1 0 1 0 1 0 1 1 0 0 0 0 0
: Jeq @zero ; 4 5 4 5 4 5 4 00 00 01 01 02 02 03 0 1 0 1 0 1 0 1 1 0 0 0 0 0
lsr a ; : : 6 7 6 7 6 :: :: 01 01 02 02 03 : : 0 1 0 1 0 : : 0 0 0 0 0
Jeq :+ ; : : 8 9 8 9 8 :: :: 00 00 01 01 01 : : 1 1 0 0 1 : : 1 1 0 0 0
Jcc :+ ; : : : : A B A :: :: :: :: 01 01 01 : : : : 0 0 1 : : : : 0 0 0
@zero: Jne :+ ; 7 8 : : : : C 00 01 :: :: :: :: 01 0 1 : : : : 1 1 1 : : : : 0
: rts ; 9 A B C D E F 00 01 00 00 01 01 01 0 1 1 1 0 0 1 1 1 1 1 0 0 0
; ^ (thanks to dclxvi for the algorithm)
; Delays A:X clocks+overhead
; Time: 256*A+X+34 clocks (including JSR)
; Written by Joel Yliluoma. Clobbers A. Preserves X,Y. Has relocations.
;;;;;;;;;;;;;;;;;;;;;;;;
: ; do 256 cycles. ; 5 cycles done so far. Loop is 2+1+ 2+3+ 1 = 9 bytes.
sbc #1 ; 2 cycles - Carry was set from cmp
pha ; 3 cycles
lda #(256-25-10-2-4) ; +2
jsr delay_a_25_clocks
pla ; 4 cycles
delay_256a_x_33_clocks:
cmp #1 ; +2; 2 cycles overhead
bcs :- ; +2; 4 cycles overhead
; 0-255 cycles remain, overhead = 4
txa ; +2; 6; +27 = 33
; 15 + JSR + RTS overhead for the code below. JSR=6, RTS=6. 15+12=27
; ; Cycles Accumulator Carry flag
; ; 0 1 2 3 4 (hex) 0 1 2 3 4
sec ; 0 0 0 0 0 00 01 02 03 04 1 1 1 1 1
: sbc #5 ; 2 2 2 2 2 FB FC FD FE FF 0 0 0 0 0
Jcs :- ; 4 4 4 4 4 FB FC FD FE FF 0 0 0 0 0
lsr a ; 6 6 6 6 6 7D 7E 7E 7F 7F 1 0 1 0 1
Jcc :+ ; 8 8 8 8 8 7D 7E 7E 7F 7F 1 0 1 0 1
: sbc #$7E ;10 11 10 11 10 FF FF 00 00 01 0 0 1 1 1
Jcc :+ ;12 13 12 13 12 FF FF 00 00 01 0 0 1 1 1
Jeq :+ ; 14 15 14 00 00 01 1 1 1
Jne :+ ; 16 01 1
: rts ;15 16 17 18 19 (thanks to dclxvi for the algorithm)
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays X:A clocks+overhead
; Time: 256*X+A+30 clocks (including JSR)
; Written by Joel Yliluoma. Clobbers A,X. Preserves Y. Has relocations.
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256x_a_30_clocks:
cpx #0 ; +2
Jeq delay_a_25_clocks ; +3 (25+5 = 30 cycles overhead)
; do 256 cycles. ; 4 cycles so far. Loop is 1+1+ 2+3+ 1+3 = 11 bytes.
dex ; 2 cycles
pha ; 3 cycles
lda #(256-25-9-2-7) ; +2
jsr delay_a_25_clocks
pla ; 4
jmp delay_256x_a_30_clocks ; 3.
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A clocks + overhead
; Preserved: X, Y
; Time: A+25 clocks (including JSR) (13+6+6)
;;;;;;;;;;;;;;;;;;;;;;;;
: sbc #7 ; carry set by CMP
delay_a_25_clocks:
cmp #7 ;2
Jcs :- ;2 do multiples of 7
; ; Cycles Accumulator Carry Zero
lsr a ; 0 0 0 0 0 0 0 00 01 02 03 04 05 06 0 0 0 0 0 0 0 ? ? ? ? ? ? ?
Jcs :+ ; 2 2 2 2 2 2 2 00 00 01 01 02 02 03 0 1 0 1 0 1 0 1 1 0 0 0 0 0
: Jeq @zero ; 4 5 4 5 4 5 4 00 00 01 01 02 02 03 0 1 0 1 0 1 0 1 1 0 0 0 0 0
lsr a ; : : 6 7 6 7 6 :: :: 01 01 02 02 03 : : 0 1 0 1 0 : : 0 0 0 0 0
Jeq :+ ; : : 8 9 8 9 8 :: :: 00 00 01 01 01 : : 1 1 0 0 1 : : 1 1 0 0 0
Jcc :+ ; : : : : A B A :: :: :: :: 01 01 01 : : : : 0 0 1 : : : : 0 0 0
@zero: Jne :+ ; 7 8 : : : : C 00 01 :: :: :: :: 01 0 1 : : : : 1 1 1 : : : : 0
: rts ; 9 A B C D E F 00 01 00 00 01 01 01 0 1 1 1 0 0 1 1 1 1 1 0 0 0
; ^ (thanks to dclxvi for the algorithm)
These functions require macros called Jeq,Jcc,Jne etc. (with a capital J). These are branch macros that do page-wrapping checking. You can find the implementation of those macros below. They were written by Blargg.
Code:
.macro branch_check opc, dest
opc dest
.assert >* = >dest, warning, "branch_check: failed, crosses page"
.endmacro
.macro Jcc dest
branch_check bcc, dest
.endmacro
.macro Jcs dest
branch_check bcs, dest
.endmacro
.macro Jeq dest
branch_check beq, dest
.endmacro
.macro Jne dest
branch_check bne, dest
.endmacro
.macro Jmi dest
branch_check bmi, dest
.endmacro
.macro Jpl dest
branch_check bpl, dest
.endmacro
.macro Jvc dest
branch_check bvc, dest
.endmacro
.macro Jvs dest
branch_check bvs, dest
.endmacro
opc dest
.assert >* = >dest, warning, "branch_check: failed, crosses page"
.endmacro
.macro Jcc dest
branch_check bcc, dest
.endmacro
.macro Jcs dest
branch_check bcs, dest
.endmacro
.macro Jeq dest
branch_check beq, dest
.endmacro
.macro Jne dest
branch_check bne, dest
.endmacro
.macro Jmi dest
branch_check bmi, dest
.endmacro
.macro Jpl dest
branch_check bpl, dest
.endmacro
.macro Jvc dest
branch_check bvc, dest
.endmacro
.macro Jvs dest
branch_check bvs, dest
.endmacro
Related, there is also this macro which checks whether a table spans across two pages, and produces a link-time warning if access to the table will produce page crossing.
Code:
.macro TableWrapCheck table, last_index, message
.assert >(table) = >(table+(last_index)), warning, message
.endmacro
.assert >(table) = >(table+(last_index)), warning, message
.endmacro