I came up with a way to upload data to the SPC over twice as fast as the normal loader today. I haven't ever looked at games, but I figure they probably have similar optimized loaders. I figured I'd post this anyway.
At best, the standard loader transfers one byte every 25 clocks, allowing 40K/sec. Its inner loop looks like this:
Since the SPC is the slower CPU, the most efficient arrangement is for the SPC to read data as fast as it can and send a "clock" to the S-CPU. Then the S-CPU uploads some data, then waits for the clock before continuing. It doesn't send any acknowledgement, since it can easily keep up.
There are four input ports, so it makes sense to transfer four bytes of data between each clock. We want to use the most efficient transfer instruction. I looked it over and there's no gain possible over a plain MOV A,$F4 for loading a byte. For writing, MOV !addr+Y,A is one clock faster than MOV [ptr]+Y,A. We can use self-modifying instructions to update the MSB of the address every 256 bytes.
If we received data in normal order, we'd have to update the index every byte, costing 2 clocks. Instead, we can receive data for each quarter of a 256-byte page, and only update the index once every four bytes, saving 1.75 clocks per byte. This means we receive data like this for each page:
It's more efficient to decrement the index, so each quarter is received in reverse order as well. Here's the inner loop:
Each iteration takes 46 clocks, so that works out to 11.5 clocks per byte, allowing about 87K/sec.
Here's the entire loop, including page handling:
The S-CPU side sending code looks similar, and the interleaved page transfer doesn't complicate it much. Using this in a SPC player allows uploading in 3/4 second. I can post a complete easily-usable routine if anyone's interested.
EDIT: commented where S-CPU is notified that SPC is ready for more data.
At best, the standard loader transfers one byte every 25 clocks, allowing 40K/sec. Its inner loop looks like this:
Code:
loop: CMP Y,$F4 ; 3
BNE not_ready ; 2
MOV A,$F5 ; 3
MOV $F4,Y ; 4
MOV [$00]+Y,A ; 7
INC Y ; 2
BNE loop ; 4
BNE not_ready ; 2
MOV A,$F5 ; 3
MOV $F4,Y ; 4
MOV [$00]+Y,A ; 7
INC Y ; 2
BNE loop ; 4
Since the SPC is the slower CPU, the most efficient arrangement is for the SPC to read data as fast as it can and send a "clock" to the S-CPU. Then the S-CPU uploads some data, then waits for the clock before continuing. It doesn't send any acknowledgement, since it can easily keep up.
There are four input ports, so it makes sense to transfer four bytes of data between each clock. We want to use the most efficient transfer instruction. I looked it over and there's no gain possible over a plain MOV A,$F4 for loading a byte. For writing, MOV !addr+Y,A is one clock faster than MOV [ptr]+Y,A. We can use self-modifying instructions to update the MSB of the address every 256 bytes.
If we received data in normal order, we'd have to update the index every byte, costing 2 clocks. Instead, we can receive data for each quarter of a 256-byte page, and only update the index once every four bytes, saving 1.75 clocks per byte. This means we receive data like this for each page:
Code:
$F4: $00-$3F
$F5: $40-$7F
$F6: $80-$BC
$F7: $C0-$FF
$F5: $40-$7F
$F6: $80-$BC
$F7: $C0-$FF
It's more efficient to decrement the index, so each quarter is received in reverse order as well. Here's the inner loop:
Code:
MOV Y,#$3F
quad: MOV A,$F4 ; 3
MOV !$0000+Y,A ; 6
MOV A,$F5 ; 3
MOV !$0040+Y,A ; 6
MOV A,$F6 ; 3
MOV !$0080+Y,A ; 6
MOV A,$F7 ; 3
MOV $F7,Y ; 4 tell S-CPU we're ready for more
MOV !$00C0+Y,A ; 6
DEC Y ; 2
BPL quad ; 4
quad: MOV A,$F4 ; 3
MOV !$0000+Y,A ; 6
MOV A,$F5 ; 3
MOV !$0040+Y,A ; 6
MOV A,$F6 ; 3
MOV !$0080+Y,A ; 6
MOV A,$F7 ; 3
MOV $F7,Y ; 4 tell S-CPU we're ready for more
MOV !$00C0+Y,A ; 6
DEC Y ; 2
BPL quad ; 4
Each iteration takes 46 clocks, so that works out to 11.5 clocks per byte, allowing about 87K/sec.
Here's the entire loop, including page handling:
Code:
mov x,#page_count
page:
; Transfer four-byte chunks
mov y,#$3F
quad: mov a,$F4
mov0: mov !$0000+y,a
mov a,$F5
mov1: mov !$0040+y,a
mov a,$F6
mov2: mov !$0080+y,a
mov a,$F7 ; tell S-CPU we're ready for more
mov $F7,Y
mov3: mov !$00C0+y,a
dec y
bpl quad
; Increment MSBs of addresses
inc mov0+2
inc mov1+2
inc mov2+2
inc mov3+2
dec x
bne page
page:
; Transfer four-byte chunks
mov y,#$3F
quad: mov a,$F4
mov0: mov !$0000+y,a
mov a,$F5
mov1: mov !$0040+y,a
mov a,$F6
mov2: mov !$0080+y,a
mov a,$F7 ; tell S-CPU we're ready for more
mov $F7,Y
mov3: mov !$00C0+y,a
dec y
bpl quad
; Increment MSBs of addresses
inc mov0+2
inc mov1+2
inc mov2+2
inc mov3+2
dec x
bne page
The S-CPU side sending code looks similar, and the interleaved page transfer doesn't complicate it much. Using this in a SPC player allows uploading in 3/4 second. I can post a complete easily-usable routine if anyone's interested.
EDIT: commented where S-CPU is notified that SPC is ready for more data.