In another thread I started talking about a way to do VRAM updates that is completely stack-based, and can be used for the entire game. I decided to start this new thread mainly to share what I have designed so far, possibly helping people in search of a VRAM update system, but also to get some input from you guys and maybe make this better.
The first thing I have is this piece of code in the NMI handler, after the OAM DMA, that takes care of running all the updates:
For each VRAM update, an output VRAM address is pulled from the stack and used, and the RTS jumps to some point in the middle of the following unrolled loop, depending on the amount of bytes that need to be copied:
This will copy a certain amount of bytes and JMP back to the update loop, so the next update can be processed. I decided that 32 bytes is the ideal length for this unrolled loop because it allows you to update the whole palette and a whole row of name table entries. You'll most likely only need more than that for pattern updates only, but there's nothing wrong in breaking those up into pairs of tiles (32 bytes = 2 tiles).
When does the update loop end then, if the program keeps JMPing back to UpdateVRAM? Well, in order to process the updates as fast as possible, I decided to not explicitly check for the end of the updates, but instead feed the RTS of the last iteration of the update loop with the address to RestoreSP. Yes, this means that the last address written to $2006 is a bogus one (if it makes you feel better, you can write $0000, like some people do anyway), but since it takes 16 cycles to set that address, that's equivalent to 8 non taken branches that would be necessary inside the loop to check for a flag that breaks out of the loop. I'm optimizing for the worst case here, so whenever there are more than 8 updates, it's cheaper to set the address unnecessarily than to check for a flag every update.
What about PPU address increments? Same thing. Whenever the increment mode has to be changed, the RTS is fed with the address of a routine that changes the increment mode, before RTSing again to reach the actual byte copy code.
All decisions so far were made to make the most out of the vblank time, meaning there's a lot of work to do before vblank starts, so the stack is formatted correctly. Personally, I'd rather use indexed addressing to fill the update stack, to avoid having to manipulate the stack pointer mid-frame and to avoid having to write all the data backwards. If you want to build the update stack using PHA, you'll have to do the next steps in the opposite order.
For each update you need to "schedule", you have to first check whether the current PPU address increment mode is the one you need. If it isn't, you have to write the address (minus 1, since it will be called with an RTS) of the routine that changes the increment mode. Then, you have to write the output address for the data, and then the data.
Once all updates have been written, you need a bogus VRAM address, which can be anything really, but you can use $0000 if that makes you more comfortable. Then you need the address of RestoreSP - 1, because that's what will allow the update loop to end. Even if there are no updates at all, you still need these last 4 bytes so the program doesn't crash.
Another interesting advantage of this method is that you can cheat, and use the RTS to jump to other types of byte copying routines besides the chain of PLA + STA. For example, to update attribute table data, I prefer to copy the bytes directly from my shadow attribute tables, instead of wasting time copying them over to the stack. In cases like this, you can simply use the address of the specialized byte copy code (minus 1) instead of the addresses from the regular look-up table. This allows you to have a constant NMI handler for the whole program, but doesn't prevent you from doing more specialized updates if you need to. This is useful if you need to switch banks and copy data directly from ROM, for example. As long as you JMP back to UpdateVRAM in the end, everything is game.
Does anyone have any comments or suggestions? The thing that's bothering me the most is the bogus VRAM address at the end, but like I said, in the worst case, it's cheaper to have that than checking a flag for every update.
The first thing I have is this piece of code in the NMI handler, after the OAM DMA, that takes care of running all the updates:
Code:
;swap stack pointers
tsx
stx RealSP
ldx FakeSP
txs
UpdateVRAM:
;set the output address and jump to the byte copy code
pla
sta $2006
pla
sta $2006
rts
RestoreSP:
;restore the stack pointer
ldx RealSP
txs
tsx
stx RealSP
ldx FakeSP
txs
UpdateVRAM:
;set the output address and jump to the byte copy code
pla
sta $2006
pla
sta $2006
rts
RestoreSP:
;restore the stack pointer
ldx RealSP
txs
For each VRAM update, an output VRAM address is pulled from the stack and used, and the RTS jumps to some point in the middle of the following unrolled loop, depending on the amount of bytes that need to be copied:
Code:
Copy32Bytes:
pla
sta $2007
Copy31Bytes:
pla
sta $2007
(...)
Copy2Bytes:
pla
sta $2007
Copy1Byte:
pla
sta $2007
CopyNothing:
jmp UpdateVRAM
pla
sta $2007
Copy31Bytes:
pla
sta $2007
(...)
Copy2Bytes:
pla
sta $2007
Copy1Byte:
pla
sta $2007
CopyNothing:
jmp UpdateVRAM
This will copy a certain amount of bytes and JMP back to the update loop, so the next update can be processed. I decided that 32 bytes is the ideal length for this unrolled loop because it allows you to update the whole palette and a whole row of name table entries. You'll most likely only need more than that for pattern updates only, but there's nothing wrong in breaking those up into pairs of tiles (32 bytes = 2 tiles).
When does the update loop end then, if the program keeps JMPing back to UpdateVRAM? Well, in order to process the updates as fast as possible, I decided to not explicitly check for the end of the updates, but instead feed the RTS of the last iteration of the update loop with the address to RestoreSP. Yes, this means that the last address written to $2006 is a bogus one (if it makes you feel better, you can write $0000, like some people do anyway), but since it takes 16 cycles to set that address, that's equivalent to 8 non taken branches that would be necessary inside the loop to check for a flag that breaks out of the loop. I'm optimizing for the worst case here, so whenever there are more than 8 updates, it's cheaper to set the address unnecessarily than to check for a flag every update.
What about PPU address increments? Same thing. Whenever the increment mode has to be changed, the RTS is fed with the address of a routine that changes the increment mode, before RTSing again to reach the actual byte copy code.
All decisions so far were made to make the most out of the vblank time, meaning there's a lot of work to do before vblank starts, so the stack is formatted correctly. Personally, I'd rather use indexed addressing to fill the update stack, to avoid having to manipulate the stack pointer mid-frame and to avoid having to write all the data backwards. If you want to build the update stack using PHA, you'll have to do the next steps in the opposite order.
For each update you need to "schedule", you have to first check whether the current PPU address increment mode is the one you need. If it isn't, you have to write the address (minus 1, since it will be called with an RTS) of the routine that changes the increment mode. Then, you have to write the output address for the data, and then the data.
Once all updates have been written, you need a bogus VRAM address, which can be anything really, but you can use $0000 if that makes you more comfortable. Then you need the address of RestoreSP - 1, because that's what will allow the update loop to end. Even if there are no updates at all, you still need these last 4 bytes so the program doesn't crash.
Another interesting advantage of this method is that you can cheat, and use the RTS to jump to other types of byte copying routines besides the chain of PLA + STA. For example, to update attribute table data, I prefer to copy the bytes directly from my shadow attribute tables, instead of wasting time copying them over to the stack. In cases like this, you can simply use the address of the specialized byte copy code (minus 1) instead of the addresses from the regular look-up table. This allows you to have a constant NMI handler for the whole program, but doesn't prevent you from doing more specialized updates if you need to. This is useful if you need to switch banks and copy data directly from ROM, for example. As long as you JMP back to UpdateVRAM in the end, everything is game.
Does anyone have any comments or suggestions? The thing that's bothering me the most is the bogus VRAM address at the end, but like I said, in the worst case, it's cheaper to have that than checking a flag for every update.