Everyone knows that the PPU offers a special address auto-increment mode of 32 to make it easier to draw columns of tiles to the name tables, but unfortunately it doesn't offer the option of incrementing the address by 8, which would be really handy for updating attributes.
One common way to avoid having to set the address before writing each attribute byte is to make use of the increment 32 mode anyway and only set the address for every 2 attribute bytes, which must be written non-sequentially (bytes 0 and 4, 1 and 5, 2 and 6, 3 and 7). This works really well when the attribute updates do not cross a screen boundary, which is indeed the case in many scrolling engines.
In my case however, I'm using a 4-screen name table layout, and column updates are always split between 2 screens, so there isn't much to gain from the increment 32 mode, because only some bytes can be written in pairs, depending on the alignment between the column and the screens, and handling all the different cases would probably negate the benefits in the end.
So, seeing as I'm stuck with setting the VRAM address for each byte, I started to think of ways to do this as fast as possible. The high byte of the address never changes within the same screen (since each attribute table is only 64 bytes), so it would seem logical to keep it permanently loaded in a register so I could simply write it to $2006 over and over without having to reload it. The low byte could be kept in the accumulator, so I could quickly jump to the next line with ADC #$08, which would leave an index register free for the data itself. I'd rather not load the data from a fixed memory position (since the space for VRAM updates is allocated dynamically), and using the stack is out of the question because it kills the ADC #$08 way of incrementing the address.
Just as I was about to give up on this solution, I remembered that $2006 and $2005 share the toggle that selects between 1st and 2nd write, and though that maybe I could use that to my advantage. As it turns out, the first $2005 write only affects bits in the lower half of the VRAM address (the coarse X scroll), and the second $2006 write overwrites the entire lower byte, so I don't really need to keep the high byte of the address loaded anywhere, I can simply write junk data to $2005 instead, and then write the actual low byte of the address, which the accumulator holds, to $2006. This means that both X and Y are free for me to load the data from a dynamic location. The code would then look something like this:
Of course, since the update is split across 2 screens, there'll be a "first byte" in each screen, and a total of 7 "remaining bytes", so the final cycle count for this is 174 (significantly better than the 200 something I had with my stack method), disregarding the logic necessary to handle the two variable-length updates (which in the worst case can be eliminated by having unrolled routines for all 8 possible alignments).
Now, I'm writing this post for 2 reasons: first, I want to run the idea by you guys to make sure it's solid, and that I didn't overlook anything. Second, I want to share the idea in case it may be useful to anyone, since combined $2005/$2006 writes is not something we usually discuss for purposes other than mid-screen scroll changes. So, what do you guys think? Is it safe to update the VRAM address in this manner? Am I forgetting anything?
One common way to avoid having to set the address before writing each attribute byte is to make use of the increment 32 mode anyway and only set the address for every 2 attribute bytes, which must be written non-sequentially (bytes 0 and 4, 1 and 5, 2 and 6, 3 and 7). This works really well when the attribute updates do not cross a screen boundary, which is indeed the case in many scrolling engines.
In my case however, I'm using a 4-screen name table layout, and column updates are always split between 2 screens, so there isn't much to gain from the increment 32 mode, because only some bytes can be written in pairs, depending on the alignment between the column and the screens, and handling all the different cases would probably negate the benefits in the end.
So, seeing as I'm stuck with setting the VRAM address for each byte, I started to think of ways to do this as fast as possible. The high byte of the address never changes within the same screen (since each attribute table is only 64 bytes), so it would seem logical to keep it permanently loaded in a register so I could simply write it to $2006 over and over without having to reload it. The low byte could be kept in the accumulator, so I could quickly jump to the next line with ADC #$08, which would leave an index register free for the data itself. I'd rather not load the data from a fixed memory position (since the space for VRAM updates is allocated dynamically), and using the stack is out of the question because it kills the ADC #$08 way of incrementing the address.
Just as I was about to give up on this solution, I remembered that $2006 and $2005 share the toggle that selects between 1st and 2nd write, and though that maybe I could use that to my advantage. As it turns out, the first $2005 write only affects bits in the lower half of the VRAM address (the coarse X scroll), and the second $2006 write overwrites the entire lower byte, so I don't really need to keep the high byte of the address loaded anywhere, I can simply write junk data to $2005 instead, and then write the actual low byte of the address, which the accumulator holds, to $2006. This means that both X and Y are free for me to load the data from a dynamic location. The code would then look something like this:
Code:
;NOTE: X points to the buffer position that contains the data for this update
;write the first byte (24 cycles)
lda UpdateBuffer+0, x
sta $2006
lda UpdateBuffer+1, x
sta $2006
ldy UpdateBuffer+2, x
sty $2007
;write the remaining bytes (18 cycles per byte)
sta $2005 ;<- could just as well be stx or sty too, it doesn't matter!
adc #$08
sta $2006
ldy UpdateBuffer+3, x
sty $2007
;(...)
;write the first byte (24 cycles)
lda UpdateBuffer+0, x
sta $2006
lda UpdateBuffer+1, x
sta $2006
ldy UpdateBuffer+2, x
sty $2007
;write the remaining bytes (18 cycles per byte)
sta $2005 ;<- could just as well be stx or sty too, it doesn't matter!
adc #$08
sta $2006
ldy UpdateBuffer+3, x
sty $2007
;(...)
Of course, since the update is split across 2 screens, there'll be a "first byte" in each screen, and a total of 7 "remaining bytes", so the final cycle count for this is 174 (significantly better than the 200 something I had with my stack method), disregarding the logic necessary to handle the two variable-length updates (which in the worst case can be eliminated by having unrolled routines for all 8 possible alignments).
Now, I'm writing this post for 2 reasons: first, I want to run the idea by you guys to make sure it's solid, and that I didn't overlook anything. Second, I want to share the idea in case it may be useful to anyone, since combined $2005/$2006 writes is not something we usually discuss for purposes other than mid-screen scroll changes. So, what do you guys think? Is it safe to update the VRAM address in this manner? Am I forgetting anything?