Has it occurred to anyone that it might be useful to have a page of ROM filled with values 0 through 255, so that you can perform operations between the accumulator and the index registers?
Instructions like ADC, SBC, AND, ORA and EOR all have "Absolute, X" and "Absolute, Y" modes, so if you point to the table with the values and use one of the index registers, the value fetched will be the same as the one in the index register, making it seem like the operation used the values of both registers as operands.
I guess I have used this trick before, but just for a small subset of the numbers, but now that I think of it, having the full table seems very useful, specially for avoiding temporary variables.
I don't know exactly why I started this topic, but I'm sure we can discuss other useful 6502 ASM tricks. I also like very much the one where you push an address (minus 1) to the stack and then use the RTS instruction to jump to that address. This can be useful for implementing jump tables, and I'm using this a lot in my game.
Great idea! It makes up for the 65xx's general lack of operations that combine A and X directly. Here's an asm summary in case not everyone gets it:
Code:
; Original code
stx temp
eor temp ; A = A EOR X
; New solution
eor table,x ; A = A EOR X
table:
.byte $00,$01,$02...$0F
.byte $10,$11,$12...$1F
...
.byte $F0,$F1,$F2...$FF
On the other hand, this only saves 2 clocks and 1 byte, so it'd have to be used more than 256 times or in a time-critical area to pay off.
EDIT: corrected clock count and major inefficiency in original code (tax?!? thanks tepples)
tokumaru wrote:
Has it occurred to anyone that it might be useful to have a page of ROM filled with values 0 through 255, so that you can perform operations between the accumulator and the index registers?
If you have the ROM space for such a table, and it's aligned, it saves a byte and two cycles over the temporary variable way:
Code:
A:
stx $FF ; 2b 3c
ora $FF ; 2b 3c
B:
ora table,x ; 3b 4c
EDIT: thanks tokumaru
Quote:
I also like very much the one where you push an address (minus 1) to the stack and then use the RTS instruction to jump to that address. This can be useful for implementing jump tables, and I'm using this a lot in my game.
It saves about four bytes off the temporary variable way to implement jump tables but is one cycle slower:
Code:
A:
lda hi
pha ; 1b 3c
lda lo
pha ; 1b 3c
rts ; 1b 6c
B:
lda hi
sta $01 ; 2b 3c
lda lo
sta $00 ; 2b 3c
jmp ($0000) ; 3b 5c
You guys are right, the savings are not that incredible. But I always feel bad about using temp variables (because the code looks messy), and it's very hard not to do so with the 6502, that has very few work registers. I liked the illusion that it's possible to have these few operations between the accumulator and index registers.
256 bytes out of a whole game ROM is not such a high price to pay for cleaner code and slightly more speed. And this can be used for other things too, such as mapper writes on boards with bus conflicts.
And tepples, as far as I know, "ora table,x" takes 4 cycles to execute, not 5, as long as the table is perfectly aligned to a memory page. Or am I wrong?
About the jump tables, yeah, it depends if you're aiming at speed or size.
Quote:
But I always feel bad about using temp variables (because the code looks messy), and it's very hard not to do so with the 6502, that has very few work registers.
Change your idea of messy. The main problem with zero page variables is when two routines try to use one at once. The best way to avoid this is to have temp variables that aren't used across subroutine calls, and aren't used by more than one thread at once (like main code and interrupt handler). But the 6502 has a ton of extended registers: 256 of them. That's why X and Y can't be used directly by arithmetic, only for indexing and counting. Embrace zero page!
Try coding for the Z80/8085/GB-Z80 for a while and you'll appreciate the elegance of the 65xx. Sure, you can do lots of register to register operations, but everything has a layer of bloat on it.
blargg wrote:
The best way to avoid this is to have temp variables that aren't used across subroutine calls, and aren't used by more than one thread at once (like main code and interrupt handler).
OK, but how do you do that and still keep things looking nice? Saying that a byte can only be used by one subroutine is a waste of space, as that byte will probably be unused most of the time. And reusing bytes is not easy when you have many nested subroutines. For routines that need a few bytes of work RAM, you can have a few groups of bytes, each dedicated to a different depth, but then you can't go very deep. And recursion is out of the question. What do you guys do about this?
Quote:
But the 6502 has a ton of extended registers: 256 of them.
Fair enough. I've heard the argument that the 6502 has 256 bytes worth of registers, and I guess this is mostly right.
Quote:
Try coding for the Z80/8085/GB-Z80 for a while and you'll appreciate the elegance of the 65xx. Sure, you can do lots of register to register operations, but everything has a layer of bloat on it.
I've done very little Z80 work, but enjoyed the fact that I could perform some fairly complicated work without having to touch a byte of RAM. And those shadow registers... that feature has to be useful! I know that instructions take more CPU cycles than on a 6502 though, probably even more than equivalent 6502 code using zero page RAM.
By the way,
this is a very interesting topic about tricks on the 6502.
I must say I absolutely love the 6502 way to do thing, for me it largely beats PIC, 8080 and Atmel so far, baybe some other CPUs/MUCs I haven't tried yet.
I have never trought of having such a table of constants, I guess it's only for use if you're short of temp variables and/or if speed is very important. I think wasting 256 bytes is significant on the NES. (unless you have maybe more than 256 KB of PRGROM). This thing could go if you know the number is small enough (something like 0-16) and that a such table is needed anyway (on a cart with bus conflicts). I have encountered a few temporary variable problems so far, I did a whole game engine with only 4 "Temp" variables, and 4 "NMITemp" variables (used in and outside NMI code separately, to avoid pushing the Temp variables or a stupid time-wasting thing like this in the NMI handler). I have encountered problem when I wrote a routine that for example uses Temp1 and Temp2, which calls a routine that uses Temp3 and Temp4, and that itself calls a routine which also uses Temp2 (and exept it to be fully available), this is a real pain to debug. Pushing Temp2 before calling the second routine is the way to go (or do it another way). Eventually it's better to give explicit names to variables. The best way could be to have an assembler which can undefine zero page variables to re-use them, so that the same adress can be used by two pieces of code if the programmer safely says they will never call eachother and that the routine does not expect a particular value to be in when called.
I also never trought of the push-push-rts way to do indirect jump, I always use the jmp []. The main problem is that the rts adress is not the real adress, and I never know how many it should be added/removed to work. However, it becomes interesting if you use this a lot, as saves a lot time four bytes may become significant. Plus the code looks more messy (this can also add to the geek factor in the other side).
tokumaru wrote:
OK, but how do you do that and still keep things looking nice? Saying that a byte can only be used by one subroutine is a waste of space, as that byte will probably be unused most of the time. And reusing bytes is not easy when you have many nested subroutines.
Note my limitation of "that
aren't used across subroutine calls". This rules out using them for loop counters, for example (if the loop makes a subroutine call). I admit setting up local variables on the stack is cumbersome and inefficient.
Quote:
I've done very little Z80 work, but enjoyed the fact that I could perform some fairly complicated work without having to touch a byte of RAM.
What's so bad about touching RAM? You're constantly reading it anyway for the opcodes.
Quote:
And those shadow registers... that feature has to be useful!
I think it's mainly to allow extremely quick interrupt response, where the handler just exchanges registers then processes the interrupt. It doesn't have to save the previous values,
and it can keep values in the shadow registers across interrupt handlings. For normal code, it doesn't seem very useful because it swaps so much.
Quote:
I know that instructions take more CPU cycles than on a 6502 though, probably even more than equivalent 6502 code using zero page RAM.
That's one problem, always paying for those extras even when the 6502's lean register set would be sufficient. My main gripe is the inconsistencies that you constantly run into. I actually like the SPC-700 sound processor in the SNES a bit better than the 6502. It's like a 6502 with fewer limitations on X and Y, and many instructions to really treat direct (zero) page variables as first-class registers. Most arithmetic and move instructions can use a direct page variable just as easily as A.
Bregalad wrote:
The main problem is that the rts adress is not the real adress, and I never know how many it should be added/removed to work.
Use RTI then:
Code:
lda #>addr
pha
lda #<addr
pha
php ; RTI will restore status, so push it now
rti
That's actually a really clever idea about the table thing. I never really thought about it. One thing that I do that I'm very glad I thought about is my NMI routine can do anything whenever it wants:
Code:
nmi:
jmp ($00)
jmp ($02)
jmp ($04)
jmp ($06)
jmp ($08)
jmp ($0A)
jmp ($0C)
jmp ($0E)
jmp ($10)
jmp ($12)
jmp ($14)
jmp ($16)
jmp ($18)
jmp ($1A)
jmp ($1C)
jmp ($1E)
lda #$00
sta $20
rti
Return:
inc $20
inc $20
inc $20
jmp ($20)
There may be a slight delay at the end of every routine, but I think it's worth it. There's nothing more I hate than doing a bunch of comparisons to have the NMI figure out what to do and when. Bytes 0-$21 are used up in Zero Page, and $0-$1F start out containing the High and Low parts of the "Return" address. $20/$21 contain the High/Low parts of wherever you are in the NMI routine. I personally am very very happy with it. I almost considered it a trick, or cheating when I first thought about it.
My solution to have a bunch of different NMI routines is this:
Code:
NMI:
jmp (NMIAddress)
That is all there is to the actual NMI routine indicated by the vector at the end of the ROM. The label "NMIAddress" points to a zero page location, and depending on where in the game we are, that location points to one of many different NMI routines:
Code:
NMITitleScreen:
(...)
rti
NMIPlayerSelect:
(...)
rti
NMIMainGame:
(...)
rti
NMIEndingSequence:
(...)
rti
I'm defining lots of "program modes" for my game, where each section (title screen, player select, title card screen, main game, bonus stage, etc) is represented by a program mode, that when initialized sets the address of the NMI routine it uses. All modes have triggers that enable the transition to other modes.
This way does not waste RAM (only 2 bytes are used to hold the address of the current NMI routine), and there is no speed penalty besides the time taken by the JMP instruction.
Why not just have the main code wait in a loop until NMI fires and sets a global flag? Then you don't have to worry about taking too long before the next NMI and having it interrupt a previous invocation. Or maybe you are saying you do this, you just also have a settable NMI routine that does things that must be done every frame, even if that frame appears the same as the previous.
blargg wrote:
Why not just have the main code wait in a loop until NMI fires and sets a global flag?
I have other projects that require constant calculation in order to avoid severe slowdown. Waiting for the NMI would be a waste of time when you could already be preparing data for the next frame. Not in my current project though.
Quote:
Or maybe you are saying you do this, you just also have a settable NMI routine that does things that must be done every frame, even if that frame appears the same as the previous.
That is certainly true for the music routine, for example. And since I also enable rendering late in the frame, I need NMI's to always use the same ammount of time. I can't ignore a single VBlank, or else the screen will jump.
Blargg, the method you describe is close to the one used in Final Fantasy, where the NMI just returns doing absolutely nothing. The game is free to call an NMI when it want without problems. The only true problem is that's it's possible to completely miss an NMI.
Quote:
That is certainly true for the music routine, for example. And since I also enable rendering late in the frame, I need NMI's to always use the same ammount of time. I can't ignore a single VBlank, or else the screen will jump.
In theory, Final Fantasy's music would also lag if the game does, but it does never lag anyways. You can however hear this in Final Fantasy II when you change rooms, the music don't change (like in Final Fantasy) and the music seriously lags (the game also seems to silent all channels for some reason, so the music will stop and restart a bit late on the next note), this also applies when entering/exiting menu.
I also remember Zelda and SMB happens to lag, with the music too. This looks extremely bad.
Final Fantasy III does exactly what Tokumary says, it has a "variable NMI vector", wich is slightly better, instead of wasting a jmp [xxx] instruction, the NMI vector directly points to RAM where a jmp instruction is stored (takes less time). This instruction is also ocasionally changed to a rti to completely ignore NMIs.
Celuis : I don't undersand anything to the method you described to us. It looks interesting however. Could you try to clarify it a little please ?
By the way I personally went the way of defining NMI the standard way (in ROM), and have it do the main graphics update and sound. That way, the music never lags, and even if the game lags, the NMI will still update the screen as fast as it can. It's even possible on the main program to synchronise on the graphic update flag (instead of the NMI flag) so if you want to update lots of graphics at once it takes more than a frame and causes no problem.
This just sounds sort of logical, and as long as different parts of the game use the same format of graphic buffers, the same NMI handler can be used for the whole game. That would be unoptimal for really big games, I think. (games with lots of unrelated minigames or such, which all have independant use of the screen, or a RPG where battle/field/menus, etc could be separated because they manage their screen completely differently in each case).
Bregalad wrote:
Celuis : I don't undersand anything to the method you described to us. It looks interesting however. Could you try to clarify it a little please ?
Oh, sorry about that. Let's pretend that there was an Indirect Absolute JSR instruction:
Code:
nmi:
jsr ($00)
jsr ($02)
jsr ($04)
...
jsr ($1E)
rti
That's pretty much what my routine would do. It has the Low/High parts of the addresses you want to go to in RAM. So if I want to do a screen updating routine at the beggining of the NMI, somewhere in the code, I'll store the Low/High parts of the address that points to wherever the screen updating routine is in $00 and $01, because at the beggining of the NMI, you see that it jumps to whatever address is stored in $00/$01. After the routine is done, we return to the NMI routine, and it goes onto the next address, which is stored at $02/$03. So if Indirect Absolute JSR was possible, it would look like this:
Code:
nmi:
jsr($00) ;The Low/High parts the lable "ScreenUpdate" are stored in $0/$01
jsr($02) ;The Low/High parts of the lable "Control" are stored in $02/$03.
...
jsr ($1E) ;The Low/High parts of the lable "Nothing" are stored in $04/$05.
rti
ScreenUpdate:
.... ;We come here at the beggining of the NMI routine.
rts
Control:
.... ;We come here after ScreenUpdate
rts
Nothing:
rts ;This is a blank routine we come to if we have nothing to do
That's basically what my routine does, except there's no such JSR($xx), so I have to use JMP($xx). Instead of doing RTS at the end of every routine I go to, I jump to a lable called "Return". At the Return lable, I tell it to jump back into the NMI routine, but instead of jumping back to where it was before, it'll jump to that + 3 bytes, so it will move on to the next JMP($xx) instruction. If I'm not using all 16 addresses, I make the ones that I'm not using to point to the address of the Return lable. It's just like what I showed above with the JSR($xx), except I just manipulated jump instructions to have the same effect. If I still didn't explain that very well, I can try again if you want... I'm not very good at explaining things.
Oh, your idea looks quite good ! Maybe a little TOO flexible, but this can come in usefull.
I guess there is plenty way to improve it, have the NMI point in a ROM adress wich does jmp($00), then the code at $00 would automatically do jmp($02) when it's done, etc... The problem is that the order cannot be nested, and I guess you don't want to have this limitation. You can also have a big jsr xxxx jsr xxxx jsr xxxx table in RAM, have the NMI point to it, and just change the adress as you wish. You can replace the jsr by a cmp or something to skip the routine without wasting time, you can change the first jsr by rti to completely ignore the NMI, or you can just replace the adress after the last jsr by a rti, making a variable-lenght NMI routine (but still have a maximum of course).
You'd still want to push the registers on the stack before the first jsr.
I'll probably do some improving in the future, but for now, I think it's pretty okay. I'm very comfortable using it. I think I will first do something where instead of having the unused ones going to "Return", I think I'll have the first one that's not used go to a routine that just does RTI, and not waste NMI time. However, as it is, I can simply tell it to do what I want whenever, and I won't have to look at an ugly NMI routine when debugging.
I understand why you needed such a dynamic way to handle your NMIs, because I had the same problem. My game has a LOT of things to do during VBlank, but the time is really short and there is no way to do everything every frame. Because of this, some tasks are interchangeable. For example: I have reserved some time for updating a row of metatiles, but if the screen did not scroll there is no need to, and I'll update the palette instead. I'm pretty sure that a similar situation inspired you to come up with that mode.
But I see a serious problem with your solution: The insane amount of valuable zero page RAM you set aside just for holding pointers that most of the time you won't even use.
My solution to the same problem, as I said above was to define interchangeable tasks. My solution ends up looking a lot like yours, but I waste less memory by defining which tasks replace which.
As I said before, I use the concept of multiple NMIs. In the main game, for example, which has it's own NMI, I have defined that during VBlank I need to:
1. Update a row of metatiles;
2. Update a column of metatiles;
3. Update the palette;
4. Update sparse metatiles;
5. Update the OAM through a sprite DMA;
6. Update some patterns;
There is no time to do it all every frame, but I can right away tell what will be done every frame: OAM updates. The rest of the tasks I handle just like you:
Code:
jsr CopySlot1
jsr CopySlot2
jsr CopySlot3
;Do OAM transfer here
;Handle music here;
rti
This would be the NMI for the main game. I used a different trick to compensate for the lack of an indirect JSR instruction:
Code:
CopySlot1:
jmp (CopyRout1) ;column or palette
CopySlot2:
jmp (CopyRout2) ;pattern block or palette and metatiles
CopySlot3:
jmp (CopyRout3) ;row or metatiles
With the concept of interchangeable tasks, I defined just a few time slots that I allocate to subroutines as needed. And in the cases where you need to do 2 things during the time otherwise used by a slower routine, you can just make a wrapper routine to call the other two:
Code:
UpdatePalette:
(...)
rts
UpdateMetatiles:
(...)
rts
UpdatePAletteAndMetatiles:
jsr UpdatePalette
jsr UpdateMetatiles
rts
Anyway, I'm just pointing out a different solution for the same problem. Your way is probablymore dynamic, because you can just use the same NMI routine for the whole game. But when combined, the two techniques I use (multiple NMIs with interchangeable tasks) are almost as dynamic, without using so much RAM.
Well, Tokumaru's solution also looks clever, and definitely more optimal.
My simple solution so far has always been to just give a priority to tasks. If something like the palette should absolutely be updataed, the main programm knowns that it should do it when all request to update metatiles are done, else the palette will be ignored, so I just programm things with this in mind. My game doesn't scroll very often and does it pretty slowly, so it is dratiscally differnt from a sonic platformer wich scrolls very fasts and almost always.
I will post the NMI I use (and that worked fine for me) for global information, but be sure to not use it as it or anything, because that's a piece of code I actually use (not just an example).
Code:
NMI
pha
txa
pha
tya
pha ;Saves the registers
bit $2002.w ;VBlank flag clear
lda SpriteDMAFlag
beq _skipSpriteDMA ;Is the sprite buffer ready ?
lda #$00
sta SpriteDMAFlag ;Sprite buffer is read
sta $2003.w
lda #$02
sta $4014.w ;Do sprite DMA at $200
_skipSpriteDMA
lda BlockFlag
beq +
lda #$00
sta BlockFlag
jsr WriteBlock ;Need to write some block data to the PPU ?
jmp _skipPPUUpload
+ lda PPUStringSize ;Need to upload a name-table string ?
beq +
jsr WriteNamStrings
jmp _skipPPUUpload
+ lda PalFlag ;Need to upload the palette ?
beq _skipPPUUpload
lda #$00
sta PalFlag
jsr WritePalette
_skipPPUUpload
lda #$3f
sta $2006.w
lda #$00
sta $2006.w
sta $2006.w
sta $2006.w
sta $2005.w ;PPUScrollH is always 0 (exept when changed midframe)
lda PPUScrollV
sta $2005.w ;Upload PPU regs
lda M2000
sta $2000.w
jsr PlaySound
jsr NumberLogic
lda #$00 ;Clears the flag to stop waiting the frame
sta NMIFlag
pla ;Restore registers
tay
pla
tax
pla
IRQ rti
So yeah this is exactly the routine I use, and it's absolutely simple, and works absolutley fine. The Sprite DMA is updated every frame, unless the main programm doesn't do this or if the game lags. Then the other PPU updates are sorted by priority, first comes the "block" update (block is just the word I used for metatiles, where the routine write a single "big" 32x32 metatile on the screen, the format I use in my game). If no block write is requested, it will check if a regular PPU string update should be done (there is a universal format my game uses for all non-map graphics). If none of this is to be done, the palette is updated if needed. If all 3 of these things are requested by the main program at the exact same time, 3 NMIs are needed in order to complete everything. Finally the music is handled every frame (no matter if we're still in VBlank or not), and "Number Logic" is applied (this routine increase/decrease some counters/timers and re-seed the random number generator). Then the NMI returns. I could do some improvement, because I guess it could be possible to update both a PPU String and the palette at the same time, or it could be possible to update only the BG or sprite palette to save time, however I keep things as simple as possible in the NMI (as long as it doesn't limit too much the main programm), and try to work with it as it.
By the way I noted I used a coule of times (4 exactly) code looking like this :
Code:
asl A
tax
lda JumpTable,X
sta PointerL
lda JumpTable+1,X
sta PointerH
jmp [Pointer]
JumpTable:
.dw Adress1
.dw Adress2
etc...
Would it be possible to just replace it by this and save space without corrupting anything ?
Code:
asl A
tax
lda JumpTable,X
pha
lda JumpTable+1,X
pha
rts
JumpTable:
.dw Adress1-1
.dw Adress2-1
etc...
This saves 4 bytes and the use of the word Pointer, which is free for use anyways since I already wrote the code with that in mind and since I use that variable for other fonctions as well.
By the way, since Tokumary mentionned this, I have a complete game engine (without much running on it for now) that uses zero page bytes $00 to $9e (62% of zero page taken), uses BSS RAM bytes $300 to $583 (plus the stack and OAM taking $100-$2ff), making the BSS RAM (50.2% of BSS RAM taken, without counting the OAM nor the stack).
So even if your game engine is more complex that mine, and even if the my game isn't compltetly completed, I guess RAM is still somewhere you have enough space to work what you must do.
tokumaru wrote:
But I see a serious problem with your solution: The insane amount of valuable zero page RAM you set aside just for holding pointers that most of the time you won't even use.
I really don't see how I'm using an insane amount of zero page. I just think I'm using a moderately sized section of it. I have plenty of RAM to work with. I can store a variable or two in something that isn't zero page. I'm sure it'll take more time, and the cycles will add up. But, for the flexibility that I have now, I think it's well worth the zero page. I'll change it in the future to use less.
I tried the pha/pha/rts thing, and it works fine ! In fact I think this is the standard way to do the jmp() thing on some other processors who doesn't have the jmp() isntruction.
About the NMI I trough of a programm who can automatically write an adress to $2006 and a string to $2007, while computing the time it will take to itself so it can automatically maximize the amount of data written to the PPU. Then the main programm doesn't have to worry much about things, as things are getting updated as soon as possible, and in the order specified, with as much data as possible on one single frame, the rest being automatically left for the next frame. Thing of it as a FIFO between the main programm and the NMI, where the buffer is emptied by a computed size automatically. This should be somewhat complicated to implement, tough. I'd prefer this to have the programm manually change the adress where the NMI is going to update things, tough, as the NMI can happen anywhere and you're always set for it, and the code will be overall more clearer, and easier to use as view from the main programm.
Soooo... I was thinking of a way to generalize the fast updating of Name Tables I talked about in
this post, and I ended up coding some sort of pseudo-DMA thing for writing data to the PPU.
This uses a lot of ROM space, because of the big jump table and the big unrolled loop (about 1KB total, not counting the ByteTable I use for X and Y operations), but it optimizes a lot the whole process of writing data to the Name Tables. Here is the code I used to test the idea (it runs on the 6502 simulator):
Code:
TempAddress = $00
Buffer = $0480
.start Reset
.org $8000
ByteTable:
ByteValue .set 0
.rept 256
.db ByteValue
ByteValue .set ByteValue + 1
.endr
JumpTableLo:
CopyAddress .set CopyEnd - 6
.rept 128
.db <CopyAddress
CopyAddress .set CopyAddress - 6
.endr
JumpTableHi:
CopyAddress .set CopyEnd - 6
.rept 128
.db >CopyAddress
CopyAddress .set CopyAddress - 6
.endr
Reset:
ldx #$7f
FillBuffer:
txa
sta Buffer, x
dex
bpl FillBuffer
lda #$0f ;data length
ldx #$00 ;data offset
jsr UpdateData
brk
UpdateData:
;INPUT:
;A: length of the string to copy -1
;X: where in the buffer the string starts;
tay
clc
adc ByteTable, x
tax
lda JumpTableLo, y
sta TempAddress+0
lda JumpTableHi, y
sta TempAddress+1
jmp (TempAddress)
CopyData:
Displacement .set 128
.rept 128
Displacement .set Displacement - 1
lda Buffer-Displacement, x
sta $2007
.endr
CopyEnd:
rts
Just watch as the values are written to $2007 to make sure the right values are being fetched. This code is pretty incomplete, of course, (I can't even say if it's 100% bug-free) it just demonstartes the idea, but you could very well implement a list of updates to perform during VBlank, composed of the destination address, the increment mode (1 or 32), the length of the data and it's offset into the buffer. Then, during VBlank, you simply scan this list and execute all the requested copying.
I limited the buffer to 128 bytes because that's the most you can do without crossing pages (which would make reads 1 cycle slower), and the buffer must be placed in the upper half of a memory page for the same reason. However, 128 bytes should be enough for data that is witten to unpredictable locations. I mean, you wouldn't use this to update the palette, since you always know where the palette goes, this is mostly for Name Table writes, and hardly you'll need to write more than 128 bytes to them when scrolling.
One thing that bothers me is that useless byte table :
Code:
stx Temp
clc
adc Temp
tax
Would take about the same time as :
Code:
tay
clc
adc ByteTable, x
tax
and would save 256 bytes.
Also, why use indirect adressing in the Displacement routine ? This could potentially waste cycles if a page boundary is crossed, but well... anyway that was just for the idea. It looks good for really fast transfer, as it would be almost as fast as self-modified code but wouldn't waste a lot of RAM.
Using $300-$7ff (or $200-$6ff or anything else) you're able to store a chain of exactly 256 lda#$xx/sta $2007 instruction which is cool.
Bregalad wrote:
One thing that bothers me is that useless byte table
In this case, yeah, there is no advantage in using it, but I used because I have this table in my game and got used to using it for X and Y operations (as described in the first post of this topic). Since I have the table anyway, and in some cases it really does reduce the complexity of the code a lot, I used it here too, just to avoid the temp.
Quote:
Also, why use indirect adressing in the Displacement routine ? This could potentially waste cycles if a page boundary is crossed, but well...
Indirect? Where? Do you mean indexed? If you think about this code a bit more, you'll see that LDA absolute can't be used for this, as it wouldn't allow for random access to the buffer. X is used to displace the start of the buffer around. As far as I know, the way I did it would be the only way to have access to any part of the buffer, starting anywhere in the list of LDAs STAs (depending on the amount of bytes to copy) so that the copying process ends with the last byte you want to copy. If you know of a better way to do the same thing, I'd like to know.
Also, as long as you respect the rules I talked about (the limit of 128 bytes and the location of the buffer), there will be no page crossing.
EDIT: Just to make it clear, the beauty of this is that you can copy a variable amount of data from anywhere in the buffer, without having to manage an index register for every byte, and without having to check for an ending condition. This is what makes it fast, it's a flexible unrolled loop. With absolute addressing you'd only be able to read hardcoded positions of the buffer.
I've actually been using something similar in my tile copying loop, it needs to be able to start and stop at arbitrary indices because a single transfer usually takes several frames. In my case I need to transfer up to 240 different characters from two separate buffers which would correspond to about 53k of unrolled code, perhaps a bit hefty ;)
Thankfully with a bit of arithmetic and you can get much the same benefits from partially unrolled loop. Now the following codes tries to copy the buffer to VRAM starting from x up continuing up to (but not including) limit. Unfortunately you lose as many bytes as the unrolling factor at the end of the array, so in this case we could only copy up to 248 bytes, and be sure to place those "wasted" bytes at the start of the page such that buffer-8 doesn't cross a page. Furthermore the initialization code and especially the computed goto could be sped up a bit but I'm thinking that it might as well be prepared outside of vblank anyway. Oh and "identity" is tokumaru's byte table which I've actually been using in my own code. It's damned nice having bytes to spare on such things for once :)
Code:
copy:
lda limit
clc
sbc identity,x
and #$07
tay
adc identity,x
tax
lda .enter_lo,y
sta trampoline+0
lda .enter_hi,y
sta trampoline+1
lda #$ff
jmp (trampoline)
.enter_lo:
.byte <.enter7,<.enter6,<.enter5,<.enter4
.byte <.enter3,<.enter2,<.enter1,<.enter0
.enter_hi:
.byte >.enter7,>.enter6,>.enter5,>.enter4
.byte >.enter3,>.enter2,>.enter1,>.enter0
.loop:
sbx #-8
.enter0:
ldy buffer-8,x
sty $2007
.enter1:
ldy buffer-7,x
sty $2007
.enter2:
ldy buffer-6,x
sty $2007
.enter3:
ldy buffer-5,x
sty $2007
.enter4:
ldy buffer-4,x
sty $2007
.enter5:
ldy buffer-3,x
sty $2007
.enter6:
ldy buffer-2,x
sty $2007
.enter7:
ldy buffer-1,x
sty $2007
cpx limit
bne .loop
Yeah, the idea is very similar... I understand that you have to do things a bit differently, as the purposes aren't the same.
My motivation for this was my Sonic game, since because of the fast scrolling, I must be able to write up to 128 bytes of Name Table data, for the new columns and rows of metatiles that scroll in. I doubt it would be possible to write as much data within standard VBlank time, considering that there are still other things to write, such as Attribute Tables, sprite DMA, and so on.
tokumaru wrote:
Yeah, the idea is very similar... I understand that you have to do things a bit differently, as the purposes aren't the same.
I'd say it's exactly the same thing, the only difference is that I'm copying up to 8k which makes complete unrolling tad impractical. But my "innerloop" works exactly the same way as your raw copying, well except for that fact that you're copying downwards IIRC.
Perhaps even 768 bytes of code might become too some day if the interrupt code has to run from the fixed bank.
Quote:
My motivation for this was my Sonic game, since because of the fast scrolling, I must be able to write up to 128 bytes of Name Table data, for the new columns and rows of metatiles that scroll in. I doubt it would be possible to write as much data within standard VBlank time, considering that there are still other things to write, such as Attribute Tables, sprite DMA, and so on.
Just how are you extending the blanking period by the way? I had all kinds of issues with this if you recall.
doynax wrote:
Perhaps even 768 bytes of code might become too some day if the interrupt code has to run from the fixed bank.
I see what you mean... I'm not worried about this because I have an 8KB ROM bank dedicated to the VBlank routines, enough to unroll everything.
Quote:
Just how are you extending the blanking period by the way? I had all kinds of issues with this if you recall.
I'm not. I was when I was using the MMC1 with CHR-RAM, so that I could copy 256 bytes to it every frame, in addition to all the other PPU operations. I simply had all my VBlank code use a fixed amount of cycles (which was a bit hard, because I alternated some tasks, so I had to make sure that alternating tasks took the same amount of time), so I always enabled rendering at the same point, 16 scanlines past the start of the frame. It worked fine, except for a few sprite problems (I couldn't seem to find the exact moment to turn sprite rendering on, so I always had a few glitchy pixels at the top left corner of the screen), and the different dot crawl pattern that results from skipping the pre-render scanline, which may or may not bother people.
Because of how hard it was to keep the Vblank code constant-timed and because of the sprite glitches, I ended up switching to the MMC3 with CHR-ROM, and all PPU updates fit in the standard Vblank time now. I didn't have to sacrifice much, and ended up getting more usable frame time from not having to manually copy tiles to the PPU. I still blank the top 16 scanlines though, by having blank tiles switched in, and only switching the actual patterns when a mapper IRQ fires 16 scanlines into the frame. This is to avoid scrolling glitches usually visible when you use vertical mirroring and scroll vertically.
I know a lot of commercial games had visual glitches, but I just couldn't live with them in my game(s). If it is possible to get rid of them, I'll do everything I can! =)
Well, time-optimising is definitely the exact opposite to byte optimising.
These times I had to do a lot of or byte-optimising, where I could save 4 bytes or so about everywhere in my game engine, and now you guys talk about 8k of unrolled code that makes me crazy.
That identity thing is a bit fun, it's true it allows easy and fast operations with index registers (adding, substract, and even and, or etc..), and each time you use it you save 1 byte, but still you have to use this trick more than 256 times for it to be really worth it.
Speaking of 6502 tricks, I just figured out how many times I had to do 4 consecutive ASL A or LSR A, so I just made 2 routine that does them and return, and call them, saving one byte everytime this trick is used. I was able to save about 30 bytes doing that, which is great.
However I had to do a lot of time-optimising for my mode-7 demo, as I perform raster-timed code and calculations at the same time.
Put some code in RAM and modify the argument of the instructions rocks, as you can add one level of indirection and then save time (lda #$xx can be equivalent to lda $xxxx, and lda $xxxx,Y can be the equivalent to lda [$xx],Y, plus you get the equivalent of the inexistant lda [$xx],X).
I know you are not using any kind of PRG bankswitching in your current project, so these tricks probably do not interest you much.
These tricks are for games with much larger PRG-ROM, which are usually more complex anyway. You'll hardly see an NROM or CNROM with complex scrolling and animations, large sprites, and so on...
You can definitely make a very fun game with NROM or CNROM. You may have heard of a little game called Super Mario Bros. :p
But seriously there are alot of fun games that use no mapper. Atleast I find them fun.
Bregalad wrote:
Speaking of 6502 tricks, I just figured out how many times I had to do 4 consecutive ASL A or LSR A, so I just made 2 routine that does them and return, and call them, saving one byte everytime this trick is used. I was able to save about 30 bytes doing that, which is great.
I was thinking about a trick to do divisions by 16/multiplications by 16. If you wanted to save about 3 cycles, you could do something like this:
ldx SomeVariable
lda Table,x
Table:
.db $00,$10,$20,$30,$40,$50,$60,$70,$80,$90,$A0,$B0,$C0,$D0,$E0,$F0
That would be good for needing to multiply 4-bit values by 16. But you could make a 256-byte table that holds those values every $10 bytes, so you could multiply 4-bit values by 16 and save 3 cycles. The same could be applied for dividing, but it would pretty much require a 256-byte table. while it's a huge waste of ROM, it may end up saving you a scanline or two from the very frequent divisions/multiplications of 16.
Bregalad wrote:
That identity thing is a bit fun, it's true it allows easy and fast operations with index registers (adding, substract, and even and, or etc..)
Hey, I just thought of another very good use for the identity table:
Code:
ldx identity, y
Code:
ldy identity, x
These work like TYX and TXY, which obviously don't exist. I was just coding my game and felt the need to do a TYX, when I noticed this could in fact be done with the table. Seriously, for anyone that still thinks that this table is not worth the 256 bytes it uses: It really increases the functionality of X and Y, usually saving RAM that would be used as temporary storage, and saving ROM that would be used by the extra code needed to perform the same tasks. This table makes me feel like I gained a lot of new opcodes. =) If you can spare a bit of ROM, you really should use this table.
EDIT: I'm almost creating some macros named like the pseudo-opcodes resulting from the functionality provided by this table... It'd be like
legal undocumented opcodes! =)
Quote:
These work like TYX and TXY, which obviously don't exist. I was just coding my game and felt the need to do a TYX, when I noticed this could in fact be done with the table. Seriously, for anyone that still thinks that this table is not worth the 256 bytes it uses: It really increases the functionality of X and Y, usually saving RAM that would be used as temporary storage, and saving ROM that would be used by the extra code needed to perform the same tasks. This table makes me feel like I gained a lot of new opcodes. =) If you can spare a bit of ROM, you really should use this table.
Well, you need a couple of temporary storage variables ANYWAY whathever you're going to do. I remember having lot of headaches to stick with only 4 temporary variables, and whenever I need more I use different named half-general purpose variables.
Also, even if it could save a couple of byte in the code at a couple of places, I guess it would be very rare to actually save 256 bytes that way. You'll do it only if you have unrolled loop with use of this table inside of something like that. So memory-wise, this isn't a good solution, but time-wise or easy-to-use wise, maybe it is.
Also, TXA/TAY and TYA/TAX takes 1 less byte than ldx Identity,Y and ldy Identity,X, and take the exact same time so I don't know why you'd want to do this. And yeah it overwrites A, but usually in a single loop/iteration you affect X and Y to a single usage so I don't see much the trick. The only reson it would be really usefull is if you use an instruction like rol $xx,X which can't be done with Y, and then sta [$xx],Y which can't be done with X, but you want the same "index", and you don't want to overwrite A in the process, so yeah in that case it's usefull, but that's not really frequent.
Honnestly, with 256 bytes you can have a very large additional level in your game or a new music with 3 tracks, wich are much better usage than a stupid identity table.
@Celius : Yeah your idea should be great for the other guys that want really fast code, however it's not great for me who want to save bytes, even if that slow the process a little. Using your trick uses 5 bytes instead of 4, or even instead of 3 if you have a subroutine that does 4 ASL and RTS (I do, and as mentionned above I use it above 25 times in the whole code).
And if you want the equivalent table for LSR, you could have an assembler place a byte with $00, $01, $02, etc... all 16 bytes and manage to have 15 very small subroutine that takes 15 bytes or less intervealed in here. Such things that a routine that polls $2002 and return, or write to the mapper while avoidinc bus conflicts, etc... I'm pretty sure a complete game engine would have 15 routines that takes 15 bytes or less.
i just recently discovered the BIT trick:
Code:
Sub1: ldx #00
.db $2C
Sub2: ldx #07
.db $2C
Sub3: ldx #11
stx somewhere
; go about business
its not terribly useful, but has its moments.
Can't resist bumping this old thread to mention I've found a use for combining BIT and the identity table.
I kind of really miss the bit immediate instruction available on the 65C02. There's quite a few cases where I'd like to test certain bits in a byte with a bitwise AND without destroying the contents of the accumulator:
Code:
lda mapFlags,X
bit #FLAG1+FLAG2
beq :+
jsr DoSomething
:
bit #FLAG5+FLAG6
beq :+
jsr DoSomeOtherThing
:
But even though the bit immediate instruction is not there, it could be emulated with BIT absolute and an identity table, using 1 more byte and 2 more cycles:
Code:
lda mapFlags,X
bit Identity+FLAG1+FLAG2
beq :+
jsr DoSomething
:
bit Identity+FLAG5+FLAG6
beq :+
jsr DoSomeOtherThing
:
A more optimized way would of course be to reserve a few zeropage bytes for the combinations you really need to test, but that's not as generic. Though with a powerful enough macroassembler, I guess you could have a BIT immediate macro that employs the identity table as a safe fallback, but uses zeropage locations for the most popular combinations. :)
If your doing 65816 in 16-bit mode, it would need to be a hiROM cart either at a $40-$7d bank or a $80-$ed bank.
Celius wrote:
The same could be applied for dividing, but it would pretty much require a 256-byte table. while it's a huge waste of ROM, it may end up saving you a scanline or two from the very frequent divisions/multiplications of 16.
Here's the code I use to divide 12-bit numbers by 16 with an 8-bit quotient (n = $000 to $FFF):
Code:
; assume Y holds "n" lsb, X holds "n" msb, A will have result
lda m16tbl_hi, Y
ora m16tbl_lo, X
and the tables:
m16tbl_hiCode:
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03
04 04 04 04 04 04 04 04 04 04 04 04 04 04 04 04
05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
06 06 06 06 06 06 06 06 06 06 06 06 06 06 06 06
07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07
08 08 08 08 08 08 08 08 08 08 08 08 08 08 08 08
09 09 09 09 09 09 09 09 09 09 09 09 09 09 09 09
0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A
0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B
0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C
0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D
0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E
0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F
m16tbl_loCode:
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
I think it can be extended to work with 16-bit numbers by using X to index into m16tbl_hi for the high byte. I haven't verified it though.
You know Baramos it was a great idea to revive this thread.
How many times I've had to do something like
pha
and #$xx
...blah blah
pla
...blah blah
when the BIT instruction would have been better but couldn't be used because there is no BIT immediate !
Without going as far as having a 256 byte identity table, simply use .db with the constant you need (this is a single byte table...)
Since you'd usually do the bit instruction with only a single bit set, and that bit 7 can be directly tested with the N flag, a 7-byte table should be enough for anything (in fact only 6 will be necessary - I'll explain below) :
.db $01, $02, $04, $08, $10, $20, $40
Also I've found another trick which is so simple but worth mentionning. If you load a value in A, the only bit you can "quickly" test is the 7th one, with the N flag.
But if you use the ASL A instruction then you can "quickly" then C=7th bit and N=6th bit, so you can quickly test 2 bits without using AND or BIT instructions ! Pretty useful !
Another very simple, but clever thing is to keep that in mind : If you made a subroutine that you're only going to call once in your code, then replace it by a macro. You'll save a 6 cycles and 4 bytes (assuming there were no branch instructions around the call). This make the code as structured as if it was a subroutine and you can always change back to a subroutine if you are going to call it somewhere else.
Finally, how many times you call a subroutine and you need more than 3 bytes of arguments ? (more than what A, X and Y can handle) ?
Well the solution to that comes from SMB disassembly...
Code:
jsr MyRoutine
.dw Pointer1, Pointer2 ;4 bytes arguments
MyRoutine
jsr GetArguments
.....
rts
GetArguments
tsx
lda $103,X ;Get return adress from the stack
sta PtrL
clc
adc #$04 ;Add 4 to return adress
sta $103,X
lda $104,X
sta PtrH
adc #$00
sta $104,X
ldy #$04 ;Copy arguments to Temp variables
- lda [Ptr],Y ;We should not forget the adress in the stack is return adress -1 !
sta Temp-1,Y
dey
bne -
rts
Sure the GetArgument routine can be pretty long and bit, but in the end, it will save you possibly hundred of times to do something like :
Code:
lda #BlahBlah
sta Temp
lda #BlahBlah
ldy #BlahBlah
ldx #Blah
jsr Routine
So this saves aproximately 6 bytes for each call, with can end up a lot if this is done frequently.
My GetArguments routine is 32 bytes, so if you use this trick more than 10 or so times it's definitely a gain.
However you can't use variable arguments unless you place your code in RAM (which ends up being unpractical on the NES as you'll have to copy it here).
I wonder if there is any way to improve this argument thing to save even more bytes, I have some feeling that it is possible.
Quote:
when the BIT instruction would have been better but couldn't be used because there is no BIT immediate !
The special case where you want to BIT #$80, #$40, #$20, etc should be doable without destroying A:
Code:
lda foobar
bmi bit7_set
cmp #$40 ; we know that bit 7 wasn't set
bcs bit6_set
cmp #$20
bcs bit5_set
; and so on
Bregalad wrote:
However you can't use variable arguments unless you place your code in RAM (which ends up being unpractical on the NES as you'll have to copy it here).
I wonder if there is any way to improve this argument thing to save even more bytes, I have some feeling that it is possible.
Coincidentally there's a similar discussion (inlining paramters) on
AtariAge.
Don't know if I'd call it an improvement but I have used code something
like this.
The list is zero terminated the code consumes a pair of zeros and outputs
a single zero, an unpaired zero terminates the list.
the first two bytes are the address of the target routine.
it only deals with the least significant byte of the return address so the
parameter list can't srtraddle a page boundary.
Pointer needs to be inialized to zero.
Code:
GET_PARAMETERS
pla
tay
pla
pha ; put the high byte back
sta pointer+1
ldx #$00
beq SKIP
LOOP
sta parameters,x
inx
SKIP
iny ; pointing one short first pass here fixes that
lda (pointer),y
bne LOOP
iny
lda (pointer),y
beq LOOP
dey ; fix the return address guess we can't return to a
; break
tya
pha
jmp (parameters)
How to check if a number at mem_ptr is a power of 2?
Code:
ldx mem_ptr
dex
txa
and mem_ptr
be power_of_two
;it is not a power of two
krzysiobal wrote:
How to check if a number at mem_ptr is a power of 2?
Code:
ldx mem_ptr
dex
txa
and mem_ptr
be power_of_two
;it is not a power of two
As long as mem_ptr is not 0.
Well you could say that 2^(-infitine) = 0, therfore 0 is a power of two
but it's a bit cheating.
@bogax : I think your approach is interesting.
I like the idea of using pla instructions cleverly to retrieve the return adress !
However, in many cases you'll need more than a single subroutine which uses the "advanced argument" system.
Therefore of course you do not want to copy ALL the argument fetching code in ALL the routines that uses the system, or else it will waste a lot of space, killing the idea of this advanced argument system.
This is why in my example, the called routine starts by calling itself the argument fetching routine, which reads 3 bytes ahead in the stack (it ignores it's own return address and goes to the return address before that).
And yet I am still sure there is a way to improve it...
Bregalad wrote:
@bogax : I think your approach is interesting.
I like the idea of using pla instructions cleverly to retrieve the return adress !
However, in many cases you'll need more than a single subroutine which uses the "advanced argument" system.
Therefore of course you do not want to copy ALL the argument fetching code in ALL the routines that uses the system, or else it will waste a lot of space, killing the idea of this advanced argument system.
This is why in my example, the called routine starts by calling itself the argument fetching routine, which reads 3 bytes ahead in the stack (it ignores it's own return address and goes to the return address before that).
And yet I am still sure there is a way to improve it...
I'm not sure I understand your objection.
Rather than JSR to a routine that then JSRs to the argument fetching
code you just JSR to the argument fetching code supplying it with
the address of the routine you want to JSR to as a parameter
which the argument fetch code then jumps to.
You shouldn't need to duplicate the argument fetch code, but you
have to include the argument fetch routine address in each JSR
to a routine that uses the argument fetching code (while saving
a JSR to the argumewnt fetch code in each of those routines)
And while it may save a little messing with the stack, it costs
25 cycles or so.
Like I said, I'm not sure it's any improvement
but it will fetch a variable number of arguments.
Perhaps it would make more sense to do something closer to
your code, but pass it the number of arguments.
Sorry - my bad - I didn't think hard enough.
I haven't decided really if "my" (really Nintendo's) solution or yours solution is better.
I should take the time to analyze of much bytes each solution saves.
Anyway I think your solution is very elegant, while mine would need a "jsr FetchArguments" at the start of every routine which needs arguments - so I think your solution is probably better.
I wonder if there should be a wiki page about 6502 asm optimisations, as it might be easier to find info from a wiki page than from a thread that is eventually going to be at the 432nd page even if the forums are fully preserved.
Go ahead and make the wiki page.
Bregalad wrote:
Sorry - my bad - I didn't think hard enough.
This is entirely my fault for not commenting
the code better.
Bregalad wrote:
I haven't decided really if "my" (really Nintendo's) solution or yours solution is better.
I should take the time to analyze of much bytes each solution saves.
Anyway I think your solution is very elegant, while mine would need a "jsr FetchArguments" at the start of every routine which needs arguments - so I think your solution is probably better.
With my code you avoid dealing with an extra
return address inside the fetch arguments routine
at the expense of pushing that address out of the
routine and into the JSR/argument list for each time
you call a routine that uses the fetch arguments
routine.
The timing is probably approximately the same but
I think it will end up costing some extra bytes
of code.
Speed wise, using the same index for source and
destination is faster so here's some code that
gets the number of parameters to transfer from
the head of the parameters list.
Sort of like Pascal style strings rather than
C style strings.
Code:
; routine to fetch parameters inline from
; the return address currently on the stack
; to be called from within a routine that will
; use the parameters
; the number of bytes to fetch is the first byte
; inline, the byte at the address on the
; stack + 1
; the number of bytes does not include itself
; so its the number of bytes following the first
; byte that is fetched ie if the first byte is
; 4 then there should be 5 bytes inline total
; and the last 4 will be fetched as paramters
; the first byte should always be one less than
; the total of in line bytes so
; the return address is adjusted to point to the
; last inline byte by adding the first byte
; plus one
; no provision is made for a carry so the parameters
; should not cross a page boundary
GET_PARAMETERS
; get the return address into pointer
tsx
lda $104,x
sta pointer+1
lda $103,x
sta pointer ; the return address is one short
inc pointer ; so fix the pointer
ldy #$00 ; pointer points to the first byte inline
; adjust the return address
sec ; plus one for the first byte
adc (pointer),y
sta $103,x ; and put it back on the stack
lda (pointer),y ; now get the number of bytes to transfer
tay ; into y
LOOP
lda (pointer),Y
sta parameters-1,Y
dey
bne LOOP
rts
I got an awesome way to test individual bits without destroying A, simulating the inexisting BIT #$xx opcode. Just use bit $xxxx on known opcodes !
Test bit 6 :
Since your code is extremely likely to have a RTI instruction somewhere, just bit this instruction, and it will test bit 6. (RTI = $40)
Test bit 5 :
BIT any JSR opcode
Test bit 4 :
BIT any BPL opcode
Test bit 3 :
BIT a PHP opcode (okay PHP isn't as common but chances are you'll still use it at least once)
OK I just made another awesome discovery.
For a function that needs a bool value, use V for an input flag. Since so few instructions affect V (adc, sbc, bit, and obviously clv, plp and rti), it's possible that in some case you want to have a routine do something slightly different, and that you don't want to use C as it's more often done as C is affected before the part that can take two different routes, V is more likely to be unaffected until this part.
This can really save you a few bytes if you would use two different routines otherwise, or use C and use php/plp but it's not as efficient.
Also bit a RTI instruction to set V.
The same could be done for returning a bool value, although I'm so much used to return C as a bool that this habit will be hard to change.
Second part, I analyzed the passing of >4 bytes of parameters very carefully.
bogax's solution is officially the best.
I counted bytes as following, assuming you have M functions that uses N parameters :
- Standard solution (3 last args in registers and anything else stored to ZP before calling) : 9+N bytes for each call, 0 overhead for the calee
Overall : (9+N)*M bytes.
- Nintendo's solution (store arguments after the jsr opcode and have the calee call a function that copies args to ZP) : 3+N bytes for each call, and 3 bytes overhead for the callee, the argument retrieve function takes 32 bytes :
Overall : (6+N)*M + 32 bytes
- bogax's solution (jsr to a special function followed by the adress of the actual function, # of arguments and arguments) : 6+N bytes for each call, 0 overhead for the calee, argument retrieve function takes 29 bytes :
Overall : (6+N)*M + 29 bytes
So bogax's is the winner (for a close call) is M>3, that if a program needs at least 4 functions with more than 3 bytes of arguments.
I haven't checked speed, but I'm sure bogyax's is less slow since it avoids an extra jsr/rts of overhead each time.
Eventually his routine can be improved further more using BRK to simulate a JSR to the argument retrieving function if IRQs aren't used.
I just thought of something. I don't think it is revolutional or never discovered, though I'll share nevertheless the basic idea.
I thought that the zero page can be used as a data stack, to give parameters to subroutines and allocating auto variables (auto as in B and C programming languages)
Random points about it:
1- Let's call the data stack pointer in the following "dSP" (strange case as to not be confused with DSP which has nothing to do here).
dSP is normally located in register X, but a zero-page address is reserved to keep its value if we need X for other purposes.
The data stack grows downward like the regular stack. Let's use a post-decrement/pre-increment scheme for pushing/pulling values to/from data stack, like the regular stack.
2- X is chosen because with RMW instructions, zero-page indexed can only use X.
3- For non-RMW instructions, zero-page indexed takes the same cycles as absolute indexed, but takes less space. For RMW instructions, zero-page indexed takes both less space and less cycles to execute.
4- As an added benefit, pointers (to data) in the stack can be directly used thanks to the rarely used (Indirect, X) addressing.
5- Example of allocating/deallocating space on data stack:
Code:
lda dSP ; if dSP already in X, do "txa" instead
sec ; can be optimized-out
sbc #space_needed
sta dSP
tax
lda #init_val ; now you can do whatever you want with
sta $01,X ; allocated space
...
...
txa ; time to free space
clc
adc #space_needed
sta dSP
tax
If at all times, X is used to keep dSP, then we can omit storing its value at its zero-page location. ISRs can then safely use X as dSP. Otherwise, you must keep at all times a valid dSP somewhere in memory so ISRs can load dSP, because you can't entirely rely on X validity as a dSP in an ISR . Of course, if the data stack is not used in any ISRs, then you don't have any trouble.
6- A (slightly) more complicated example, the memcpy function:
Code:
; void *mempcy(void *dest, void *src, uint8_t n)
; (cdecl)
; A non-conforming implementation of memcpy
; Copies src into dest, returns original dest
memcpy_cdecl:
lda $04,X
sta ret0
lda $05,X
sta ret1
ldy $01,X ; fetch 'n'
bne memcpy_cdecl_loop ; protect against n = 0
rts
memcpy_cdecl_loop:
lda ($02,X)
sta ($04,X)
dey
bne :+ ; end?
rts
: inc $02,X
bne :+
inc $03,X
: inc $04,X
bne memcpy_cdecl_loop
inc $05,X
jmp memcpy_cdecl_loop
Here I introduced ret0 and ret1, which are fixed zero-page locations storing return values. This isn't part of my idea; one could use only the stack for return values, or registers if the size of the datatype of the result is small enough, just to give some examples, it will work if you're consistent with your calling convention anyway. The key here is that parameters are "pushed" from left to right, so n @ dSP + 1, src @ dsp+2 and dest @ dSP+4.
7- To be really useful in a project, one could mix several calling conventions, say "cdecl" (my idea) and "fastcall" (when regular registers are used to pass values, so it's quicker) and suffix the name of the subroutine with "_cdecl" or "_fastc" so it is explicit. Moreover, if one changes its mind and change the calling convention of a given subroutine and change accordingly its name, when compiling, errors will tell where the old subroutine were called and where to make the change to adapt the code to the changed calling convention.
Sorry if my code examples don't compile or don't work because if messed something, but I hope you get the basic idea
Yes,
I always found that it was retarded that they used (),Y to adress data stack in CC65 when they could do what you describe instead.
Also is there any reason the stack should go "upwards" like the hardware stack ? I think it would make just as much sense to have a downwards stack instead - I don't think it would affect performance in any way - but it would be better that when you push pointers on the stack you push the low byte first and then you can use (,X). If we do it as you describe we would have to be sure to push the high byte first.
Or even better, get rid of data stacks completely, like SDCC, at the small price to be unable to do re-entrant functions, which nobody ever uses anyways.
Also I think there's a 99% of chances X has to be used for something else once in a while, so you should definitely save it somewhere.
Bregalad, I think you misunderstood something. I described the data stack as a "downward stack", that is, the data stack pointer points initially at a "high" address, and the more you "push" things into it, the more the pointer get low, and hence it is like the regular stack. It has nothing to do with the order of the individual bytes "pushed" into it — obviously, you put pointers in the stack in little endian, so you can use correcty the (indirect, X) addressing.
Using stack-based memory allocation has the benefit of allowing recursion, but it wasn't the intent. In the absence of an intelligent linker and a keyword to define a memory location to use only within a subroutine¹, you have to ensure yourself that any "temp" variable is used only by one active subroutine at a time (including ISRs), or you can welcome bugs, especially the gay ones. Using a data stack remove that burden. Note that you don't have to manipulate the return address nor the regular stack to get parameters, and because it is located in zero page, accesses are fairly fast (slower than absolute access though), compact and you can use pointers directly from your data stack without having to move them into zero page. You can manipulate your data directly into zero-page, and when you're done with it, you can free it automagically by adjusting your data stack pointer.
But another idea I've just got is to treat simply part of the zero page as a big set of slow virtual registers, part of it must be saved by the caller and the other part saved by the callee, if necessary. ISRs conforming to that model must then save all zero-page locations needed prior to using them. Therefore, there's no data stack needed, the part "saved-by-caller" must be big enough for any return type, the part "saved-by-called" big enough for temp data and parameters. With care, it can be both efficient and easy to maintain.
—
[1] An external tool to an assembler that analyse the usage of some marked memory locations and when these subroutines using them are active counts.
~J-@D!~ wrote:
But another idea I've just got is to treat simply part of the zero page as a big set of slow virtual registers, part of it must be saved by the caller and the other part saved by the callee, if necessary.
Yeah, that's exactly what I adopted for parts of Thwaite: $0000-$0007 saved by caller, $0008-$000F saved by callee.
I don't think I misunderstood something, but it's possible I got "downward" and "upward" backwards.
If you use a stack like the hardware stack at $100, for example let's say the pointer points to $ff, and you push a pointer you'll have to push the MSB first becuase :
$ff = MSB of pointer
$fe = LSB of pointer
If X = $fe :
lda ($00,X) -> load from pointer
If you would have pushed the LSB first this would not have worked.
However if you opt for the stack for the other order, let's say the stack starts at $0 and goes downward.
Then if you push your pointer you'll have to push it in the regular little endian format :
$00 = LSB of pointer
$01 = MSB of pointer
X = 0;
lda ($00,X) -> load from pointer
I'm not saying one is better than the other, it's just you'll have to be careful with that.
And you're completely right and I 100% agree about the issues that comes if you don't use a data stack, I know those poblems very well because how many bitchy bugs I had because I called a routine that called another routine modified the Temp variables I relied on.
However a good compiler, such as SDCC could compile 6502 while avoiding those clashes, as you said by treating a part of zero-page as registers. I think this would be the most efficient solution, even though the zero page data stack would still be reasonably efficient, while the current solution implemented in CC65 is very inefficient.
I'm going to use this one:
Normally, you would do this (“flag” is a byte in the zero page):
Code:
function code bytes cycles
set lda #1
sta flag 4 5
clear lda #0
sta flag 4 5
test lda flag
beq clear 4 5/6
Woz did it like this:
Code:
function code bytes cycles
set lda #$80
sta flag 4 5
clear lsr flag 2 5
test bit flag
bpl clear 4 5/6
This assumes only bit 7 is important. (You could also use the overflow flag without much more effort for two flags.)
For setting the flag you could also use:
sec
ror flag
So you never have to touch a register, but it is slower.
Credited to Wozniak.. Thank you Apple Computers.
http://www.pagetable.com/?p=33
^ But that's a RAM waster considering that 1 bytes is basically sacrificed as a 1-bit flag, I don't think that's useful in the NES which has only 2KB of CPU-side RAM.
Yeah, speed/convenience vs space are always an issue, but many games use one byte for some flags. You do save rom space.
Maybe on a personal computer of the 80's where there's around 8KB+ of system RAM and around 32KBish for programs. But not NES.
I often use "inc" to set flags... you'd just don't want to do this if you risk to set it more than 255 times between two clears.
3gengames wrote:
Maybe on a personal computer of the 80's where there's around 8KB+ of system RAM and around 32KBish for programs. But not NES.
I'm hardly any kind of authority, but looking over the source of SMB and Metroid I see lots of 1 byte flags.
After you've tried coding for the Atari 2600 and its 128 bytes of RAM, the 2KB of the NES seem like a lot of space, and suddenly flags that use entire bytes don't sound so bad.
tokumaru wrote:
I also like very much the one where you push an address (minus 1) to the stack and then use the RTS instruction to jump to that address. This can be useful for implementing jump tables, and I'm using this a lot in my game.
I have thought of the same thing and used that in modifying PPMCK.
Another (very simple) thing is that if you call a subroutine and then followed immediately by the return from subroutine instruction, you don't need that on the stack, you can just jump directly to the subroutine without use of stack, to make tail calls. (Some NES programs did not use this and I have fixed them)
I thought of a way to handle an object's state with bit-packed variables (2 bits or 4 variables per byte).
Code:
LDA #$55 ; becomes #$AA
STA var
ldx #4
.Loop:
JSR CheckState
ROL
ROL
STA var
DEX
BNE .Loop
ROL
brk
CheckState:
lda var
bit var
bpl .0x
bvc .10
jmp MakeState00 ; @11
.10: ; @10
jmp MakeState11
.0x:
bvc .00
jmp MakeState10 ; @01
.00: jmp MakeState01 ; @00
MakeState00:
AND #$3F
RTS
MakeState01:
AND #$3F
ORA #$40
RTS
MakeState10:
AND #$3F
ORA #$80
RTS
MakeState11:
ORA #$C0
RTS
This way there aren't a bunch of and's and cmp's.
tokumaru wrote:
After you've tried coding for the Atari 2600 and its 128 bytes of RAM, the 2KB of the NES seem like a lot of space, and suddenly flags that use entire bytes don't sound so bad.
/me shivers thinking back at that one time he tried Atari.
Specific to ca65, but maybe workable in other assemblers:
Call any compatible subroutine and pass stack parameters and no messing around with return address:
Code:
.macro call function, param1, param2, param3, param4, param5, param6
lda #>@return
pha
lda #<@return-1
pha
.ifnblank param6
lda param6
pha
.endif
.ifnblank param5
lda param5
pha
.endif
.ifnblank param4
lda param4
pha
.endif
.ifnblank param3
lda param3
pha
.endif
.ifnblank param2
lda param2
pha
.endif
.ifnblank param1
lda param1
pha
.endif
jmp function
@return:
.endmacro
'function' should use pla to get all paramaters and rts when done.
Improved version:
Code:
.macro call function, param1, param2, param3, param4, param5, param6, param7
.local return
.if .paramcount > 3+1
lda #>(return-1)
pha
lda #<(return-1)
pha
.endif
.ifnblank param7
lda param7
pha
.endif
.ifnblank param6
lda param6
pha
.endif
.ifnblank param5
lda param5
pha
.endif
.ifnblank param4
lda param4
pha
.endif
.ifnblank param3
ldy param3
.endif
.ifnblank param2
ldx param2
.endif
.ifnblank param1
lda param1
.endif
.if .paramcount <4+1
jsr function
.else
jmp function
.endif
return:
.endmacro
Fixes issue of possible page wrap for return address at the cost of a byte or two, only uses stack as needed, local label.
[quote="Movax12"]
Code:
.macro call function, param1, param2, param3, param4, param5, param6
lda #>@return
pha
lda #<@return-1
pha
...
Potential rare but ugly bug here. Let's imagine when expanding the macro, "@return" resolve to $C000 (for example), then it'll push $C0 then $FF, and when your subroutine returns the program will go haywire. So both the high and the low part should get a "-1", not just the low part. Otherwise, nice trick.
Also, that somehow reminds me of some TI's DSPs that doesn't have a "call", "jsr", "bx", "blx" instruction or anything equivalent, with these you have to put the return address in a register (B3) then jump to your subroutine.
Another (small) thing, it's better to use .local control command in CA65 instead of cheap local labels (@foo) for labels inside the macro, because the cheap local label will still be visible outside the macro in this case.
Thanks for the feedback. I did know about .local in a macro, fixed. I think I fixed the return address problem in a slightly hacky way, but I am happy with it.
Movax12 wrote:
Thanks for the feedback. I did know about .local in a macro, fixed. I think I fixed the return address problem in a slightly hacky way, but I am happy with it.
Why not this:
Code:
lda #>(return-1)
pha
lda #<(return-1)
pha
Oh yeah that's much better. Thanks
Added one last revision to that post. It could be improved more, but the idea is there anyway.
Bregalad wrote:
Second part, I analyzed the passing of >4 bytes of parameters very carefully.
bogax's solution is officially the best.
I counted bytes as following, assuming you have M functions that uses N parameters :
- Standard solution (3 last args in registers and anything else stored to ZP before calling) : 9+N bytes for each call, 0 overhead for the calee
Overall : (9+N)*M bytes.
- Nintendo's solution (store arguments after the jsr opcode and have the calee call a function that copies args to ZP) : 3+N bytes for each call, and 3 bytes overhead for the callee, the argument retrieve function takes 32 bytes :
Overall : (6+N)*M + 32 bytes
- bogax's solution (jsr to a special function followed by the adress of the actual function, # of arguments and arguments) : 6+N bytes for each call, 0 overhead for the calee, argument retrieve function takes 29 bytes :
Overall : (6+N)*M + 29 bytes
So bogax's is the winner (for a close call) is M>3, that if a program needs at least 4 functions with more than 3 bytes of arguments.
I haven't checked speed, but I'm sure bogyax's is less slow since it avoids an extra jsr/rts of overhead each time.
Eventually his routine can be improved further more using BRK to simulate a JSR to the argument retrieving function if IRQs aren't used.
Just to confuse the issue some more
It occured to me to move the target addresses and numbers of parameters
into tables. I think this might work out better provided you're calling each
target routine several times.
Code:
;gets parameters inline
;then jumps to a target routine that uses the parameters
;the first byte inline is a pointer into a table of
;addresses of target routines and their associated
;number of parameters
;the low byte of the pointer to the inline parameters
;is kept in y and pushed back on the stack as the
;lo byte of the address the target routine returns to
;the paramters are fetched inline in ascending order
;but put into memory in descending order
;there's no provision for propagating a carry to the
;high byte of the inline pointer/return address
;so the inline parameters should not straddle a page boundary
;you jsr to this routine supplying a pointer as the
;first byte inline followed by the parameters to fetch
;in reverse order
GET_PARAMETERS
pla ;first get the return address
tay ;into the inline parameters pointer
pla ;lo byte in y
sta ptr+1 ;hi byte in ptr
pha ;put the hi byte back on the stack
iny
lda (ptr),y ;get the first byte inline
tax ;and put it in x to point to
lda jmp_tbl_lo,x ;the target routine's paramter's
sta jmp_add ;target address
lda jmp_tbl_hi,x
sta jmp_add+1
lda num_parameters_tbl,x
tax ;number of parameters to fetch
LOOP
iny
lda (ptr),y
sta parameters-1,x
dex
bne LOOP
tya ;put the lo byte of the return address
pha ;back on the stack
jmp (jmp_add)
jmp_tbl_lo
.db <ROUTINE1,<ROUTINE2,<ROUTINE3 ;etc
jmp_tbl_hi
.db >ROUTINE1,>ROUTINE2,>ROUTINE3 ;etc
num_parameters_tbl
.db num_parameters1,num_parameters2,num_parameters3 ;etc
So you jsr to the GET_PARAMETERS routine and the first parameter
inline is a pointer into the GET_PARAMETERS routine parameters table(s)
with an entry for each routine that uses GET_PARAMETERS,
containing that routine's address and the number of parameters to fetch.
strat wrote:
I thought of a way to handle an object's state with bit-packed variables (2 bits or 4 variables per byte).
Code:
LDA #$55 ; becomes #$AA
STA var
ldx #4
.Loop:
JSR CheckState
ROL
ROL
STA var
DEX
BNE .Loop
ROL
brk
CheckState:
lda var
bit var
bpl .0x
bvc .10
jmp MakeState00 ; @11
.10: ; @10
jmp MakeState11
.0x:
bvc .00
jmp MakeState10 ; @01
.00: jmp MakeState01 ; @00
MakeState00:
AND #$3F
RTS
MakeState01:
AND #$3F
ORA #$40
RTS
MakeState10:
AND #$3F
ORA #$80
RTS
MakeState11:
ORA #$C0
RTS
This way there aren't a bunch of and's and cmp's.
If I'm reading that right, you just increment two bits at a time
that is you've got four independent two bit counters that count
to 11 and then wrap back to 00.
How about just incrementing the top two bits where a carry
can't do you any harm?
Code:
lda var
ldx #$04
LOOP
clc
adc #$40 ; increment top two bits
asl ; rotate two bits through
adc #$80 ; the carry
rol
dex
bne LOOP
sta var
I just tought of a good old trick I'd like to share :
Divide a value by 2^n (in this example 8) but round to the nearest integer, instead of rounding down :
Code:
[...]
lsr A
lsr A
lsr A
adc #$00
Do addition in another base (typically 10) :
Code:
lda foo
clc
adc bar ;Supposed to be in the 0-9 range
cmp #10
bcc +
sbc #10
+ sta wathever ;At this point, we got the low digit in A AND the carry is set if and only if there was an overflow
;Then we can continue with the dizains, etc...
lda foo2
adc bar2
cmp #10
etc, etc....
Aah, this is good stuff only assembly language can do it so elegantly
Bregalad wrote:
Code:
[...]
lsr A
lsr A
lsr A
adc #$00
I've done this a couple of times.
Quote:
Do addition in another base (typically 10)
And this too, for scores and such. Instead of converting numbers from binary to decimal just for displaying, I prefer to give each digit 1 byte and do all the math straight in decimal. This only gets complicated if you have to do more than add/subtract/compare these numbers.
Same here, I don't even care to pack the digits in nybles because the minor RAM savings isn't worth the loads of shifts you're going to add in your code.
Also I'm pretty sure most games does this. Only games with a decent amount of numbers and math makes it worthwhile to actually code them in binary and convert them to be display on the screen. Heh, I'm even sure it would be possible to code an entiere RPG while keeping the HP, Mana, Money, etc... coded entirely in BCD at all times. It would also make people have a harder time to find cheat codes
Quote:
I've done this a couple of times.
I know it's nothing so spectacular, but it shows the power of assembly language, there is no way to use the carry like this in any high level language.
If I wanted for example to do this in C (divide by 8 and round to the nearest integer) I'd have no choice but to do this :
Code:
if(variable & 0x8 == 0)
result = variable / 8;
else
result = variable / 8 + 1;
It would be the exact same, but I think there is no way the compiler would make it this efficient (completely removing the if-else clause), unless it was specifically written with this case in mind.
Bregalad wrote:
I'm even sure it would be possible to code an entiere RPG while keeping the HP, Mana, Money, etc... coded entirely in BCD at all times. It would also make people have a harder time to find cheat codes
No, it'd be
easier. You just find the location that changes from 0 to 1 when your character earns 100 cio.
Quote:
If I wanted for example to do this in C (divide by 8 and round to the nearest integer) I'd have no choice but to do this :
Code:
if(variable & 0x8 == 0)
result = variable / 8;
else
result = variable / 8 + 1;
To round to nearest, add half the divisor before floor-dividing:
Code:
// In C, assuming variable and result are unsigned
result = (variable + 4U) / 8U;
# In Python
from __future__ import division
result = (variable + 4) // 8
It would be just as easy (not easier) if they know the data is stored in BCD internally, but if they're looking for binary values they can search for a long time...
Quote:
// In C, assuming variable and result are unsigned
result = (variable + 4U) / 8U;
This would translate into :
Code:
lda variable
clc
adc #$04
lsr A
lsr A
lsr A
sta result
which is only one byte and 2 cycles more than my super-optimized assembly code, I think it's ok. Great finding !
For signed variables when using this method, if the argument is negative, you have to negate the variable first, then perform the division, and then negate it again to get correct rounding. Just thought I'd mention that since it might not be immediately obvious.
I just stumbled into this one: How to check an unsigned byte range with one branch:
Code:
; assume register a has the value to check:
clc
adc #(256 - LessThanThis)
cmp #(EqualOrGreaterThanThis + ( 256 - LessThanThis) )
bcs IsInRange
As well, EqualOrGreaterThanThis cannot be more than (LessThanThis - 1 ). You could reverse the condition and check for numbers outside this range with bcc.
Edit: Oops..Fixed some mistakes. I
think that is correct now.
Didn't check if it was already mentioned, but here's something I have used a couple of times:
Code:
cmp something
lda #0
rol a
I.e. how to set a flag based on the comparison result without using branches, in constant time.
Now if I could figure out how to do that for arbitrary values. Currently I am doing this:
Code:
ldy Bat_YCoordinateHi ;4
bcs @nohide ;3
ldy #$FF ;-1+2
jmp @hideornohide ;3
@nohide:
nop
nop ; 4 balance cycles
@hideornohide:
Or with word data:
Code:
setw $00, BatData1 ;10
lda FrameCounter ;3
and #8 ;2
bne @frame0 ;3
setw $00, BatData2 ;-1+10
jmp @common ;+3
@frame0:jsr rts12 ; 12 balance cycles: jsr to some random location that contains just rts.
@common:ldx #8*4 ; Sprite index ;2
Where setw is defined as lda #, sta, lda #, sta.
(In my Castlevania II retranslation patch I have a savescreen with animation going on, and the animation is performed during scanline wait loops, and those loops are cycle-counted and ensured that all branches yield the same number of cycles.)
If the data is compile-time constant, you could do:
Code:
cmp ...
lda #0
rol a
tax
ldy @values,x
.pushseg
.data
@values: .byte $55, $FF ; example
.popseg
Which is 10 cycles excluding the cmp, 1 cycle fewer than the above.
To set A to one of two arbitrary constant values based on carry in eight cycles, do something like this:
Code:
; CMP goes here
lda #$FF
adc #$00
; here, A = $00 if C was true else $FF; C unchanged
and #value_if_set ^ value_if_clear
; here, A = $00 if C was true else the bitwise difference
eor #value_if_set
All this requires is that the value if carry is set and the exclusive OR of the two values, or (with a small change) the value if carry is set and the difference of the values, be known.
Wow, this is cool ! But it takes 8 bytes, while my variant :
Code:
lda #value 1
bcc +
lda #value 2
+
Only takes 6. Also it is 5/6 cycles, not constant, but almost.
Bregalad wrote:
Only takes 6
Taking 6 or 5 cycles is fine in some cases, and I do this a lot in Thwaite for velocity modifications based on TV system because Thwaite doesn't do any raster effects other than a split on some fairly static screens to switch to the bank with font tiles. If you're trying to do something in constant time, on the other hand, you need to eliminate all time variability. Extending it to constant time would likewise take 8 bytes and 8 cycles:
Code:
; CMP goes here
lda #value_if_clear
bcc +
+
bcc +
lda #value_if_set
+
EDIT: clarify intent for the record
Quote:
It takes 6 or 5.
I meant, 6 bytes (instead of 8 in your variant).
Flipping bit 0 on non-zp variable if carry is set (9 bytes, constant 12 cycles) (clears carry):
Code:
; Flip direction if carry was set
lda #0 ;2
rol a ;2
eor Bat_Direction ;4
sta Bat_Direction ;4
To flip bit 7, change "rol" into "ror".
To change flipping into setting, change eor into ora.
Flipping bit 0 on non-zp variable if carry is CLEAR (11 bytes, constant 14 cycles) (clears carry):
Code:
; Flip direction if carry was clear
lda #0 ;2
rol a ;2
eor #1 ;2
eor Bat_Direction ;4
sta Bat_Direction ;4
To flip bit 7, change "rol" into "ror" and #1 into #$80.
To change flipping into setting, change the second eor into ora.
Flipping any bit on non-zp variable if carry is set (12 bytes, constant 14 cycles) (doesn't change carry):
Code:
; Flip direction if carry was set
lda Bat_Direction ;4
bcc *+4 ;3
bcc *+6 ;3
eor #2 ;-2+2 (flip bit 1)
sta Bat_Direction ;4
To flip if carry is clear, change both bcc into bcs.
To change flipping into setting, change eor into ora.
To change flipping into clearing, change eor into and and invert the constant.
To transfer the carry flag into any particular bit on a non-zp variable (15 bytes, constant 15 cycles) (doesn't change carry):
Code:
lda Bat_Direction ;4
bcc @clear ;3
ora #2 ;-1+2
bne @ok ;3 :: 11 so far
@clear: and #(255-2) ;2
nop ;2 :: 11 so far
@ok: sta Bat_Direction ;4
To invert the flag, change the bcc into bcs.
Subtract 2 bytes and 2 cycles from each example if your variable is in zeropage.
Ah, how many times I did the good old
Code:
lda #$0
rol A
trick to have A = 0 if there is no carry and A = 1 if there is carry.
Also, if you want to act on a simple flag, usually I use the carry but sometimes comes where you have to use an ADC or CMP instruction that will affect the flag (and overwrite it). I came with an elegant solution :
Code:
[...]
ror Temp ;Our flag in bit 7 of temp (we don't care about what value was stored in temp)
[...]
adc #wathever ;Carry is affected
[...]
bit Temp ;Test the flag
bpl _flag_is_clear
[....]
asl Temp ;If anyhow the flag should be restored to the carry
Dunno if this is clever enough for this topic, but here's something basic I only just now thought of. Often I find myself adding an 8 bit number to a 16 bit number like this:
Code:
lda low16
clc
adc low8
sta low16
lda high16;3 (assuming zero page)
adc #$00;2
sta high16;3
when this would work since the high byte is always 0.
Code:
lda low16
clc
adc low8
sta low16
bcc skip;2/3
inc high16;5 (assuming zero page)
skip:
Saves two bytes and a (couple) cycles.
I might make it a macro.
Edit: Ah yeah... but you could gain time on a page cross. It'd be rare, though
Edit2: Wait, no, you wouldn't gain time. Because when the branch is taken the inc isn't run.
I don't know if it's really a trick, or if it's been mentioned. When comparing signed values, you can just flip the negative bits and compare them as you would any other value.
Code:
lda val1
eor #$80
cmp #<The value you'd compare against, pre-xord $80>
Or against a variable:
Code:
lda val1
eor #$80
sta temp
lda val2
eor #$80
cmp temp
You guys could probably figure out something faster for the 2nd code
.
This trick was mentioned in the old Allegro documentation: to convert signed samples to unsigned, you just had to flip the MSB. What this does is shift the range (so e.g. -0x80~0x7F gets shifted to 0x00~0xFF - it's equivalent to adding 0x80 to all values).
Flipping the MSB seems the best way to deal with comparing to a constant (pre flipped). But otherwise I would refer to the link tepples posted a few post back. It has a method using N and V flags that avoids using a temp variable.
I revive this old thread with some recent thinkings about the use of the identity table by tokumaru (1st post in thread, also
see here).
Basically, you can do these operations:
Code:
((X or Y) ± cst) [OP A] → A
(X ± cst) → Y
(Y ± cst) → X
cst is an unsigned constant that you want to add/substract to X or Y;
[OP A] is some optional ALU operation with A; obviously, if OP is
cmp, A isn't affected.
The restriction is that the addition/substraction of X/Y with the constant must not overflow/underflow; otherwise the result is undefined. If one want (((X or Y) ± cst) & 0xFF) and never have undefined result, they'll need an identity table twice the size, or it could be done in hardware with a non-inverting tri-state buffer that connects A0-A8 to D0-D8, output would be enabled with reads into a memory area of interest.
As you might have guessed, these pseudo-instructions use absolute indexed addressing, and the arithmetics are constant offsets added to the identity table address. Undefined results happen simply when the accesses goes out of the table. Assuming the table is page-aligned, when a constant is
added to X or Y, it takes 4 cycles, whereas with the substractive case, it takes 5 cycles because of the page crossing.
Some examples of use: (idt is the identity table)
Code:
ldy idt+4,X ; Y = X + 4
cmp idt-1,Y ; Compare A with Y - 1, Y is != 0
adc idx+3,X ; ADC A with X+3
Code:
HideSprites: ; Make all sprites in OAM invisible (in next sprite DMA)
; Could be modified to hide some sprites only
lda #$FF
ldx #0
: ldy idt+4,x
sta OAMbuff,x
ldx idt+4,y
sta OAMbuff,y
cpy #$FC
bne :-
rts
It's possible that the
HideSprite routine above is the fastest one without using unofficial opcodes and without total unrolling (and without clearing a bit in $2001, you fools!). There's very little unrolling here, just what's necessary so it works.
EDIT: fixed (enhanced) code.
Quote:
[...]the fastest one without using unofficial opcodes and without total unrolling
Who said total unrolling was required ?
What you're doing is 2 loop iteration in one, in which you gain roughly 50% of the speed you gain in total unrolling.
If you make 4 loop iteration in one, you gain 75% of the speed you gain in total unrolling.
There's really no point in going anything further than that...
Code:
[...] blah blah initialization
_loop
sta $200,Y
sta $240,Y
sta $280,Y
sta $2c0,Y
dey
dey
dey
dey
bne _loop
If this isn't fast enough then some other part of your code has a major problem.
Jarhmander's code can start the clear at any X position, not just 0, which allows using it to clear the rest of the sprites after the ones that are already displayed.
I mentioned unrolled loops because obviously it would be faster (and of limited utility). Also, I tried to showcase a useful example, but it turns out that even when getting rid of the last cpy (by putting a 0 byte after the identity table, a bit of a hack) it is as fast as the following with the same number of stores in the loop:
Code:
ldx #0
clc
: lda #$FF
sta OAMbuff,x
sta OAMbuff+4,x
txa
acd #8
tax
bcc :-
32 iterations, 21 cycles for the loop. Bad example, then.
This will probably sound trivial to some, but here it goes:
Testing for equality without affecting the carry flagYesterday I caught myself in a situation where I needed to verify if a variable contained a specific value, but the carry flag was holding the result of a previous operation that I would like to keep for a future decision. This meant I couldn't use CMP, since that would modify the carry flag. Then, I realized I could use EOR for this:
Code:
lda Variable
eor #VALUE
bne NotEqual
EOR is a drop-in replacement for CMP in this case (unless you need the value in the accumulator to be preserved, of course), there's no need to change anything else. This works because EOR turns bits that are the same in both bytes into 0s, and the bits that are different into 1s, so the only way to get all 0s is if all bits in both bytes are the same.
Again, I'm sure this is known to some people, but since I had never thought of this trick before (I sure hope it wasn't mentioned in this thread already!), I thought it was a good idea to share it here so everyone knows that there's another way to compare numbers for equality that might be useful in a few special cases.
zeroone wrote:
I've been doing something similar for a while, but I use DEC and INC to flip my flags between false ($00) to true ($FF).
Which is fine until you try to clear to false a flag that is already false. Then the next time you try to set it to true, it'll still be false.
To clear a flag, as in the article:
Code:
lsr flag
To set a flag, beating the article by one byte but adding two cycles:
Code:
sec
ror flag
tepples wrote:
Which is fine until you try to clear to false a flag that is already false.
True, but I only use INC/DEC when the state of the flag is known, which is most of the time in my programs so far. In the few cases when I don't know the value, I indeed have to set the new value the old LDA + STA way.
tepples wrote:
To set a flag, beating the article by one byte but adding two cycles:
Code:
sec
ror flag
As well as not modifying any registers!