Nessie wrote:
Instead of keeping all flags in a single byte, you should use one boolean for each flag.
For a while I was working on x86 assembly code for an SPC700 core (the sound chip for SNES, which is 6502-based). Using the x86 flags is a nice trick you can do if your core is written in assembly. In nearly all cases, the x86 instructions incidentally compute into x86 flags the values you need for the 6502 flags. Here's some example code to save away the x86 flags. This is common tail code I would stick right before my dispatch loop and jump into for non-RMW instructions (warning: UNTESTED):
Code:
vhcnz_tail: seto [ebp+FLAG_V] ; 4 *3 0000000V
lahf ; 1 1
mov [ebp+FLAG_H],ah ; 3 *2 ???H????
cnz_tail: setc [ebp+FLAG_C] ; 4 *3 0000000C
nz_tail: lahf ; 1 1
mov [ebp+FLAG_NZ],ah ; 3 *2 NZ??????
Keep in mind that after a subtract, the carry flag in 6502 has the opposite value from the x86 carry flag. So use SETNC for that case.
The NZ flags are (almost?) always set together, so it's convenient to combine them into one byte. Note that N and Z are stored in bits 7 and 6 of the LAHF result. Half-carry is stored in bit 4. So those are the meaningful bits of my FLAG_NZ and FLAG_H bytes. Whereas I use bit 0 in the FLAG_C and FLAG_V bytes.
The SETcc instructions are available and efficient on all modern x86 processors. The LAHF looks pretty efficient on paper but I haven't really tried this stuff so I'm not 100% sure. On paper at least, on a Pentium II or III it's a 1-uop instruction with a 3-cycle latency and on an AMD chip it's a direct-path instruction with a 3-cycle latency. So its no more costly than a cache-hit load. LAHF is a nice way to get at the x86 Half-carry flag too, which (as far as I'm aware) works exactly the same as the Carry flag (where you have to flip it for SBC). Here's a snippet of code for a read-modify-write SBC instruction:
Code:
op2_sbc_t_t2_W8:mov dl,[ebp+FLAG_C] ; 3 *1
sub dl,1 ; 3 1 CF=(!C)
sbb al,cl ; 2 *2
sbc_tail_w8: lahf ; 1 1
seto [ebp+FLAG_V] ; 4 *3
setnc [ebp+FLAG_C] ; 4 *3 C=(!CF)
xor ah,0x10 ; 3 1
mov [ebp+FLAG_H],ah ; 3 *2 H=(!AF)
jmp short nz_tail_w8.1 ; 2 1
The only tricky part about having this non-uniform representation of the flags, is what do you do when you need to merge them into a 6502 flag byte, or split a byte of 6502 flags back into your internal representation? Here's some more code (again, UNTESTED):
Code:
%macro MERGE_FLAGS 0 ; what we need: NVP0H0ZC
mov cl,[ebp+FLAG_H] ; 3 1 ???H????
and cl,0x10 ; 3 1 000H0000
shr cl,3 ; 3 1 000000H0
mov al,[ebp+FLAG_NZ] ; 3 1 NZ??????
shr al,7 ; 3 1 0000000N C=Z
adc cl,cl ; 2 1 00000H0Z
mov bl,[ebp+FLAG_C] ; 3 1 0000000C
add al,al ; 2 1 000000N0
add cl,cl ; 2 1 0000H0Z0
add al,[ebp+FLAG_V] ; 3 *2 000000NV
add cl,bl ; 2 1 0000H0ZC
add al,al ; 2 1 00000NV0
add al,[ebp+FLAG_P32+1] ; 3 *2 00000NVP
shl al,5 ; 3 1 NVP00000
add al,cl ; 2 1 NVP0H0ZC
%endmacro
%macro SPLIT_FLAGS 0 ; start with: NVP?H?ZC
test al,0x20 ; 2 1 NZ=(P)
setnz [ebp+FLAG_P32+1] ; 4 *3 ---> 0000000P
mov bl,al ; 2 1 NVP?H?ZC
test al,0x01 ; 2 1 NZ=(C)
setnz [ebp+FLAG_C] ; 4 *3 ---> 0000000C
add al,al ; 2 1 VP?H?ZC0 C=N
and bl,0x80 ; 3 1 N0000000
mov [ebp+FLAG_H],al ; 3 *2 ---> xxxHxxxx
rol al,4 ; 3 1 ?ZC0VP?H
test al,0x08 ; 2 1 NZ=(P)
setnz [ebp+FLAG_V] ; 4 *3 ---> 0000000V
and al,0x40 ; 2 1 0Z000000
add bl,al ; 2 1 NZ000000
mov [ebp+FLAG_NZ],bl ; 3 *2 ---> NZxxxxxx
%endmacro
The numbers in the ; comments are instruction size in bytes, and number of uops on a P2/P3. * mark insns that have to pass through the first decoder on a P2/P3 (remember, they used the 4-1-1 decoder template). That doesn't matter for P4's but it probably does for modern Pentium M's (I've never bothered to look into that).
Hopefully reading the above will give people some clever ideas.
Here's another (unrelated) trick I came up with, to cheaply support executing code out of I/O port memory: the SPC700 has only a small range of I/O ports in its address space, the rest is basically RAM. I use handler functions for those addresses which store the port state somewhere else; then I fill those bytes within the address space with 0xFF, the opcode for a rarely-used instruction (STOP). Instruction fetch then ignores the possibility that I fetched an opcode from a port address. The port check is actually done in the handler for the STOP instruction, and if it turns out we fetched the 0xFF from a port, it fixes up the cycle counter, does a "real" port fetch using the port handler and then dispatches again to the new opcode. In the SPC700 case my port check is so fast it might not matter (two alu insns and one highly-predictable branch).
[Edit: does the 6502 even have a half-carry flag? Maybe you get off easy not having to worry about that one. Hmm.
]
[Edit: I forgot to mention, part of my rationale for combining NZ into one byte was to reduce the number of writes. But since most flag-writing instructions only set two or three flags, and since modern processors have lots of store buffers...maybe its simpler to just use separate SETcc instructions. Premature optimization is a favorite pasttime of mine...]
[Edit: hmm, this got me thinking---if you use SETcc for all flags and lay the flag bytes out in your context structure the way the bits are laid out in the 6502 flags register...then you might be able to merge flags with the mmx instruction PMOVMSKB. I think it makes my head hurt too much.
]