faster 'P' emulation

faster 'P' emulation
by WedNESday on 2006-02-07 (#9230)

I have read somewhere that it is possible to use a host's 'P' status flag as we would use the NES's. This would mean less code within our CPU emulators. We could then use assembler to access these flags. However, I fear that after a register transfer has been made, changing the PC and CC could overwrite our work. For example;

Code:

inline void OpticCode98()
{
   CPU.A = CPU.Y;
   ^^ Would set the neccessary flags
   CPU.PC++;
   CPU.CC += 2;
   ^^ Would reset the neccessary flags?
}

Can anyone shed more light on this?

by RoboNes on 2006-02-07 (#9243)

i believe cpu flags are saved/restored upon the os switching to another running program if that what's you wanted to know

by blargg on 2006-02-07 (#9256)

WedNESday is probably referring to legacy processor architectures with only one set of status flags that are set as if the last result of an operation was compared with zero (like the 6502, for example LDA, ORA, ADC, etc.). On those, any intervening operations between the flag setting and branch must be severely limited (STA is doesn't affect the flags). It's unlikely that using the status flags would give a speed benefit, because accessing them probably stalls the pipeline, as it's not a common operation to need.

WedNESday, if you do a profile of how often each instruction is used, and look at what 6502 status flags they modify, you might find some opportunities for optimization without using non-portable techniques like this.

by WedNESday on 2006-02-07 (#9262)

Here is what I mean exactly. I want to use the x86 processors status flag while the emulator is running. I know that this method of implementation is possible.

Code:
inline void OpticCode98()
{
CPU.A = CPU.Y;

// if CPU.Y = 0 then the x86's zero flag would be set, no problem

CPU.PC++;
CPU.CC += 2;

// but incrementing these two would modify the x86 status flag, therefore losing the data

If we could retain the status flags register in the way that we wanted then we could omit data like the following...

Code:
CPU.P &= 0x7D;
if( !CPU.A )
CPU.P += 0x02;
CPU.P += (CPU.A & 0x80);

...from just about every instruction. That would be an obvious speed increase.

by tepples on 2006-02-07 (#9267)

WedNESday wrote:
Here is what I mean exactly. I want to use the x86 processors status flag while the emulator is running.

For that, you probably have to use assembly language. C definitely won't work, but C-- (C minus minus) might.

by Nessie on 2006-02-07 (#9273)

Instead of keeping all flags in a single byte, you should use one boolean for each flag. That way, you won't have to mask out any bits whenever you want to access a flag. When you push the status register to the stack you just convert those eight booleans to a single byte (and the other way around when you pull the status from stack).

You don't want to keep your flags in the x86 flag register (too much overhead), but you can use the x86 flags after an arithmetic operation to set your own boolean flag vars.
Here's an ADC example in assembly. I'm not sure exactly how to make your C compiler understand asm.

Code:
// ADC, operand is in al

shr flagC, 1 // Put C flag into x86 register
adc a, al // Do an adc and let the x86 set all flags

sets flagS // Yay, here we
setz flagZ // use the x86 flags
setc flagC // to set our booleans
seto flagV // to the proper values

I believe this is how Q did it in Nintendulator, you should check his source.
There's also blargg's approach where you don't evaluate any flags until you need to. It's described somewhere in the wiki.

--Martin

by mozz on 2006-03-07 (#10475)

Nessie wrote:
Instead of keeping all flags in a single byte, you should use one boolean for each flag.

For a while I was working on x86 assembly code for an SPC700 core (the sound chip for SNES, which is 6502-based). Using the x86 flags is a nice trick you can do if your core is written in assembly. In nearly all cases, the x86 instructions incidentally compute into x86 flags the values you need for the 6502 flags. Here's some example code to save away the x86 flags. This is common tail code I would stick right before my dispatch loop and jump into for non-RMW instructions (warning: UNTESTED):

Code:
vhcnz_tail: seto [ebp+FLAG_V] ; 4 *3 0000000V
lahf ; 1 1
mov [ebp+FLAG_H],ah ; 3 *2 ???H????
cnz_tail: setc [ebp+FLAG_C] ; 4 *3 0000000C
nz_tail: lahf ; 1 1
mov [ebp+FLAG_NZ],ah ; 3 *2 NZ??????

Keep in mind that after a subtract, the carry flag in 6502 has the opposite value from the x86 carry flag. So use SETNC for that case.

The NZ flags are (almost?) always set together, so it's convenient to combine them into one byte. Note that N and Z are stored in bits 7 and 6 of the LAHF result. Half-carry is stored in bit 4. So those are the meaningful bits of my FLAG_NZ and FLAG_H bytes. Whereas I use bit 0 in the FLAG_C and FLAG_V bytes.

The SETcc instructions are available and efficient on all modern x86 processors. The LAHF looks pretty efficient on paper but I haven't really tried this stuff so I'm not 100% sure. On paper at least, on a Pentium II or III it's a 1-uop instruction with a 3-cycle latency and on an AMD chip it's a direct-path instruction with a 3-cycle latency. So its no more costly than a cache-hit load. LAHF is a nice way to get at the x86 Half-carry flag too, which (as far as I'm aware) works exactly the same as the Carry flag (where you have to flip it for SBC). Here's a snippet of code for a read-modify-write SBC instruction:

Code:
op2_sbc_t_t2_W8:mov dl,[ebp+FLAG_C] ; 3 *1
sub dl,1 ; 3 1 CF=(!C)
sbb al,cl ; 2 *2
sbc_tail_w8: lahf ; 1 1
seto [ebp+FLAG_V] ; 4 *3
setnc [ebp+FLAG_C] ; 4 *3 C=(!CF)
xor ah,0x10 ; 3 1
mov [ebp+FLAG_H],ah ; 3 *2 H=(!AF)
jmp short nz_tail_w8.1 ; 2 1

The only tricky part about having this non-uniform representation of the flags, is what do you do when you need to merge them into a 6502 flag byte, or split a byte of 6502 flags back into your internal representation? Here's some more code (again, UNTESTED):
Code:
%macro MERGE_FLAGS 0 ; what we need: NVP0H0ZC
mov cl,[ebp+FLAG_H] ; 3 1 ???H????
and cl,0x10 ; 3 1 000H0000
shr cl,3 ; 3 1 000000H0
mov al,[ebp+FLAG_NZ] ; 3 1 NZ??????
shr al,7 ; 3 1 0000000N C=Z
adc cl,cl ; 2 1 00000H0Z
mov bl,[ebp+FLAG_C] ; 3 1 0000000C
add al,al ; 2 1 000000N0
add cl,cl ; 2 1 0000H0Z0
add al,[ebp+FLAG_V] ; 3 *2 000000NV
add cl,bl ; 2 1 0000H0ZC
add al,al ; 2 1 00000NV0
add al,[ebp+FLAG_P32+1] ; 3 *2 00000NVP
shl al,5 ; 3 1 NVP00000
add al,cl ; 2 1 NVP0H0ZC
%endmacro

%macro SPLIT_FLAGS 0 ; start with: NVP?H?ZC
test al,0x20 ; 2 1 NZ=(P)
setnz [ebp+FLAG_P32+1] ; 4 *3 ---> 0000000P
mov bl,al ; 2 1 NVP?H?ZC
test al,0x01 ; 2 1 NZ=(C)
setnz [ebp+FLAG_C] ; 4 *3 ---> 0000000C
add al,al ; 2 1 VP?H?ZC0 C=N
and bl,0x80 ; 3 1 N0000000
mov [ebp+FLAG_H],al ; 3 *2 ---> xxxHxxxx
rol al,4 ; 3 1 ?ZC0VP?H
test al,0x08 ; 2 1 NZ=(P)
setnz [ebp+FLAG_V] ; 4 *3 ---> 0000000V
and al,0x40 ; 2 1 0Z000000
add bl,al ; 2 1 NZ000000
mov [ebp+FLAG_NZ],bl ; 3 *2 ---> NZxxxxxx
%endmacro

The numbers in the ; comments are instruction size in bytes, and number of uops on a P2/P3. * mark insns that have to pass through the first decoder on a P2/P3 (remember, they used the 4-1-1 decoder template). That doesn't matter for P4's but it probably does for modern Pentium M's (I've never bothered to look into that).

Hopefully reading the above will give people some clever ideas.

Here's another (unrelated) trick I came up with, to cheaply support executing code out of I/O port memory: the SPC700 has only a small range of I/O ports in its address space, the rest is basically RAM. I use handler functions for those addresses which store the port state somewhere else; then I fill those bytes within the address space with 0xFF, the opcode for a rarely-used instruction (STOP). Instruction fetch then ignores the possibility that I fetched an opcode from a port address. The port check is actually done in the handler for the STOP instruction, and if it turns out we fetched the 0xFF from a port, it fixes up the cycle counter, does a "real" port fetch using the port handler and then dispatches again to the new opcode. In the SPC700 case my port check is so fast it might not matter (two alu insns and one highly-predictable branch).

[Edit: does the 6502 even have a half-carry flag? Maybe you get off easy not having to worry about that one. Hmm.]

[Edit: I forgot to mention, part of my rationale for combining NZ into one byte was to reduce the number of writes. But since most flag-writing instructions only set two or three flags, and since modern processors have lots of store buffers...maybe its simpler to just use separate SETcc instructions. Premature optimization is a favorite pasttime of mine...]

[Edit: hmm, this got me thinking---if you use SETcc for all flags and lay the flag bytes out in your context structure the way the bits are laid out in the 6502 flags register...then you might be able to merge flags with the mmx instruction PMOVMSKB. I think it makes my head hurt too much. ]

by baisoku on 2006-03-07 (#10488)

mozz wrote:
Here's another (unrelated) trick I came up with, to cheaply support executing code out of I/O port memory: the SPC700 has only a small range of I/O ports in its address space, the rest is basically RAM.

Hmm.. In a chat i had with Burger Bill once, he told me he stored code in some SNES I/O port registers in one of his titles, Wolfenstein 3-D i believe. Was this a common technique? Neat emulation trick, though.