A few months back, somebody sent me this test ROM to run on a real GSU board. I don't remember who sent it to me. I've gone through my post history and can't find it. If you recognize it, please chime in.
I tested it on every revision I have on hand. The MARIO-CHIP board (Star Fox) just gave me a blank screen, but it could have been an issue with the board, because I've been messing around trying to transplant the SRAM on the board. All other revisions, GSU-1, GSU-1A, and GSU-2 all gave the same result.
However, the test does not match any emulator I have tested.
higan-accuracy 0.94
Snes9x 1.53
I'm not even going to bother downloading ZSNES to try it on there. But there you go. No idea what the numbers mean, hopefully whoever sent me the ROM sees this and can give some more information.
Probably just counting around how many iterations of a loop happen within a given time. Note that this is tricky to get right, since it involves proper emulation of the CPU at the cycle (or even subcycle) level, memory timings (this can throw off things really badly if you don't take into account everything) as well as proper emulation of whatever is being used to do the timing in the first place.
Saying this from experience since I did something similar to benchmark 68000 cycle count accuracy in Mega Drive emulators and none gets it right. Turns out that besides VDP emulation issues, there are memory access timings that were still not being taken into account (RAM refresh is emulated by nothing, and I'm not sure if ROM-side refresh was known at the time either).
I don't think the MARIO chip supports either 21 MHz mode or fast multiply, though I don't know why it would outright refuse to run...
Also, I was under the impression that fast multiply was disallowed in 21 MHz mode, and it seems byuu thought the same. Of course, just because it executes at the higher speed doesn't mean the answers are right...
Either way, those are some substantial differences. I hope I don't have to wreck a game just to find out if my port of [redacted] is even feasible...
Ah yes, I recognize that screen, thanks for taking the time to run it on hardware, very much appreciated.
I made that rom to test whether the fast multiplication setting worked in 21mhz mode. Book says no, so does bsnes, yet some superfx games I've looked at seem to use it at 21mhz so I wanted to see what's up.
Apparently it works on hardware so this change to higan/sfc/chip/superfx/timing/timing.cpp would reflect that:
http://vpaste.net/VYOR0There might be more to it than that, but it should be a start.
Here's the source for the original test:
https://github.com/ARM9/snesdev/tree/ma ... u/profilerThe reason it crashes on the MARIO chip could be because I don't check the chip version before setting the clock to 21mhz, because I don't know what VCR returns on the different revisions. fullsnes only has this information:
Known versions: 1=MC1/Blob, ?=MC1/SMD, ?=GSU1, ?=GSU1A, 4=GSU2, ?=GSU2-SP1.
I would guess that 2=GSU1, 3=GSU1A but I don't know for sure.
93143 wrote:
I hope I don't have to wreck a game just to find out if my port of [redacted] is even feasible...
[redacted]
Sik wrote:
Probably just counting around how many iterations of a loop happen within a given time. Note that this is tricky to get right, since it involves proper emulation of the CPU at the cycle (or even subcycle) level, memory timings (this can throw off things really badly if you don't take into account everything) as well as proper emulation of whatever is being used to do the timing in the first place.
Saying this from experience since I did something similar to benchmark 68000 cycle count accuracy in Mega Drive emulators and none gets it right. Turns out that besides VDP emulation issues, there are memory access timings that were still not being taken into account (RAM refresh is emulated by nothing, and I'm not sure if ROM-side refresh was known at the time either).
Perhaps the cache functionality of the Super FX emulation in Higan/BSNES is not entirely accurate? Maybe that's why the numbers are off. I do recall hearing emulators like ZSNES and SNES9X do not simulate the cache at all.
(Crossposted from byuu's forum)
The following patch makes bsnes match hardware exactly:
Code:
diff --git a/bsnes/snes/chip/superfx/core/opcodes.cpp b/bsnes/snes/chip/superfx/core/opcodes.cpp
index 7d2f13a..3b14d81 100644
--- a/bsnes/snes/chip/superfx/core/opcodes.cpp
+++ b/bsnes/snes/chip/superfx/core/opcodes.cpp
@@ -366,7 +366,7 @@ template<int n> void SuperFX::op_mult_r() {
regs.sfr.s = (regs.dr() & 0x8000);
regs.sfr.z = (regs.dr() == 0);
regs.reset();
- if(!regs.cfgr.ms0) add_clocks(2);
+ if(!regs.cfgr.ms0) add_clocks(cache_access_speed);
}
//$80-8f(alt1): umult rN
@@ -375,7 +375,7 @@ template<int n> void SuperFX::op_umult_r() {
regs.sfr.s = (regs.dr() & 0x8000);
regs.sfr.z = (regs.dr() == 0);
regs.reset();
- if(!regs.cfgr.ms0) add_clocks(2);
+ if(!regs.cfgr.ms0) add_clocks(cache_access_speed);
}
//$80-8f(alt2): mult #N
@@ -384,7 +384,7 @@ template<int n> void SuperFX::op_mult_i() {
regs.sfr.s = (regs.dr() & 0x8000);
regs.sfr.z = (regs.dr() == 0);
regs.reset();
- if(!regs.cfgr.ms0) add_clocks(2);
+ if(!regs.cfgr.ms0) add_clocks(cache_access_speed);
}
//$80-8f(alt3): umult #N
@@ -393,7 +393,7 @@ template<int n> void SuperFX::op_umult_i() {
regs.sfr.s = (regs.dr() & 0x8000);
regs.sfr.z = (regs.dr() == 0);
regs.reset();
- if(!regs.cfgr.ms0) add_clocks(2);
+ if(!regs.cfgr.ms0) add_clocks(cache_access_speed);
}
//$90: sbk
diff --git a/bsnes/snes/chip/superfx/timing/timing.cpp b/bsnes/snes/chip/superfx/timing/timing.cpp
index aae7820..3f493d0 100644
--- a/bsnes/snes/chip/superfx/timing/timing.cpp
+++ b/bsnes/snes/chip/superfx/timing/timing.cpp
@@ -72,14 +72,17 @@ void SuperFX::update_speed() {
if(clockmode == 2) {
cache_access_speed = 1;
memory_access_speed = 5;
- regs.cfgr.ms0 = 0; //cannot use high-speed multiplication in 21MHz mode
return;
}
//default: allow S-CPU to select mode
cache_access_speed = (regs.clsr ? 1 : 2);
memory_access_speed = (regs.clsr ? 5 : 6);
- if(regs.clsr) regs.cfgr.ms0 = 0; //cannot use high-speed multiplication in 21MHz mode
+ //According to docs, CLSR and MS0 should not both be set to 1.
+ //Previously it was believed that setting CLSR forced MS0 to 0, but
+ //hardware tests show that this is not the case. It is possible that
+ //multiplication may not work reliably when CLSR and MS0 are both set.
+ //if(regs.clsr) regs.cfgr.ms0 = 0;
}
void SuperFX::timing_reset() {
AWJ wrote:
The following patch makes bsnes match hardware exactly:
Really? Awesome.
Why was there a difference with high-speed multiply turned off?
qwertymodo wrote:
Uh... I'm not trying to re-port Doom, if that's what you were trying to imply... I'm referring to that shmup I keep dragging into the conversation and then refusing to name.
93143 wrote:
Why was there a difference with high-speed multiply turned off?
Because low-speed multiply was always taking an additional 10MHz cycle, rather than either a 10MHz or 20MHz cycle depending on the clock multiplier.
ARM9, can you write another program that tests these two instructions?
Code:
//$9f(alt0): fmult
void GSU::op_fmult() {
uint32_t result = (int16_t)regs.sr() * (int16_t)regs.r[6];
regs.dr() = result >> 16;
regs.sfr.s = (regs.dr() & 0x8000);
regs.sfr.cy = (result & 0x8000);
regs.sfr.z = (regs.dr() == 0);
regs.reset();
step(4 + (regs.cfgr.ms0 << 2));
}
//$9f(alt1): lmult
void GSU::op_lmult() {
uint32_t result = (int16_t)regs.sr() * (int16_t)regs.r[6];
regs.r[4] = result;
regs.dr() = result >> 16;
regs.sfr.s = (regs.dr() & 0x8000);
regs.sfr.cy = (result & 0x8000);
regs.sfr.z = (regs.dr() == 0);
regs.reset();
step(4 + (regs.cfgr.ms0 << 2));
}
Because I'm fairly sure they're also wrong in bsnes (for one thing, they take more cycles in high-speed mode than in low-speed mode)
AWJ wrote:
Because low-speed multiply was always taking an additional 10MHz cycle, rather than either a 10MHz or 20MHz cycle depending on the clock multiplier.
Thanks; I see that now. Apparently I'm fuzzy-headed today...
This seems like a really convoluted way to improve GSU timing ...
AWJ looks at instructions, ARM9 writes tests, qwertymodo runs tests and reports numbers, and I apply submitted patches.
... but hey, if it improves the emulation, then I'm all for it.
> I do recall hearing emulators like ZSNES and SNES9X do not simulate the cache at all.
That is correct. It's a lot more than one cache, too.
There's a ROM buffer cache, RAM buffer cache, primary pixel buffer cache, secondary pixel buffer cache, and 16x16 instruction cache. Nobody else emulates any of that (well, nobody can inspect what nocash is doing, so I guess we'd have to ask him ...), but I emulate it all. Just that, as you're seeing, it's never been compared to real hardware timings, so it certainly has its issues.
The one thing I don't know how to emulate is what happens when the secondary pixel buffer is full and you are executing a tight loop out of RAM? Does it stall the pixel cache (least likely), stall the CPU loop, or interleave the two operations?
93143 wrote:
qwertymodo wrote:
Uh... I'm not trying to re-port Doom, if that's what you were trying to imply... I'm referring to that shmup I keep dragging into the conversation and then refusing to name.
Sorry, the first thing to pop into my mind was that thread Espozo started about Doom awhile back.
Yeah, I remember. I seem to recall getting really into the question of how much picture one could stuff through VBlank with different levels of compromise. That was not meant to imply that I was going to do something about it myself - I'm already so busy and/or burnt-out that I can scarcely find time to work on the shmup.
Sorry if I disappointed you...
AWJ wrote:
ARM9, can you write another program that tests these two instructions?
Here you go:
https://dl.dropboxusercontent.com/u/134 ... f/mult.sfcSource:
https://github.com/ARM9/snesdev/tree/master/gsu/multThere might be some cache issues involved in this one, writing 0 to the Go flag in the GSU status/flags register doesn't invalidate the cache in bsnes. Supposedly when doing so all cache flags are cleared and the CBR is set to 0x0000, I'm writing some cache tests to help verify.
byuu wrote:
This seems like a really convoluted way to improve GSU timing ...
haha yeah, I need to make a dev cart, but I'm not yet sure what to get in terms of rom and programmer.
byuu wrote:
The one thing I don't know how to emulate is what happens when the secondary pixel buffer is full and you are executing a tight loop out of RAM? Does it stall the pixel cache (least likely), stall the CPU loop, or interleave the two operations?
I don't know how to go about testing that yet, any ideas?
ARM9 wrote:
AWJ wrote:
ARM9, can you write another program that tests these two instructions?
Here you go:
https://dl.dropboxusercontent.com/u/134 ... f/mult.sfcSource:
https://github.com/ARM9/snesdev/tree/master/gsu/multThere might be some cache issues involved in this one, writing 0 to the Go flag in the GSU status/flags register doesn't invalidate the cache in bsnes. Supposedly when doing so all cache flags are cleared and the CBR is set to 0x0000, I'm writing some cache tests to help verify.
byuu wrote:
This seems like a really convoluted way to improve GSU timing ...
haha yeah, I need to make a dev cart, but I'm not yet sure what to get in terms of rom and programmer.
byuu wrote:
The one thing I don't know how to emulate is what happens when the secondary pixel buffer is full and you are executing a tight loop out of RAM? Does it stall the pixel cache (least likely), stall the CPU loop, or interleave the two operations?
I don't know how to go about testing that yet, any ideas?
Why does this ROM depend on cache behaviour when the previous one didn't?
AWJ wrote:
Why does this ROM depend on cache behaviour when the previous one didn't?
I think it's just the granularity of the counter that's too coarse on the previous one, I could make the scpu loop shorter (at the cost of more complexity).
No, that's okay, if we can fix incorrect cache behaviour in bsnes as well that's two birds with one stone
I was planning to make a more thorough cache test to check any edge cases there such as how cache lines are loaded, fetching stalls, various invalidation methods (what happens if you write to < xxxfh in a cache line from scpu f.ex) and such.
byuu wrote:
This seems like a really convoluted way to improve GSU timing ...
AWJ looks at instructions, ARM9 writes tests, qwertymodo runs tests and reports numbers, and I apply submitted patches.
... but hey, if it improves the emulation, then I'm all for it.
You have seen nothing yet, wait until people start bringing in oscilloscopes =P
qwertymodo, we're all waiting for you to run that new test ROM
mult.sfc
GSU-1 (didn't test it on the other GSU revisions since the last one was identical on all of them)
higan v0.94
As I suspected, fmult and lmult were reversing the sense of ms0 and were also off by one. The manual says that they take "4 or 8 cycles", but one of those cycles is the one that every one-byte instruction takes.
Code:
diff --git a/bsnes/snes/chip/superfx/core/opcodes.cpp b/bsnes/snes/chip/superfx/core/opcodes.cpp
index 3b14d81..35da5ef 100644
--- a/bsnes/snes/chip/superfx/core/opcodes.cpp
+++ b/bsnes/snes/chip/superfx/core/opcodes.cpp
@@ -476,7 +476,7 @@ void SuperFX::op_fmult() {
regs.sfr.cy = (result & 0x8000);
regs.sfr.z = (regs.dr() == 0);
regs.reset();
- add_clocks(4 + (regs.cfgr.ms0 << 2));
+ add_clocks((regs.cfgr.ms0 ? 3 : 7) * cache_access_speed);
}
//$9f(alt1): lmult
@@ -488,7 +488,7 @@ void SuperFX::op_lmult() {
regs.sfr.cy = (result & 0x8000);
regs.sfr.z = (regs.dr() == 0);
regs.reset();
- add_clocks(4 + (regs.cfgr.ms0 << 2));
+ add_clocks((regs.cfgr.ms0 ? 3 : 7) * cache_access_speed);
}
//$a0-af(alt0): ibt rN,#pp
With this patch, bsnes-classic comes very close to hardware, just 2-5 cycles off:
Wow, yeah. Even ignoring my boolean flag inversion (did that with the GBA sequential access speeds too >_>), 4 or 8 cycles isn't right at all. This can be up to 14 cycles. Did not expect CLSR to factor in here, too.
Retested Winter Gold and SMW2, doesn't seem to cause any regressions. Which is good, the former was always a nightmare.
Looks like we're just a tiny bit too slow in every case. But since it's such a small difference, it's a one-time error rather than cumulative for each loop of this test. Probably some kind of delay in starting the GSU ("go" or whatever)?
Anyway, thanks everyone for all the help on this! v095's shaping up to be a great release :D
byuu wrote:
Wow, yeah. Even ignoring my boolean flag inversion (did that with the GBA sequential access speeds too >_>), 4 or 8 cycles isn't right at all. This can be up to 14 cycles. Did not expect CLSR to factor in here, too.
Anyway, thanks everyone for all the help on this! v095's shaping up to be a great release
Using CLSR there is a complete guess. The timing test ROM only tests 21MHz mode, so it produces the same results whether I multiply by cache_access_cycles or not. Perhaps ARM9 can modify the test ROMs to test 10MHz mode as well (in fact, it should be possible just by hexediting a single byte in the ROM...)
Yeah, and maybe also test for that cache invalidation on stop thing, if that's even practical.
For now though, I think it's probably wise to guess that we multiply off CLSR. The whole point of 21MHz mode is supposed to be that it's "twice as fast"; at least for non-memory accesses.
byuu wrote:
Yeah, and maybe also test for that cache invalidation on stop thing, if that's even practical.
For now though, I think it's probably wise to guess that we multiply off CLSR. The whole point of 21MHz mode is supposed to be that it's "twice as fast"; at least for non-memory accesses.
bsnes is already running the test slightly slower than hardware rather than slightly faster, so adding more cache invalidation will probably only make the error worse.
If you have any other tests you want me to run, just let me know. Just know that it is a bit of a pain to reprogram the cart, since the GSU doesn't provide /CS or /WE signals, meaning I can't program the thing in-circuit from the cart edge and have to resort to desoldering the ROM, cleaning the rosin flux off the pins, reprogramming the chip in a socket, then resoldering it... I wish it was as easy as the Cx4
Wow that is really close, nice one.
qwertymodo wrote:
and have to resort to desoldering the ROM, cleaning the rosin flux off the pins, reprogramming the chip in a socket, then resoldering it... I wish it was as easy as the Cx4
Ouch, that's quite an arduous process. I'm making a test suite so you don't have to reprogram it for each test. Currently got mult, cache, plot timing tests and you can toggle between 10/21mhz, going to add rom/sram buffer tests. Any suggestions are welcome./
Maybe solder in a socket so you don't have to desolder the memory? o.o
I suggest multiply accuracy, possibly as a way to figure out why fast multiply in fast mode was discouraged.
Sik wrote:
Maybe solder in a socket so you don't have to desolder the memory? o.o
It's a TSOP-48 chip, mounted to a SOIC-32 footprint via an adapter board. Not only would it be a huge pain to wire up, I don't have another TSOP test socket just lying around.
ARM9 wrote:
Ouch, that's quite an arduous process. I'm making a test suite so you don't have to reprogram it for each test. Currently got mult, cache, plot timing tests and you can toggle between 10/21mhz, going to add rom/sram buffer tests. Any suggestions are welcome./
I don't know if what you listed covers this, but some of the only remaining game issues in higan-snes are special chip games running slightly faster or slower due to "cache delays" or whatever being unemulated. These issues are very minor and would be hard to detect in a blind comparison, but they're still worth fixing.
One noticeable example is SMW2's long intro that sort of desyncs over time. There's a scratching sound that's supposed to play when Yoshi looks at the map, but in higan it plays much later.
There is a similar issue in MMX2's attract mode (cx4 chip) where Mega Man dies in the attract mode because of unknown cache behavior.
It's been a few weeks with no replies, so I figured I'd check back. Any new tests to run?