Espozo wrote:
Correct me if I'm wrong, but it seems the only advantage the Super FX has over the SA-1 is converting packed pixel to the SNES graphics format in hardware.
EDIT: I misread that as a comparison with the S-CPU. The SA-1 is indeed much less weak.
The Super FX still had a number of advantages. For one, the clock speed was higher, at least for the later revisions. For another, it had a faster multiplier - it could do 8x8 in two master cycles, or 16x16 in eight master cycles, or nine if you wanted the full 32 bits of output. (On the other hand, it had no real division functionality as far as I can tell.) Also, the RISC idea wasn't entirely bogus - lots of stuff could be done in a single cycle, and lots more could be done in two, versus an average of about 4 for the 65816. And of course it had 16 general-purpose registers (including the program counter) and you could operate between any of them with no I/O penalty.
Instructions often consisted of a 4-bit opcode and a 4-bit operand register number, with source and destination set by FROM, TO, or WITH. If the source and destination registers were unset, they defaulted to R0, and most instructions reset them; this meant you could get better speed by using R0 as an accumulator of sorts.
The PLOT functionality was more than just a packed-to-planar converter (which the SA-1 had in the form of a couple of special DMA modes). It had features like automatic checkerboard dither and palettized 8bpp drawing with 4-bit input. It used two of the registers as screen coordinates, removing the need to calculate addresses, and it auto-incremented the x coordinate. All in one cycle, so you could continue with the rest of the algorithm while the pixel caching system did its work.
The Super FX actually had hardware texture mapping capability, kinda. The MERGE opcode takes the top bytes of R7 and R8 and concatenates them into the destination register. If R7 and R8 contain 8.8 fixed-point texture coordinates, and you prefix MERGE with TO R14, you can then read a texel from ROM.
It may be interesting to note that the default ADD doesn't take carry into account. You need to prefix it with a flag instruction to get ADC. Kinda the opposite of the 65816, where you must prefix ADC with CLC to get ADD...
Quote:
I don't know how much better they could have made it
I'm guessing, but a 16-bit chip would obviously have a much larger potential opcode selection, and of course double the bus width at a given memory speed is double the bandwidth. You could do 16-bit addressing in two words, or 32-bit addressing in three, and reading or writing a word would be one cycle. If you were willing to constrain the opcode space a bit, you could do 8-bit addressing in one word, or 24-bit in two, but I'm not sure the internal architecture would be up to the former... Basically everything would be either twice as big or twice as fast, and opcode count would no longer be a significant constraint. You'd get no bonus for 8-bit data, but that was always a bit of a booby prize anyway...
...not to mention that if you eliminated the phi1/phi2 nonsense like Hudson did, you'd double performance for free (assuming the process was up to it)...
As long as we're making wish lists, how about some reasonably quick multiply and divide instructions? The 5A22 has an external multiplier and divider, though they aren't very good, and the SPC700 has them as actual instructions - it uses Y for the upper byte of 16-bit values. A hypothetical 6516 with 16x16 and 32/16 could do something similar, and with a Z register there'd still be two index registers free.
Are we getting into the 68000 price range here?
Quote:
I really don't understand, how did they get the SA-1 to run at 10MHz?
I think the speed of the core wasn't the issue with the S-CPU. It was the memory speed. (It also came out five years earlier. Five years was a long time back then. Remember, the SA-1 only came out a year before the N64...)
With the SA-1, I believe they used 16-bit ROM with a memory controller to split the words for the CPU core. This meant you'd get wait states if you accessed data or had to branch; only linear program counter reads and DMA went at full speed. Somebody did the math, and it seems that if the SA-1 used single-master-clock half-cycles, ordinary FastROM would be enough for 10.74 MHz with this setup.
But they also used 2 KB of fast RAM for the I-RAM cache, and you could run at the full 10.74 MHz in that. BW-RAM was bigger but slower; it would take the chip down to 5.37 MHz. And of course if the S-CPU accessed a particular memory at the same time you'd get extra wait states on the SA-1 side...
The later models of Super FX could run at 21.4 MHz. Unfortunately, this was only possible inside the 512-byte instruction cache. There was no data cache, just the internal registers. A cache miss or any sort of data access was five master clocks per byte, or 4.3 MHz. The dual busing and buffering saved it somewhat; if you were careful, you could generally keep working while data access was happening - unless you were reading from RAM, since there was no preload functionality for RAM (I suspect it would have been complicated) and out-of-order execution wasn't really a thing back then. Try to avoid reading from RAM a lot when using the Super FX.
...it kinda burns me that nobody thought to use 120ns ROM with the Super FX. That would be 3 master clocks per byte, and would dramatically speed up any sort of memory access including PLOT. I wonder if you could simply overclock it to 43 MHz and leave it in Slow mode - that gets you 3-cycle memory accesses, but your period authenticity is shot...