And how it's compared to the addition/substraction operations?
Addition and subtraction are pretty trivial on a 65816, though you have to watch out because all of the available opcodes are "with carry", meaning if you want a fresh operation you generally have to clc before adding or sec before subtracting. Also, you can't add or subtract the contents of two CPU registers; it's always between the accumulator and some memory location.
The carry flag operation is one cycle to load the opcode and one cycle to execute it.
The actual arithmetic operation is one cycle to load the opcode, 0-3 cycles to load the data address (immediate, direct page, absolute, long), 0-4 cycles for indirect or stack-relative address handling, 0-1 cycles to deal with 16-bit indexing, indexing across page boundaries or nonzero direct page low byte, and 1-2 cycles to load the operand (8-bit or 16-bit). Total is between 2 and 8 cycles. There are no additional internal cycles associated with the arithmetic.
...
Multiplication and division are not in the instruction set; they're done on a memory-mapped ALU. On the SNES's 5A22, for an 8x8 unsigned multiplication, you put the numbers you want to multiply in $4202 (if it isn't already there) and $4203 (always), wait 8 cycles (or the number of significant bits in $4202), and read the 16-bit result from $4216-$4217. For a 16/8 unsigned division, you put the dividend in $4204-$4205 (if it isn't already there) and the divisor in $4206 (always), wait 16 cycles, and read the quotient from $4214-$4215 and the remainder from $4216-$4217. If you start with one operand in a register and one in memory somewhere, as with addition or subtraction, I calculate that it's at least 16-18 cycles for multiplication on top of what an add or subtract would take, or 25-27 cycles for division (or 29-32 if you want the remainder too). Note that it is possible to slip other operations in while the ALU is working, so the computational penalty may be lower than this.
On the SA-1, for a 16x16 signed multiplication, you set $2250 to 0 (or 2 if starting a cumulative string of multiplications), load the numbers you want to multiply in $2251-$2252 (if it isn't already there) and $2253-$2254 (always), wait 5 cycles, and read the 32-bit result from $2306-$2309, or (if using cumulative mode) wait 6 cycles instead of 5 and read the 40-bit result from $2306-$230A. For 16/16 signed/unsigned division, you set $2250 to 1, load the dividend in $2251-$2252 (always) and the divisor in $2253-$2254 (always), wait 5 cycles, and read the (signed) quotient from $2306-$2307 and the (unsigned) remainder from $2308-$2309.
The S-PPU has a 16x8 signed multiplier for internal use in Mode 7. Obviously you shouldn't touch it if the screen is in Mode 7, but if it's not you can upload a 16-bit signed value to $211b (one byte at a time, if it isn't already there) and an 8-bit signed value to $211c (if it isn't already there), and read the 24-bit result from $2134-$2136. It's ready way too fast for the S-CPU to ever read the result too soon, so you don't have to wait. Due to the shuffle you have to do to load $211b, this isn't really any faster than the ALU, but it's 16x8 signed rather than 8x8 unsigned. (Also, since it's on the B bus rather than being internal to the S-CPU, you could in principle address it with HDMA...)
Alternately, you could construct multiplication and division operations out of the available instructions for a slow but interrupt-safe method...
Since both $42xx and $21xx are fast access, every S-CPU bus cycle addressing the multiplication/division hardware is 6 master clocks, which means that if you're running in FastROM and don't need to fetch anything from RAM, the whole operation is 6 master clocks per cycle unless you touch slow memory while the ALU is working. The SA-1 runs much faster, of course, typically 2 master clocks per cycle, or 4 if the S-CPU is getting in the way...
Cool, thanks. So, doing multiplication/division on mode7-processor can free some 5A22 cpu-time and increase actual 3D performance? I mean, 24-bit result fits well for three 8-bit X,Y,Z numeric coordinates.
Just multiplication. The S-PPU doesn't do division.
And you have to make sure it isn't drawing a Mode 7 background at the moment you try to use the multiplier. Apparently this includes HBlank, so you can't use HDMA to dodge this requirement.
...
Oddly, the SPC700 (the CPU that controls the audio DSP) does have multiply and divide opcodes - they're 8x8 unsigned and 16/8 unsigned. And they work purely within the CPU registers - mul ya takes 9 cycles to multiply A by Y, putting the result in Y:A, and div ya, x takes 12 cycles to divide Y:A by X, putting the result in A and the remainder in Y (yes, this is an 8-bit result from a 16/8 division, but it sets the overflow and half carry flags so I guess it's okay...?). Theoretically, you could use it as a coprocessor for the main CPU via the APU I/O ports, but that would be complicated, and it's effectively a 1 MHz chip anyway with (usually) its own tasks to handle, so you'd have to search long and hard to find a situation where it'd be worth it to do such a thing...
I wonder if Electronic Arts programmers used this kind of multi-thread perversion.
Quote:
And you have to make sure it isn't drawing a Mode 7 background at the moment you try to use the multiplier. Apparently this includes HBlank, so you can't use HDMA to dodge this requirement.
In theory, it is possible to turn off rendering, do a multiplication, send result anywhere in the WRAM and turn rendering on mid-scanline, right?