Dealing with only X, Y, and Direct Page.

Dealing with only X, Y, and Direct Page.
by Drew Sebastino on 2017-10-23 (#206574)

I know I've iterated about how big of a pain in the ass this is in the past, and I've asked some of you people how you've worked around this, but this has been a big enough issue for me that I think it deserves its own thread. Basically, two index registers is not enough for a lot of things. Two examples where I ran into problems are my metasprite routine, and my vram finding routine. My metasprite routine needs offsets for my object table, sprite buffer, and metasprite data, and my vram finding routine needs offsets for my object table, vram space table, and animation frame data. You actually have just enough registers if you include Direct Page but, asside from only saving you one cycle when it is a multiple of 256, it's got a major problem, and that's that it can only be in bank 0. If the SNES were designed to have more than 8KB or RAM in each bank, IHow the hell did anyone program for the PCE/Turbografx 16?) it wouldn't be a problem, but as it currently stands, it's a major pain in the ass. I really don't want to worry about cramming all my metasprite data, along with everything else, into one bank (as I had been doing until I realized the limitedness of Direct Page) but I also don't want to worry about running out of space for an object table. How many bytes make up each object in your code? I guess I could comfortably see using 48 bytes per slot and having 128 object slots, which should be 6KB. It's a bit ridiculous how I've gotten slower how time has gone on due to me revising my code one thousand times.

Re: Dealing with only X, Y, and Direct Page.
by TOUKO on 2017-10-24 (#206582)

On PCE you don't need any VRAM finding routine as you can CRAM sprites in all your 64ko .
Personally i use a double buffer in VRAM if dynamic sprites is needed(E.G:for BTU) as you can write to it at any time.

For OAM/SAT, no registrers is needed, as personally i write directly to VRAM .

I think if you lack of index registers, you must use pointers, for your OAM buffer perhaps .

Re: Dealing with only X, Y, and Direct Page.
by HihiDanni on 2017-10-24 (#206583)

I ran into the same issue while refactoring my sprite routine, actually. I had the direct page pointing to the spritedef with X and Y pointing to the current object and the OAM destination, respectively. I realized that this would only let you put spritedefs into bank 0, so I reworked the function.

Now, X is the object as always, Y is now the spritedef, and D points to the OAM destination. This theoretically means that I can now have the spritedef in any LoROM style bank, since I can switch the data bank before calling into the function.

Currently I'm finding 8kB plenty to work with, although space might get tighter once I start working on the optimized collision routines where I'll need some data structures for indexing.

Edit: A VRAM slot finder probably doesn't need the current object index to be able to look for a new slot, so you should be able to just store it temporarily in a scratch variable (or the stack). I don't suspect such a function would be called super often each frame so you can easily afford it.

Re: Dealing with only X, Y, and Direct Page.
by tepples on 2017-10-24 (#206586)

Espozo wrote:

My metasprite routine needs offsets for my object table, sprite buffer, and metasprite data

My NES programs use 16 bytes of zero page for local variables. The metasprite routine in The Curse of Possum Hollow uses 13 of them.

The "draw individual actor" and "draw bullet" subroutines have to copy the position and identity of each object from the respective object table into zero page. But once all these are read out, it no longer has to access the object table for that sprite when drawing the metasprite proper.

It's called with the the sprite sheet ID and frame number in X and A. Additional arguments are passed in the local variable area, taking 6 bytes:
2 bytes: X coordinate
2 bytes: Y coordinate
1 byte: Base tile number in video memory
1 byte: Attributes

Once it starts running, it can proceed to use Y to index into the metasprite data, X to index into for the sprite buffer, and 7 more bytes of zero page for the current horizontal strip's state.
2 bytes: X coordinate
1 byte: Remaining width in sprites
1 byte: Y coordinate
1 byte: Attributes of current strip (for split-palette or layered sprites)
2 bytes: Pointer to start of metasprite data

Espozo wrote:

How the hell did anyone program for the PCE/Turbografx 16?

The TG16 has 8 KiB of RAM, which you observed is the same as the Super NES's low memory. The NES has 2 KiB of RAM, and the majority of games didn't expand that with extra RAM on the cartridge. Yet people programmed for it.

Espozo wrote:

How many bytes make up each object in your code?

Actors in Curse are 16 bytes, and there are 6 slots. There are 8 additional "entry queue" slots for actors, each 4 bytes in size. Bullets are 6 bytes, and there are 12 slots.

Re: Dealing with only X, Y, and Direct Page.
by psycopathicteen on 2017-10-24 (#206632)

tepples wrote:

Espozo wrote:

How the hell did anyone program for the PCE/Turbografx 16?

Espozo wrote:

How many bytes make up each object in your code?

Actors in Curse are 16 bytes, and there are 6 slots. There are 8 additional "entry queue" slots for actors, each 4 bytes in size. Bullets are 6 bytes, and there are 12 slots.

Pretty much it. SNES homebrewers want more stuff onscreen than NES homebrewers.

Now that I think about it, I wonder how much of a placebo effect memory size has on how you use it. Maybe if I try stuffing more stuff in 8kB, 8kB wouldn't seem as tight.

Something that gets on my nerves more and more is how you can do long indexing with X but not Y, but you can do long indirect indexing with Y but not X. I don't think you can directly load X or Y from long addresses.

Re: Dealing with only X, Y, and Direct Page.
by Drew Sebastino on 2017-10-24 (#206643)

HihiDanni wrote:

Now, X is the object as always, Y is now the spritedef, and D points to the OAM destination.

How are you able to use Direct Page for this with the SNES's dumbass HiOAM table?

HihiDanni wrote:

Currently I'm finding 8kB plenty to work with

I mean, it's plenty unless you plan on squeezing your object table into it. However, I have noticed that I have the majority of the data in need set aside for an object in each routine, and it's 20 bytes, and I should be able to have over twice that for each object and still have 128 objects.

psycopathicteen wrote:

Something that gets on my nerves more and more is how you can do long indexing with X but not Y, but you can do long indirect indexing with Y but not X. I don't think you can directly load X or Y from long addresses.

It's honestly not as absurd as not having an add without carry. Having under 256 different instructions must have been a bitch when designing this thing.

Re: Dealing with only X, Y, and Direct Page.
by psycopathicteen on 2017-10-24 (#206644)

That makes no sense either. Don't think it even needs an "ADC" and "SBC" in the first place. Just have an "ICS" increment if carry set, and "DCC" decrement if carry clear instructions and it would make more sense.

I'd like to know if there were any other cheap CPUs that actually fixed the problems of the 65xx architecture. Everything else seemed to be just a battle of who can make the most expensive CPU possible.

Re: Dealing with only X, Y, and Direct Page.
by HihiDanni on 2017-10-25 (#206690)

Espozo wrote:

HihiDanni wrote:

Now, X is the object as always, Y is now the spritedef, and D points to the OAM destination.

How are you able to use Direct Page for this with the SNES's dumbass HiOAM table?

I don't. I reuse the X index register to do it. This requires preserving the current value of X, and it's something I could optimize in the future maybe, but I personally don't think it's a big deal.

And most of the high OAM processing isn't done in AddSprite either.

Re: Dealing with only X, Y, and Direct Page.
by rainwarrior on 2017-10-25 (#206694)

psycopathicteen wrote:

This is only true if the only result you care about is what's left in the accumulator. The result of the other flags after adding or subtracting are dependent on that carry, and are essential for multi-byte/word operations. You need ADC/SBC for that.

Re: Dealing with only X, Y, and Direct Page.
by HihiDanni on 2017-10-25 (#206695)

rainwarrior wrote:

This is only true if the only result you care about is what's left in the accumulator.

On the SNES, that is most of the time. It has a 16-bit CPU after all.

Re: Dealing with only X, Y, and Direct Page.
by 93143 on 2017-10-25 (#206697)

Espozo wrote:

Having under 256 different instructions must have been a bitch when designing this thing.

Try coding for the Super FX. 16 registers, 8-bit instruction size. And lots of instructions need source, operand and destination registers. Something as simple as XOR requires a prefix instruction just to specify the operation because they ran out of opcodes.

But yeah, it's a tad limiting. Just upgrading the 65xx concept to 16-bit without worrying about backward compatibility would have resulted in a massively more powerful processor. I kinda like the idea of a Z register too...

Re: Dealing with only X, Y, and Direct Page.
by psycopathicteen on 2017-10-25 (#206713)

rainwarrior wrote:

psycopathicteen wrote:

You need the carry bit, but that doesn't mean you need an ADC/SBC for that.

Re: Dealing with only X, Y, and Direct Page.
by Drew Sebastino on 2017-10-25 (#206714)

93143 wrote:

Espozo wrote:

Having under 256 different instructions must have been a bitch when designing this thing.

Sounds like a real POS. :lol:

Correct me if I'm wrong, but it seems the only advantage the Super FX has over the SA-1 is converting packed pixel to the SNES graphics format in hardware.

93143 wrote:

But yeah, it's a tad limiting. Just upgrading the 65xx concept to 16-bit without worrying about backward compatibility would have resulted in a massively more powerful processor. I kinda like the idea of a Z register too...

The lack of a Z register wouldn't be too bad if there were a way to quickly switch values in and out of X and Y. For example, if there was an instruction for swapping the value of X or Y with an area of memory, (if that's even possible) that would be fine, but as it stands, it's too slow for what you're doing. I don't know how much better they could have made it, but I will give it to the 65816 for stomping other processors from the period at the same clock speed despite having only an 8 bit data bus. Of course, most all of them ran faster than 3MHz.

I really don't understand, how did they get the SA-1 to run at 10MHz? Was it built using a smaller manufacturing process? (It doesn't have a fan, or even a heat sink.) The 5A22 in the SNES can't be that underclocked.

Re: Dealing with only X, Y, and Direct Page.
by 93143 on 2017-10-25 (#206717)

Espozo wrote:

Correct me if I'm wrong, but it seems the only advantage the Super FX has over the SA-1 is converting packed pixel to the SNES graphics format in hardware.

EDIT: I misread that as a comparison with the S-CPU. The SA-1 is indeed much less weak.

The Super FX still had a number of advantages. For one, the clock speed was higher, at least for the later revisions. For another, it had a faster multiplier - it could do 8x8 in two master cycles, or 16x16 in eight master cycles, or nine if you wanted the full 32 bits of output. (On the other hand, it had no real division functionality as far as I can tell.) Also, the RISC idea wasn't entirely bogus - lots of stuff could be done in a single cycle, and lots more could be done in two, versus an average of about 4 for the 65816. And of course it had 16 general-purpose registers (including the program counter) and you could operate between any of them with no I/O penalty.

Instructions often consisted of a 4-bit opcode and a 4-bit operand register number, with source and destination set by FROM, TO, or WITH. If the source and destination registers were unset, they defaulted to R0, and most instructions reset them; this meant you could get better speed by using R0 as an accumulator of sorts.

The PLOT functionality was more than just a packed-to-planar converter (which the SA-1 had in the form of a couple of special DMA modes). It had features like automatic checkerboard dither and palettized 8bpp drawing with 4-bit input. It used two of the registers as screen coordinates, removing the need to calculate addresses, and it auto-incremented the x coordinate. All in one cycle, so you could continue with the rest of the algorithm while the pixel caching system did its work.

The Super FX actually had hardware texture mapping capability, kinda. The MERGE opcode takes the top bytes of R7 and R8 and concatenates them into the destination register. If R7 and R8 contain 8.8 fixed-point texture coordinates, and you prefix MERGE with TO R14, you can then read a texel from ROM.

It may be interesting to note that the default ADD doesn't take carry into account. You need to prefix it with a flag instruction to get ADC. Kinda the opposite of the 65816, where you must prefix ADC with CLC to get ADD...

Quote:

I don't know how much better they could have made it

I'm guessing, but a 16-bit chip would obviously have a much larger potential opcode selection, and of course double the bus width at a given memory speed is double the bandwidth. You could do 16-bit addressing in two words, or 32-bit addressing in three, and reading or writing a word would be one cycle. If you were willing to constrain the opcode space a bit, you could do 8-bit addressing in one word, or 24-bit in two, but I'm not sure the internal architecture would be up to the former... Basically everything would be either twice as big or twice as fast, and opcode count would no longer be a significant constraint. You'd get no bonus for 8-bit data, but that was always a bit of a booby prize anyway...

...not to mention that if you eliminated the phi1/phi2 nonsense like Hudson did, you'd double performance for free (assuming the process was up to it)...

As long as we're making wish lists, how about some reasonably quick multiply and divide instructions? The 5A22 has an external multiplier and divider, though they aren't very good, and the SPC700 has them as actual instructions - it uses Y for the upper byte of 16-bit values. A hypothetical 6516 with 16x16 and 32/16 could do something similar, and with a Z register there'd still be two index registers free.

Are we getting into the 68000 price range here?

Quote:

I really don't understand, how did they get the SA-1 to run at 10MHz?

I think the speed of the core wasn't the issue with the S-CPU. It was the memory speed. (It also came out five years earlier. Five years was a long time back then. Remember, the SA-1 only came out a year before the N64...)

With the SA-1, I believe they used 16-bit ROM with a memory controller to split the words for the CPU core. This meant you'd get wait states if you accessed data or had to branch; only linear program counter reads and DMA went at full speed. Somebody did the math, and it seems that if the SA-1 used single-master-clock half-cycles, ordinary FastROM would be enough for 10.74 MHz with this setup.

But they also used 2 KB of fast RAM for the I-RAM cache, and you could run at the full 10.74 MHz in that. BW-RAM was bigger but slower; it would take the chip down to 5.37 MHz. And of course if the S-CPU accessed a particular memory at the same time you'd get extra wait states on the SA-1 side...

The later models of Super FX could run at 21.4 MHz. Unfortunately, this was only possible inside the 512-byte instruction cache. There was no data cache, just the internal registers. A cache miss or any sort of data access was five master clocks per byte, or 4.3 MHz. The dual busing and buffering saved it somewhat; if you were careful, you could generally keep working while data access was happening - unless you were reading from RAM, since there was no preload functionality for RAM (I suspect it would have been complicated) and out-of-order execution wasn't really a thing back then. Try to avoid reading from RAM a lot when using the Super FX.

...it kinda burns me that nobody thought to use 120ns ROM with the Super FX. That would be 3 master clocks per byte, and would dramatically speed up any sort of memory access including PLOT. I wonder if you could simply overclock it to 43 MHz and leave it in Slow mode - that gets you 3-cycle memory accesses, but your period authenticity is shot...

Re: Dealing with only X, Y, and Direct Page.
by creaothceann on 2017-10-26 (#206718)

93143 wrote:

...not to mention that if you eliminated the phi1/phi2 nonsense like Hudson did, you'd double performance for free

What did they do?

93143 wrote:

Five years was a long time back then.

Still is.

Re: Dealing with only X, Y, and Direct Page.
by TOUKO on 2017-10-26 (#206724)

Quote:

how did they get the SA-1 to run at 10MHz? Was it built using a smaller manufacturing process? (It doesn't have a fan, or even a heat sink.) The 5A22 in the SNES can't be that underclocked.

If i remember correctly, the 65816 was at start intended to run at this clock,and quickly pushed to 14mhz,and don't forget the C64's super CPU released in 1996 which run at 20mhz .

I don't think that the snes's CPU frequency was a manufacturing problem, but a memory requirement problem.
Nintendo wanted to sold the cheapest system, and knew that the CPU could be easily upgraded, this is why IMO it was so weak .

Re: Dealing with only X, Y, and Direct Page.
by tepples on 2017-10-26 (#206736)

It sounds like 93143 has 68000 envy. Blast processing much?

Take it a step further and you have Thumb.

Re: Dealing with only X, Y, and Direct Page.
by TOUKO on 2017-10-26 (#206737)

tepples wrote:

It sounds like 93143 has 68000 envy. Blast processing much?

Take it a step further and you have Thumb.

The 68K even in 90 was expensive and not customisable (not allowed by motorola).
For me The 65816 was a good choise, but not at 2.68 mhz, a SA-1 like(even more limited in features) would have been really nice and powerful .

Re: Dealing with only X, Y, and Direct Page.
by psycopathicteen on 2017-10-26 (#206764)

How expensive was the 68000 anyway in the early 90s anyway?

Re: Dealing with only X, Y, and Direct Page.
by TOUKO on 2017-10-27 (#206817)

psycopathicteen wrote:

How expensive was the 68000 anyway in the early 90s anyway?

Really it's hard to tell,it's not easy to find prices of that erea,but i think the MD's price when the snes came out speak for himself,snes's VRAM,PPUs,And audio part was more expensive than her MD's counterpart,and the snes's launch price was inferior,the couple M68K + Z80 was not cheap IMO .

Re: Dealing with only X, Y, and Direct Page.
by psycopathicteen on 2017-10-27 (#206827)

I'm confused, Wikipedia says they both cost $200 at launch, but the Genesis was $150 when the SNES launched.

Re: Dealing with only X, Y, and Direct Page.
by TOUKO on 2017-10-27 (#206828)

psycopathicteen wrote:

I'm confused, Wikipedia says they both cost $200 at launch, but the Genesis was $150 when the SNES launched.

ah yeah, i kept in mind the european price

but the Md was at 150$ in 1992, not at the snes launch,but at the european snes launch .
here the snes was at the same price (150$ after conversion) but with a game(SF2 or SMW for 129$) and 2 pads,and 99$ alone with 1 pad .

to sumarise the both was at the same price for two years, the snes's chipset was more expensive to produce than the Md one,plus snes has more expensive VRAM(not sure here), and more expensive audio RAM (64k ,100ns SRAM) .

Re: Dealing with only X, Y, and Direct Page.
by Oziphantom on 2017-10-28 (#206887)

Set the Data Bank Register to $7F ( or 7E if you want)

now
DP relative hits your 256 bytes of "registers" in 00:XXXX
16 bit ABS hits 64K of WRAM in 7F:XXXX
if you need to read from another bank or ROM
24bit ABS will read from wherever, give you have 16 bit index registers you can do
LDX #address in bank
LDA $820000,x ; reads the byte you want

or use a 24bit pointer in the DP and do direct indirect indexed long or direct indirect long modes to pull data from ROM/other RAM.

Thus you have 256 registers, 64K of RAM to play with and can access anywhere in RAM or ROM for a slight penalty.

Unless the SNES MMU mucks up the DBR and makes it also follow the lower shadow rule, but you then still get 32K+8K to play with.

Re: Dealing with only X, Y, and Direct Page.
by HihiDanni on 2017-10-28 (#206889)

Oziphantom wrote:

Set the Data Bank Register to $7F ( or 7E if you want)

now
DP relative hits your 256 bytes of "registers" in 00:XXXX

I would not recommend this, because even though the DP access only takes 3 cycles, they're slow cycles because you're hitting RAM. One of the keys to performance on the SNES is to minimize excessive RAM reads/writes.

Oziphantom wrote:

if you need to read from another bank or ROM
24bit ABS will read from wherever, give you have 16 bit index registers you can do
LDX #address in bank
LDA $820000,x ; reads the byte you want

This is alright for one-off reads, but if you're accessing lots of values out of a different bank, better to set the data bank register and use absolutes.

Oziphantom wrote:

or use a 24bit pointer in the DP and do direct indirect indexed long or direct indirect long modes to pull data from ROM/other RAM.

This is an interesting idea, but would probably need some values pre-stored during scene initialization, or else it'd be slower than absolute long.

You don't really need 256 fake registers. My take is set program and data bank to 80, and set data bank inside of, or before calling into, routines where you want to use data from elsewhere (but only when you need to).

You're not going to want to use a data bank like 7e or 7f as a general purpose thing if it's going to prevent you from accessing ROM via absolute addressing, which means harder access to read-only data structures and lookup tables that you can access with fast cycles (assuming you're using FastROM).

So yeah, I'd basically just take a "set it when you need it" approach to SNES programming. The whole bank thing definitely creates some mental barriers, but once you become more familiar with the addressing modes, and you know how to work the data bank register, it'll become a lot more tangible.

Re: Dealing with only X, Y, and Direct Page.
by psycopathicteen on 2017-10-28 (#206905)

I use $80 as my default bank, but I occasionally use $7e and sometimes $7f, but I only change banks at the beginning and end of routines, because the bank register isn't easy to change.

Re: Dealing with only X, Y, and Direct Page.
by lidnariq on 2017-10-28 (#206907)

HihiDanni wrote:

Oziphantom wrote:

Set the Data Bank Register to $7F ( or 7E if you want)

now
DP relative hits your 256 bytes of "registers" in 00:XXXX

But... 1-only the cycles that actually access "slow" speed memory are slow. 2-the only way you can get fast DP accesses are by setting them to something in the range of 0x2000-0x3FFF/0x4200-0x5FFF, because you can't move DP out of bank 0 anyway.

Re: Dealing with only X, Y, and Direct Page.
by Drew Sebastino on 2017-10-28 (#206922)

I have no clue what we're even talking about anymore, but I've come to the conclusion that my object table will be offset by Direct Page (again... :roll:

). I figure that having to fit the object table in 6 or so KB shouldn't be too bad, especially considering that many of the variables I have are 1 or 3 bytes, and they had to be 2 and 4 bytes due to having a table per attribute. The only real bummer is that it is slower to increment Direct Page, but I figure if each object I have is 32 bytes, I'm also saving a cycle every 8 objects.

Re: Dealing with only X, Y, and Direct Page.
by 93143 on 2017-10-29 (#206950)

Espozo wrote:

I have no clue what we're even talking about anymore

Sorry 'bout that.

Re: Dealing with only X, Y, and Direct Page.
by psycopathicteen on 2017-10-29 (#206956)

@93143

How does the SuperFX's cache work anyway?

Re: Dealing with only X, Y, and Direct Page.
by 93143 on 2017-10-29 (#206967)

There's a CACHE opcode. When the Super FX hits that opcode, it sets the cache base register to the address after CACHE (but with the bottom nibble zeroed), and loads your code into the cache in 16-byte chunks as it executes it from ROM or RAM. Subsequent loops over that code are one cycle per byte, as long as it fits in the cache.

The CBR doesn't store any bank information, so LJMP is treated the same as CACHE.

You cannot store data in the cache. It's only for code. The S-CPU can write directly into the cache, but I'm not sure what the point is, since even DMA is slower than just letting the Super FX load it...