This works fast enough. This makes me wonder how fast the Genesis would be if it had the same PPU as the SNES, and needed to do all this shit in order to have good animation, and vice-versa.
Agreed. It doesn't really seem that the SNES's CPU is slow, it actually seems like it has to do more work than the Genesis's because of the PPUs' design. The planar graphics format comes to mind...
psycopathicteen wrote:
I would've used 80 equally sized 32x32 slots.
Wow.
Well, you got to look on the bright side. You have more than 64 colors.
(I can barely get by with 256.
)
Stef said I only got my Gunstar Heroes demo running on the SNES because I "simplified" everything to run good on the system, when my demo was actually running a much more complicated dynamic animation engine than the original game, and still had plenty of CPU time left.
I decided to convert part of the code to 68000 myself to compare the cycle counts.
Code:
65816 code:
-;
inx //2
lda {vram_slot_table},x //4 6
cmp #$0f //2 8
beq - //3 11
-;
lsr //2
bcc + //3 5
iny //2 7
bra - //3 10
+;
68000 code:
-;
mov.b d0,(a0)+ //8
cmp.b d0,d1 //4 12
beq - //10 22
-;
lsr.w d0,#1 //8
bcc + //10 18
addq.b d2,#1 //4 22
bra - //10 32
+;
This is one of the speed crucial parts of the code. It looks for a 32x32 slot that is not completely full, then it looks at which 16x16 slot is still open. The first loop is 11 vs 22 cycles, the second loop is 10 vs 32 cycles. Look at how "fast" the 68000 is.
psycopathicteen wrote:
Stef said I only got my Gunstar Heroes demo running on the SNES because I "simplified" everything to run good on the system, when my demo was actually running a much more complicated dynamic animation engine than the original game, and still had plenty of CPU time left.
Why does it even matter how "simple" it is if it looks and plays exactly the same and doesn't have any slowdown? That'd be like I bragged that I could write "inc" 30 times instead "clc adc #$30" and the game would run without slowdown. It isn't any more impressive from a gameplay perspective and it just looks stupid.
psycopathicteen wrote:
This is one of the speed crucial parts of the code. It looks for a 32x32 slot that is not completely full, then it looks at which 16x16 slot is still open. The first loop is 11 vs 22 cycles, the second loop is 10 vs 32 cycles. Look at how "fast" the 68000 is.
Wow.
It's funny, because even though the SNES is clocked about 1/2 as fast as the Genesis, it takes about twice (or three times in the second example) the amount of time to do the same thing. I feel like video hardware generally matters more in a video game console than the main CPU anyway. You can try to optimize on a CPU, but not on video hardware, and if it's like the SNES, you can add more processing power with an expansion chip, (again, I never said it would be easy, it's just possible.) but you (unfortunately) can't add more overdraw. (No, the Yoshi's Island method is not a substitute for overdraw tepples.
)
This is pretty random, and although I haven't been doing anything SNES related for the past week or so (I think you know why) I wonder. I plan on making both characters be able to carry a rocket launcher which will (obviously) cause an explosion. You won't be to shoot more than one rocket at a time, so there will be no more than 2 explosions onscreen at the same time. The main problem is that the explosions are going to be 64x64 pixels each, so there will (obviously) be a 1/4 of sprite vram gone. There will also only be about 2 32x32 sprites left I can animate, so I figure that I'd double buffer the explotions so there will be 3/8 of sprite vram taken up, (there will be the 2 64x64 explotions and 2 64x32 for the double buffering) but I'll only be using about 2/5 of the DMA bandwidth. I plan on cutting the top and bottom 8 pixels of the screen if they are not visible anyway, and if it comes down to it, I'll cut off 16 more pixels for DS resolution, but I'm only going to do that if I have to. I was originally going to make the scoreboard out of sprites, but I'm not sure how that's going to fit in the sprite vram...
Espozo wrote:
even though the SNES is clocked about 1/2 as fast as the Genesis, it takes about twice (or three times in the second example) the amount of time to do the same thing.
Cycles, not time. In FastROM, the procedure would take roughly 6/7 as long on the SNES as on the Genesis, assuming neither one can be further optimized (they look pretty simple, but they're out of context, and I don't know 68K).
Yes, that's what I meant. (The SNES's CPU obviously isn't twice as fast.
)
I felt like posting a YouTube video of me ranting about how it is bullshit to design game logic around perceived CPU speed. I'll still do it if I have enough time Today.
I did notice that the second loop could be optimized a little by rearranging the loop.
Code:
lsr //2
bcc + //3 5
-;
iny //2
lsr //2 4
bcs - //3 7
+;
lsr.w d0,#1 //8
bcc + //10 18
-;
addq.b d2,#1 //4
lsr.w d0,#1 //8 12
bcs - //10 22
+;
The 68000 still takes 3 times the cycles.
Decided to redo the 68000 version to see how fast I could get (going by the code originally posted):
Code:
moveq #0, d0 ; 4
lea @Table(pc), a1 ; 8
@Loop:
move.b (a0)+, d0 ; 8
move.b (a1,d0.w), d1 ; 14
bmi.s @Loop ; 10 usually
; ...
@Table:
dc.b $00, $01, $00, $02 ; %0000, %0001, %0010, %0011
dc.b $00, $01, $00, $03 ; %0100, %0101, %0110, %0111
dc.b $00, $01, $00, $02 ; %1000, %1001, %1010, %1011
dc.b $00, $01, $00, $FF ; %1100, %1101, %1110, %1111
That's 12 cycles for init and 32 cycles per iteration. For the sake of comparison, at 65816's usual speeds that'd be about 6 and 16, respectively.
Um,
ouch, although now I want to see psycopathicteen go ahead and try the same using look-up tables. Pretty sure that if there wasn't a dare (or I was heavily starved for cycles) I'd have tried an approach similar to his.
PS: if anybody wonders, bmi.s would take 6 cycles when not branching. That'd only happen in the last iteration though, so I've decided to not count that possibility for the purpose of profiling.
psycopathicteen wrote:
Stef said I only got my Gunstar Heroes demo running on the SNES because I "simplified" everything to run good on the system, when my demo was actually running a much more complicated dynamic animation engine than the original game, and still had plenty of CPU time left.
Your bad apple demo for snes is a proof that actualy the snes CPU and 6xx architecture are not crappy
The snes cpu is "slow" because is not at his full speed (3,58 mhz) mainly because the use of crappy slow WRAM
..
And add to this, how about inexperimented 65xx coders ??
i saw the source code of Art of fighting for PCE, this game is full of macro in 68k style, and the ice on the cake, the code is 6502, not even hu6280,and the game run very fast with a faked zoom ..
Quote:
Stef said I only got my Gunstar Heroes demo running on the SNES because I "simplified" everything to run good on the system, when my demo was actually running a much more complicated dynamic animation engine than the original game, and still had plenty of CPU time left.
I have got some discusion about GH, and i said him that is more simpliest than it seem .
I think you can do an exact port, not because snes CPU,but only because you cannot put the same amount of sprites in H32, readability become terrible, and should have an heavy sprites flicker .
TOUKO wrote:
psycopathicteen wrote:
Stef said I only got my Gunstar Heroes demo running on the SNES because I "simplified" everything to run good on the system, when my demo was actually running a much more complicated dynamic animation engine than the original game, and still had plenty of CPU time left.
Your bad apple demo for snes is a proof that actualy the snes CPU and 6xx architecture are not crappy
The snes cpu is "slow" because is not at his full speed (3,58 mhz) mainly bacause the use of crappy slow WRAM
..
Pretty much. People just look at 3.58 and 7.6 and say "Duh, 7.6 is biger dan 3.58." They don't know anything deeper than that. It's funny when people try to compare processor speeds by seeing how fast the screen scrolls.
Even James Rolfe does in his second SNES vs Genesis video. (You wouldn't believe how stupid the comments are there, and I'm including some of the people who are SNES side.)
Quote:
Pretty much. People just look at 3.58 and 7.6 and say "Duh, 7.6 is biger dan 3.58.
True, how many times i heard that
, mainly on the best 16 bits forum,the well know sega-16
Quote:
It's funny when people try to compare processor speeds by seeing how fast the screen scrolls.
And parallaxes ??, you know that snes can't do the millions ton of parallaxes which we have in practically all Md games ..
however i like this kind of parallaxes, on PCE you can only do the same with only 1 bck layer,you have no choice, but this are called screen ruptures and not parallaxes because there not overlapping .
3.580 on the Master System or 4.194 on the monochrome Game Boy is also bigger than 1.790 on the NES, yet the 6502 gets more done in each cycle than Z80 or similar chips, so it's mostly a wash. I'm under the impression that the 68000 and 65816 share a similar relationship: the latter can do twice as much Work Per Clock.
So to be able to compare work per clock on an even playing field, some time ago I invented a unit of time called
gencycles. One gencycle is a fraction of a 65816 cycle intended to roughly approximate the period of one 68000 cycle. Each slow access (WRAM, slow ROM, and each byte of DMA) takes 3 gencycles, and each fast access (most I/O ports, fast ROM, and "internal operation") takes 2 gencycles. So you can cycle count a 68000 subroutine, see how many cycles it used, cycle count a 65816 subroutine that does the same thing, count gencycles, and you should get something fairly close to the machines' relative speeds.
A Genesis emulator author (i don't remember his name) profiled the 68K in some games ,and his mips ratio is 0,7/0,8 mips at his best ..
This is not abnormal, the 68k was designed for workstations and servers, not for a gaming system .
his streinght is easy to code for, even in a high level language (like C), and his 24 bits addressing space, you don't have to care about the annoying mermory banks managment,and can access to a large memory space,not his freq/cycles ratio ,this is why in 68K arcade systems, the CPU manages only the game logic, an let all customs chips to do the dirty work.
The 68k profiling files are here :
http://exophase.devzero.co.uk/profiles.zip
http://forums.sonicretro.org/index.php?showtopic=33501 This thread has a lot of really stupid posts like this:
Quote:
I'd rather see it attempted with a plain vanilla SNES with nothing added. I would think these approaches are a good idea:
1. Optimize the hell out of the code and make sacrifices where it may not be noticed.
2. Maybe slow Sonic's max speed to 5 pixels per frame so that it doesn't look excessively fast.
3. Have an SCD-style pan forward when running.
4. Scale what art you can down to 80% original scale horizontally so that at least those don't look distorted.
News flash! Did it ever occur to you that deliberately causing the game to run slow at full speed, will cause **gasp** the game to run slow at full speed? Maybe that's the reason why you're having a hard time optimizing you're game? The SNES isn't actually lagging, you just programmed it that way?
Oh yes, i have the same kind of discussions with stef on a french forum about the snes 65816 and the Hu6280,and he assimilates all two to a vanilla 6502, i think all guys in sonic forum do the same .
And you know, the 68k is better because of 32 bits instructions
, but in a 16 bits and 2D games who care about this ??
For PCE i use mainly 8 bits and some 16 bits (for pointers ) ,and the hu6280 is clocked at 7,16 not 1,79.
At the same speed the 65816 is close to twice as fast as 6502 .
Quote:
News flash! Did it ever occur to you that deliberately causing the game to run slow at full speed, will cause **gasp** the game to run slow at full speed? Maybe that's the reason why you're having a hard time optimizing you're game? The SNES isn't actually lagging, you just programmed it that way?
Yes if you code badly you have the wrong result, the snes CPU doesn't have much cycles to waste with bad programming .
The big advantage of 65xx achitecture, is his high level of code optimisation.
For GH if it's the CPU the limiting factor, when why a similar game do not exist with a sfx or SA-1 ??
Same for BTU, if is the cpu which limit to 3/4 sprites on screen, why capcom don't put a faster processor for FF 2/3 like he did with MMX series ???
Simply because is the pixels sprite limit / scanline the real limiting factor,and a faster CPU cannot change this fact .
In the post above I was referring to the sonicretro guy stating "maybe slow Sonic's max speed to 5 pixels per frame so that it doesn't look excessively fast" making no sense, because lowering the pixels-per-frame movement speed of Sonic's character wouldn't speed up the CPU at all, it would just make it look like the game is lagging, even when it is running at 60fps.
It's like recording a slow-motion voice into a tape recorder. If you play it at normal speed, it will sound like slow-motion, because it was recorded that way.
You missed something:
Some Bloke on Sonic Retro wrote:
3.54MHz do not a fast system make
Dat Gramar. Anyway, I'm pretty sure it's 3.5
8. I know I'm nitpicking, but come on. It's as easy as typing "SNES" on google and looking at the Wikipedia page.
Quote:
Same for BTU, if is the cpu which limit to 3/4 sprites on screen, why capcom don't put a faster processor for FF 2/3 like he did with MMX series ???
Simply because is the pixels sprite limit / scanline the real limiting factor,and a faster CPU cannot change this fact .
Again, I can't stress enough how video hardware is generally more important to a video game system than processing power. I do think they could have at least added another enemy though and just said "oh well!" when the "cheese grader effect" (that's what I like to call it anyway. Yeah, I know I'm wierd) kicks in.
Quote:
In the post above I was referring to the sonicretro guy stating "maybe slow Sonic's max speed to 5 pixels per frame so that it doesn't look excessively fast" making no sense, because lowering the pixels-per-frame movement speed of Sonic's character wouldn't speed up the CPU at all, it would just make it look like the game is lagging, even when it is running at 60fps.
Yes, moving a sprite with 5 pixels or 50 is the same in term of CPU cycles, and sprites moving/repositioning is not cpu intensive at all .
Quote:
Again, I can't stress enough how video hardware is generally more important to a video game system than processing power.
Yes more powerfull hardware you have less processing power is needed .
But it's better to have the two if possible, because with cpu power you can do more
But the snes ppu is not good for me in the sprite department,too convoluted and bad designed, and force coders to spend a big amount of CPU cycles to deal with .
TOUKO wrote:
But the snes ppu is not good for me in the sprite department,too convoluted and bad designed, and force coders to spend a big amount of CPU cycles to deal with .
I don't think the high oam thing is
too bad, but the 1/4 of vram thing is certainty an issue. Not being able to use 8x8, 16x16, and 32x32 sprites is a major annoyance though. (I could care less about 64x64, considering there isn't nearly enough vram for sprites or overdraw to justify its existence.)
Just thinking, wouldn't Sonic (or any sprite really) moving at 5fps actually take
more processing power? Then you would have to deal with sub pixel velocity.
psycopathicteen> Sorry you took it as a critic... actually i was mainly speaking about the fact you were using only 1 type of enemy, with simplistic IA, explosion animation physic and so on. I'm just chocked people really compare your demo to GH... Maybe your dynamic sprite allocation system is complex enough but still, try to fit it in a real game with all the game logic and others calculations aside to see if it still works...
Honestly GH is one of the most impressive game on Sega Megadrive, do you really believe you can replicate it accurately on the SNES which does have a
weaker CPU and a dumb PPU design ? If you really believe that, sorry guys but your are really naive or just fanboy blind.
And speaking about the 65816 vs 68000, i never compared them on their clock, you definitely can't compare different CPU architecture on clock... The MD 68000 runs at 7.7 Mhz compared to the 2.68 (or ~3.1 Mhz with Fast ROM) 65816 but if we compare the BUS speed then the 68000 only runs at 1.92 Mhz. If you want to compare these CPU on cycles then do it on BUS cycles, it's more fair
And it's why i really think the 65xx architecture is just bad: it requires very fast memory to work with and their efficiency definitely sucks. The 65816 uses faster memory than the MD 68000 and does far less with it (of course the 8 bits data BUS does not help but that is a tribute for the 65xx inspired design).
If you really want to compare 65816 versus 68000, don't adapt your 65816 code in 68000 code, of course you won't use the same algorithm on the 68000 which has plenty of registers and a 16/32 bits architecture. At least you try to optimize your code to take benefit of the 16 bits data bus, here (in Sik's code) we can see we have 2 bytes accesses, that is already a good indication that original code has not be designed for the 68000, of course sometime you don't have the choice but almost time there is way to adapt for 16 bits accesses.
Stef wrote:
psycopathicteen> Sorry you took it as a critic...
What else is he supposed to take it as?
Stef wrote:
actually i was mainly speaking about the fact you were using only 1 type of enemy,
That's about all there is in the game.
Stef wrote:
with simplistic IA,
That's about how it looks in the game.
Stef wrote:
try to fit it in a real game with all the game logic and others calculations aside to see if it still works...
What other game logic is left exactly? Maybe a second player and other weapons, but not much else. I'm not saying that that's it, but
the CPU intensive stuff is for the most part done with. Adding more levels doesn't cause slowdown, it just takes up more memory. I guess you could be talking about the bosses or the stage set pieces (like the mine cart things), but those don't look like they'd be very processor intensive.
Stef wrote:
Honestly GH is one of the most impressive game on Sega Megadrive,
I'm sure the Megadrive can do better. If you're going to pick a game to have a fetish for, pick a better game.
Stef wrote:
weaker CPU
I thought we'd gone over this...
Stef wrote:
a dumb PPU design ?
I know, the SNES only having 64 colors was a really stupid design choice. The Megadrive is way better because it has 256 colors.
Stef wrote:
If you really believe that, sorry guys but your are really naive or just fanboy blind.
I'm really not sure what to say...
Stef wrote:
And speaking about the 65816 vs 68000, i never compared them on their clock, you definitely can't compare different CPU architecture on clock... The MD 68000 runs at 7.7 Mhz compared to the 2.68 (or ~3.1 Mhz with Fast ROM) 65816 but if we compare the BUS speed then the 68000 only runs at 1.92 Mhz. If you want to compare these CPU on cycles then do it on BUS cycles, it's more fair
I like how you made the MD go up .1 megahertz and the SNES go down .4 megahertz.
Stef wrote:
65xx architecture is just bad: it requires very fast memory to work with and their efficiency definitely sucks
It takes about twice the amount of cycles to do the same thing on the 68000 as the 65816, and we're talking about efficiency...
Stef wrote:
If you really want to compare 65816 versus 68000, don't adapt your 65816 code in 68000 code
You're right. He should try to adapt the code from the 68000 to the 65816 instead.
Stef wrote:
psycopathicteen> Sorry you took it as a critic... actually i was mainly speaking about the fact you were using only 1 type of enemy, with simplistic IA, explosion animation physic and so on. I'm just chocked people really compare your demo to GH... Maybe your dynamic sprite allocation system is complex enough but still, try to fit it in a real game with all the game logic and others calculations aside to see if it still works...
Honestly GH is one of the most impressive game on Sega Megadrive, do you really believe you can replicate it accurately on the SNES which does have a weaker CPU and a dumb PPU design ? If you really believe that, sorry guys but your are really naive or just fanboy blind.
I don't have any idea what you're basing your assumptions on. You can argue all you want about the advantages of each processor but at the end of the day I'm still not seeing any reason that it can't be done. You haven't done any math here, you're basically saying "because it's inferior in these few respects it can't POSSIBLY run this game" which is ridiculous. Do you really think that GH is so painstakingly optimized and pushing the limits of the system so extremely that a very-nearly-equivalent system couldn't possibly do the same thing, no matter how you made use of its advantages? That is a HUGE assumption. You have absolutely nothing to back it up.
The only way you're going to prove your point is if you or somebody else actually does try to port GH to SNES. And if they fail that only proves that THEY couldn't do it - not that it can't be done. This conversation is frankly ridiculous.
Thank you Khaz. I'm 99% sure the reason psychopathicteen quit the GH port was because he got sick of it and didn't feel like finishing it, not because it wasn't possible.
Oh, it's on now!
Espozo and Stef: let's see if we can't not get the thread locked this time (ie: have something approximating a rational discussion) - it's a useful thread and it'd be a shame if this digression sank it...
Stef wrote:
And speaking about the 65816 vs 68000, i never compared them on their clock, you definitely can't compare different CPU architecture on clock... The MD 68000 runs at 7.7 Mhz compared to the 2.68 (or ~3.1 Mhz with Fast ROM) 65816 but if we compare the BUS speed then the 68000 only runs at 1.92 Mhz. If you want to compare these CPU on cycles then do it on BUS cycles, it's more fair
And it's why i really think the 65xx architecture is just bad: it requires very fast memory to work with and their efficiency definitely sucks. The 65816 uses faster memory than the MD 68000 and does far less with it (of course the 8 bits data BUS does not help but that is a tribute for the 65xx inspired design).
We actually have clock data and instruction cycle counts for both SNES and MD, and it is possible to write equivalent code and compare wall clock time (obviously, the larger and more functionally complete the code sample the better). Complaining about 65x being a bus hog and requiring faster memory doesn't come to the point; the consoles are what they are, and have been for a quarter of a century.
On the topic of FastROM, notice how psycopathicteen's original example snippet only has a single slow-access cycle out of 21 cycles? Not all code is that insular, but to get near 3.1 MHz average speed you'd have to be doing non-indexed 16-bit direct page WRAM accesses pretty much exclusively; most instructions have a lower RAM access fraction than that.
(If you're really desperate, I think you can set direct page to $4300 and use the DMA registers as fast RAM; as long as $420B and $420C are zeroed, nothing too horrifying should happen...)
93143 wrote:
Espozo and Stef: let's see if we can't not get the thread locked this time (ie: have something approximating a rational discussion)
Aww...
That's no fun!
Espozo wrote:
I'm not saying that that's it, but the CPU intensive stuff is for the most part done with. Adding more levels doesn't cause slowdown, it just takes up more memory.
Compression costs CPU time. I remember reading an
interview stating that
Gunstar Heroes would be over 2 MB if uncompressed. The same interview states that complex jointed bosses also cost CPU time, and I'm guessing that the 68000's multiply instructions help with that.
Quote:
Stef wrote:
And speaking about the 65816 vs 68000, i never compared them on their clock, you definitely can't compare different CPU architecture on clock... The MD 68000 runs at 7.7 Mhz compared to the 2.68 (or ~3.1 Mhz with Fast ROM) 65816
I like how you made the MD go up .1 megahertz and the SNES go down .4 megahertz.
I'm assuming the extra .4 MHz came out of two things: slow RAM and RAM refresh.
Quote:
I'm 99% sure the reason psychopathicteen quit the GH port was because he got sick of it
The other 1% being a cease and desist letter?
93143 wrote:
to get near 3.1 MHz average speed you'd have to be doing non-indexed 16-bit direct page WRAM accesses pretty much exclusively
How many cycles does
STA [$69],Y take?
Espozo wrote:
Stef wrote:
psycopathicteen> Sorry you took it as a critic...
What else is he supposed to take it as?
Stef wrote:
actually i was mainly speaking about the fact you were using only 1 type of enemy,
That's about all there is in the game.
Stef wrote:
with simplistic IA,
That's about how it looks in the game.
Here's how I always thought the AI in Gunstar Heroes work.
Roll the dice. If it lands on X, perform action Y. Roll the dice again to determine how long before the enemy performs a another move. Repeat.
tepples wrote:
How many cycles does STA [$69],Y take?
Two or three ROM/internal, four or five RAM. How often do you actually use direct indirect indexed long stores? Or any indirect indexed instructions?
(I've only ever programmed test codes and mockups, and I don't hack ROMs, so I don't have a good idea of what game logic actually looks like...)
...
@psycopathicteen: If you load an indexed sprite sheet into the GIMP, copy it to a new RGB file, rescale it with high-quality interpolation, and copy-paste it back into the original indexed image (delete the original first if there's any transparency involved) so as to force it back to the original palette, there should be a fairly minimal amount of work involved in touching up the resulting rescaled sprites. Just be sure to delete any graphics that don't share the palette and re-index so you know you won't get misplaced colours. This gives much better results than simple nearest-neighbour rescaling.
Proof of concept attached:
Attachment:
super_tie_fighter.gif [ 11.24 KiB | Viewed 1132 times ]
tepples wrote:
The other 1% being a cease and desist letter?
...wait, did he actually get one?
Oh wow, there's bullshit from both ends here. First the "let's limit Sonic's speed to 5 pixels per frame" on one side, then the other side taking it to mean "make the game run at 5FPS". WTF?
Also I want somebody to try the table look-up method on the 65816 to see how it performs there.
93143 wrote:
tepples wrote:
The other 1% being a cease and desist letter?
...wait, did he actually get one?
If so, I'm not aware. It was a guess. But I wouldn't put it beyond a lot of major video game publishers nowadays, especially with the growing willingness of courts to enforce exclusive rights in "look and feel".
Sik wrote:
Oh wow, there's bullshit from both ends here. First the "let's limit Sonic's speed to 5 pixels per frame" on one side, then the other side taking it to mean "make the game run at 5FPS". WTF?
I know the original quote was unknowledgeable, but what happens when you still can't get all the processing done at 30fps? I sincerely doubt that you could even make the SNES run at 5fps, and there's absolutely no question that the SNES version wouldn't run as slow as 5fps even if it could. (I know it would pain him dearly, but even Stef could admit that.)
Espozo wrote:
I sincerely doubt that you could even make the SNES run at 5fps
What kind of frame rate do you get out of games like
Jurassic Park or some of the Super FX games like
Doom, especially in more complex scenes?
I didn't think it ever went slower than 10. Anyway, the main question is how much the frame rate drops every time it doesn't finish processing, like if the frame rate always got cut in half to where it went 60, 30, 15, 7.5. I just don't have a clue how it works, and it can't be any number because (at least to my knowledge) you cant have a game run at 59fps.
5fps means moving once every 12 tv frames.
If it's not a fixed rate you can have non-integer divisors. e.g. 59 could be "dropped one frame every second".
Espozo wrote:
Sik wrote:
Oh wow, there's bullshit from both ends here. First the "let's limit Sonic's speed to 5 pixels per frame" on one side, then the other side taking it to mean "make the game run at 5FPS". WTF?
I know the original quote was unknowledgeable, but what happens when you still can't get all the processing done at 30fps? I sincerely doubt that you could even make the SNES run at 5fps, and there's absolutely no question that the SNES version wouldn't run as slow as 5fps even if it could. (I know it would pain him dearly, but even Stef could admit that.)
Wait, what game are we talking about now? x_X Bleh, ignore that, although maybe it's a good moment to split the thread since this doesn't seem to be about dynamic VRAM slot algorithms anymore (I bet that split thread would get locked rather soon like the other one, though...).
I almost don't see the point in locking topics. All that happens is that the argument gets brought elsewhere to continue. I imagine people would get tired of arguing at one point or another anyway.
Quote:
The same interview states that complex jointed bosses also cost CPU time, and I'm guessing that the 68000's multiply instructions help with that.
Yes but you have only player and boss,and multiply instruction(if used here) is slow and you'd better use LUT .
A game like super aleste is much more cpu intensive, and here too i think the problem to have a snes port of AS is sprites limit .
This game like GH are designed to use the Md sprites capabilities to maximise sprites on screen with minimum flicker in H40 .
Khaz wrote:
I don't have any idea what you're basing your assumptions on. You can argue all you want about the advantages of each processor but at the end of the day I'm still not seeing any reason that it can't be done. You haven't done any math here, you're basically saying "because it's inferior in these few respects it can't POSSIBLY run this game" which is ridiculous. Do you really think that GH is so painstakingly optimized and pushing the limits of the system so extremely that a very-nearly-equivalent system couldn't possibly do the same thing, no matter how you made use of its advantages? That is a HUGE assumption. You have absolutely nothing to back it up.
The only way you're going to prove your point is if you or somebody else actually does try to port GH to SNES. And if they fail that only proves that THEY couldn't do it - not that it can't be done. This conversation is frankly ridiculous.
Oh i see, "we cannot say it is not possible to make GH on SNES as nobody tried to do it"... ok in this case maybe we could run FarCry 4 on SNES but we will never know as nobody tried it ? I admit you can always find new ways to improve / optimize your code but at the end you need to find something proper at least for real game condition exploitation. Imagine you need to use 1 MB lut and having GFX organized in very specific way with many hard coded constraints then your engine is definitely not usable in real conditions. After the optimization then you have the maths... A lot of people believe the 65816 is better because of its better IPS but the problem is that you will never be able to clock the 65816 at the same speed than a 68000 (because of the memory). Actually even 3.58 Mhz was not possible when the SNES was released, fast ROM were used quite late in the SNES lifetime... But anyway, i accept to compare with a 3.1 Mhz 65816 (which is, i believe, an optimist estimation of the CPU speed while it is running from Fast ROM) so let's do the maths.
memory fill68000 = 1 byte each 3 cycles (classic move) and ~2.2 cycles (with movem)
65816 = 1 byte each 2 cycles (STA DirectPage with 16 bits Acc)
memory copy68000 = 1 byte each 5 cycles (classic move) and ~4.5 cycles (with movem)
65816 = 1 byte each 7 cycles (MVN) and 5 cycles (LDA/STA Absolute sequence with 16 bits Acc)
I don't know enough the 65816 but i believe we can't do faster and i used simplest addressing mode to get fastest usable "case" but even in these particular situations we can see that for basic operation as memory fill or copy the 68000 is on par in term of cycles number... when the 68000 has many more available cycles.
Then we can find an unlimited number of examples but i am thinking about the one posted in Sega-16 by tomaitheous to show the 6502 can be fast, even for 24 bits calculation :
Code:
;24bit add for X and Y. 16:8 = 16bit for whole, 8bit for fractional
ldx #$xx ;2
jsr AddVelocity ;6
AddVelocity:
lda x_float,x ;4
clc ;2
adc x_float_inc,x ;4
sta x_float,x ;5
lda x_whole.l,x ;4
adc x_whole_inc,x ;4
sta x_whole.l,x ;5
lda x_whole.h,x ;4
adc x_whole_inc.h,x ;4
sta x_whole.h,x ;5 = 41
lda y_float,x ;4
clc ;2
adc y_float_inc,x ;4
sta y_float,x ;5
lda y_whole.l,x ;4
adc y_whole_inc,x ;4
sta y_whole.l,x ;5
lda y_whole.h,x ;4
adc y_whole_inc.h,x ;4
sta y_whole.h,x ;5 = 41
rts ;6
;41+41+6+6+2 = 96 cycles
The point of the method is to handle object displacement (player ship or enemies) and that the 6502 could even be faster here than a 68000 in term of number of cycles (Steve Snake wrote a method which was above 100 cycles for the 68000). But actually reworking a bit the data structure and the code for the 68000 ended to that :
Code:
.loop:
move.l (a0)+,d0 ; 12 X_inc
move.l (a0)+,d1 ; 12 Y_inc
add.l d0,(a0)+ ; 20 X += X_inc
add.l d1,(a0)+ ; 20 Y += Y_inc
dbra d7, .loop ; 10
74 cycles.
Ok, we were comparing against the 6502 which is not fair and i'm certain we can lower the initial 96 cycles number a lot with the 65816. But then, even if you are able to get the calculation done in 40 cycles with the 65816 you are still below the 68000 performance level.
Considering 7.67 Mhz against 3.1 Mhz you need a ratio of ~2.5 to be equivalent to the 68000, so here with 74 cycles for the 68000, you had to do it in 74/2.5~30 cycles to be on par... and this time i really doubt you can make it fit in 30 cycles with the 65816.
Stef wrote:
Oh i see, "we cannot say it is not possible to make GH on SNES as nobody tried to do it"... ok in this case maybe we could run FarCry 4 on SNES but we will never know as nobody tried it ?
The audacity of this guy, making such claims about what is and is not possible on the SNES when he admits that:
Stef wrote:
I don't know enough the 65816
If you don't know enough then maybe you aren't qualified to be dictating what can and cannot be done on the system, huh? For example, are you even familiar with the concept of DMA? I would think that has some bearing on your calculations. And furthermore you missed my point entirely; the point was that pointing out subtle differences where the SNES doesn't come out on top is not proving ANYthing about what games can and can't run on it. This conversation is still ridiculous.
I'm sorry for continuing to post off topic.
You see! Stupid shit like this happens if you don't be aggressive and take action.
Stef wrote:
I don't know enough the 65816
I don't know enough about the 68000, but you don't see me picking fights. If you're here just to bitch about things, go somewhere else.
You know, why don't we just do a topic split right here and now for a "Useless Hate" thread.
Khaz wrote:
If you don't know enough then maybe you aren't qualified to be dictating what can and cannot be done on the system, huh?
I have played with many CPU with different architecture in my programmer life and the first assembly i ever touched (maybe 20 years ago) was actually 6502 assembly (on a VIC20). So even if i am not currently programming with the 65816 and don't know its ISA by heart I have a perfect understanding of its architecture and a good view its capabilities.
Quote:
For example, are you even familiar with the concept of DMA? I would think that has some bearing on your calculations.
Are you serious ? I just wanted to compare the CPU performance just to point out the problem with the 65xx architecture.
Quote:
And furthermore you missed my point entirely; the point was that pointing out subtle differences where the SNES doesn't come out on top is not proving ANYthing about what games can and can't run on it.
Oh really ? so please tell me how can you quantify it then ? The CPU is the heart of the system... if a CPU A can't compute as much than a CPU B then necessarily you can't replicate games from system B on system A at same speed, just pure logic. Honestly i really amazed that some people still dare to compare the 65816 to the 68000, the later one is so much more advanced there is no competition. Even with a 65816 running @3.1 Mhz you barely get about half of the performance than a 7.7 Mhz 68000 which requires slower memory, that is the point. You can definitely find particular situation where the 65816 will perform well (and why not even outperform the 68000) but generally the 68000 will be the clear winner and by a large margin.
Espozo wrote:
You see! Stupid shit like this happens if you don't be aggressive and take action.
I don't know enough about the 68000, but you don't see me picking fights. If you're here just to bitch about things, go somewhere else.
You know, why don't we just do a topic split right here and now for a "Useless Hate" thread.
Why do you need to be that aggressive ? I was just replying about the hilarious stuff i read in this topic and i came with arguments and numbers. Still I agree it turns out of the topic and we should split it but imo it can be an interesting discussion if people try to think and speak without their biased point of view.
Quote:
So even if i am not currently programming with the 65816 and don't know its ISA by heart I have a perfect understanding of its architecture and a good view its capabilities.
No, and i said you all the time, 65816 or hu6280 are not 6502 ..
Hu6280 is more faster than vanilla a 6502, and run at 7,16mhz, the 65816 is faster than the other two at same clock .
Do you remember you claimed that 68k was 4x more faster than 65816 ??
I told you they are close for a gaming machine,and at 2,58 the 68k is still faster, but not by a lot .
TOUKO wrote:
No, and i said you all the time, 65816 or hu6280 are not 6502 ..
Hu6280 is more faster than vanilla a 6502, and run at 7,16mhz, the 65816 is faster than the other two, but run at a lower clock than 6280.
Of course the 65816 is faster than a 6502. It's why i said we can lower the number of cycles of the previous code by a large margin on the 65816 (from 96 cycles with the 6502 i guess we can at least obtain 50 cycles on the 65816).
Quote:
Do you remember you claimed that 68k was 4x more faster than 65816 ??
Please, i just invite you to find where i claimed that ok ?
As usual you probably mis-interpreted my words or put them out of context
I always said the 65816 in the SNES (and so clocked at ~3 Mhz with Fast ROM) is about half of the performance level than the 7.7 Mhz 68000.
Quote:
I told you they are close for a gaming machine,and at 2,58 the 68k is still faster, but not by a lot .
Yeah i know that but you are just wrong
@2.68 Mhz the 68000 is actually *a lot* faster.
Quote:
Please, i just invite you to find where i claimed that ok ?
As usual you probably mis-interpreted my words or put them out of context
I always said the 65816 in the SNES (and so clocked at ~3 Mhz with Fast ROM) is about half of the performance level than the 7.7 Mhz 68000.
No stef, i'am sure of that, but i'am not sure to find where in the "snes vs md" (ton of posts) topic
Quote:
Yeah i know that but you are just wrong
@2.68 Mhz the 68000 is actually *a lot* faster.
A lot ??, you're hardly optimistic
Do you claim again that bad apple in 8 MB is impossible on snes ??,i don't see the "a lot" factor here
Quote:
I'm just chocked people really compare your demo to GH
Huuuum, maybe like you compare your starfox demo ??
Or axelay scroll demo with nothing around ??
To be honest i am really impressed by the SNES version of BadApple but even more impressed by the fact that they figured a codec fast enough to work on the 65816 and to fit the video and sound in 8 MB.
Still currently even if 90% of the demo is done the last 10% which consist of getting rid of the last slowdowns will be really tricky as the code is already quite optimized. Also there is the sound streaming issue but i believe this one is easier to sort.
The MD version of the demo uses more complex compression methods. I used about 5/6 compression schemes for the tilemap and about 7 or 8 for the tile data... honestly i think my code is definitely not optimal on that point.
For sure the native 2bpp mode of the SNES help for the compression, in the MD version i need to encode 2 2bpp frames into a single 4bpp frame and that definitely reduces the compression ratio, also i have 25% more data to unpack compared to the SNES (wider resolution).
Quote:
Huuuum, maybe like you compare your starfox demo ??
Or axelay scroll demo with nothing around ??
Ok for axelay but the starfox demo is a different, what cost the most and *by far* in this case is the 3D math calculations and the 3D rendering. If you get the engine done then you can consider to make a game from it without much troubles. Any game logic will have an insignificant cost on the CPU compared to the 3D stuff.
Quote:
Ok for axelay but the starfox demo is a different, what cost the most and *by far* in this case is the 3D math calculations and the 3D rendering. If you get the engine done then you can consider to make a game from it without much troubles. Any game logic will have an insignificant cost on the CPU compared to the 3D stuff.
Agreed, but is not a proof of anything, you can have the CPU ressource for your demo, but not enough for collisions, and all around .
Sorry but your arguments can be applied to GH demo, the difficult part is to have many sprites on screen with collisions without any slowdown, the rest is easy to do,in GH the sprites are simply clones, i think all move fit in VRAM because they are very limited,same for Ai and physic ,which are realy simplistic .
Quote:
The MD version of the demo uses more complex compression methods. I used about 5/6 compression schemes for the tilemap and about 7 or 8 for the tile data
May be, but same final result,if you want impress, do the same just with 68k, and not 68k+z80 ..
TOUKO wrote:
May be, but same final result,if you want impress, do the same just with 68k, and not 68k+z80 ..
Should the SNES demo be done without the SPC700?
DoNotWant wrote:
Should the SNES demo be done without the SPC700?
It's different, i think most of the audio work in snes version is done by 65816, it stream audio directly to SPC .
But i don't know exactly what sort of technics (audio and tiles) psycopathicteen use ..
In Md case all audio was done by z80,let a big amount of 68k cycles free ..
i think we should stop derivating, this is not the good topic for CPU comparison
Well, I originally typed out something much larger and "aggressive", but I've decided to play "nicer". Here we go...
Stef wrote:
Why do you need to be that aggressive ?
Do you see me starting arguments? That'd be like If I went to Sega16 just to say "The 68000 is trash, the 65816 is better because it can handle Space Megaforce."
Stef wrote:
their biased point of view.
...
Stef wrote:
You can definitely find particular situation where the 65816 will perform well (and why not even outperform the 68000) but generally the 68000 will be the clear winner and by a large margin.
Why do you find that the 68000 "is the clear winner", but nobody else else does? Is it because of their "biased point of view"? Anyway, psychopathicteen made a code that performed better on the 65816 and you don't like it because he "ported" it over to the 68000 and was faster on the 68516, so you make a code on the 68000 and "port" it over to the 65816 and it's faster on the 68000, to show that the 68000 is faster because it handled that code better, completely disregarding what psychopathicteen wrote. From what I got of the codes, the 65816 and the 68000 are better for different things, but you don't want a draw, you want to win. I understand that the Genesis's 68000 is still faster than the SNES's 65816 but not even relatively close to 4x. (Don't even pretend you didn't say that.)
Anyway, can we finally settle on a draw and not have another one of these issues?
Espozo wrote:
Do you see me starting arguments? That'd be like If I went to Sega16 just to say "The 68000 is trash, the 65816 is better because it can handle Space Megaforce."
Did you noticed that i basically replied to psychopathicteen and argued about why his demo is not enough to say "we can port GH on the SNES" ? Of course you are more than welcome on Sega-16 and get some technicals talks, as soon your arguments are valid and you are not aggressive against others members.
Stef wrote:
Why do you find that the 68000 "is the clear winner", but nobody else else does? Is it because of their "biased point of view"?
Nobody else does ? Are you speaking about the general statement and the one we find here ?
I really wonder in which world you live... Did you have ever had a look on specialized hardware forum ?
Quote:
Anyway, psychopathicteen made a code that performed better on the 65816 and you don't like it because he "ported" it over to the 68000 and was faster on the 68516, so you make a code on the 68000 and "port" it over to the 65816 and it's faster on the 68000, to show that the 68000 is faster because it handled that code better, completely disregarding what psychopathicteen wrote. From what I got of the codes, the 65816 and the 68000 are better for different things, but you don't want a draw, you want to win.
But you can take the code from psychopathicteen if you want, take the *best* code for your 65816 but then consider to take the best one for the 68000... then you can compare. And of course there is a win, again these CPU does not compete in the same category. Even clocked at 3Mhz (which is fast regarding the required memory speed) the MD 68000 has the edge by a large amount. You don't want to accept it but that is just simple truth. The only advantage of the 65xx serie CPU is the price, it's a really cheap CPU but then you have to pay it elsewhere, as the general very poor "efficiency".
Quote:
I understand that the Genesis's 68000 is still faster than the SNES's 65816 but not even relatively close to 4x. (Don't even pretend you didn't say that.)
Oh yeah i said it, you better know than me
are you Touko's cousin ?
I said the Genesis's 68000 is faster than SNES's 65816 and about twice as fast which is already a nice difference when you think the Genesis is 2 years older. A 6 Mhz 65816 should be on par with a 7.7 Mhz 68000 which is, actually, not a good score for the 65816.
Stef wrote:
I have played with many CPU with different architecture in my programmer life and the first assembly i ever touched (maybe 20 years ago) was actually 6502 assembly (on a VIC20).
Then it sounds like you should have enough experience to know better than to try to pick this particular fight. The war between Genesis and SNES has been going on for decades now. Like any war that lasts decades, nobody has emerged a clear winner. There are no fanboys left - they have grown into fanmen.
The Sega Genesis was the first console I ever had. I loved it to death back then and I still do now. If you forced me to pick a side, I would be on YOUR SIDE. The reason I'm against you right now is because you're trying to make a ridiculous point. You're throwing around wild speculation that you believe GH can't be replicated, with nothing that specifically backs that point up.
When I say to "back it up", I mean with some serious analysis that warrants consideration. Something like "to reproduce GH you'd need these tile sizes in this video mode, you'd need this much time for sprite routines and this much time to process the AI and this much time for the rest, and due to the much faster way the Genesis does ____ there is no possible way the SNES can do the same job." That would warrant a response. Saying "This specific sample of code is slower on SNES" is totally meaningless, and you being a programmer yourself should know that.
Stef wrote:
Are you serious ? I just wanted to compare the CPU performance just to point out the problem with the 65xx architecture.
Uh, yes, I'm quite serious. DMA makes a huge difference to the speed of writing/copying blocks of data to either WRAM or VRAM. Since you were doing all your comparisons based on direct page instructions and not considering this advantage I don't think your assessment was fair.
Stef wrote:
Quote:
And furthermore you missed my point entirely; the point was that pointing out subtle differences where the SNES doesn't come out on top is not proving ANYthing about what games can and can't run on it.
Oh really ? so please tell me how can you quantify it then ?
I DON'T.
I don't go around trying to say that Game X is too complicated for System Y because that would be a foolish thing to do - you can't prove it and you're just inviting people to prove you wrong. You talk exclusively about CPU power, which again you should know better than. The SNES is more than a 65816. The Genesis is more than a 68000.
Anyways, I don't see the point of this conversation here. I don't see why somebody would come into the SNES Development forums to try to convince people that the SNES is an inferior console. What is it you're trying to accomplish, Stef? Right now I only see two possible motivations: Either you're just the biggest GH fan on earth and you're trying to goad somebody into porting it for you, or you just hate the SNES and you're trying to distract us all from our work here.
Regardless, I think the best course of action would be to split this discussion out of this (extremely valuable!) thread. And I second my call for a dedicated "SNES VS GENESIS BLOODBATH" forum where we can just HAVE all these ridiculous discussions and keep them away from the actual work being done. I don't think it is possible to prevent people from debating it altogether.
I am sorry, again, for posting more.
Espozo wrote:
You know, why don't we just do a topic split
PM me the split point and I'll handle it.
Okay I have a lot of posts to reply to.
The slowdown in Bad Apple is mostly the 65816 waiting for the SPC700 to respond to it.
Somebody said that I kept the frames in vram in GH. I could've did that because I only used 4 animation frames for enemies, but I used a dynamic animation engine instead because I anticipated eventually running out of vram.
psycopathicteen wrote:
The slowdown in Bad Apple is mostly the 65816 waiting for the SPC700 to respond to it.
That can be worked around by polling the SPC700 in other parts of your 65816 code, such as during tile decoding.
Anyway, I'll proceed to lighten the mood by missing the point:
To do GH, you'd need either Sega CD or MSU1 to stream large background and guitar parts. The
Nintendo DS version is a 1 Gbit cartridge, and all other versions come on optical disc.
To do TF3, you'd need not only 3D acceleration but also a time machine. Valve can't count to three, and last time I checked, it was making enough money selling hats in TF2 that it doesn't
need to make a TF3.
It seems to me that using two HDMA channels, one indirect addressed to read from a buffer (or straight from ROM, if in repeat mode, though bank boundaries might be trouble) and one direct addressed to write the control byte, should save most of the time, as long as the SPC700 can keep up. That way you wouldn't need to repopulate a whole HDMA table every frame.
Did you uncover a reason not to use HDMA, or is it just not working yet?
Khaz wrote:
Then it sounds like you should have enough experience to know better than to try to pick this particular fight. The war between Genesis and SNES has been going on for decades now. Like any war that lasts decades, nobody has emerged a clear winner. There are no fanboys left - they have grown into fanmen.
The Sega Genesis was the first console I ever had. I loved it to death back then and I still do now. If you forced me to pick a side, I would be on YOUR SIDE. The reason I'm against you right now is because you're trying to make a ridiculous point. You're throwing around wild speculation that you believe GH can't be replicated, with nothing that specifically backs that point up.
It's not even about SNES versus Genesis, i owned both consoles back in time and people who know me can confirm i always claimed i played more on my SNES than my MD and that my all time favorite game is Super Metroid. It's just about technical mis-information, i think there is nothing worst than that... Here what annoy me is that we can read the 65816 is a beast and even surpass the 68000 as it takes less cycles to execute comparable operations. I just want to explain why we should not compare on cycle and why the 65816 is definitely a poor choice as a main CPU (for whatever system actually, not only the SNES). I know the SNES has others flaws as the convoluted PPU with spitted OAM, sprites size restriction and memory arragement but all systems has its flaws (the Megadrive has only 4 palettes to play with and the sound system has some nice holes too) and honestly i think they can be worked out, at least partially. But here, in the SNES, the CPU is definitely and *by far* the main issue of the whole system. Honestly i tried to develop on SNES but the CPU is just so under powered you can't correctly use the offered graphics features. Accessing memory by bank of 32KB / 64 KB on a 16 bits system (released in 1990) is just ridiculous and painful for developers, it is as if i was coding on a 8 bits system with boosted graphics and audio hardware, totally unbalanced and very unpleasant. You can believe if you want that GH is possible on SNES, that is your right but the truth is that the CPU alone is a good reason to not see it happening. The guys from Treasures left Konami company because they wanted to have more freedom in their development but also because they felt limited by the SNES CPU, GH relies a lot on the power offered by the 68000 so definitely it would not work on SNES because of its *slow CPU*, that is...
If you are interested, here is an interview from the guys who actually worked on GH:
http://megadrive.me/2011/11/03/an-inter ... -treasure/And a relevant part of it:
Quote:
Q: Konami is a big 3rd party for Nintendo, so why are you now making games for Sega?
A: I’ve always been fascinated with hardware. People are constantly comparing Mega Drive to SNES, saying that the SNES has more colors etc…
But the Mega Drive has a 68000 processor, which is very easy for programmers to work with. I was a programmer for years, making games for the SNES, and I can tell you, the hardware is a pain in the butt. If consumers look at a still shot, they may think the SNES is better, but actually, if you tried to put Gunstar Heroes onto the SNES there would be no way. See those bosses? On the SNES they would slow down, that movement requries sooo much computation. It could only be done on the Sega hardware.
...
as I said the hardware is very easy to work with. All things considered, the 68000 is a very good CPU allowing room for experimentation while the SNES hardware limits you to their design standards. Scaling and rotation can be implemented in the Sega software, forget it on the SNES.
Then now free feel to ignore it... and continue to believe it can be done on the SNES.
Quote:
When I say to "back it up", I mean with some serious analysis that warrants consideration. Something like "to reproduce GH you'd need these tile sizes in this video mode, you'd need this much time for sprite routines and this much time to process the AI and this much time for the rest, and due to the much faster way the Genesis does ____ there is no possible way the SNES can do the same job." That would warrant a response. Saying "This specific sample of code is slower on SNES" is totally meaningless, and you being a programmer yourself should know that.
Any program is just about dealing with data: read data, interpret it, transform it, modify it...
So taking the performance of extra basic operation as read and copy data is already a good start point to evaluate what you can achieve with the CPU. Of course that's not enough, i just say it's a good start point.
If you want more advanced maths to compare these CPU then we can go in it but trust me you won't like the result.
Stef wrote:
Uh, yes, I'm quite serious. DMA makes a huge difference to the speed of writing/copying blocks of data to either WRAM or VRAM. Since you were doing all your comparisons based on direct page instructions and not considering this advantage I don't think your assessment was fair.
Of course i do know what DMA is (and the MD also has a DMA) but the point was to compare the CPU (see my previous point).
Quote:
I DON'T.
I don't go around trying to say that Game X is too complicated for System Y because that would be a foolish thing to do - you can't prove it and you're just inviting people to prove you wrong. You talk exclusively about CPU power, which again you should know better than. The SNES is more than a 65816. The Genesis is more than a 68000.
Ok, you don't, i try... and what do you expect ? do we need to reverse engineer entirely GH game to see if you can port the engine 1 to 1 ? Do you know at least what can the 68000 CPU do ? Again we can go further in the calculation, but do we really need it ??
psycopathicteen wrote:
The slowdown in Bad Apple is mostly the 65816 waiting for the SPC700 to respond to it.
Too bad to waste CPU time here, but you will probably try get back to the HDMA method ?
Reading the SPC7000 documentation, it looks like you have a 4 bytes port communication between the SPC and the main CPU. When you fed the SPC from the HDMA i guess you do something as write 1 byte of compressed data to port 0 and write a specific value to port 1 to notify the SPC a data is ready then the SPC can read and acknowledge it by writing 0 back to port 1 ? From this scheme you can probably build a BBR circular buffer inside the SPC RAM where the DSP will fetch its samples but i guess synchronizing read and write is all the matter (avoiding to read where write are occurring).
Quote:
Somebody said that I kept the frames in vram in GH. I could've did that because I only used 4 animation frames for enemies, but I used a dynamic animation engine instead because I anticipated eventually running out of vram.
Does you dynamic engine re-allocate at each frame or it keeps tiles in VRAM as long it can ?
Quote:
See those bosses? On the SNES they would slow down, that movement requries sooo much computation. It could only be done on the Sega hardware.
Aahahah, i 've seen many of this words for any console, from many developpers, i call this a self errection ..
"We are so good that we can only prove it on ....(replace by your favorite machine)"
Are we really going to start this again?
Stef wrote:
Quote:
Q: Konami is a big 3rd party for Nintendo, so why are you now making games for Sega?
A: I’ve always been fascinated with hardware. People are constantly comparing Mega Drive to SNES, saying that the SNES has more colors etc…
But the Mega Drive has a 68000 processor, which is very easy for programmers to work with. I was a programmer for years, making games for the SNES, and I can tell you, the hardware is a pain in the butt. If consumers look at a still shot, they may think the SNES is better, but actually, if you tried to put Gunstar Heroes onto the SNES there would be no way. See those bosses? On the SNES they would slow down, that movement requries sooo much computation. It could only be done on the Sega hardware.
...
as I said the hardware is very easy to work with. All things considered, the 68000 is a very good CPU allowing room for experimentation while the SNES hardware limits you to their design standards. Scaling and rotation can be implemented in the Sega software, forget it on the SNES.
I don't consider this proof of anything. It just means that this particular programmer, using the particular technique he used in one platform, probably wouldn't be able to replicate the same effects in some other platform with the same performance. It's arrogant to say that something is impossible simply because you haven't figured out a way to do it.
Fortunately, there's more than one way to implement the same idea, and if you design your algorithms around the limitations and strong points of each CPU/system, you're likely to succeed in implementing that idea. Unless we're talking about larger generational gaps, which is not the case of SNES vs. Genesis (although that didn't stop Chinese pirates from porting several 16-bit titles to the NES, with varying degrees of success).
Something that happens to me sometimes is that I spend a lot of time thinking of how to implement something non-trivial, and when I finally find the answer, it becomes my point of reference on how to perform that particular task, and I'll base all my performance estimations off of that. Then comes along someone else with a different point of view, and a new idea on how to do the same thing, and I realize that I didn't have the absolute answer after all. Sometimes it's not even someone else, I often think of alternative ways to implement something out of the blue, and the new way can be even twice as fast as the old solution.
In before someone calls me a SNES fanboy: Out of the 2 systems, the Genesis is my favorite. I grew up with it and I find the overall aesthetics of Genesis games more interesting. But I also like the SNES very much, and I don't consider either system obviously superior to the other.
This is not a question of who's better, stef cannot accept than some impressive games exist for snes or PCE because all two have a 8 bits CPU ..
I saw some shmups on PCe/snes that are at least, as impressive as TF3/4 or Gh in term of cpu usage and sprites on screen .
TOUKO wrote:
two have a 8 bits CPU ..
You mean 1, right?
TOUKO wrote:
I saw some shmups on PCe/snes that are at less, as impressive as TF3/4 or Gh in term of cpu usage and sprites on screen .
If I recall correctly, one of the TF games on the Genesis actually has some slowdown. (Not Gradius III level, but definatelly not far away enough to brag.) Anyway, I said that Space Megaforce looked as good as Gunstar Heroes in that there are the same amount of sprites and stuff flying around, (and other CPU taxing stuff,) but he complained that it didn't have enough animation and that the game seemed "stiff" because of it. They're spaceships, they don't have arms and limbs. All you need to do is create multiple frames of the ships at different angles, which the game does.
also TOUKO, I think you mean "at least", not at less. I completely changes the meaning of the sentence.
A lot of big name companies used very traditional methods, and were afraid to deviate from the standard.
And programmers didn't necessarily have a lot of experience with the systems or CPUs they were using, which explains that kind of conservative programming.
Being a good programmer allows you to code for practically anything if you have proper documentation, but that doesn't mean you'll become a master at it overnight.
Are you referring to animation, psychopathicteen? Most of the "clever" ideas I thought of (like the sprite vram slot thing, even though you actually mode the idea into code. I'm going to do this, but I've been a bit busy.) I thought were standard. I looked at the DKC games in the bsnes games in a debugger and assumed that all games were as solid from a technical standpoint as they were. I looked at other games and saw how they did stupid stuff like store all the palettes for every enemy in the entire game in cgram, even though there is only one type of enemy on screen at a time and others. I had thought that DKC did the same thing that I thought of with finding vram for sprites, but it appears that Rare's way was actually simpler.
New post just came in while I was typing:
tokumaru wrote:
Being a good programmer allows you to code for practically anything if you have proper documentation, but that doesn't mean you'll become a master at it overnight.
I've learned that with x86.
Espozo wrote:
If I recall correctly, one of the TF games on the Genesis actually has some slowdown.
I thought all of them did o.o (although I believe TF4 slows down right in the first level around the area with the large snake-like enemy) Also reminder that they can't play PCM while anything else is playing either (even TF4, which is from 1992 and was unexcusable by then).
tokumaru wrote:
And programmers didn't necessarily have a lot of experience with the systems or CPUs they were using, which explains that kind of conservative programming.
Deadlines usually play a much bigger role (forcing programmers to just stick to something quick that works than bothering to come up with good code).
Sik wrote:
I thought all of them did o.o
They probably do, I just never played them.
I was referring to game physics. Konami/Treasure relied heavily on 16.16 physics which is convenient for the 68000, but not for the 65816. On the 65816 it's better to do physics in a signed 8.8 format with the coordinates being in 16.8 format.
I'm not even going to lie, but I don't have the slightest clue what "16.16" means. Is it just the first 16 represents pixels, and the second 16 is subpixels, like to where when you made metasprites, you only dump the first 16 bits? I guess this would be convenient for the 68000, because it can do some 32bit instructions. Really though, this doesn't seem hard to fix. Physics for y would only need to be 8.8, because 256 pixels is large enough to cover the height of the screen and hide 32 pixel tall sprites.
Espozo wrote:
Physics for y would only need to be 8.8, because 256 pixels is large enough to cover the height of the screen and hide 32 pixel tall sprites.
Ideally, physics happens in level space, not screen space.
Yeah, level space would make 8.8 pointless.
What I do these days with the 68000 though is store coordinates as 16-bit integers and speed as 8.8 fixed point, and then make use of subpixel faking to work around that. Kind of lame, but saves memory, makes things simpler (by being able to ignore subpixels almost everywhere) and should be easily doable on either system really. Heck, on 3rd generation systems this should be easily feasible.
Amusingly, Miniplanets uses 8-bit coordinates (!!). This was to make looping maps easier, I just abuse overflow =P
Quote:
also TOUKO, I think you mean "at least", not at less. I completely changes the meaning of the sentence.
Oups yes
Quote:
You mean 1, right?
The 65816 is a 8 bit datas bus cpu, for me it can not be a true 16 bits like the 68k can not be a 32 bits .
A game is not related only to physics, physic is used in few games and i don't think it was applied for all sprites .
you must count all branch/tests, variables read/write, subroutine calls, interruptions,all this stuffs which are far slower on 68k ..
For my PCE stuffs, i never used any 32 bits operations, on a 2D system i think is useless,or at least is not really an advantage .
Sik wrote:
What I do these days with the 68000 though is store coordinates as 16-bit integers and speed as 8.8 fixed point, and then make use of subpixel faking to work around that.
That's a good idea! It really isn't any different than 16.16 fixed point, because sub pixel coordinates don't really help collision detection and things, but they do help velocity. It actually seems more like an optimization than a compromise, and is obviously useful on the 68000 (as 32 bit instructions are supposed to be slower). Well, that's one area an SNES version of Gunstar Heroes could be faster in (Because Treasure was supposedly using 32.32 fixed point, even though it was unnecessary).
Me wrote:
I completely changes the meaning of the sentence.
Dat grammar. I meant to say "it". (You're not the only one who has trouble writing, TOUKO...)
Sik wrote:
What I do these days with the 68000 though is store coordinates as 16-bit integers and speed as 8.8 fixed point
How do you actually add the 16 bits integer coordinate to the 8.8 speed efficiently with the 68k ? You probably waste some cycles using that no ? the 16.16 is very practical, using 32 bits only for the speed addition and 16 bits when sending position in vram or for collision calculation.
Well how do you change accumulator (or whatever here) width on the 68000? On the 65816, It only takes about 1 or 2 instructions, and I'm pretty sure the instructions only take 2 cycles, which is the smallest amount of cycles any of the instructions have. This wouldn't be the first time different approaches are better for different processors.
Espozo wrote:
Well how do you change accumulator (or whatever here) width on the 68000?
I don't think you change this globally, instead there are different opcodes for the different data sizes. Does that mean that it takes 0 cycles to change register widths?
Stef wrote:
How do you actually add the 16 bits integer coordinate to the 8.8 speed efficiently with the 68k ? You probably waste some cycles using that no ? the 16.16 is very practical, using 32 bits only for the speed addition and 16 bits when sending position in vram or for collision calculation.
There's a ton of spare time, I'd rather reduce memory usage instead (but then again there's also a lot of spare memory...). The other issue is one that I had
a lot with Project MD (which used 16.16), suddenly now you can't rely on pixel precision for a lot of things and that can become an issue, although granted Project MD made it even worse by accounting for NTSC and PAL speed differences.
As for adding efficiently, huh, first I calculate the subpixel offset for the frame (this is done once when the frame starts):
Code:
moveq #0, d6
move.w (Anim), d7
rept 8
add.b d7, d7
roxr.b #1, d6
endr
move.w d6, (Subpixel)
Then when adding the speed to the position it's just a matter of adding that offset, then bit shifting (d2 = 8.8 speed, d0 = 16-bit position):
Code:
add.w (Subpixel), d2
asr.w #8, d2
add.w d2, d0
Espozo wrote:
Well how do you change accumulator (or whatever here) width on the 68000? On the 65816, It only takes about 1 or 2 instructions, and I'm pretty sure the instructions only take 2 cycles, which is the smallest amount of cycles any of the instructions have. This wouldn't be the first time different approaches are better for different processors.
The size is part of the instruction, no flags controlling it at all. (heck, I think the 65816 is the only CPU I know that does the flags thing, neither 68k nor Z80 nor x86 do it)
Sik wrote:
There's a ton of spare time, I'd rather reduce memory usage instead (but then again there's also a lot of spare memory...). The other issue is one that I had a lot with Project MD (which used 16.16), suddenly now you can't rely on pixel precision for a lot of things and that can become an issue, although granted Project MD made it even worse by accounting for NTSC and PAL speed differences.
Hehe i more often think there are many spare memory (64 KB is a lot, except for very specific case) and so tend to prefer to optimize for speed (reasonably i mean, not anything insane).
Quote:
As for adding efficiently, huh, first I calculate the subpixel offset for the frame (this is done once when the frame starts):
Code:
moveq #0, d6
move.w (Anim), d7
rept 8
add.b d7, d7
roxr.b #1, d6
endr
move.w d6, (Subpixel)
What do (Anim) contains at first ? And about this :
Code:
rept 8
add.b d7, d7
roxr.b #1, d6
endr
It looks like you're trying to reverse bits from (Anim) and store result into (Subpixel) right ?
Quote:
Then when adding the speed to the position it's just a matter of adding that offset, then bit shifting (d2 = 8.8 speed, d0 = 16-bit position):
Code:
add.w (Subpixel), d2
asr.w #8, d2
add.w d2, d0
I've to admit i don't really get the point of doing that O_o ? the asr.w #8 is very taxing... it seems over complicated to me but i am probably missing something. It's ok for a system where you don't have 32 bits operations but here on the 68k you have just need to do :
Code:
add.l d1, d2
where d1 is the 16.16 speed and d2 the 16.16 coordinate (that we can easily extends it to 32.16 for world position with an extra addx). I guess you use the offset information for some others calculations but shouldn't offset be different for each object ?
Espozo wrote:
The size is part of the instruction, no flags controlling it at all. (heck, I think the 65816 is the only CPU I know that does the flags thing, neither 68k nor Z80 nor x86 do it)
It's another reason why i dislike this CPU, the designers hardly tried to preserve 6502 compatibility but at the cost of some awfuls choices (as this insane register size change instruction).
x86 has the weird real mode also... I remember you had to prefix an instruction by 0x33 to set it in 32 bits ("mov word" becomes then "mov long") and in flat mode it was the contrary (i.e, the 0x33 prefix allowed 16 bits instruction instead of default 32 bits). But that is definitely different from the 65816.
Stef wrote:
Hehe i more often think there are many spare memory (64 KB is a lot, except for very specific case) and so tend to prefer to optimize for speed (reasonably i mean, not anything insane).
You also program in C and we all know that GCC for 68000 is... kind of crap. Especially with optimizations disabled (and I see that people don't want to use anything above -O1 out of fear of breaking things, even though if you break any hardware access with optimizations enabled that means your code is probably wrong and you probably misused volatile).
Eh, each to their own =P I just have too much to spare on both ends of the spectrum. Lately I'm literally using RAM as a giant buffer to decompress data into.
Stef wrote:
What do (Anim) contains at first ?
Oh, just a generic counter that increments every frame. I normally use it to do the timing of the animations of most things (hence "anim"), so I don't have to waste time giving everything its own counter (also helps keeping everything synchronized).
Anyway, just a value that increments every frame.
Stef wrote:
It looks like you're trying to reverse bits from (Anim) and store result into (Subpixel) right ?
Yep. Only the bottom byte though (since that's the size of the fractional part in speed values).
Stef wrote:
I've to admit i don't really get the point of doing that O_o ? the asr.w #8 is very taxing... it seems over complicated to me but i am probably missing something. It's ok for a system where you don't have 32 bits operations but here on the 68k you have just need to do :
Code:
add.l d1, d2
where d1 is the 16.16 speed and d2 the 16.16 coordinate (that we can easily extends it to 32.16 for world position with an extra addx). I guess you use the offset information for some others calculations but shouldn't offset be different for each object ?
Then
everything needs to account for the possibility of subpixels, that's the issue. Remember this is only done when applying speeds, every other operation just treat positions as 16-bit integers.
Sure, you could probably just take the upper word when doing comparisons (Sonic 3D does this for some situations), but if anything needs to modify the coordinate, you better account for subpixels. And if you do NTSC/PAL speed compensation, you run the risk that something moving at 1px per frame in NTSC may outright skip the position you want in PAL (since it'll move over 1px in some frames). Basically you start going into the same kind of issues you'd get from floating point (progressive loss of precision aside).
Also that bit shift is 22 cycles =| It's not uncommon for memory operations to take longer. This is not a place where you'd do speedcoding either (where every single cycle counts), so programmer sanity is more important here.
(EDIT: oh, also before I forget, the offset is added to the fractional part of the speed, so the final value is actually different for every speed)
Stef wrote:
x86 has the weird real mode also... I remember you had to prefix an instruction by 0x33 to set it in 32 bits ("mov word" becomes then "mov long") and in flat mode it was the contrary (i.e, the 0x33 prefix allowed 16 bits instruction instead of default 32 bits). But that is definitely different from the 65816.
Eh, given they only have 8-bits for the instruction, using prefixes to modify them makes sense (Z80 does it as well, and I have no idea why 65816 didn't take that route too).
Also x86 needs prefixes for 64-bit operations, except those prefixes have enough room to hold four flags (making them more useful as they account for all possible variants, the more common 32-bit-only operations use 1 byte while the less common operations use two). There's also the fact that the massive pipelining in those processors means the effect of prefixes on speed is unpredictable (it's not uncommon for prefixes to make things go
faster - see RET vs REP RET, which are identical except for how long they take to execute).
Sik wrote:
I have no idea why 65816 didn't take that route too
Probably because the bus is 8-bit, so adding information to opcodes would either restrict the instruction set or slow down execution.
Sik wrote:
Stef wrote:
Hehe i more often think there are many spare memory (64 KB is a lot, except for very specific case) and so tend to prefer to optimize for speed (reasonably i mean, not anything insane).
You also program in C and we all know that GCC for 68000 is... kind of crap. Especially with optimizations disabled (and I see that people don't want to use anything above -O1 out of fear of breaking things, even though if you break any hardware access with optimizations enabled that means your code is probably wrong and you probably misused volatile).
I haven't had trouble using -O2 and -O3, as well as std=c99, which all seem to be uncommon choices for 68000 as a C target. That said, I'm not forgetting volatile, so that's a big part of it.
93143 wrote:
Probably because the bus is 8-bit, so adding information to opcodes would either restrict the instruction set or slow down execution.
That didn't stop the Z80 from doing it.
A prefix is pretty much a single-byte opcode that affects how the following opcode works. If I recall correctly there was a lot of space left in the instruction set (certainly enough that they could afford opcodes for setting the new flags), so it's not like they couldn't do it.
I'm going to guess that they did it this way because that's how decimal mode works as well (as far as I know). Saying that many opcodes would be affected is another possibility, but alternating between 8-bit and 16-bit happens a lot.
mikejmoffitt wrote:
I haven't had trouble using -O2 and -O3, as well as std=c99, which all seem to be uncommon choices for 68000 as a C target. That said, I'm not forgetting volatile, so that's a big part of it.
Just because you add volatile doesn't mean it's working as intended =P (you could put it in the wrong place of a pointer declaration, for example, making the variable holding the pointer volatile rather than the address inside the pointer)
Stef wrote:
Hehe i more often think there are many spare memory (64 KB is a lot, except for very specific case) and so tend to prefer to optimize for speed (reasonably i mean, not anything insane).
That sounds pretty much like what you have to do on the SNES, except a bit more extreme in this case. If you have the option to use 5x the amount of ram to make a code 2x as fast, that's usually the route you'll take. I'm not sure why they decided to make the SNES have 128KB of main ram and not vram, considering I couldn't even imagine using more than half that amount of ram. (That's definatelly not the case for vram, but in order to really be useful, they'd at least need to double the amount of tiles that are available to sprites.)
Stef wrote:
It's another reason why i dislike this CPU, the designers hardly tried to preserve 6502 compatibility but at the cost of some awfuls choices (as this insane register size change instruction).
Are you meaning to say that they did try to preserve 6502 compatibility and because of this, they ended up making some awful choices? I don't think there would be a cost of not trying to preserve 6502 compatibility. Anyway, having separate instructions for 8 and 16 math would probably be better, but I don't mind having to change the accumulator size. I originally thought all processors worked this way, but if you see the GBA asm thread, you'll see I thought other processors were almost exactly like the 65816. Now that I've been looking at some other processors, it actually appears the 65816 is the odd man out.
I guess it just comes down to what you're used to.
Sik wrote:
93143 wrote:
Probably because the bus is 8-bit, so adding information to opcodes would either restrict the instruction set or slow down execution.
That didn't stop the Z80 from doing it.
That may have been a contributing factor to the Z80 being slow as tar. Isn't that an extra bus cycle for every single instruction that needs the prefix?
Quote:
If I recall correctly there was a lot of space left in the instruction set (certainly enough that they could afford opcodes for setting the new flags), so it's not like they couldn't do it.
There is exactly one unused opcode in the 65816 instruction set (
wdm), and only three new ones that can affect the data width bits (
sep,
rep, and
xce). Could this be done with only one or two prefix codes?
93143 wrote:
That may have been a contributing factor to the Z80 being slow as tar. Isn't that an extra bus cycle for every single instruction that needs the prefix?
Most instructions don't use prefixes though (especially
not the most common ones).
93143 wrote:
There is exactly one unused opcode in the 65816 instruction set (wdm), and only three new ones that can affect the data width bits (sep, rep, and xce). Could this be done with only one or two prefix codes?
Was talking about the 6502 instruction set actually (which is the one 65816 was based on).
Sik wrote:
You also program in C and we all know that GCC for 68000 is... kind of crap. Especially with optimizations disabled (and I see that people don't want to use anything above -O1 out of fear of breaking things, even though if you break any hardware access with optimizations enabled that means your code is probably wrong and you probably misused volatile).
I do mix C and assembly. C is definitely easier to maintain and read, i wanted SGDK to be usable by a lot of people and having a 100% assembly development library would have closed doors to many potentials users which only know C. At least SGDK offers both possibilities... if you prefer to use assembly only then you're free to do it (you only need to respect the C calling convention). About the GCC compiler, to be honest older GCC are not that bad, 2.x versions were almost good but they really lack features. In SGDK i'm using GCC 3.4.6 and code generation for 68k is not really good but not awful too as it can be with later 4.X version. It's true that i'm using -O1 (and some others specifics optimization switch) for "release" profile, but not because -O2 or -O3 break something (as soon your code is correct optimization level should not break anything) but because they do not bring any significant speed boost and they usually make the generated code a lot more convoluted. I tested almost all optimization switch to find the best set to obtain the fastest code, but the line is commented in my makefile, again i prefer to stand with -O1 as generated code is already pretty code and less convoluted than with higher optimization levels.
Quote:
Eh, each to their own =P I just have too much to spare on both ends of the spectrum. Lately I'm literally using RAM as a giant buffer to decompress data into.
Yeah i can understand that, actually the map can be an element which quickly fill the RAM... for instance i calculated that Sonic level map does not even fit in the RAM so they have to unpack it in real time. Tile data is another matter but as you are limited by the VRAM bandwidth you don't need to allocate that much for it.
Quote:
Then everything needs to account for the possibility of subpixels, that's the issue. Remember this is only done when applying speeds, every other operation just treat positions as 16-bit integers.
Sure, you could probably just take the upper word when doing comparisons (Sonic 3D does this for some situations), but if anything needs to modify the coordinate, you better account for subpixels. And if you do NTSC/PAL speed compensation, you run the risk that something moving at 1px per frame in NTSC may outright skip the position you want in PAL (since it'll move over 1px in some frames). Basically you start going into the same kind of issues you'd get from floating point (progressive loss of precision aside).
Ok i got the point, still honestly for me it looks simpler to just treat every coordinates / positions as 16.16 when it comes to modify them and then use the 16 bits integer part only when possible (collision or others stuff). I think it really depends from what you want to do in your game
For a shump where you need to deal with many objects (bullets and enemies) the 16.16 coordinates system is really handy and really allow you to obtain very fast calculations.
Quote:
Also that bit shift is 22 cycles =| It's not uncommon for memory operations to take longer. This is not a place where you'd do speedcoding either (where every single cycle counts), so programmer sanity is more important here.
(EDIT: oh, also before I forget, the offset is added to the fractional part of the speed, so the final value is actually different for every speed)
Oh ok but so definitely we speak about the quiet case where you really don't need to have the fastest code
I understand the method, you add the offset to speed so depending the fractional part of the speed it will increase the integer part (of the speed) each X frames. Then you only add integer part of speed to integer position and that is done... now i am thinking about it I remember a discussion about this. And actually if you really need speed you can even make your method faster by using 16.16 speed (d2) and 16.16 offset (d1)
Code:
movq #0,d1
move.w (Subpixel), d1
then for each objects:
Code:
add.l d1,d2
swap d2
add.w d2,d0
Quote:
Eh, given they only have 8-bits for the instruction, using prefixes to modify them makes sense (Z80 does it as well, and I have no idea why 65816 didn't take that route too).
Also x86 needs prefixes for 64-bit operations, except those prefixes have enough room to hold four flags (making them more useful as they account for all possible variants, the more common 32-bit-only operations use 1 byte while the less common operations use two). There's also the fact that the massive pipelining in those processors means the effect of prefixes on speed is unpredictable (it's not uncommon for prefixes to make things go faster - see RET vs REP RET, which are identical except for how long they take to execute).
Yeah of course it made sense, it was just they preserved retro compatibility by having 2 work modes : real (compatibility) and flat (pure 32 bits) and it worked quite well, you were even able to access the new 32 bits instructions from real mode (and vice versa) by using a previously unused opcode as a prefix.
Espozo wrote:
That sounds pretty much like what you have to do on the SNES, except a bit more extreme in this case. If you have the option to use 5x the amount of ram to make a code 2x as fast, that's usually the route you'll take. I'm not sure why they decided to make the SNES have 128KB of main ram and not vram, considering I couldn't even imagine using more than half that amount of ram. (That's definatelly not the case for vram, but in order to really be useful, they'd at least need to double the amount of tiles that are available to sprites.)
Maybe the marketing department though it was a good idea to have at least more than the Sega Megadrive competitor here. Another reason could be the SimCity game, they initially intended to do it on NES but then they needed to add a large amount of memory on the cart to store the map informations. As the Super Nintendo was almost ready they realized it would be a better match to do it for their new system and so Nintendo probably wanted to keep an important amount of WRAM in their new system just to handle that type of game.
Quote:
Are you meaning to say that they did try to preserve 6502 compatibility and because of this, they ended up making some awful choices? I don't think there would be a cost of not trying to preserve 6502 compatibility. Anyway, having separate instructions for 8 and 16 math would probably be better, but I don't mind having to change the accumulator size. I originally thought all processors worked this way, but if you see the GBA asm thread, you'll see I thought other processors were almost exactly like the 65816. Now that I've been looking at some other processors, it actually appears the 65816 is the odd man out.
I guess it just comes down to what you're used to.
I think they could just make 2 modes: 8 bits (6502 compatible) and 16 bits and that's. The problem is that the 65816 is not a real 16 bits CPU based on the 6502, it's more a 6502C with 16 bits enhancements: extended A, X and Y registers, extended memory addressing, a new 16 bits ALU... but all the memory accesses are still done as the 6502C and so using a 8 bits data bus. I guess they chosen that solution to limit cost of course but also they could reuse all the logic for the instruction pipeline and decoding stuff, having a real 16 bits data BUS would have required many changes here.. Still because of that, this CPU is slow when working in 16 bits mode and so you try to stay in 8 bits mode (and program as with a 8 bits CPU) to maximize performance. And yeah the 65816 is definitely not a very conventional CPU in the way you have to program it.
I just found out something I didn't know before. Bio Metal is a SlowROM game.
Stef wrote:
this CPU is almost time slower in 16 bits mode
Eh? I'm not quite sure what you mean here.
It's somewhat slower in 16-bit mode. Basically an extra bus cycle to handle the extra byte. The opcodes and addresses are the same length and thus take the same amount of time. It's true that staying in 8-bit mode helps if you have 8-bit jobs to do, because it is faster as well as more convenient. But if you have 16-bit jobs to do it's far faster to do them in 16-bit mode.
Stef wrote:
Another reason could be the SimCity game, they initially intended to do it on NES but then they needed to add a large amount of memory on the cart to store the map informations. As the Super Nintendo was almost ready they realized it would be a better match to do it for their new system and so Nintendo probably wanted to keep an important amount of WRAM in their new system just to handle that type of game.
Except that games like SimCity need to preserve the game state between play sessions, something the internal WRAM simply can't do, so they'd need this memory present in the cart anyway.
This is random, but does anyone know why Sim City takes what feels like an eternity to show you the pictures of the different maps?
Quote:
I just found out something I didn't know before. Bio Metal is a SlowROM game.
I know that Contra III in Japan (Contra Spirits I think?) was a SlowROM game, but that it was FastROM in the US. Did Contra Spirits have slowdown issues?
93143 wrote:
Stef wrote:
this CPU is almost time slower in 16 bits mode
Eh? I'm not quite sure what you mean here.
Me neither :p I just wanted to say it runs generally slower in 16 bits mode than in 8 bits mode.
Quote:
It's somewhat slower in 16-bit mode. Basically an extra bus cycle to handle the extra byte. The opcodes and addresses are the same length and thus take the same amount of time. It's true that staying in 8-bit mode helps if you have 8-bit jobs to do, because it is faster as well as more convenient. But if you have 16-bit jobs to do it's far faster to do them in 16-bit mode.
Yeah, but switching between different register width is definitely not practical, i guess that in some critical situation you will even stay in 8 bits mode as the overhead of SEP/REP instructions can hide the benefit of using 16 bits registers.
tokumaru wrote:
Except that games like SimCity need to preserve the game state between play sessions, something the internal WRAM simply can't do, so they'd need this memory present in the cart anyway.
This is different, you are speaking about backup RAM just to save your progresses.
When you save/restore your progresses you can compress the data a lot to make your map fit in much smaller space that in "active game" conditions (Sim City backup RAM is 32 KB, which allow to save up to 3 or 4 cities if i remember correctly). "In game" you need to store the state of the whole map into WRAM in a common way (and so not using compression) and in the case of SimCity you can have many informations to store into a single cell of the map (map are 120x100 cells) so it quickly consume a lot of WRAM. I would not be surprised to see that Sim City uses more than 64 KB of WRAM to store the map state.
Espozo wrote:
This is random, but does anyone know why Sim City takes what feels like an eternity to show you the pictures of the different maps?
Because if uses an algorithm to generate the map. This algorithm use a seed value to generate the map so it will always generate the same map for a given seed value. You can see it as a sort of *very heavy compression* to store 1000 maps in some Kb (the generation code) =)
Of course you can't really store a specific map with this method, it just gives you 1000 random map but always the same ones
Does it actually generate the map while it shows the picture? Would it have been too much memory to store the tilemaps for the pictures in the cartridge?
You know, I wonder if anyone has even looked through all 1,000 maps...
Yeah it internally generates the map then show the picture out of it. Storing images for the 1000 map would have cost a bit of rom, even it they used 256 bytes to store a single picture (which is optimist even with a good compression) it still requires 256KB just to store the images for each map, a real waste.
Espozo wrote:
I don't think there would be a cost of not trying to preserve 6502 compatibility.
The 65816 CPU was also used in the Apple IIGS and in Commodore 64 upgrade boards. The CPU in a IIGS needed to run software designed for the original Apple II, II Plus, and IIe (6502) and for the IIc and enhanced IIe (65C02).
tokumaru wrote:
Except that games like SimCity need to preserve the game state between play sessions, something the internal WRAM simply can't do, so they'd need this memory present in the cart anyway.
Then it would need
two big RAMs, one battery-backed. It would probably have been the only MMC5 game actually using 64K RAM.
Sik wrote:
93143 wrote:
mikejmoffitt wrote:
I haven't had trouble using -O2 and -O3, as well as std=c99, which all seem to be uncommon choices for 68000 as a C target. That said, I'm not forgetting volatile, so that's a big part of it.
Just because you add volatile doesn't mean it's working as intended =P (you could put it in the wrong place of a pointer declaration, for example, making the variable holding the pointer volatile rather than the address inside the pointer)
This just sounds like programmer problems, not problems with GCC at all.
Stef wrote:
93143 wrote:
Stef wrote:
this CPU is almost time slower in 16 bits mode
Eh? I'm not quite sure what you mean here.
Me neither :p I just wanted to say it runs generally slower in 16 bits mode than in 8 bits mode.
Quote:
It's somewhat slower in 16-bit mode. Basically an extra bus cycle to handle the extra byte. The opcodes and addresses are the same length and thus take the same amount of time. It's true that staying in 8-bit mode helps if you have 8-bit jobs to do, because it is faster as well as more convenient. But if you have 16-bit jobs to do it's far faster to do them in 16-bit mode.
Yeah, but switching between different register width is definitely not practical, i guess that in some critical situation you will even stay in 8 bits mode as the overhead of SEP/REP instructions can hide the benefit of using 16 bits registers.
I always done the complete opposite. I use 16-bit mode most of the time, but switch to 8-bit whenever convenient.
Stef wrote:
Yeah i can understand that, actually the map can be an element which quickly fill the RAM... for instance i calculated that Sonic level map does not even fit in the RAM so they have to unpack it in real time. Tile data is another matter but as you are limited by the VRAM bandwidth you don't need to allocate that much for it.
A 256×128 chunk map where each entry is 1 byte would be 32KB, and maps in Sonic games aren't
that large... in fact I think only Sonic 2 has the maps compressed (but that game compresses just about everything that isn't the player sprites). If you use 16×16 chunks then it piles up quickly though, yeah (which also makes me wonder how did most NES games handle that, since it seems most games used that size? could be wrong though).
Stef wrote:
For a shump where you need to deal with many objects (bullets and enemies) the 16.16 coordinates system is really handy and really allow you to obtain very fast calculations.
In a shump you rarely need subpixel precision though =P
mikejmoffitt wrote:
This just sounds like programmer problems, not problems with GCC at all.
It is, but because it's easy to get wrong instead what happens is that optimizations are disabled as a quick workaround. Yuck.
Quote:
I always done the complete opposite. I use 16-bit mode most of the time, but switch to 8-bit whenever convenient.
Oh glad to heard that actually
That does mean that even if the 16 bits is generally slower it still worths using it in almost case (for convenient reason as well). Reading some people here we could think that for them "8 bits is enough" :p
Sik wrote:
A 256×128 chunk map where each entry is 1 byte would be 32KB, and maps in Sonic games aren't that large...
Well, they are not huge but still:
Sonic 1 Green Hill Zone 2: 8448x1536 (1056x192 tilemap)
http://info.sonicretro.org/File:Ghz2.PNGSonic 2 Aquatic Ruin Zone 2: 11904x2048 (1488x256 tilemap)
http://www.sonicgalaxy.net/img/maps/gen ... /arz-2.pngIf you take the unpacked raw map data (the one you sent to the VRAM) we obtains a map of 400KB for Sonic 1 and about 750 KB for Sonic 2. For that type of game you have to heavily compress maps, you just can't store them in ROM in basic tilemap format.
Even 8 bits games had to compress them. Sonic 1 on Master System can have map of about 4096x1024 (512x128 tilemap), which is still about 128 KB of unpacked raw map data. I watched the compression for the later (can be found
here), it actually uses 4x4 tiles block to encode map information, each of these 4x4 block are referenced by a single byte so you already reduce map data size by a factor of 16*2. Then these byte are RLE packed per level
I believe the 16 bits version does similar stuffs.
Quote:
In fact I think only Sonic 2 has the maps compressed (but that game compresses just about everything that isn't the player sprites). If you use 16×16 chunks then it piles up quickly though, yeah (which also makes me wonder how did most NES games handle that, since it seems most games used that size? could be wrong though).
Believe me, a lot more game compress their map data... even using 16x16 chunks is not enough to keep the ROM size low.
The way NES organize the name /attribute tables lower memory consumption but if you take the example of mario 3. Map can be 4096x256, with 16x16 tiles size we still require 256*16=4 KB of name table + the attribute table... not that much but then you have to consider the huge number of level (more than 50), even using 5 KB per level for the map, we quickly raise 250 or 300 Kb of data on the original 384 KB rom. So definitely even in some NES games maps have to be compressed.
Stef wrote:
In a shump you rarely need subpixel precision though =P
I don't really agree with that, you almost always need subpixel accuracy for at least speed (and so coordinate).
I believe some SNES games (thinking about Super Aleste or Rendering Ranger) abuse of integer calculations in some part (for faster computations) but then the game becomes insanely fast (player bullets movement or plain enemies movement for Super Aleste) and feels really "artificial". I really dislike those games for that reasons, almost every things in the game feel so "rigid" (i don't have the good word to express my feeling here).
Quote:
Reading some people here we could think that for them "8 bits is enough" :p
Wahou stef, you ate some clowns each morning ??...
I don't see any pce games that slowdown more than 68k ones, even if it doesn't have any Z80 for audio
16 bits is very usefull if instructions are faster or a bit slow than 8 bits ,and this is not the case for 68k ,his 8 bit instructions are a little useless,and no optimisations are available in case of 8 bits operations .
And i confirm, 8 bits variables are the most used in 2D games, followed by 16 bits.
Why using 16 bits variables if 8 bits are enough ??
You think 16bits all the time because with 68k you have no interest to use 8 bits.
Quote:
I believe some SNES games (thinking about Super Aleste or Rendering Ranger) abuse of integer calculations in some part (for faster computations) but then the game becomes insanely fast (player bullets movement or plain enemies movement for Super Aleste) and feels really "artificial".
you're probably the only man on earth to have that kind of worries when he's playing a game .
A good thing in Md is his SAT organised like a chained list, when you need some heavy Z ordering it's very usefull and you don't waste CPU cycles to arrange SAT .
I do have to agree that Rendering Ranger feels rigid, but not because of the coordinates of objects. It's mostly to me because of the animation during the run and gun parts and how everything feels like it's a picture being suspended by strings. This is fine for the ship sections, but not on land where there could certainly be more movement. I know the main creator of the creator of the game (I don't remember his name right now, but he created Turrican) originally wanted the game to just be a horizontal shooter, but that the company he was under said it wouldn't sell and wanted him to make a run and gun, so he did both. This could be why the space shooter sections feel better to me, and I enjoy them more. I know you, Stef, complained about the animation in Rendering Ranger R2 and Space Megaforce, but I only feel that this applies to the run and gun sections, so Space Megaforce is excluded. It's really just a matter of opinion though, like all of this.
I honestly don't feel the difference from the coordinates though, even though I have played both Gunstar Heroes and Rendering Ranger.
TOUKO wrote:
Why using 16 bits variables if 8 bits are enough ??
I think it's like if 16 bit variables are more convenient, you use them. You aren't going to use 2 8bit load and stores when you could use a 16bit one.
Quote:
I do have to agree that Rendering Ranger feels rigid, but not because of the coordinates of objects.
I don't consider RR² as a good game in general, but technicaly is really impressive with a very good use of snes hardware effects .
Quote:
I think it's like if 16 bit variables are more convenient, you use them.
of course, but not always .
I don't know how snes coders did her games, but in my case 8 bits are more used .
Quote:
You aren't going to use 2 8bit load and stores when you could use a 16bit one.
In such case you'r right .
A rapid exemple, you are using 32x32 pixels for sprites, the pattern sprites address is a 16 bits address .
So if you want increment it, for sprite animation you must do a $100 increment (which is 256 bytes) .
you can easily do an inc pattern_high (8 bits) in 65xx when you must do a 16bits operation on 68k .
Who's faster ??
Stef wrote:
Sonic 1 Green Hill Zone 2: 8448x1536 (1056x192 tilemap)
http://info.sonicretro.org/File:Ghz2.PNGSonic 2 Aquatic Ruin Zone 2: 11904x2048 (1488x256 tilemap)
http://www.sonicgalaxy.net/img/maps/gen ... /arz-2.pngDude, nobody in their right mind would store level data directly as 8×8 tiles, you'd use larger chunks instead (;゚ω゚)
For the sake of example: the engine in Pulseman and Magical Taruuruto-kun uses 32×32 chunks. Consider that all of them are more likely to fit in just 8-bit, and you end up using only 3.125% the space in the tilemap. When I gave the 256×128 example, that was with this kind of chunk sizes in mind (256×128 would expand to 8192×4096 pixels using that chunk size - 512×64 would be a more likely size for the same amount of memory, and that's 16384×2048 pixels)
Sik wrote:
Stef wrote:
Sonic 1 Green Hill Zone 2: 8448x1536 (1056x192 tilemap)
http://info.sonicretro.org/File:Ghz2.PNGSonic 2 Aquatic Ruin Zone 2: 11904x2048 (1488x256 tilemap)
http://www.sonicgalaxy.net/img/maps/gen ... /arz-2.pngDude, nobody in their right mind would store level data directly as 8×8 tiles, you'd use larger chunks instead (;゚ω゚)
For the sake of example: the engine in Pulseman and Magical Taruuruto-kun uses 32×32 chunks. Consider that all of them are more likely to fit in just 8-bit, and you end up using only 3.125% the space in the tilemap. When I gave the 256×128 example, that was with this kind of chunk sizes in mind (256×128 would expand to 8192×4096 pixels using that chunk size - 512×64 would be a more likely size for the same amount of memory, and that's 16384×2048 pixels)
I think 32x32 is not very convenient for all games (you loss small details here and there) but ok i didn't know you were calculating with 32x32 block and not tile
Ok so yeah with 32x32 chunks Sonic 2 maps fit in main RAM but it still eats a nice amount of it =)
Anyway even the 8 bits version of Sonic 1 compresses its maps and i am pretty sure many platform games does it as well. That make me think that I really need to implement better ways to handling maps in SGDK :p
Quote:
I do have to agree that Rendering Ranger feels rigid, but not because of the coordinates of objects. It's mostly to me because of the animation during the run and gun parts and how everything feels like it's a picture being suspended by strings. This is fine for the ship sections, but not on land where there could certainly be more movement. I know the main creator of the creator of the game (I don't remember his name right now, but he created Turrican) originally wanted the game to just be a horizontal shooter, but that the company he was under said it wouldn't sell and wanted him to make a run and gun, so he did both. This could be why the space shooter sections feel better to me, and I enjoy them more. I know you, Stef, complained about the animation in Rendering Ranger R2 and Space Megaforce, but I only feel that this applies to the run and gun sections, so Space Megaforce is excluded. It's really just a matter of opinion though, like all of this.
I honestly don't feel the difference from the coordinates though, even though I have played both Gunstar Heroes and Rendering Ranger.
For me there is really something off with both games. Super Aleste is very rigid and synthetic in its physic and (too fast) movements, i really have a bad experience while playing it... RR² is more about the (poor) animation generally speaking, the way ship bullets are handled, almost everything seems lifeless in this game... I prefer some other SNES shumps as R-TYpe 3, Phalanx, Parodius, Twin Bee... even if they sometime suffers from slowdowns i think they are really far more interesting to play.
Quote:
A good thing in Md is his SAT organised like a chained list, when you need some heavy Z ordering it's very usefull and you don't waste CPU cycles to arrange SAT .
There is many good things in the MD, too bad we can't say the same about the SNES...
Ok i am trolling, I admit the HDMA is a really neat feature
The irony is that in practice nobody really took advantage of the linked list (although to be fair I never considered about the Z ordering case, I'd just sort sprites before putting them on the table - may be faster too, since you sort by object and not by sprite). I think the original purpose of the linked list was to make it easier to do stuff like sprite flickering (Rainbow Islands actually does it!), I know that on the Master System they achieved it by reversing the sprite table every frame and a linked list would make such a thing easier. The problem is that they did it in a console where there's enough room to make that not really needed anyway...
At least it has the upside that there isn't any need to fill the entire table (as long as there's one entry you're OK). I think the SNES requires you to explicitly set the sprite slots that go unused? (and pretty much every other system, I think the only other one I know the specs of that doesn't do this is Super A'can which lets you define directly how many sprites to show - I don't recall how Neo Geo handled it, though)
Stef wrote:
I prefer some other SNES shumps as R-TYpe 3
Woot woot! Now we're talking! R-Type III is pretty good about not slowing down though. The olnly times I recall slowdown were whenever the stage is rotating on the first
marathon level (I assume it's some sort of fancy collision detection?) and on advanced mode in a couple places on level 5 and 6 (collision detection with enemies and bullets). You know, sorry for me being a bit of a jerk earlier... I'm a bit dramatic.
Stef wrote:
I admit the HDMA is a really neat feature
You mean other CPUs don't have this? How do they do raster effects than? Doesn't the Genesis actually have a horizontal scrolling table for BGs?
Sik wrote:
The irony is that in practice nobody really took advantage of the linked list
It's so unused, that I don't even know what it is.
Espozo wrote:
Stef wrote:
I admit the HDMA is a really neat feature
You mean other CPUs don't have this? How do they do raster effects than?
One way is cycle counting from some known raster point, seen in many NES games. This can be 100% CPU (sprite 0 sync on discrete mappers and MMC1) or assisted by a programmable interval timer on the cartridge (on VRC series or FME-7). Another way is hardware-assisted scanline counting (Atari 2600; NES MMC3; Genesis hblank interrupts).
Quote:
Doesn't the Genesis actually have a horizontal scrolling table for BGs?
I believe so. Any vertical scrolling (hills, etc.) has to be done by reading the scanline counter and writing appropriately.
Quote:
At least it has the upside that there isn't any need to fill the entire table (as long as there's one entry you're OK). I think the SNES requires you to explicitly set the sprite slots that go unused? (and pretty much every other system, I think the only other one I know the specs of that doesn't do this is Super A'can which lets you define directly how many sprites to show - I don't recall how Neo Geo handled it, though)
I did figure out that you can clear out unused sprites with dma, instead of having the CPU manually do it.
i think md's chained SAT is a good advantage for BTU, as Z-ordering (or Y ordering in fact) is most used .
Quote:
You mean other CPUs don't have this? How do they do raster effects than? Doesn't the Genesis actually have a horizontal scrolling table for BGs?
HDMA is not a CPU feature, you can do praticaly the same with HSYNC interruptions, it use only more CPU but very easy to do .
The Md's Hscroll table is just related to horizontal scrolling, HDMA and HSYNC interrupts are more flexibles,you can do more .
Quote:
Stef wrote:
I prefer some other SNES shumps as R-TYpe 3
Woot woot! Now we're talking! R-Type III is pretty good about not slowing down though. The olnly times I recall slowdown were whenever the stage is rotating on the first
marathon level (I assume it's some sort of fancy collision detection?) and on advanced mode in a couple places on level 5 and 6 (collision detection with enemies and bullets). You know, sorry for me being a bit of a jerk earlier... I'm a bit dramatic.
Indeed R-Type III does not have much slowdowns, i mainly remember about the ones during the rotation in the first stage (the large laser made it even worst) but it was not that annoying... No worries about the past, i can be a bit aggressive sometime as well. We all are passionated :p
Quote:
Stef wrote:
I admit the HDMA is a really neat feature
You mean other CPUs don't have this? How do they do raster effects than? Doesn't the Genesis actually have a horizontal scrolling table for BGs?
As Touko pointed, HDMA is definitely not a part of the CPU, it's really a specific feature of the SNES allowing any raster effects without CPU intervention. Normally you have to use the Horizontal interrupt for that and you loss some CPU cycles just to send a few byte data to some PPU/VDP registers / color ram or whatever... With HDMA it's "almost" free (some cycles lost in BUS mastering) and thanks to it, you can even stream samples to the SPC (definitely not easy though) without eating the main CPU
Quote:
It's so unused, that I don't even know what it is.
It's just the way the Sega Genesis VDP is parsing the sprite list, each sprite own the index of the next sprite to parse (0 to end parse).
Here is one of my quote (from Sega 16) which give some infos about it:
Quote:
About the Contra Hard Corps case, actually your sprite table can be up to 127 entries so that is definitely not surprising to see that.
The Sega Megadrive VDP sprite limits in H40 are the following (considered per scanline):
- 80 sprite entries read.
- 20 visible sprites.
- 320 pixels of visible sprites.
The first met limitation interrupts the sprite parsing process.
But so now you have to remember the Sega Genesis use a linked sprite list which is very convenient to handle the sprite priority but also to hide some sprites.
For instance you can setup the list so the VDP will parse them like this:
SPR0 ------ SPR50-SPR51-SPR52------SPR17-SPR18-SPR19-------SPR110---SPR77...
You can even create a loop in the list then the VDP stop fetch when first limit is meet (20 displayed, 320 pixels or 80 sprites read).
But the important is that, as you can see, the link information allow up to 127 sprites begin accessed (0 is a special value to end the list) so you often use 127 length SAT. You just know you will almost never display all sprites at the same time because some will be out of the screen and so you can bypass in the list.
I guess in Contra developers just did not bothered to actually end the sprite list after 80 read sprites (by putting a 0 link)... the system will meet its limit before it actually display every sprites but that is not a big deal....
Stef wrote:
HDMA is definitely not a part of the CPU, it's really a specific feature of the SNES allowing any raster effects without CPU intervention.
It depends on how you define "CPU". The DMA controller is on the same S-CPU die as the 65816.
Indeed, if you refer the CPU as the 5A22... for me the 5A22 is a lot more than a CPU (the 65816 is the CPU) and specifically made for the SNES. It does even include the A and B bus logics (and so the DMA controller)... really a matter of point of view
Espozo wrote:
You mean other CPUs don't have this? How do they do raster effects than? Doesn't the Genesis actually have a horizontal scrolling table for BGs?
Yeah, you have a table specifying every line (other settings are for the entire screen and for every tile). But the rest of the raster effects (unique vertical scroll per line, palette changes, etc.) have to be done by using the hblank interrupt.
TOUKO wrote:
HDMA is not a CPU feature, you can do praticaly the same with HSYNC interruptions, it use only more CPU but very easy to do .
The Md's Hscroll table is just related to horizontal scrolling, HDMA and HSYNC interrupts are more flexibles,you can do more .
The problem with using an interrupt is that CPUs are kind of slow to react. HDMA's primary advantage is being able to move in data without having to wait for the CPU to acknowledge the interrupt (which involves not just acknowledging the signal but also setting up the stack and jumping to a routine, as well as then having to undo the stack when returning from that routine later), it just starts directly.
Maybe the main issue with HDMA is that its configurations can be somewhat limited. It's way more than enough for nearly all the effects you may want to try, though, so that's a non-issue in practice.
Quote:
The problem with using an interrupt is that CPUs are kind of slow to react. HDMA's primary advantage is being able to move in data without having to wait for the CPU to acknowledge the interrupt (which involves not just acknowledging the signal but also setting up the stack and jumping to a routine, as well as then having to undo the stack when returning from that routine later), it just starts directly.
You're right, but interrupts have a low latency on 65xx, 14 cycles are needed to take and release an interruption .
It's of coure far from the 64 cycles needed for 68k to do the same,but we also must count the interrupt routine .
I 'am not saying that interrupts are better than HDMA, of course they are not, but a good alternative IMO,at least for Hu6280 .
Quote:
It's way more than enough for nearly all the effects you may want to try, though, so that's a non-issue in practice.
i think too,HDMA was in snes to avoid spending CPU cycles for hsync effect, like MD's hscroll table .
Interrupts are okay on 65x (the 65816 manual makes a big deal of how it was designed to be super fast at interrupt servicing), but it isn't just the overhead of the interrupt. HDMA is, as the name suggests, a DMA operation, which means it writes much faster than the CPU itself could. So if you're making very heavy use of raster effects, you might not be able to get the job done before HBlank ends unless you also spend the cycles to set up a bunch of DMA transfers. Either way, the IRQ method eats a lot more CPU than HDMA would have.
...dude. I just realized I forgot to account for the mid-screen OBSEL change when I planned my HDMA, and I'm already using all 8 channels in the general case. Fortunately adding a quick dec linecount; beq midframe to the mid-scanline H-IRQ routine doesn't murder the remaining CPU time... If I can manage to avoid using Y during the upper half of the playfield I can use dey; beq midframe instead, but it remains to be seen whether the tiny increase in free CPU time is worth that restriction...
Quote:
Either way, the IRQ method eats a lot more CPU than HDMA would have.
Of course if you want use the CPU for transfering many bytes like HDMA does ,it's not usefull and too slow .
But changing some registers like H/V scrolls or 2/3 background colors you can easily doing it with interrupts without having mush overhead .
Most PCE games did that :
https://www.youtube.com/watch?v=OHyvX5dzwEIhttps://www.youtube.com/watch?v=mlSFGs2TlSoaxelay demo for sgx :
https://youtu.be/44P56o8qhGE?t=10m5sBut HDMA is still a better and efficient method for doing that .
PCE was also the most forgiving of the three platforms when it came to video hardware accesses during active scan, while on the SNES doing anything outside of hblank can be a very bad idea, so the timing matters (as for the MD, well you can write anywhere but you risk running out of bandwidth quickly which will slow down things).
HDMA also has the advantage that you can queue multiple writes in a row (avoiding any gaps between them), and you can also mix multiple raster effects easily (with an interrupt you would need to check for all the effects you want in every line, which can result in a lot of wasted CPU time - that makes more than one effect kind of problematic).
For the record, setting up DMAs in interrupts is not an issue: you set it up before the interrupt, so all the interrupt has to do is to fire it. That means the DMA will happen right when the interrupt routine starts.
Quote:
PCE was also the most forgiving of the three platforms when it came to video hardware accesses during active scan,
Yes, but unfortunately, the CPU is not able to use all the available bandwidth ..
However you can use the fast DMA for VRAM to VRAM transferts,you can fake some parallaxes or doing background animations .
You can do a DMA list driven by interrupt, and set the VDC to H64 (12,6 ko / vblank) .
Sik wrote:
For the record, setting up DMAs in interrupts is not an issue: you set it up before the interrupt, so all the interrupt has to do is to fire it. That means the DMA will happen right when the interrupt routine starts.
But setting up the DMA is not free, so you have extra work to do regardless, and it can start to eat into total CPU time when you have to do it ~200 times per frame.
Maybe my perspective is a bit skewed; the game I'm attempting to port wasn't designed with the SNES in mind, and my notional gameplay screen design is
very heavy on the raster effects...
Well I was talking more about the timing. Prepare the DMA beforehand where it isn't timing critical yet, and fire it at the right moment.
I think we're on the same page. I've considered problems that needed such an approach before (some of my VBlank scenarios are pretty marginal, so I plan to have $43xx ready to go before turning the screen off). I was just pointing out that the secondary issue is still there - if I'm understanding correctly, resetting a DMA channel on the Super NES cannot possibly take less than 7.5 dots even if all you're doing is refreshing the bottom half of the byte counter, and it adds up fast if you have a lot to do; this is on top of the IRQ overhead.
...I suppose you could get it down to 4.5 dots if you could avoid needing both of the index registers in your main code, but that's a significant restriction and could well waste more time than you save...
Hey Stef, what basic hardware feature was the cpu to slow to use, and what did you do that required you to change banks?
Just the sprites... Put 128 hardware sprites when the CPU can hardly handle more than 40 of them is a bit ridiculous.
Also giving that much graphic modes is a real waste, some of them are not usable because both of the lack of VRAM and bandwidth.
About the banking issue, it's just that you have to take care about GFX data storage. Just for the bad apple demo, i guess you have to avoid having a frame data sharing different 32/64 KB banks else you have to take care of that in your code and it becomes inefficient. When you are used to 68000 programming you really don't accept anymore to deal with that sort of problem
Stef wrote:
Just the sprites... Put 128 hardware sprites when the CPU can hardly handle more than 40 of them is a bit ridiculous.
Well, metasprites... The DKC games frequently push past 96 sprites, but because of the infuriating setup to where there are only 2 sprite sizes at a time, it doesn't seem like it, because the game uses 8x8 and 16x16 sized sprites instead of 16x16 and 32x32 sized sprites. Platforming games generally don't need that many sprites anyway.
Stef wrote:
Also giving that much graphic modes is a real waste, some of them are not usable because both of the lack of VRAM and bandwidth.
I actually kind off got to agree with you on this one.
I don't think anyone would miss mode 0, or any of the hires modes. I wouldn't mind the hires modes for somethings, but the fact that sprites are still in 256x224 resolution is extremely ugly. I would have actually preferred if the overdraw was halved and the sprites where running at the same resolution as the BGs. I really don't get the horizontal column scrolling on mode 2, as it doesn't seem very useful at all. It would have been better if they had ditched it and devoted the extra 2bpp to something else, like what the mode with the 8bpp and the 2bpp layer did. I think it would have been cool to see 6bpp graphics.
Really though with the background layers, I would only ever really use mode 1, mode 2, mode 3, and mode 7. The rest are pretty much near useless.
You know, this would eat vram like a monster, but one 8bpp layer at 512x448 would be really cool for title screens.
Do you still have the code for whatever you did with 40 sprites?
Banks $40-$7d and $c0-$ff are all ROM, and unlike common belief, the 65816 can actually do both linear and segmented memory space.
You know psychopathicteen, this is random, but did you ever use a separate routine for metasprites that only use one sprite? This could be useful for the 32x32 explosions in your game.
Espozo wrote:
You know psychopathicteen, this is random, but did you ever use a separate routine for metasprites that only use one sprite? This could be useful for the 32x32 explosions in your game.
Not in a long time. I used to hard code every sprite size into my game, but I stopped doing it, because it just ended up being a sloppy mess.
Well, I mean in the metasprite routine, it checks to see if there is only one sprite and if so, it jumps to the single metasprite routine. When you are programing the objects, you don't need to hardcode anything, because when you first arrive at the metasprite routine, it does a simple cmp, beq. This saved a substantial amount of CPU time for me, and I didn't need to adjust anything outside the metasprite code. I guess you're now adding a cmp and a beq, but unless every single object consists of more than one sprite, it would be worth it to put the cmp and the beq in there.
Well, I actually added a load, but it still takes less time, because doing this:
Code:
lda a:MetaspriteCount
cmp #$0001
bne horizontal_flip_check
Is faster than doing this:
Code:
horizontal_flip_check:
stz a:HFlipMask
lda Attributes
bit #$4000
beq vertical_flip_check
lda #$FFFF
sta a:HFlipMask
vertical_flip_check:
stz a:VFlipMask
lda Attributes
bit #$8000
beq metasprite_loop
lda #$FFFF
sta a:VFlipMask
plus, you don't need to jump back into the loop to check to see if you created all the sprites because you know there's only going to be one. You also don't need to move the sprites around and calculate the center (I actually haven't implemented finding the center yet, but it should be easy.)
Shouldn't the little tidbit I wrote only take 9 cycles? I don't feel like finding how many cycles the other code would take.
I know what you mean. I was just talking about the last time I used a seperate routine for single sprites, was back when I was disorganized about sprite routines. It's a good thing that both of your routines have the same entry point, unlike my old primitive routines.
Quote:
When you are used to 68000 programming you really don't accept anymore to deal with that sort of problem
you can say that for any CPU, when you're used to 65xx you cannot accept that a 16 bit CPU cannot do 8 bit operations faster than 16, slow irqs, slow tests+branching, slow sub routine calls ,and so on ...
This is not like if 68k was a perfect CPU ..
I'm still waiting for more information about Stef's sprite code. It probably has the same typical mistakes that Konami and Capcom always made:
-Using 8-bit where 16-bit is needed
-Redundant loads and stores
-Overuse of subroutines
-pushing and pulling everything inside the subroutine
hum, a single JSR/RTS take at least 32 cycles
A JSR-RTS pair on a 65816 takes 12 cycles. Per my "gencycles" metric, 32 cycles on a 68000 is like 4 fast and 8 slow cycles on a 65816.
On Hu6280 JSR/RTS takes 16 cycles,but @7,16 mhz .
Quote:
32 cycles on a 68000 is like 4 fast and 8 slow cycles on a 65816.
how did you get these values ??
With the clock speed difference ??
So this is yet another situation where a 68000 with 2x the clock speed of a 65816 is the same speed overall?
Edit: By the way, according to this:
http://wiki.superfamicom.org/snes/show/65816+Reference, a JSR/RTS can take either 12 or 14 cycles, depending on if the JSR is "absolute long" (is that 24bit?) or not.
TOUKO wrote:
Quote:
32 cycles on a 68000 is like 4 fast and 8 slow cycles on a 65816.
how did you get these values ??
With the clock speed difference ??
So to be able to compare work per clock on an even playing field, some time ago I invented a unit of time called gencycles. One gencycle is a fraction of a 65816 cycle intended to roughly approximate the period of one 68000 cycle. Each slow access (WRAM, slow ROM, and each byte of DMA) takes 3 gencycles, and each fast access (most I/O ports, fast ROM, and "internal operation") takes 2 gencycles. So you can cycle count a 68000 subroutine, see how many cycles it used, cycle count a 65816 subroutine that does the same thing, count gencycles, and you should get something fairly close to the machines' relative speeds.
So from what I've seen so far, Espozo is right: the increased CPU frequency and increased cycle counts balance out.
And yes, JSL and RTL have a one-cycle penalty for having to push and pull K (program segment).
Espozo wrote:
I don't think anyone would miss mode 0
Yoshi's Island uses it a lot (you lose on color count but you gain on parallax count, which is nice).
Espozo wrote:
I really don't get the horizontal column scrolling on mode 2, as it doesn't seem very useful at all.
Maybe they were thinking about split screen or parallax, but yeah not sure... Then again I also question the inclusion of the mosaic effect, since it seems its only intended purpose is to do some fancy fading stuff and nothing else.
TOUKO wrote:
you can say that for any CPU, when you're used to 65xx you cannot accept that a 16 bit CPU cannot do 8 bit operations faster than 16
Dunno, I would not mind that: it means the ALU is fast for 16-bit as well. I
would mind if 8-bit was slower, however.
tepples wrote:
So from what I've seen so far, Espozo is right: the increased CPU frequency and increased cycle counts balance out.
I just feel like the people at Nintendo aren't idiots. They may have made some questionable design choices with the SNES, but they aren't going to put a CPU that's only half as fast as the console that came out 2 years earlier. The thing is that 7.6 looks way better than 3.58. (or 2.68. I really don't get it, to be honest.) The thing is that although something like color depth can be easily measured by just looking at 2 different numbers. CPU speed is a lot more complex than that, but it gets just as much attention on spec sheets as something like color depth.
Sik wrote:
Espozo wrote:
I don't think anyone would miss mode 0
Yoshi's Island uses it a lot (you lose on color count but you gain on parallax count, which is nice).
You lose so much to gain so little that it really isn't worth it 99% of the time. Yoshi's Island mostly uses mode 1 anyway, except it does use mode 0 for the opening video thing and I think it also does on a select amount of rooms, like the one with Kamek and the pillars leading to baby bowser. One thing that mode 0 could be useful for is something like the "indie" look, where there is about NES color capabilities, but with a lot of parallax scrolling.
Sik wrote:
Maybe they were thinking about split screen or parallax, but yeah not sure... Then again I also question the inclusion of the mosaic effect, since it seems its only intended purpose is to do some fancy fading stuff and nothing else.
At least the mosaic effect looks nice for transitions. If the mosaic effect is taxing on the hardware or something, you could definitely get rid of it though. About the column scrolling, up and down is usefull, but there's a weird mode it has where when the screen scrolls the width of a tile horizontally, the tilemap appears to shift over a tile horizontally also. I don't get it at all.
Quote:
So from what I've seen so far, Espozo is right: the increased CPU frequency and increased cycle counts balance out.
Ok, but i take in account the HU6280 too, for me the 68K jsr/rts are just too slow .
Quote:
I would mind if 8-bit was slower, however.
it would have been a shame really, but the 65816 is close to be as fast in 16 bits as in 8 bits .
I just wanted to demonstrate that each CPU has gaps, and it depends to where they are used .
TOUKO wrote:
Ok, but i take in account the HU6280 too, for me the 68K jsr/rts are just too slow .
May want to look at
this. I could have optimized out the jump, but the gain was near minimal in exchage for a huge size trade off.
Maybe it's irrelevant since it's JMP instead of JSR, but still. 68000's biggest weakness is any memory access, which I compare to a modern CPU getting a cache miss on every single access (to give you an idea of how bad it is and why you should keep everything in registers wherever possible).
Of course put some common sense, aside from some speedcode, you'll have a lot of spare CPU time anyway unless you're overdoing it, so don't waste too much time optimizing as long as your code is decent and simple enough. And for raster effects, well remember that most of the time it's just modifying a single value every line (or so), and you can easily optimize the IRQ handler down to just a write if you take into account the autoincrement register =P (this means no other video accesses during active scan but why would you want to do those anyway?)
Quote:
Maybe it's irrelevant since it's JMP instead of JSR, but still.
no, but it's really cheap to be forced to use a JMP rather than the common jsr, for a "mush more evoluted" CPU than 65xx (dixit stef), i think it's not really serious
and you can do the same with HU6280,where jmp is 4 cycles .
Quote:
Of course put some common sense, aside from some speedcode, you'll have a lot of spare CPU time anyway unless you're overdoing it, so don't waste too much time optimizing as long as your code is decent and simple enough. And for raster effects, well remember that most of the time it's just modifying a single value every line (or so), and you can easily optimize the IRQ handler down to just a write if you take into account the autoincrement register =P
for raster effects you cannot summarise to change only a value, modifying scroll X/Y or both for exemple (like axelay did) need more than one value .
Of course on each CPU you can get around the weakness .
Quote:
(this means no other video accesses during active scan but why would you want to do those anyway?)
May be bacause HU6280 has full VRAM access in active display for exemple !!
Slow interrupt,this is why sega did the HSCROLL table i think .
If a subroutine ends with JSR followed by RTS, it's common to replace them with a JMP. It's called "tail call optimization", and the Scheme language is entirely designed around it. Then if you enter a subroutine only from one place, it's also common to move the subroutine's body there. This is called "inlining". It's not "cheap" as much as something you don't think about much on a CPU designed for high-level languages because your compiler does it for you.
TOUKO wrote:
for raster effects you cannot summarise to change only a value, modifying scroll X/Y or both for exemple (like axelay did) need more than one value .
Horizontal scrolling is done by the VDP itself so no raster effects here. Vertical scrolling is a single 16-bit write for a single plane. If you need to do two planes, it'd be two 32-bit writes (of which one is always the same value), which yeah is slower but still not as bad as it sounds.
There's the issue of where do you keep the pointer to the values, you can usually reserve a register for it. I actually use USP for this purpose, amusingly (not as fast since you need to get the value out of there first but still better than being locked out of a normal register in the main code).
TOUKO wrote:
May be bacause HU6280 has full VRAM access in active display for exemple !!
At least the Mega Drive has some room for access in active display (and if your write is not larger than 64-bit, it'll be entirely cushioned by the FIFO). The SNES doesn't even have that (I imagine the only reason why HDMA exists is to allow timing-perfect access to the few values that can be modified during hblank).
TOUKO wrote:
Slow interrupt,this is why sega did the HSCROLL table i think .
That, but also because the effect was becoming extremely common at the time (like nearly all parallaxes were starting to rely on it to fake even more planes). Worth noting that there's also a vertical scroll table that allows multiple vertical values on the same line, but the video memory speed meant its granurality was limited (still useful for some vertical parallaxes though, especially in games like vertical shmups).
Quote:
If you need to do two planes, it'd be two 32-bit writes (of which one is always the same value), which yeah is slower but still not as bad as it sounds.
Of course this is not as dramatical like it seems, but not as fast than 65xx,and you must spend 2 registers only for that effect ..
Quote:
I actually use USP for this purpose, amusingly (not as fast since you need to get the value out of there first but still better than being locked out of a normal register in the main code).
Yeah, if you don't want to keep 1 or 2 registers you must load each values in RAM before, which is slower .
Quote:
At least the Mega Drive has some room for access in active display (and if your write is not larger than 64-bit, it'll be entirely cushioned by the FIFO). The SNES doesn't even have that (I imagine the only reason why HDMA exists is to allow timing-perfect access to the few values that can be modified during hblank).
Md is a really good designed machine, it's a shame it came with only 4 sub palettes and "poor" samples capacities bacause it lack the 2612 timer interrupt line connected to Z80/68000 .
Quote:
If a subroutine ends with JSR followed by RTS, it's common to replace them with a JMP. It's called "tail call optimization", and the Scheme language is entirely designed around it. Then if you enter a subroutine only from one place, it's also common to move the subroutine's body there. This is called "inlining". It's not "cheap" as much as something you don't think about much on a CPU designed for high-level languages because your compiler does it for you.
Yes it's common when you have
sub0:
.
.
.
jsr sub 1
.
.
sub1:
.
.
.
.
jmp sub2
sub2:
.
.
.
.
rts (return directly to sub0)
In a game when you code on 65xx you don't replace a single jsr with a jmp just because jsr is too slow.
It's may be common on 68k or Z80 but not on 65xx, unless you do a demo where each cycles count.
In front of that, if you need self modifying code,it's also more faster on 65xx than 68k .
Espozo wrote:
Well, metasprites... The DKC games frequently push past 96 sprites, but because of the infuriating setup to where there are only 2 sprite sizes at a time, it doesn't seem like it, because the game uses 8x8 and 16x16 sized sprites instead of 16x16 and 32x32 sized sprites. Platforming games generally don't need that many sprites anyway.
Yeah but in this case you only display maybe 10 real sprites then you deal with meta sprite as we can do on Megadrive. But on SNES the sprite size restriction require you to use more sprites for metasprite... still i am surprised about the choice of DKC, this game rarely display small sprites so it would have make more sense to use 16x16 + 32x32 sizes. But i guess i don't remember well and using 8x8 + 16x16 allow you to optimize the sprite bandwidth, i think that if i had to code a sprite engine for the SNES i would use this configuration as well... then all is matters of optimization. And generally speaking i prefer to have many small sprites than few big ones.
But get back to the point and to reply psycopathicteen question, what i mean is about dealing with "real" objects and so having the CPU need to handle state and physic of more than 40 objects at once. I don't remember my code and it wasn't optimized as its best but still what i wanted to do is just a simple vertical shooter but with many bullets (8x8) and enemies (16x16) on screen to take benefit of the offered 128 sprites. All bullets and enemies having their own movement but quickly i was running in slowdown.
Then i did the same on Megadrive with simple C code and it was actually faster than my 65816 asm code... I didn't get really far with that, the sprite allocation was fixed and my main concern was more about handling the objects. I probably still have the code somewhere but anyway the point is that it was faster on Megadrive in C than on SNES with assembly. For the technical details, I was using 16.8 signed fixed point on snes while i was using 16.16 signed fixed point on MD but i didn't make anything crazy slow on SNES (i don't remember exactly, it was probably something as 5 or 6 years ago).
Stef wrote:
I really don't get the horizontal column scrolling on mode 2, as it doesn't seem very useful at all. It would have been better if they had ditched it and devoted the extra 2bpp to something else, like what the mode with the 8bpp and the 2bpp layer did. I think it would have been cool to see 6bpp graphics.
Really though with the background layers, I would only ever really use mode 1, mode 2, mode 3, and mode 7. The rest are pretty much near useless.
You know, this would eat vram like a monster, but one 8bpp layer at 512x448 would be really cool for title screens.
Yeah high resolution could be nice for title screen but honestly that does really worth to have a special mode for that :p
Also i don' like how they made the tile per offset scrolling stuff, it's over complicated and again generates probably many CPU overhead when you want to use it for simple effect. Generally i see it used for column tile scrolling (Starfox, Earthworm Jim bonus level, RR) but in this case the way it's done on Megadrive is definitely smarter and easier, the only drawback is that it works on 2 cells unit (probably to not eat too much of the VDP die space as the VSRAM in directly included into it).
Quote:
-Using 8-bit where 16-bit is needed
-Redundant loads and stores
-Overuse of subroutines
-pushing and pulling everything inside the subroutine
I'm definitely not an expert about 65816 programming but generally speaking i do have some experience with assembly programming and i do know the main bottlenecks you can meet. One of the problem of the 65816 is the number of memory accesses you need to do, you make all the work from the memory and you require to do many load/stores operations which make things slow very quickly.
Looking at the 65816 instruction cycles, the direct page does not have much interest as it has 1 cycle penalty when low byte is <> 0 and when you use it with X index it requires 4 cycles at best (8 bits mode) as much as the absolute X indexed access actually...
So when you perform heavy calculations as position update you will always require *at least* 4 cycles to access 8 bits or 5 cycles for 16 bits... that's the problem of this CPU, you will never be able to get lower than that (when the 68000 requires 4 cycles for 16 bits accesses). And that is without even speaking of the general overhead because of the poor instruction set (as CLC, SEP, REP...).
Quote:
May want to look at
this. I could have optimized out the jump, but the gain was near minimal in exchage for a huge size trade off.
Maybe it could be a good comparison point, trying to replicate this piece of code on the 65816 and see how much it takes...
Quote:
Maybe it's irrelevant since it's JMP instead of JSR, but still. 68000's biggest weakness is any memory access, which I compare to a modern CPU getting a cache miss on every single access (to give you an idea of how bad it is and why you should keep everything in registers wherever possible).
I think the memory accesses are not that slow, they are slow when you took them independently but if you can arrange your data structures to do a single movem to load your registers then do the computation from registers and store back result with another movem then it is quite efficient. I do that for some matrices operation and the final output operation number is really good regarding we are on a simple 68000
Quote:
Of course put some common sense, aside from some speedcode, you'll have a lot of spare CPU time anyway unless you're overdoing it, so don't waste too much time optimizing as long as your code is decent and simple enough.
I do agree with that, it's also why i am using C when i can.
I wonder if psycopathicteen would want to see your 65816 source code and point out obvious problems. Apparently there are a lot of people who try programming 65816 without knowing
peephole optimization.
tepples wrote:
I wonder if psycopathicteen would want to see your 65816 source code and point out obvious problems.
Apparently there are a lot of people who try programming 65816 without knowing peephole optimization.
i agree with that ..
if you're coding 65xx like Z80 or 68K you'll go in the wall .
Hmm.. I dont really understand when you say programming the 65816 as you were programming the 68000. Actually the is no way of doing that you know ? though it is actually possible to program a 68000 as a 65xx (then it would be pretty inefficient).
Every CPU have their own strength and of course you study a bit the instruction set before trying to code anything. Even the Z80 need to be coded in a very different way from the 68000 and to be honest I'm quite uncomfortable with this one. But the 65xx have a very simple and limited instruction set, it is not as if you can discover some mysterious tricks...
Tepples> these optimizations are just what I could call "the basics good practices of assembly", nothing tricky here, really :-/ As I told my code was probably not optimized as it's best, but it was not awful too...
Quote:
Every CPU have their own strength and of course you study a bit the instruction set before trying to code anything.
of course stef, but we cannot said if a cpu is good or bad only with that , mainly if you're more experienced and used with an architecture ..
you cannot look an ISA and say, "oh it's pretty limited, so what's a bad and unefficient CPU",everything depends on its use and where it used ..
Quote:
As I told my code was probably not optimized as it's best, but it was not awful too...
And you're wrong, on 65xx there is a large margin between bad/simple converted code and an optimised one ..
And like tepples, i 'am curious to see your code, i'am pretty sure you coded it like a 6502..
Stef wrote:
But the 65xx have a very simple and limited instruction set, it is not as if you can discover some mysterious tricks...
It appears the problem is that you are used to CISC processors, and not RISC ones. Honestly, I'd say that there are more "mysterious tricks" you can do with risc processors, do to the fact that there are multiple ways of going about things. I don't know about the 68000, but on the 65816, there are a ton of different ways to do the same thing. For example, you could write
Code:
inx
inx
inx
inx
inx
inx
but this is inefficient. The better thing to do would be to do this:
Code:
txa
clc
adc #$06
txa
I imagine on the 68000, there isn't as much freedom to what you can do. It's like if you had a box of 24 different markers, or if you had red, yellow, and blue paint. One processor isn't necessarily better than the other, there's just a different way of doing things. The "peephole optimization" example tepples gave is really good, like it doesn't make sense to load the same value twice, whereas on the 68000, if it's anything like the NEC V33/ 8086, I'm pretty sure loading and storing is 1 instruction, so you do the same thing twice if you are loading the same number. Like I said, the 65816 gives you more freedom, but 68000 has everything conveniently done for you, at the cost of a little less flexibility. I don't really think one architecture is better than the other, it's just what you're used to.
Espozo wrote:
{ 6x INX }
but this is inefficient. The better thing to do would be to do this:
{ TXA / CLC / ADC / TAX }
Just as a side note, this is only really more efficient if the A register is a free resource in this context. If the use of TXA means you need to do an extra LDA/STA it's might be a wash or even a net loss. A lot of assembly optimization is about effective use of registers.
If the adc is in RAM (zp), you can do this on hu6280
Code:
ldx #low(zp_var)
clc
set
adc #$6
rather than:
lda <zp_var
clc
adc #$6
sta <zp_var
#$6 is directly added in zp_var location,without touching the A reg .
rainwarrior wrote:
Espozo wrote:
{ 6x INX }
but this is inefficient. The better thing to do would be to do this:
{ TXA / CLC / ADC / TAX }
Just as a side note, this is only really more efficient if the A register is a free resource in this context. If the use of TXA means you need to do an extra LDA/STA it's might be a wash or even a net loss. A lot of assembly optimization is about effective use of registers.
Very true, I actually can't think of a situation where I've run into that problem. Most of the time, I use that trick when preparing for a loop where nothing I am going to use again is loaded into the accumulator. If I'm not mistaken, aren't excessive loads and stores the reason C generated code on the 65816 is so inefficient?
About one thing, does "registers" refer to the accumulator, X, and Y, or does it refer to the bytes in RAM? I thought it was ram, but I've been hearing otherwise. If it is the second, it's good the SNES has more than enough of it.
The term "register" refers to internal storage in the CPU, like A/B/X/Y, the stack pointer, the flags, etc.
Zero page RAM is sometimes thought of as a pseudo-register on the 6502, since it is quicker to access than other external memory.
Conversely, on some RISC architectures there are a lot of general purpose registers, and there are even random-access/indexed usage patterns that might make them seem like pseudo-RAM.
rainwarrior wrote:
like A/B/X/Y
B?
The one exception is the term "register" used to describe an
MMIO register. These are not CPU registers, but rather memory addresses used to control some particular part of the system (ex. $2100 on the SNES). The use of the term "register" here is absolutely acceptable, and anyone with half a clue will understand the difference between a MMIO register and a CPU register.
TL;DR -- when using the term "register", use context.
Espozo wrote:
rainwarrior wrote:
like A/B/X/Y
B?
Probably referring to the bank register on the 65816, commonly referred to as "B" or sometimes "PBR".
Thread is now officially off-topic.
B is the top byte of C, which is usually referred to as A. Technically A is the low byte of C, not the whole thing. Hence opcodes like tcd and xba.
In 8-bit mode the accumulator is A. In 16-bit mode it's C. But we still call it A...
koitsu wrote:
The one exception is the term "register" used to describe an
MMIO register. These are not CPU registers, but rather memory addresses used to control some particular part of the system (ex. $2100 on the SNES). The use of the term "register" here is absolutely acceptable, and anyone with half a clue will understand the difference between a MMIO register and a CPU register.
TL;DR -- when using the term "register", use context.
Thanks you for that. (I'm not being sarcastic, it just kind of sounds like it.)
koitsu wrote:
Thread is now officially off-topic.
I have a way of doing that...
Anyway, about the 65816's speed...
koitsu wrote:
The one exception is the term "register" used to describe an
MMIO register. These are not CPU registers, but rather memory addresses used to control some particular part of the system (ex. $2100 on the SNES). The use of the term "register" here is absolutely acceptable, and anyone with half a clue will understand the difference between a MMIO register and a CPU register.
These aren't really different. The term "register" for memory-mapped registers is still referring to some internal storage on the chip. The memory address is just the mapping, the point of access. The "register" part is still internal to the chip.
Doesn't Gunstar Heroes only have about 40 onscreen objects anyway? It just looks like there's more because the sprites are bigger than in a shmup. If it had 80 or 128 objects onscreen you wouldn't be able to move your character around without dying.
Quote:
And you're wrong, on 65xx there is a large margin between bad/simple converted code and an optimised one ..
And like tepples, i 'am curious to see your code, i'am pretty sure you coded it like a 6502..
I definitely don't agree with that, there are less optimization margin on a 65xx CPU than on a 68000. More limited ISA = more limited possibilities = more limited ways of exploiting it... definitely even if you can use lut or stuff like that, that is also the case on every others CPU. The 68000 offers a lot of possible optimization, one of the reason is the important number of register, you can build the algo to fit in registers and you cannot take that approach on a 65xx family CPU for instance.
If you think the 68000 is substantially faster than the 65816 at a particular task assuming (say) a 15:7 clock ratio, post your 68000 code and the best 65816 code you could come up with. Then psycopathicteen and I would probably be willing to help you fix it. Until I see something along these lines, I'm going with "imagined".
And the difference between CPU registers and MMIO is such that I've started calling MMIO addresses "ports" much of the time.
Espozo wrote:
It appears the problem is that you are used to CISC processors, and not RISC ones. Honestly, I'd say that there are more "mysterious tricks" you can do with risc processors, do to the fact that there are multiple ways of going about things.
are you kidding ? i'm well aware about CISC and RISC (i programmed both ARM and SuperH cpu) and i don't see the point here. 65xx is definitely not RISC (even if the limited ISA look a bit like...).
Quote:
I imagine on the 68000, there isn't as much freedom to what you can do.
Really you should try to see how work other CPU... all CPU can almost do that, that is again "basics good practices" of asm programming.
tepples wrote:
If you think the 68000 is substantially faster than the 65816 at a particular task assuming (say) a 15:7 clock ratio, post your 68000 code and the best 65816 code you could come up with. Then psycopathicteen and I would probably be willing to help you fix it. Until I see something along these lines, I'm going with "imagined".
And the difference between CPU registers and MMIO is such that I've started calling MMIO addresses "ports" much of the time.
Ok, not the time now but i will try to do that later
Stef wrote:
are you kidding ? i'm well aware about CISC and RISC (i programmed both ARM and SuperH cpu) and i don't see the point here. 65xx is definitely not RISC (even if the limited ISA look a bit like...).
Um, thanks? You act like it's so outrageous I said it was a RISC processor when you just said that it's a bit of a wash.
Stef wrote:
Really you should try to see how work other CPU... all CPU can almost do that, that is again "basics good practices" of asm programming.
I'm just really not sure why you like to act like "Mr. 65816 know-it-all" when you said yourself that you haven't done very much 65x asm and that you did it about 5-6 years ago.
I looked at ARM 6 and x86 for about a week, does that make me an expert?
Espozo wrote:
B?
I was specifically thinking of the 65816, where the high byte of the 16-bit accumulator register is sometimes called B, such as with the instruction
XBA, though some other CPUs have a B register too (e.g. M6809).
TOUKO wrote:
Of course this is not as dramatical like it seems, but not as fast than 65xx,and you must spend 2 registers only for that effect ..
What second register? o.O
Code:
move.l #$40000010, ($C00004)
move.l (a6)+, ($C00000)
Or if you want the full code (which is where it gets slow, ugh):
Code:
IRQ4:
move.l a6, -(sp)
move.l usp, a6
move.l #$40000010, ($C00004)
move.l (a6)+, ($C00000)
move.l a6, usp
move.l (sp)+, a6
rte
Although if you can afford to waste the register:
Code:
IRQ4:
move.l #$40000010, ($C00004)
move.l (a6)+, ($C00000)
rte
Quote:
What second register? o.O
for the second value if you scroll on Y ..
A use of (An) is slower no ??
How many cycles this take ?? .
Code:
move.l #$40000010, ($C00004)
move.l (a6)+, ($C00000)
I think with interruption is close to 100 cycles / line
Code:
move.l a6, -(sp)
move.l usp, a6
move.l #$40000010, ($C00004)
move.l (a6)+, ($C00000)
move.l a6, usp
move.l (sp)+, a6
Here with interruption it's definitely too slow to do it for each line in a game like axelay.
Espozo wrote:
Um, thanks? You act like it's so outrageous I said it was a RISC processor when you just said that it's a bit of a wash.
Well you said that as if i was biased to CISC CPU or/and not used to RISC, again honestly for me as soon you know some general knowledge about assembly programming, you quickly understand the logic of a CPU and how to use its instruction set. You can deduce cycles from instruction and you tend to naturally use fast way of doing stuff (at least i do). Of course with the experience you will learn some tricks to gain some cycles here and there, but that won't make a big difference.
And again, even if the 6502 is not micro programmed, it is not a RISC by itself. Generally RISC has fixed instruction length (fixed 16 bits or 32 bits) to start with, so instruction fetch time is fixed... that's definitely not the case with the 6502.
Stef wrote:
I'm just really not sure why you like to act like "Mr. 65816 know-it-all" when you said yourself that you haven't done very much 65x asm and that you did it about 5-6 years ago.
I never said i am an expert with the 65816 but you can clearly evaluate the CPU performance but looking at its ISA, if you have experience with assembly programming you quickly "see" what can be done, what can't... that's all. I started programming in assembly about 20 years ago, it is not like i am discovering that today. By the way did you see the reply of Byuu in the animation player topic ? He wrote Higan so i guess he know its business and he's the first to claim the 65816 is slow.
Quote:
I looked at ARM 6 and x86 for about a week, does that make me an expert?
I'm afraid not :p but maybe you will quickly become who know
tepples wrote:
If you think the 68000 is substantially faster than the 65816 at a particular task assuming (say) a 15:7 clock ratio, post your 68000 code and the best 65816 code you could come up with. Then psycopathicteen and I would probably be willing to help you fix it. Until I see something along these lines, I'm going with "imagined".
Ok, first just for the fun
3D transformation and 2D projection, here are the prototypes of the methods:
Code:
/**
* \brief
* Process 3D transform (rotation and translation) to specified 3D vertices buffer.
*
* \param t
* Transformation object containing rotation and translation parameters.
* \param src
* Source 3D vertices buffer.
* \param dest
* Destination 3D vertices buffer.
* \param numv
* Number of vertices to transform.
*/
void M3D_transform(Transformation3D *t, const Vect3D_f16 *src, Vect3D_f16 *dest, u16 numv);
/**
* \brief
* Process 2D projection to specified 3D vertices buffer (s16 version).
*
* \param src
* Source 3D vertices buffer.
* \param dest
* Destination 2D vertices buffer - s16 format
* \param numv
* Number of vertices to project.
*/
void M3D_project_s16(const Vect3D_f16 *src, Vect2D_s16 *dest, u16 numv);
If you want to look at the structure definitions:
https://github.com/Stephane-D/SGDK/blob ... nc/maths.hhttps://github.com/Stephane-D/SGDK/blob ... /maths3D.hThen the code:
Code:
M3D_transform:
movm.l %d2-%d5/%a2-%a5,-(%sp)
move.l 36(%sp),%a2 | a2 = &transform
move.l 2(%a2),%a0 | a0 = &translation
movem.w (%a0)+,%a3-%a5 | a3 = translation.x a4 = translation.y a5 = translation.z
move.l 40(%sp),%a0 | a0 = src
move.l 44(%sp),%a1 | a1 = dest
lea 10(%a2),%a2 | a2 = &(transform.mat)
move.w 50(%sp),%d5 | d5 = numv
subq.w #1,%d5
jmi .L50
.L48:
movem.w (%a0)+,%d2-%d4 | d2 = sx d3 = sy d4 = sz
move.w (%a2)+,%d0 | d0 = mat.a.x
muls.w %d2,%d0 | d0 = mat.a.x * sx
move.w (%a2)+,%d1 | d1 = mat.a.y
muls.w %d3,%d1 | d1 = mat.a.y * sy
add.l %d1,%d0 | d0 = (mat.a.x * sx) + (mat.a.y * sy)
move.w (%a2)+,%d1 | d1 = mat.a.z
muls.w %d4,%d1 | d1 = mat.a.z * sz
add.l %d1,%d0 | d0 = (mat.a.x * sx) + (mat.a.y * sy) + (mat.a.z * sz)
asr.l #6,%d0
add.w %a3,%d0 | d0 = (mat.a.x * sx) + (mat.a.y * sy) + (mat.a.z * sz) + translation.x
move.w %d0,(%a1)+ | dest++ = (mat.a.x * sx) + (mat.a.y * sy) + (mat.a.z * sz) + translation.x
move.w (%a2)+,%d0 | d0 = mat.b.x
muls.w %d2,%d0 | d0 = mat.b.x * sx
move.w (%a2)+,%d1 | d1 = mat.b.y
muls.w %d3,%d1 | d1 = mat.b.y * sy
add.l %d1,%d0 | d0 = (mat.b.x * sx) + (mat.b.y * sy)
move.w (%a2)+,%d1 | d1 = mat.b.z
muls.w %d4,%d1 | d1 = mat.b.z * sz
add.l %d1,%d0 | d0 = (mat.b.x * sx) + (mat.b.y * sy) + (mat.b.z * sz)
asr.l #6,%d0
add.w %a4,%d0 | d0 = (mat.b.x * sx) + (mat.b.y * sy) + (mat.b.z * sz) + translation.y
move.w %d0,(%a1)+ | dest++ = (mat.b.x * sx) + (mat.b.y * sy) + (mat.b.z * sz) + translation.y
muls.w (%a2)+,%d2 | d2 = mat.c.x * sx
muls.w (%a2)+,%d3 | d3 = mat.c.y * sy
add.l %d3,%d2 | d2 = (mat.c.x * sx) + (mat.c.y * sy)
muls.w (%a2)+,%d4 | d4 = mat.c.z * sz
add.l %d4,%d2 | d2 = (mat.c.x * sx) + (mat.c.y * sy) + (mat.c.z * sz)
asr.l #6,%d2
add.w %a5,%d2 | d2 = (mat.c.x * sx) + (mat.c.y * sy) + (mat.c.z * sz) + translation.z
move.w %d2,(%a1)+ | dest++ = (mat.c.x * sx) + (mat.c.y * sy) + (mat.c.z * sz) + translation.z
lea -18(%a2),%a2 | a2 = &(transform.mat)
dbra %d5,.L48
.L50:
movem.l (%sp)+,%d2-%d5/%a2-%a5
rts
Code:
M3D_project_s16:
movem.l %d2-%d7/%a2-%a3,-(%sp)
move.l 36(%sp),%a0 | a0 = s = src
move.l 40(%sp),%a1 | a1 = d = dst
move.w 46(%sp),%d7 | d7 = i = numv
subq.w #1,%d7
jmi .L42
lea context3D,%a2
move.w (%a2)+,%d5
lsr.w #1,%d5 | d5 = centerX = viewport.x / 2
move.w (%a2)+,%d6
lsr.w #1,%d6 | d6 = centerY = viewport.y / 2
moveq #0,%d4
move.w (%a2),%d4 | d4 = camDist
move.w %d5,%d0
swap %d0
move.w %d6,%d0
move.l %d0,%a2 | a2 = (centerX << 16) | centerY
move.w %d4,%d0
ext.l %d0
swap %d0
asr.l #6,%d0
move.l %d0,%a3 | a3 = camDist << (6 + 4)
| 6 to prepare for division
| 4 for *16 ratio
.L40: | while(i--) {
movem.w (%a0)+,%d0-%d2 | d0 = x = s->x, d1 = y = s->y, d2 = z = s->z
add.w %d4,%d2 | if ((zi = (camDist + z)) > 0)
jle .L38 | {
move.l %a3,%d3
divs.w %d2,%d3 | d3 = scale = fix16Div((camDist << (4 + 2)), camDist + z)
muls.w %d3,%d0
swap %d0
rol.l #4,%d0
add.w %d5,%d0
move.w %d0,(%a1)+ | d->x = centerX + (scale * x)
muls.w %d3,%d1
swap %d1
rol.l #4,%d1
move.w %d6,%d0
sub.w %d1,%d0
move.w %d0,(%a1)+ | d->y = centerY - (scale * y)
dbra %d7,.L40
movem.l (%sp)+,%d2-%d7/%a2-%a3
rts | }
| else
.L38: | {
move.l %a2,(%a1)+ | d->x = centerX
| d->y = centerY
| }
dbra %d7,.L40 | }
.L42:
movem.l (%sp)+,%d2-%d7/%a2-%a3
rts
The multiply and division instructions are reputed very slow on the 68000 (and they are), still i can transform (3D transform + 2D projection) about 10000 vertices per second on the MD 68000.
Vertices are 16 bits fixed points (10.6 signed), the 5A22 has built-in multiply and division units (which are faster than 68000 mul / div instructions) so it should be doable.
I will try to find something better as i admit using code with multiply and division is not really fair for the 65816.
Stef wrote:
I will try to find something better as i admit using code with multiply and division is not really fair for the 65816.
And although this doesn't really have to do with the 65816, let's not forget that planar graphics format... I think something like a collision detection code or other 2D things would be better for a fair comparison.
Quote:
I will try to find something better as i admit using code with multiply and division is not really fair for the 65816.
Why ??, these are 68k instructions, it's fair IMO.
Stef don't be overconfident
Honestly i think there is nice ways of optimizing the code with the 5A22 as you can let multiply and division units process while you are doing other stuff with the 65816. Still the division and multiply units aren't as powerful than the 68000 instructions, the multiplication is only 8x8=16 (unsigned) and division 16/8=16:16 (unsigned) where the 68000 supports 16x16=32 and 32/16=16:16 both in signed and unsigned so more operations need to be done... in my case, as soon we safely keep the 10.6 signed fixed point informations during the computations i am ok
Espozo wrote:
And although this doesn't really have to do with the 65816, let's not forget that planar graphics format... I think something like a collision detection code or other 2D things would be better for a fair comparison.
Of course i don't consider the graphics format, we just speak here about the CPU... and as i mentioned to Touko, you can use the externals multiply and division units to perform the operation if you want (in which case you can't use the mode 7 anymore as i believe it uses these units).
Quote:
you can use the multiply and division units from the PPU to perform the operation if you want (in which case you can't use the mode 7 anymore as i believe it uses these units).
In this case is not really fair, because mul/div on snes is not a CPU part ..
For a good comparison, you must fake these instructions by any means,in real time or precalc, or doing what stef did by another way .
TOUKO wrote:
In this case is not really fair, because mul/div on snes is not a CPU part ..
That depends on what you're actually comparing... if it's 68000 vs. 65816, then yeah, that would be cheating. However, that comparison doesn't make much sense IMO, since we don't even know how fast these CPUs are clocked unless we bring actual systems into the discussion. As I see it, this is about Genesis vs. SNES, so anything the stock hardware has to offer is fair game.
Quote:
That depends on what you're actually comparing... if it's 68000 vs. 65816
Yes it's for 65816 vs 68000, and not entire system,this is why stef doesn't care of planar/packed display.
For a Md vs snes, it would be also unfair because Md use a sort of chunky mode vs the snes's planar .
TOUKO wrote:
Quote:
you can use the multiply and division units from the PPU to perform the operation if you want (in which case you can't use the mode 7 anymore as i believe it uses these units).
In this case is not really fair, because mul/div on snes is not a CPU part ..
For a good comparison, you must fake these instructions by any means,in real time or precalc, or doing what stef did by another way .
Honestly if you don't use the externals multiply and division units then i think you have no chance of having comparable execution time :-/
You can try using lut but to keep reasonable table dimension you need to use 8x8 multiply or 8/8 division lut then it requires more operations.
TOUKO wrote:
Yes it's for 65816 vs 68000, and not entire system,this is why stef doesn't care of planar/packed display.
OK, and how can you tell who wins if you don't know the speed they're being clocked at?
Quote:
Honestly if you don't use the externals multiply and division units then i think you have no chance of having comparable execution time :-/
May be,but in this case why do not use an external ship then,the big deal is mainly mul/div .
3D is not my cup of tea, i let more experienced coders here doing their own version.
tokumaru wrote:
TOUKO wrote:
Yes it's for 65816 vs 68000, and not entire system,this is why stef doesn't care of planar/packed display.
OK, and how can you tell who wins if you don't know the speed they're being clocked at?
Of course we compare the MD 68000 (7.67 Mhz) and the SNES 65816 (considering fast rom speed if you want). There is good reasons for these CPU to run at their specific speed so we have to compare them given these clock. The 15:7 ratio is not correct as it assumes 3.58 Mhz for the 65816, which is the speed only for PPU port and fast rom accesses, we also have to consider the ram access at 2.68 Mhz.
Stef wrote:
Of course we compare the MD 68000 (7.67 Mhz) and the SNES 65816 (considering fast rom speed if you want).
That's exactly my point! If you know which machines have these CPUs, you know what other hardware is available to use (e.g. SNES multiplier). I don't see the point in pretending the CPUs are isolated, we're talking about operations that real games, running on real machines would need, so you have to consider everything the machines can do to achieve the goal, not only the CPU.
I don't care if a CPU can make a billion 3D calculations per second if the video hardware is too slow to render the graphics, because it means that despite the powerful CPU, the console still can't host a proper 3D game. Similarly, another console with a weak CPU could make up for it with an insanely powerful/versatile GPU. It's all about what the systems as a whole can do, because when we're talking about games, CPUs are just a part of the equation, it's pointless to compare just that aspect.
I'm of the opinion that you can only settle the SNES vs. MD debate if you actually try to implement similar games on both platforms and look at the whole picture. You can keep pointlessly comparing CPUs (that are floating in space and connected to nothing), but whatever conclusion you get to will be absolutely no indication of what games you can make with them. Do what you will.
What tokumaru said pretty much just rendered this thread useless, and I agree with him. I know Stef wanted to see a GH heroes port on the SNES and that would be a good comparison as to how good each system is compared to each other, but the truth is, nobody cares enough to make one. Basically, this is and seems like it always will be a bunch of useless speculation. Stef thinks the Genesis as a whole is more powerfull than the SNES. Great. I think the SNES is as a whole more powerful. Great. If you truly believe one way, why try and convince everybody that you're correct? Just sit knowing that what you think is right is right and other people are wrong. I don't think anyone is ever really going to change their mind about it.
By the way Stef, this wasn't specifically targeted toward you. This actually applies to me quite a bit.
Espozo wrote:
Stef thinks the Genesis as a whole is more powerfull than the SNES. Great. I think the SNES is as a whole more powerful.
I honestly don't care which one is more powerful. This is not 1994 anymore, and both consoles are dirty cheap (I for example own several versions of both), so it's not like you have to choose. I can understand 90's teens defending their consoles, because few could afford both and nobody likes to admit they own something inferior, but come on, both are great in their own way... if you like retro games, that is - a lot of new gamers will just say both consoles are crap, and that a PS4 is way more powerful.
Nowadays you can literally play both complete libraries for free (not necessarily legally). This is great, because we get to experience the best of both worlds, enjoying what each console does best. Both have great games and nobody can deny that, so you can choose between wasting your time arguing about which console is more powerful or you can play great games. Your call. If you're a developer, you don't have to prove anyone anything, just make your game for the platform you think is more suited. If you do feel like you have to prove anything though, well, go ahead, it's always fun for the rest of us to see "unconventional" homebrews.
Code:
$ ls Gu*tar\ Hero*
To port GH, first you'd need to make a controller. You'd have to choose a three-button layout (Gunstar Freaks), a five in line layout (Gunstar Hero), or a three-by-two layout (Gunstar Hero Live). The Genesis 6-button controller has a three-by-two layout that matches Gunstar Hero Live, while most Super NES controllers other than fighting game sticks have a 4 button diamond.
The all-new plastic gunstar...and its Sega 6-button pad approximation
[/silly]
Espozo wrote:
What tokumaru said pretty much just rendered this thread useless, and I agree with him.
Well, not really... if the debate started it was because we read here and there that it is actually possible to reproduce the most impressive Genesis games on the SNES, and specifically Gunstar Heroes to the SNES. By saying that, you were pretending the SNES could handle it and so that the SNES CPU can replicate all what happen in this game.
To that affirmation i just replied you're wrong (not specifically you) because the MD 68000 is more powerful than the SNES 65816 and this game is actually already quite demanding on the 68000. So yes basically we wanted to compare the CPU capabilities of these systems... but i admit that we can consider the SNES hardware as a whole when it can assist the CPU, and the mul/div units are here for that: the 65816 doesn't support native multiply nor division so they added specific units for that in the 5A22... when i posted the code of course i considered using them to compare against the SNES.
Quote:
I know Stef wanted to see a GH heroes port on the SNES and that would be a good comparison as to how good each system is compared to each other, but the truth is, nobody cares enough to make one.
Of course nobody will attempt a port just for that reason, it's far too much work. But there is no need to attempt to port GH on SNES to accept the 65816 can't replicate what the 68000 can do, we can post simple code samples (2D collisions, physic calculation, line drawing or whatever...) and compare, i think that is a good way to evaluate performance of both CPU.
Quote:
Basically, this is and seems like it always will be a bunch of useless speculation. Stef thinks the Genesis as a whole is more powerfull than the SNES. Great. I think the SNES is as a whole more powerful.
I never said the Genesis *as a whole* is more powerful, i just said the Sega Genesis has a more powerful CPU *without any doubt* and this allows to do a lot more in term of processing. But the Sega Genesis also display less colors than the SNES *without any doubt* and the SNES supports hardware effect the Genesis cannot replicate, even in software (as transparencies).
Stef wrote:
you can use the externals multiply and division units to perform the operation if you want (in which case you can't use the mode 7 anymore as i believe it uses these units).
No, Mode 7 does not use the ALU. The PPU has a separate 16x8 signed multiply capability for that. And if you're not using Mode 7, the CPU can use the PPU multiply hardware - it seems the result appears faster than the CPU can read it, so there's no waiting involved.
If you
are using Mode 7, rendering is dead simple. You don't even have to interleave the tilemap bytes, because of how the VRAM gate works. The only problems are (a) you can't use PPU multiply, and (b) you can't render a whole screen because you only have 256 unique tiles; Wolfenstein 3D got around this by using Mode 7 to zoom in on a low-res playfield, resulting in blocky graphics...
By the way i have an idea of algo to compare the CPU: a simple bresenham line drawing code.
Honestly i think that is typically the case where the 65xx architecture is in advantage: byte memory access (we consider 8 bits per pixel) and many conditional jump. To make it even simpler for the 65816 I restrict it to 8 bits coordinates (0-255). I'm not really interested with the initialization code but more by the drawing loop as the bottleneck is there, i really believe that is a case where the 65816 can perform really well compared to the 68000 but ok let's go
To start, we don't push too much the optimization with insane unrolling or stuff like that, just a smart implementation of bresenham algo with the smallest loop as possible !
Code:
drawLine:
init
....
.loop_dx_sup_dy
add.w d5,d0 ; 4 - / check if you need to jump to
bcc.s .mov1 ; 10 - \ the next scanline
adda.l d2,a0 ; +8 - jump to the next scanline
.mov1
move.b d6,(a0)+ ; 8 - plot and go to next pixel
dbra d4,.loop_dx_sup_dy ; 10
jmp .end
.loop_dx_inf_dy
add.w d4,d0 ; 4- / check if you need to jump to
bcc.s .mov2 ; 10 - \ the next pixel in the scanline
addq.l #2,a0 ; +8 - go to next pixel in scanline
.mov2
move.b d6,(a0) ; 8 - plot the pixel
adda.l d2,a0 ; 8 - go to next scanline
dbra d5,.loop_dx_inf_dy ; 10
.end
....
rts
so we need 32 (+8) ~36 cycles per pixel when DX > DY
and 40 (+8) ~ 44 cycles per pixel when DY > DX
so a mean of
40 cycles per pixel.
I do know we can do with less cycles with the 65816 but can it be less than 19 cycles to be as fast compared clock to clock ?
TOUKO wrote:
for the second value if you scroll on Y ..
Just store both values consecutively in the buffer, then you use the same pointer for both.
TOUKO wrote:
How many cycles this take ?? .
Code:
move.l #$40000010, ($C00004)
move.l (a6)+, ($C00000)
56 cycles.
TOUKO wrote:
I think with interruption is close to 100 cycles / line
Code:
move.l a6, -(sp)
move.l usp, a6
move.l #$40000010, ($C00004)
move.l (a6)+, ($C00000)
move.l a6, usp
move.l (sp)+, a6
152 cycles (acknowledge and RTE included), that's about 31% of the scanline duration.
TOUKO wrote:
Here with interruption it's definitely too slow to do it for each line in a game like axelay.
You haven't seen Sonic 3D's IRQ handler for the special stages then, which is a monster (
and looks like this).
The thing is, when I'm programming, I make liberal use of sprites and objects all the time, and I don't usually have issues with slowdown, and I am not going to tailor my game design ideas for perceived limitations.
Stef wrote:
By saying that, you were pretending the SNES could handle it and so that the SNES CPU can replicate all what happen in this game.
But we don't know for sure if it could or not, until we've seen some kind of evidence.
Stef wrote:
i just said the Sega Genesis has a more powerful CPU *without any doubt*
It may be more powerful *without any doubt*, but how much more powerful is debatable. I think they're pretty comparable, and a lot of people here also seem to.
Stef wrote:
we can post simple code samples (2D collisions, physic calculation, line drawing or whatever...) and compare, i think that is a good way to evaluate performance of both CPU.
That sounds good to me.
psycopathicteen wrote:
The thing is, when I'm programming, I make liberal use of sprites and objects all the time, and I don't usually have issues with slowdown, and I am not going to tailor my game design ideas for perceived limitations.
This. Honestly, if I've done what I want to do and have optimized the code as much as humanly possible, I'm probably just going to shrug my shoulders and say "oh well!" Otherwise, we probably wouldn't have seen a Gradius III port and Super R-Type. Maybe I'm biased though, because slowdown really doesn't bother me much.
Well, for a code comparison, I have an old collision code for the SNES I made back in my good ol' WLA days. It's probably not the best, but it's so simple that I don't see much that you could do to optimize it. The numbers beside are the amount of cycles each thing takes. They might be wrong, but I think they're good.
Code:
sep #$20 ;3
lda CollisionWidth1X ;3
clc ;3
adc CollisionWidth2X ;3
sta FinalCollisionWidthX ;3
lda Sprite1PositionX ;3
sec ;2
sbc Sprite2PositionX ;3
cmp FinalCollisionWidthX ;3
bcc _y_check ;2 (It might not be. I don't get a document's explanation)
clc ;3
adc FinalCollisionWidthX ;3
bcc _no_collision ;2
_y_check:
lda CollisionWidth1Y ;3
clc ;3
adc CollisionWidth2Y ;3
sta FinalCollisionWidthY ;3
lda Sprite1PositionY ;3
sec ;2
sbc Sprite2PositionY ;3
cmp FinalCollisionWidthY ;3
bcc _collision ;2
clc ;3
adc FinalCollisionWidthY ;3
bcc _no_collision ;2
_collision:
lda #$01 ;2
rts ;6
_no_collision:
lda #$00 ;2
rts ;6
If you go the longest way possible, that's 77 cycles I'm pretty sure.
By the way 93143, why did you delete your post?
Oh yeah, here's my reference:
http://wiki.superfamicom.org/snes/show/ ... rence#fn:6
The code I used to calculate coordinates for multijointed sprites, looks a lot like the 3D matrix transformation code. I'll post it and count cycles when I get the chance. It DOES use the PPUs multiplication registers though.
psycopathicteen wrote:
PPUs multiplication registers
What? I though the multiplication registers were inside of the 5A22. Anyway though, Stef said your demo didn't show enough to say that a GH port was a possibility on the SNES, but from what I've heard, multijointed bosses are one of the most complex parts, which you've mostly been able to do. (There was a weird ball contraption thing at the end of the level, and you made the flipping plasma Grinch thing in Alisha's Adventure.)
Somebody at either Nintendo or Ricoh screwed up and implemented 2 different sets of multiplication registers by mistake. The PPU's multiplication registers are better though, but they interfere with mode-7 somehow.
Never having played Gunstar Heroes before, I went to see what all the fuss was about. Well, it definitely looks like a well made, cool, fast-paced game with lots of enemies, but not something that looks impossible to do on the SNES by any means.
Many of the objects are dumb, they just stand there and shoot, meaning that not much physics is happening every frame. Many objects are flying robots, which don't collide with anything. The huge explosions help give the impression that a lot of action is going on at once, but explosions are as dumb as it gets.
I believe that this has to do with the human perception of speed. Most games are slow-paced, probably so that players have a proper amount of time to plan their actions and react to hazards, but sometimes a fast-paced game with frenetic action appears and people are under the impression that it's running faster than the others, so it's logically using more hardware resources. What people fail to consider, specially people without a technical background, is that moving an object by 16 pixels is just as easy as moving it by 1 pixel. It's one addition either way. Every console ever made runs at 60Hz (or 50Hz), and even the Atari 2600 can animate frenetic action scenes if programmers choose to.
What Gunstar Heroes has going for it is that it combines the fast movements with an above average amount of enemies on screen, but they aren't particularly smart, they're simple objects moving in 2D space. Some areas are exaggerated, and I would expect the SNES to slow down as much as the Genesis does in those parts.
In short, I didn't see anything incredibly amazing that seemed particularly taxing on the CPU. It's just a regular 2D game, with a few raster effects here and there, and fast-moving objects that are sometimes placed in abundance. Oh, and fire. Lots of fire. Just optimize the A.I. routines and physics as much as you can and you should be fine.
Espozo wrote:
By the way 93143, why did you delete your post?
I figured it was premature. I'm already fairly new at 65816 assembly; I'd rather not compound that by posting a half-assed attempt at an algorithm I just learned today and don't understand very well. In fact I've already improved on that attempt...
Stef wrote:
a simple bresenham line drawing code
I don't see a tile structure. Are you plotting to a linear bitmap?
Quote:
You haven't seen Sonic 3D's IRQ handler for the special stages then, which is a monster (and looks like this).
Yes i know, and i like it very mush
But IMO it's not comparable to axelay for exemple, sonic is a little bit empty of actions, and in this case you can spend 150 cycles for that, is not a big deal,but it still represents 33600 cycles/frame .
Quote:
but not something that looks impossible to do on the SNES by any means.
I don't know how many times i said that to stef,and if snes can do it, PCE can do it too !!!
I think the only things which should be problematic,will be sprites flickering, and game playability (too much stuffs on screen for a 256px wide game)
tokumaru wrote:
Never having played Gunstar Heroes before, I went to see what all the fuss was about. Well, it definitely looks like a well made, cool, fast-paced game with lots of enemies, but not something that looks impossible to do on the SNES by any means.
Many of the objects are dumb, they just stand there and shoot, meaning that not much physics is happening every frame. Many objects are flying robots, which don't collide with anything. The huge explosions help give the impression that a lot of action is going on at once, but explosions are as dumb as it gets.
...
In short, I didn't see anything incredibly amazing that seemed particularly taxing on the CPU. It's just a regular 2D game, with a few raster effects here and there, and fast-moving objects that are sometimes placed in abundance. Oh, and fire. Lots of fire. Just optimize the A.I. routines and physics as much as you can and you should be fine.
I definitely don't agree with that... what is impressive is not the speed of the game, just the tons of things happening at same time.
Super Aleste is fast but there is not that much happening at same time and/or every things is static and precalculated which is definitely not the case of GH. The IA is not really smart but not that dumb, they try to shoot or catch at you but for me what is the most impressive is the detail of the explosion animations, that is what bring "life" to game and make it so pleasant to play. My self i am not a big fan of that game but i have to admit the level of animation happening at same time is outstanding for a genesis game and there is no way to reproduce that level on a SNES, you are totally free to believe that is possible, actually it is... but with majors slowdowns...
Quote:
Stef wrote:
a simple bresenham line drawing code
I don't see a tile structure. Are you plotting to a linear bitmap?
Oh yeah i really take the simplest way, it's not to draw a real line on SNES nor on MD, just the algorithm implementation
TOUKO wrote:
Yes i know, and i like it very mush
But IMO it's not comparable to axelay for exemple, sonic is a little bit empty of actions, and in this case you can spend 150 cycles for that, is not a big deal,but it still represents 33600 cycles/frame .
Little action ? The bridge is depth shadowed and zoomed in real time, this single zoom is even more computation that anything happening in Axelay.
Quote:
I don't know how many times i said that to stef,and if snes can do it, PCE can do it too !!!
I think the only things which should be problematic,will be sprites flickering, and game playability (too much stuffs on screen for a 256px wide game)
Yeah i know, you keep repeating that and i keep repeating you are wrong :p Also i don't understand why you put the PCE and the SNES together, did you forget the PCE use a 7.1 Mhz 6280 ? for me it's *really* different from a 3 Mhz 65816...
Quote:
Little action ? The bridge is depth shadowed and zoomed in real time, this single zoom is even more computation that anything happening in Axelay.
You speak as if an entire image is zoomed .
Even if it done in real time, it concern 1/2 tiles i think (only the roof ones),and seems performed on 2bpp only (4 colors) ,the perspective is done simply by Hsync interrupts, and shadow by changing some colors on the fly,or H/S .
The Tilemap is repetitive , nothing to do with the dynamic one from axelay.
all objects on the roof seems to be sprites,and i think are precalc,but may be i'am wrong .
tokumaru wrote:
Never having played Gunstar Heroes before, I went to see what all the fuss was about. Well, it definitely looks like a well made, cool, fast-paced game with lots of enemies, but not something that looks impossible to do on the SNES by any means.
I
have played Gunstar Heroes and I still don't know what the fuss is about. It's just like Contra III with more "dumb" explosions and multi sprite bosses. If you rapidly switch between weapons with a turbo controller, there's even more firepower. With 2 spread guns for each player, you can get 40 bullets onscreen at once and the game runs fine, meaning that it still has to check collisions against all the bullets.
Stef wrote:
I definitely don't agree with that...
I definitely do.
Stef wrote:
Super Aleste is fast but there is not that much happening at same time and/or every things is static and precalculated which is definitely not the case of GH.
You know, you keep saying that, but how do you know? Have you disassembled each of the games?
Stef wrote:
My self i am not a big fan of that game
If you're not a fan of GH, then I'm not a fan of Gun Force 2.
Stef wrote:
you are totally free to believe that is possible, actually it is... but with majors slowdowns...
Well, at least it
can replicate it in some way...
https://www.youtube.com/watch?v=gUz-Qc0c-9Q (I know this isn't fair, it's mostly a joke...)
Espozo wrote:
I have played Gunstar Heroes and I still don't know what the fuss is about. It's just like Contra III with more "dumb" explosions and multi sprite bosses. If you rapidly switch between weapons with a turbo controller, there's even more firepower. With 2 spread guns for each player, you can get 40 bullets onscreen at once and the game runs fine, meaning that it still has to check collisions against all the bullets.
Ok we took Gunstar Heroes because you were speaking about Gunstar Heroes, but you can take Contra Hard Corps as well...
I guess you also believe we can port Contra Hard Corps on SNES right ? Adventures of Batman and Robin should be doable also...
Espozo wrote:
You know, you keep saying that, but how do you know? Have you disassembled each of the games?
That is your ultimate argument ? do we really need that when the result on screen is so demonstrative ? why all SNES games are so empty with missing actions compared the Megadrive games if its blazing fast CPU can handle as much the MD 68000 ?
Espozo wrote:
If you're not a fan of GH, then I'm not a fan of Gun Force 2.
Where i said i like this game ? I just said it's a technically impressive game, that's all.
Espozo wrote:
Well, at least it
can replicate it in some way...
https://www.youtube.com/watch?v=gUz-Qc0c-9Q (I know this isn't fair, it's mostly a joke...)For a chinese clone game it's not that bad :p Graphism are ok, the rest sucks...
Espozo wrote:
It may be more powerful *without any doubt*, but how much more powerful is debatable. I think they're pretty comparable, and a lot of people here also seem to.
Pretty comparable... i'm quoting you
I would estimate the MD 68000 to be about 100% faster (understand twice as fast) than the SNES 65816 running
with fast rom (so even more when running with slow rom). That is a general estimation, meaning in certain case the difference will be smaller or larger depending the code.
About your collision code, to be honest i don't really understand it, are you comparing 2 static objects ? the method cannot be used for general purpose collision code and so not really meaningful, you should consider processing collision over an array of objects (and so, having a loop) instead of having a function to process only a single collision. For instead, you usually want to test the player sprite colliding against enemies bullets or ship, so the method test at least one object versus an array of objects.
I remember we already played with collision code with Touko on another forum but i can't find the post anymore :-/
Quote:
I remember we already played with collision code with Touko on another forum but i can't find the post anymore :-/
Me too ..
Quote:
Yeah i know, you keep repeating that and i keep repeating you are wrong :p
Stef, i'am not the first guy who tell you that GH is not that impressive technicaly speaking or in the department you calling physics, one of these days will require you to realise that you're wrong
I just wrote out an algorithm on a piece of paper at work, and I calculated 99 cycles for collision. I'm surprised, because it's more than I thought.
psycopathicteen wrote:
I just wrote out an algorithm on a piece of paper at work, and I calculated 99 cycles for collision. I'm surprised, because it's more than I thought.
Not bad, stef's one goes between 93 and 250 cycles with a sorting part in main loop ..
I was for 192 cycles with a very generic one, i use some indirect adressing .
Mine works for all games, you have 2 sprites coords in input, and collision or not as output, very basic .
Stef wrote:
I guess you also believe we can port Contra Hard Corps on SNES right ? Adventures of Batman and Robin should be doable also...
Um, yeah... How is Contra Hard Corps
any different than Contra III? It's almost exactly like a Contra III boss rush.
Stef wrote:
That is your ultimate argument ? do we really need that when the result on screen is so demonstrative ? why all SNES games are so empty with missing actions compared the Megadrive games if its blazing fast CPU can handle as much the MD 68000 ?
Since you're so picky with wording, when did I say the 65816 was "blazing fast"?
Stef wrote:
Where i said i like this game ?
Thank you for proving my point. Again, where did I say the 65816 was "blazing fast"? (don't say I implied it, because you've implied plenty of things that you've said you've denied.)
Stef wrote:
Graphism are ok
Relatively... There's only one type of enemy at a time because of the limited amount of color palettes.
TOUKO wrote:
one of these days will require you to realise that you're wrong
I wish...
Again, why do you feel the need to shove your opinion down everyone's throat? I doubt you care about anyone here, so you're not doing anyone a favor by posting your useless speculation. You said that I should learn how to make a logical argument if I went and talked about how strong the SNES is on Sega 16, but so far, you haven't provided one yourself. What your saying is: The SNES couldn't handle the game I don't care about but worship because I said so, not because of any evidence you've provided. You even had the audacity to compare GH to Far Cry 4 at one point. Why act like it is so obvious that GH is a technical masterpiece if no one else here does? Tokumaro pretty much hit the nail on the head earlier, saying that there is a lot of fast movements and "dumb" objects like explosions that occupy half the screen. I don't get you're motives at all, except blind rage.
Quote:
Pretty comparable... i'm quoting you
What? I don't even get why that's important, or why you are quoting me, or anything about that statement.
TOUKO wrote:
Not bad, stef's one goes between 93 and 250 cycles with a sorting part in main loop ..
I was for 192 cycles with a very generic one, i use some indirect adressing .
Mine works for all games, you have 2 sprites coords in input, and collision or not as output, very basic .
You remember the number of cycles ? I though we even got lower than that and i remember i even wrote some piece of the 65816 code
Espozo wrote:
Um, yeah... How is Contra Hard Corps any different than Contra III? It's almost exactly like a Contra III boss rush.
If you really think this :
https://www.youtube.com/watch?v=-9YFtbCb3y0#t=75is the same as this :
https://www.youtube.com/watch?v=6lMxDTGrHYQ#t=77then i think i understand the problem now
Don't you see a
slight difference in the way explosions are handled for instance ?
Stef wrote:
Since you're so picky with wording, when did I say the 65816 was "blazing fast"?
Nice way to avoid replying the question
I was sarcastic of course...
Quote:
Again, why do you feel the need to shove your opinion down everyone's throat? I doubt you care about anyone here, so you're not doing anyone a favor by posting your useless speculation. You said that I should learn how to make a logical argument if I went and talked about how strong the SNES is on Sega 16, but so far, you haven't provided one yourself. What your saying is: The SNES couldn't handle the game I don't care about but worship because I said so, not because of any evidence you've provided. You even had the audacity to compare GH to Far Cry 4 at one point. Why act like it is so obvious that GH is a technical masterpiece if no one else here does? Tokumaro pretty much hit the nail on the head earlier, saying that there is a lot of fast movements and "dumb" objects like explosions that occupy half the screen. I don't get you're motives at all, except blind rage.
Ok, you know i was just (hardly) trying to explain you why these CPU aren't comparable because you (but not only you) believe (or want to believe) they are. Then now i understand you are just a lost fanboy and whatever are the arguments you will always deny them... some fanboys at least accept to learn.
You should read that (entirely really) :
http://www.smwcentral.net/?p=viewthread&t=14402Don't you think the dragonboy versus smkdan discussion looks similar to our ? Honestly it is so similar that is even funny...
Then one question, at the end, in your opinion, do you really think dragonboy is the one who has the truth ?
I think there is nothing to add, smkdan said exactly what i said (or the contrary),
word to word ! And i think its conclusion apply to you as well :
Quote:
I suggest you actually try to put together some actual homebrew examples since in all honesty the only way you can come to the conclusion that the snes '816 is on a level playing field with the 68k is misinformation and a narrow minded approach of comparing the two
Quote:
You remember the number of cycles ? I though we even got lower than that and i remember i even wrote some piece of the 65816 code
Yes i was under 93 cycles (may be 50/60) with some things (computing the bounding box ) when i move each sprites in main loop, and not all the job like i do with my classic one .
But i can admit for mem copy, the 68k is faster, not for small transferts may be like 32/64 bytes(vs the Hu6280) but sure for large .
Stef wrote:
Don't you see a slight difference in the way explosions are handled for instance ?
Yes. They just added velocity to the explosions and might have had some sort of way to figure out what the velocities of the explotions are. they could have possibly indexed a table that has precalculated values or something. I doubt the shrinking explosions would be that hard either. I wouldn't be surprised is it were something like jumping to where the velocity goes in reverse and then the object disappears when it goes back to the central, recorded point. Nothing to brag about really. I am familiar with both games.
Stef wrote:
Nice way to avoid replying the question
What was the question?
Stef wrote:
Then now i understand you are just a lost fanboy and whatever are the arguments you will always deny them... some fanboys at least accept to learn.
Woah, big man. At least I'm not the one arguing with a 16 year old. Who's the judge to declare who's a fan boy and who's not? By calling me one, I wouldn't be surprised if you're calling a considerable amount of other people here too.
Stef wrote:
I was sarcastic of course...
With you, I don't have any idea.
Stef wrote:
And i think its conclusion apply to you as well :
Couldn't this go the same for you? If you really believe the 68000 is way better, than how about you try to port GH over to the SNES and show us how crippled it is? Unlike you, I don't have 20+ years of experience. You could say that I have no room to talk, but because all you've done so far is make blind observations, I've done the same. This debate hasn't progressed any since it first started and it's unlikely that it will ever
Stef wrote:
You should read that (entirely really) :http://www.smwcentral.net/?p=viewthread&t=14402
I will in a while.
So wait, you guys in some other thread had 93 cycles for the 65816, but 250 cycles for the 68000 for collision detection? Yeah, this REALLY proves how much faster the Genesis is.
psycopathicteen wrote:
So wait, you guys in some other thread had 93 cycles for the 65816, but 250 cycles for the 68000 for collision detection? Yeah, this REALLY proves how much faster the Genesis is.
Keep in mind that I'm the blind fan boy.
psycopathicteen wrote:
So wait, you guys in some other thread had 93 cycles for the 65816, but 250 cycles for the 68000 for collision detection? Yeah, this REALLY proves how much faster the Genesis is.
I don't want to be the devil's lawyer, but his code is between 93 AND 250, it depends of some hazardous circumstances, if does not occur it's 93 all the time, like it could be 250 if occur .
So I'm the blind fan boy again...
(I don't care what you say Stef, but there's no way in hell I'm as dimwitted as dragonboy. Did I call anyone a dickhead?)
Ok i found stef's code .
Code:
8 lea 4(a7), a6 ; a6 point on parameters
12 move.l (a6)+, a0 ; a0 = player
12 move.l (a6)+, a1 ; a1 = enemies
12 move.l (a6), d7 ; d7 = numEnemies
4 subq.w #1,d7 ; d7 = numEnemies - 1 (prepare for DBCC instruction)
48
12+16 movem.w (a0)+, d0-d3 ; d0 = player.xmin... d3 = player.ymax
28
.loop ; do {
// TEST on X coordinates
4 move.l a1, a2 ; a2 point on enemy
8 cmp.w (a2)+, d1 ; if (player.xmax < enemy.xmin)
10 blt .no_collid ; no collision
22
8 cmp.w (a2)+, d0 ; if (player.xmin > enemy.xmax)
10 bgt .no_collid ; no collision
18
// TEST on Y coordinates
8 cmp.w (a2)+, d3 ; if (player.ymax < enemy.ymin)
10 blt .no_collid ; no collision
18
8 cmp.w (a2)+, d2 ; if (player.ymin > player.ymax)
10 bgt .no_collid ; no collision
18
.collid
4 move #1,d0 ; return 1
16 rts
20
.no_collid
8 add.l #8, a1 ; a1 point on next enemy
10 dbra d7, .loop ; } while (numEnemies--)
18
4+2 move #0,d0 ; return 0
16 rts
22
init: 76
test min: 22
test mean: 40
test max: 76
total per loop: 94
end: 20/22
Hum, there is no box additions in it !!
Escroc
Here with bounding computations
Code:
8 lea 4(a7), a6 ; a6 point on parameters
12 move.l (a6)+, a0 ; a0 = player
12 move.l (a6)+, a1 ; a1 = enemies
12 move.l (a6), d7 ; d7 = numEnemies
4 subq.w #1,d7 ; d7 = numEnemies - 1 (prepare for DBCC instruction)
48
8 move.w (a0)+, d0 ; d0 = player.x
8 move.w (a0)+, d2 ; d2 = player.y
12 move.l (a0)+, a0 ; a0 = player.box
4 move.w d0, d1 ; d1 = player.x
4 move.w d2, d3 ; d3 = player.y
8 add.w (a0)+, d0 ; d0 = player.x + box.xmin = player.xmin
8 add.w (a0)+, d1 ; d1 = player.x + box.xmax = player.xmax
8 add.w (a0)+, d2 ; d2 = player.y + box.ymin = player.ymin
8 add.w (a0)+, d3 ; d3 = player.y + box.ymax = player.ymax
68
.loop ; do {
// TEST on X coordinates
8 move.w (a1)+, d4 ; d4 = enemy.x
8 move.w (a1)+, d5 ; d5 = enemy.y
12 move.l (a1)+, a0 ; a0 = enemy.box
28
4 move.w d4, d6 ; d6 = enemy.x
8 add.w (a0)+, d6 ; d6 = enemy.x + box.xmin = enemy.xmin
4 cmp.w d1, d6 ; if (player.xmax < enemy.xmin)
10 bgt .no_collid ; no collision
26
8 add.w (a0)+, d4 ; d4 = enemy.x + box.xmax = enemy.xmax
4 cmp.w d0, d4 ; if (player.xmin > enemy.xmax)
10 blt .no_collid ; no collision
22
// TEST on Y coordinates
4 move.w d5, d6 ; d6 = enemy.y
8 add.w (a0)+, d6 ; d6 = enemy.y + box.ymin = enemy.ymin
4 cmp.w d3, d6 ; if (player.ymax < enemy.ymin)
10 bgt .no_collid ; no collision
26
8 add.w (a0), d5 ; d5 = enemy.y + box.ymax = enemy.ymax
4 cmp.w d2, d5 ; if (player.ymin > player.ymax)
10 blt .no_collid ; no collision
22
.collid
4 move #1,d0 ; return 1
16 rts
.no_collid
10 dbra d7, .loop ; } while (numEnemies-)
4+2 move #0,d0 ; return 0
16 rts
init: 116
loop min: 54
loop mean: 76
loop max: 134
end: 20/22
What do the numbers on the bottom mean? (I guess that that's the number of cycles.)
Quote:
Hum, there is no box additions in it !!
I though I noticed that too... (never mind, he found one with them.)
What exactly does his code do though? Does it check the players against the enemy projectiles, and then the player projectiles against the enemies?
Quote:
What do the numbers on the bottom mean? (I guess that that's the number of cycles.)
I explain like i understand it .
first is init, load values in 68k registers .
second cycles min if no collision
third cycles if no collision with the first Y test
fourth cycles if collision or not
fifth if collision add 20/22 cycles for return from subroutine .
Quote:
What exactly does his code do though? Does it check the players against the enemy projectiles, and then the player projectiles against the enemies?
Yes it check only player versus ennemies and bullets, but you must sort all sprites to test in your main loop before .
How many enemies does it check against?
Anyway, if it only checks against one and we're including the init part, that's 250 cycles. If you want something comparable on the SNES, you'd have to make it about half, which is 125 cycles. That doesn't sound impossible, and I might try if I know more about his code so I don't do too little or too much.
I just checked and 250 cycles × 80 sprites (let's suppose this many, that's the cap you can show on screen) = 20000 cycles. That's 15.64% of a NTSC frame, and it ignores the fact that loop iterations are faster. Also the fact that this routine may not be exactly optimal either.
psycopathicteen wrote:
Somebody at either Nintendo or Ricoh screwed up and implemented 2 different sets of multiplication registers by mistake. The PPU's multiplication registers are better though, but they interfere with mode-7 somehow.
Wait, there are two sets? That seems pretty useful in practice actually.
And yeah, the PPU ones interfere with mode 7 because they're using the very same hardware used to do the mode 7 calculations (meaning, you can't use both simultaneously). Sucks, but makes sense, no need to leave them to waste when mode 7 isn't being used.
tokumaru wrote:
In short, I didn't see anything incredibly amazing that seemed particularly taxing on the CPU. It's just a regular 2D game, with a few raster effects here and there, and fast-moving objects that are sometimes placed in abundance. Oh, and fire. Lots of fire. Just optimize the A.I. routines and physics as much as you can and you should be fine.
Alien Soldier is way more interesting in this sense (hell, that's the whole premise of that game), but yeah it's still more looks than actual complexity (there's some pretty clever stuff going on though).
I think the problem is that when it comes to stuff done with the video hardware, huh, both systems can cope with it just fine, it's all about how clever you can get with what you have available. The real struggles are when the CPU is doing everything, the most blatant example being software rendered games (usually 3D ones), and admittedly the Mega Drive tends to wreck the SNES in that area... (at least, without coprocessors)
TOUKO wrote:
I don't know how many times i said that to stef,and if snes can do it, PCE can do it too !!!
Until multiple layers come in and they have too much stuff to not risk running into sprite overflow =P (that's probably the biggest weakness of the PCE, at least without the SGX add-on). Sound hardware is quite weak too but with some quick trickery you can get it to do some decent PCM output on multiple channels (much more easily than on the Mega Drive, in fact).
PCE is definitely on par with the rest whenever it comes to sprites, though.
Stef wrote:
If you really think this :
https://www.youtube.com/watch?v=-9YFtbCb3y0#t=75is the same as this :
https://www.youtube.com/watch?v=6lMxDTGrHYQ#t=77then i think i understand the problem now
Don't you see a
slight difference in the way explosions are handled for instance ?
Holy shit, I didn't recall the SNES version slowing down that badly (although honestly that talks more about sloppy programming, the SNES definitely should be able to at least
not slow down).
Sik wrote:
Holy shit, I didn't recall the SNES version slowing down that badly (although honestly that talks more about sloppy programming, the SNES definitely should be able to at least not slow down).
For some reason, the car exploding at the beginning is very processor intensive.
The rest isn't that bad, but there is slowdown when some of the bosses die.
Code:
collision_loop:
lda {xmax} 4
cmp {xmin},x 5 9
bcc no_collision 2/3 11
lda {xmin} 4 15
cmp {xmax},x 5 20
bcs no_collision 2/3 22
lda {ymax} 4 26
cmp {ymin},x 5 31
bcc no_collision 2/3 33
lda {ymin} 4 37
cmp {ymax},x 5 42
bcs no_collision 2/3 44
lda #$0001 3 47
rts 6 53
no_collision:
txa 2 47
clc 2 49
adc #{object_slot_size} 3 52
tax 2 54
dey 2 56
bne collision_loop 2/3 59
lda #$0000 3
rts 6
53 for a collision, but 59 for a loop in worst case scenario.
Sik wrote:
Holy shit, I didn't recall the SNES version slowing down that badly (although honestly that talks more about sloppy programming, the SNES definitely should be able to at least not slow down).
Slowdown was one of the defining features of SNES action games.
rainwarrior wrote:
Slowdown was one of the defining features of SNES action games.
Some might even say "iconic".
psycopathicteen wrote:
53 for a collision, but 59 for a loop in worst case scenario.
Is this about as functional as Stef's first code?
I think it would be about 70% the speed of the 7.67mhz 68000 in this situation, for 59 vs 94 cycles.
I thought Touhou said the 68000 code was 250 cycles?
Espozo wrote:
If you really believe the 68000 is way better, than how about you try to port GH over to the SNES
It'd be hard to get the rights to do that.
Sik wrote:
I just checked and 250 cycles × 80 sprites (let's suppose this many, that's the cap you can show on screen)
Collision is O(n^2) unless you're using a sorting method to narrow the search horizon. Then it's closer to O(n^1.5).
tepples wrote:
It'd be hard to get the rights to do that.
No one said it had to be legal...
(What does Treasure even do now anyway?)
psycopathicteen wrote:
I thought Touhou said the 68000 code was 250 cycles?
Touhou, the compamy?
- Toho: Godzilla company
- Touhou Project: Bullet hell shoot-em-up series for PC by Team Shanghai Alice
- Touko: NESdev member
Toho also published Super Aleste (Space Megaforce), which is actually kinda on topic...
93143 wrote:
Toho also published Super Aleste (Space Megaforce), which is actually kinda on topic...
"Bad animation" and "clunky movements".
Espozo wrote:
For some reason, the car exploding at the beginning is very processor intensive.
The rest isn't that bad, but there is slowdown when some of the bosses die.
Maybe the explosion spawner is too slow?
tepples wrote:
It'd be hard to get the rights to do that.
Gameplay can't be copyrighted (unless it's Tetris), and that's the important part here.
Sik wrote:
Gameplay can't be copyrighted
Treasure's characters, however, can be and are. If you can provide original character designs, levels, and boss patterns for a new game in the same genre, I'd be interested to look at your design document.
Sik wrote:
Espozo wrote:
For some reason, the car exploding at the beginning is very processor intensive.
The rest isn't that bad, but there is slowdown when some of the bosses die.
Maybe the explosion spawner is too slow?
Maybe they thought it wasn't important to spend time profiling and solving a slowdown problem that doesn't effect gameplay AND/OR maybe they thought the slowdown made the scenes feel more "epic".
tepples wrote:
Sik wrote:
Gameplay can't be copyrighted
Treasure's characters, however, can be and are. If you can provide original character designs, levels, and boss patterns for a new game in the same genre, I'd be interested to look at your design document.
What I meant is that all we need is a game that plays like Gunstar Heroes but we don't need any of the rest (and that "rest" is all the copyrighteable stuff). So yeah really, copyright is not a barrier here if one wants to stay legal and still prove the point. (even better, take advantage of this to adapt the game to the SNES's best abilities)
Sik wrote:
tepples wrote:
If you can provide original character designs, levels, and boss patterns for a new game in the same genre [as Gunstar Heroes], I'd be interested to look at your design document.
What I meant is that all we need is a game that plays like Gunstar Heroes but we don't need any of the rest
Exactly my point. Design some characters and some levels, and make the best damn run and gun possible on the Super NES.
I would love it if one of you quit your job and spent a year of your life making a game just to prove a point to somebody on the internet.
rainwarrior wrote:
maybe they thought the slowdown made the scenes feel more "epic".
Well, it's true that ever since
The Matrix movies have been suffering a lot of slowdown. There's been some complaints, but most people don't seem to mind.
rainwarrior wrote:
I would love it if one of you quit your job and spent a year of your life making a game just to prove a point to somebody on the internet.
Unfortunately, that doesn't always work. I had a little over a year without a job a few years ago and I made very little progress with my projects.
tepples wrote:
Sik wrote:
tepples wrote:
If you can provide original character designs, levels, and boss patterns for a new game in the same genre [as Gunstar Heroes], I'd be interested to look at your design document.
What I meant is that all we need is a game that plays like Gunstar Heroes but we don't need any of the rest
Exactly my point. Design some characters and some levels, and make the best damn run and gun possible on the Super NES.
Dude the SNES could barely handle SMW with one object onscreen. You'd probably need to change Sonic's color to red to work on the SNES. I know the SNES's 2.78 instructions still couldn't handle it though, so you'd need to make Sonic move so slow that he moves backwards.
If you drop to part-time and spend a year making a game, you get a
game. That you can
sell. And make
money. And get something to put on your CV to get a
better job. So you can buy better
lizard costumes.At 0:31, my lizard is the Lizard of LEGO.
I doubt I'll make money from Lizard, and my job prospects are already pretty good. I'm just doing it because I wanted to make a game.
Thanks Touko, so here's the 68000 code for a real simple bounding box collision code of player versus an array of bullet/enemies:
Code:
8 lea 4(a7), a6 ; a6 point on parameters
12 move.l (a6)+, a0 ; a0 = player
12 move.l (a6)+, a1 ; a1 = enemies
12 move.l (a6), d7 ; d7 = numEnemies
4 subq.w #1,d7 ; d7 = numEnemies - 1 (prepare for DBCC instruction)
48
8 move.w (a0)+, d0 ; d0 = player.x
8 move.w (a0)+, d2 ; d2 = player.y
12 move.l (a0)+, a0 ; a0 = player.box
4 move.w d0, d1 ; d1 = player.x
4 move.w d2, d3 ; d3 = player.y
8 add.w (a0)+, d0 ; d0 = player.x + box.xmin = player.xmin
8 add.w (a0)+, d1 ; d1 = player.x + box.xmax = player.xmax
8 add.w (a0)+, d2 ; d2 = player.y + box.ymin = player.ymin
8 add.w (a0)+, d3 ; d3 = player.y + box.ymax = player.ymax
68
.loop ; do {
// TEST on X coordinates
8 move.w (a1)+, d4 ; d4 = enemy.x
8 move.w (a1)+, d5 ; d5 = enemy.y
12 move.l (a1)+, a0 ; a0 = enemy.box
28
4 move.w d4, d6 ; d6 = enemy.x
8 add.w (a0)+, d6 ; d6 = enemy.x + box.xmin = enemy.xmin
4 cmp.w d1, d6 ; if (player.xmax < enemy.xmin)
10 bgt .no_collid ; no collision
26
8 add.w (a0)+, d4 ; d4 = enemy.x + box.xmax = enemy.xmax
4 cmp.w d0, d4 ; if (player.xmin > enemy.xmax)
10 blt .no_collid ; no collision
22
// TEST on Y coordinates
4 move.w d5, d6 ; d6 = enemy.y
8 add.w (a0)+, d6 ; d6 = enemy.y + box.ymin = enemy.ymin
4 cmp.w d3, d6 ; if (player.ymax < enemy.ymin)
10 bgt .no_collid ; no collision
26
8 add.w (a0), d5 ; d5 = enemy.y + box.ymax = enemy.ymax
4 cmp.w d2, d5 ; if (player.ymin > player.ymax)
0 ; no collision
12
.no_collid
10 dbge d7, .loop ; } while (numEnemies--)
4/6+4/2=8 sge d0 ; d0 = 1 if collision, d0 = 0 if no collision
16 rts
.collid
4 move #1,d0 ; return 1
16 rts
init: 116
loop min: 64 (discarded on first test)
loop mean: 86 (discarded after complete X checking)
loop max: 124
end: 20/24
I slightly modified the loop and fixed the number (they were under-estimated), so i counted a mean of 86 cycles per tested collision (discarded after X checking) which should be quite realist. Now it would be nice to do the same for the 65816, with similar data structure, having player and enemy structure like this:
Code:
{
s16 x;
s16 y;
Box *b;
}
and where Box is
Code:
{
s16 xminOffset;
s16 xmaxOffset;
s16 yminOffset;
s16 ymaxOffset;
}
so basically you need to add player/enemy coordinates to the bounding box offsets.
That seems a classic implementation for me, the bounding box is a pointer as many objects could share the same one (and so we don't waste memory for it) but if you think that is not fair for the 65816 we can directly include it in the player/enemy object structure or even precalculate it, i don't care the modalities =)
Espozo wrote:
Yes. They just added velocity to the explosions and might have had some sort of way to figure out what the velocities of the explotions are. they could have possibly indexed a table that has precalculated values or something. I doubt the shrinking explosions would be that hard either. I wouldn't be surprised is it were something like jumping to where the velocity goes in reverse and then the object disappears when it goes back to the central, recorded point. Nothing to brag about really. I am familiar with both games.
"they just added..."
Do you realize that what they just added is typically what make the snes CPU to crawl ? Do you see how the simple car explosion make things to slowdown in the snes when the megadrive handle much more without any troubles ? Definitely this type of animation is heavier to handle than you think. Many new objects to spaw at same time and then to handle physic / movement...
Quote:
Until multiple layers come in and they have too much stuff to not risk running into sprite overflow =P (that's probably the biggest weakness of the PCE, at least without the SGX add-on). Sound hardware is quite weak too but with some quick trickery you can get it to do some decent PCM output on multiple channels (much more easily than on the Mega Drive, in fact).
PCE is definitely on par with the rest whenever it comes to sprites, though.
Yes, agreed with you,i spoke only for sprites on screen not for the whole game
psycopathicteen wrote:
I meant Touko.
It's 250 cycles if a collision occur at first test, it decrease on each loop until 93 (without bounding calculations),it's 264/260 and 116 with bounding box calculations .
It's not that bad for player vs ennemies bullets,collision for the first test should be rare, but can be problematic with player bullets vs ennemis .
If you compute the bounding box when each sprites move, and not in collision routine, the 65xx one is mush faster,if you do it in the routine, 68k can be mush faster but with an overhead in main loop for sorting sprites .
On 68k side you can see that code is crippled by the slow test/branch instructions .
Quote:
How many enemies does it check against?
Anyway, if it only checks against one and we're including the init part, that's 250 cycles. If you want something comparable on the SNES, you'd have to make it about half, which is 125 cycles. That doesn't sound impossible, and I might try if I know more about his code so I don't do too little or too much.
Of course the purpose of the method is to show a case where collision computation can be a bottleneck in the game, if you have only one player versus one enemy to check you will never meet problem whatever is your collision code.
Here the method check 1 versus N, typically when you want to compare player versus enemies bullet and enemies ships (as we have in shmup games). We can ever do a N against M code (O(N²) complexity) to compare player bullets against enemies which is usually where we have the most computation but the 1 versus N test is already a good start and we can include it in a loop to do the N against M case (not the most optimal though).
Quote:
I just checked and 250 cycles × 80 sprites (let's suppose this many, that's the cap you can show on screen) = 20000 cycles. That's 15.64% of a NTSC frame, and it ignores the fact that loop iterations are faster. Also the fact that this routine may not be exactly optimal either.
Well actually that is a case where you only need to compare you ship against others, as said just previously for a shmup where your ship can have up to 20 independents bullets displayed at once while 20 enemies are on screen then computation can be up to 250x20x20 ~ 100000 cycles which is close to the whole frame time...
Actually with my code it would be more something like that: (150+100+(100x20))x20 ~ 45000 cycles which is better (and can be optimized) but still a lot !
Stef wrote:
"they just added..."Do you realize that what they just added is typically what make the snes CPU to crawl ? Do you see how the simple car explosion make things to slowdown in the snes when the megadrive handle much more without any troubles ? Definitely this type of animation is heavier to handle than you think. Many new objects to spaw at same time and then to handle physic / movement...
Konami isn't exactly known for their programming skills... (Yes, I know they obviously made Contra Hard Corps) Any CPU, 68000 or 65816, is going to slow down with bad programming. What needs to be done is the explosion spawner code needs to be disassembled and looked at. Otherwise, things like this wouldn't be possible:
https://www.youtube.com/watch?v=0w-U-9izrGs (look at the part where there are a lot of ships swarming. The explotions have velocity too.)
Espozo wrote:
Konami isn't exactly known for their programming skills...
Is this sarcastic ? Honestly if in your mind Konami are not good programmers then i really wonder which team you consider as good O_o ?? I admit they could have made it better and avoid the slowdown on the simple car explosion but still that is relevant about the CPU capabilities... On Megadrive there is not only a lot more explosions but also a lot of wreckages accompanying them and without any slowdowns, definitely another level.
Quote:
What needs to be done is the explosion spawner code needs to be disassembled and looked at. Otherwise, things like this wouldn't be possible:
https://www.youtube.com/watch?v=0w-U-9izrGs (look at the part where there are a lot of ships swarming. The explotions have velocity too.)
Oh no please, not this game again :p there is so much technical weakness is that game, do you really want to talk about them ?
Touko wrote:
I explain like i understand it .
first is init, load values in 68k registers .
second cycles min if no collision
third cycles if no collision with the first Y test
fourth cycles if collision or not
fifth if collision add 20/22 cycles for return from subroutine .
By the way Touko, where is the 65816 version of the code ? as we wrote it i remember...
Quote:
By the way Touko, where is the 65816 version of the code ? as we wrote it i remember...
It was more 6502 than 65816
All operations was done in 8 bit,and you used only the 6502 opcodes.
If you want the link :
http://www.gamopat-forum.com/t66305-mei ... ouding-box
Stef wrote:
Espozo wrote:
Konami isn't exactly known for their programming skills...
Is this sarcastic ? Honestly if Konami are not good programmers then i really wonder which team you consider as good O_o ??
I could still find a million things wrong with Konami's games just by tracing them with a debugger.
Here, this is i call an impressive shmup (but may be i'am a fanboy too) :
https://www.youtube.com/watch?v=cu9nfj0X34gIt's a mod of thunder force 3(made by tomaitheous), but the original game was so close already ..
psycopathicteen wrote:
I could still find a million things wrong with Konami's games just by tracing them with a debugger.
DO IT! I'd be interested in a list of WTFs you find in well-known Konami games for Super NES. At least it'd act as evidence for "The Super NES doesn't suck; Konami just sucks."
Stef wrote:
Is this sarcastic ? Honestly if in your mind Konami are not good programmers then i really wonder which team you consider as good O_o ?? I admit they could have made it better and avoid the slowdown on the simple car explosion but still that is relevant about the CPU capabilities... On Megadrive there is not only a lot more explosions but also a lot of wreckages accompanying them and without any slowdowns, definitely another level.
Contra Hard Corps was also made 2 years latter using a CPU that they were already well familiar with from their arcade boards. Konami may be good at 68000 assembly, but it's an entirely different story when it comes to the 65816. (Gradius III...)
Stef wrote:
Oh no please, not this game again :p there is so much technical weakness is that game, do you really want to talk about them ?
Exactly. The game wasn't programmed very well.
psycopathicteen wrote:
I could still find a million things wrong with Konami's games just by tracing them with a debugger.
If I'm not mistaken, psychopathicteen already found a very backwards and inefficient code to handle hioam in Gradius III. I'm sure there's other dumb stuff like that in that game.
TOUKO wrote:
Here, this is i call an impressive shmup (but may be i'am a fanboy too) :
I'd watch it if my school didn't have a dumb web filter to block out youtube... I'll watch it when I get home.
tepples wrote:
DO IT! I'd be interested in a list of WTFs you find in well-known Konami games for Super NES. At least it'd act as evidence for "The Super NES doesn't suck; Konami just sucks."
I second that. I wonder if Konami kept the bad practices they made in Gradius III to their other titles. (The two headed water dragon boss and the Mode 7 tunnel in Super Castlevania IV practically get the game to a complete stop for some reason. I really don't get the slowdown in the tunnel.)
Quote:
Exactly. The game wasn't programmed very well.
Why ??, this game is not the better ever, but a good game in average, but technicaly is very impressive .
No slowdown, an impressive amount of sprites on screen, and a very good use of snes's special effects (very well placed).
His big default is the shitty sfx and musics, exept for level 7 .
So to compare a collision code between the 2 CPUs
Here's a code i wrote for the 6502 CPU, it computes 16 bits integer collision of 1 bounding box against N bounding boxes (where N is limited to 30 max as i use 8 bits indexing only and a part of the zero page is also used by player position). Also bounding boxes are "offseted" to contains only positives values to make the code faster (i think that is a classic optimization for this CPU) and everything considered in zero page. Ok that is a bit cheating but it was to obtain the fastest collision code.
Code:
3 ldy <numEnemies
2 ldx #0
5
.loop
// TEST on X coordinates
3 lda <player.xmax_low
4 cmp <enemies.xmin_low,X
3 lda <player.xmax_high
4 sbc <enemies.xmin_high,X
2 bcc .no_collid_cc
16
3 lda <player.xmin_low
4 cmp <enemies.xmax_low,X
3 lda <player.xmin_high
4 sbc <enemies.xmax_high,X
2 bcs .no_collid_cs
16
// TEST on Y coordinates
3 lda <player.ymax_low
4 cmp <enemies.ymin_low,X
3 lda <player.ymax_high
4 sbc <enemies.ymin_high,X
2 bcc .no_collid_cc
16
3 lda <player.ymin_low
4 cmp <enemies.ymax_low,X
3 lda <player.ymin_high
4 sbc <enemies.ymax_high,X
2 bcs .no_collid_cs
16
.collid
2 lda #1
7 rts
9
.no_collid_cs
2+1 txa
2 adc #7
2 tax
2 dey
3 bne .loop
12
2-1 lda #0
7 rts
8
.no_collid_cc
2+1 txa
2 adc #8
2 tax
2 dey
3 bne .loop
12
2-1 lda #0
7 rts
8
My cycle counting may be a bit off but here's what i obtain:
- 76 cycles per loop max
- 28 cycles at min if discarded on first test
- 44 cycles if discarded after complete X test
so i guess we can consider about a bit less than 44 cycles for the mean, which is, i believe, a very good score for the 6502.
It would be nice to have someone converting it to use the 65816 16 bits capabilities and also remove the "offseting" which should be useless then. I think that actually we cannot do much faster with the 65816 (in term of number of cycles) but let's see maybe i'm wrong
To compare, here's an equivalent 68000 code:
Code:
8 lea 4(a7), a6 ; a6 point on parameters
12 move.l (a6)+, a0 ; a0 = player
12 move.l (a6)+, a1 ; a1 = enemies
12 move.l (a6), d7 ; d7 = numEnemies
4 subq.w #1,d7 ; d7 = numEnemies - 1 (prepare for DBCC instruction)
48
12+16 movem.w (a0)+, d0-d3 ; d0 = player.xmin... d3 = player.ymax
28
.loop ; do {
// TEST on X coordinates
4 move.l a1, a2 ; a2 point on enemy
8 cmp.w (a2)+, d1 ; if (player.xmax < enemy.xmin)
10 blt .no_collid ; no collision
22
8 cmp.w (a2)+, d0 ; if (player.xmin > enemy.xmax)
10 bgt .no_collid ; no collision
18
// TEST on Y coordinates
8 cmp.w (a2)+, d3 ; if (player.ymax < enemy.ymin)
10 blt .no_collid ; no collision
18
8 cmp.w (a2)+, d2 ; if (player.ymin > player.ymax)
10 bgt .no_collid ; no collision
18
.collid
4 move #1,d0 ; return 1
16 rts
20
.no_collid
8 add.l #8, a1 ; a1 point on next enemy
10 dbra d7, .loop ; } while (numEnemies--)
18
4+2 move #0,d0 ; return 0
16 rts
22
Note the code could be optimized a bit but it's already a good start, so the results:
- 94 cycles per loop max
- 40 cycles at min if discarded on first test
- 58 cycles if discarded after complete X test
so i guess we can consider about a bit less than 58 cycles for the mean, which is actually more cycles than the 6502 code i posted ! But remember the 6502 code cheat a bit (limited to N = 30 max, zero page for everything, offseting...) and here we want to compare against the 65816 which has 16 bits operation support
Quote:
My cycle counting may be a bit off but here's what i obtain:
Yes, for exemple a branching instruction take 3 cycles if branch is take else is 2 .
You also forget clc before adc and sec before sbc, it's not forced if you are sure that cary is clear or set before doing adc/sbc .
Quote:
limited to N = 30 max, zero page for everything, offseting.
yes but Ram operations take one more cycle than ZP, not dramatical .
i think 65816 code should be close to 2 time faster.
TOUKO wrote:
Quote:
My cycle counting may be a bit off but here's what i obtain:
Yes, for exemple a branching instruction take 3 cycles if branch is take else is 2 .
You also forget clc before adc and sec before sbc, it's not forced if you are sure that cary is clear or set before doing adc/sbc .
The branch is ok as i add +1 on the txa instruction for that reason (branch was taken).
Also the sbc instruction is exactly used for that purpose, the previous "cmp" instruction set the C flag so the sbc correctly handle the 16 bits operation, that's where this 16 bits collision code is really efficient (given we have a 6502).
Ok, that's right so ..
Quote:
The branch is ok as i add +1 on the txa instruction for that reason (branch was taken).
yes but you count a bne always for 3 cycles even if branch is not taken .
It's confusing,i think you must do a branch always for 2 cycles and doing your tax 2+1.
In fact you treat all bne,beq etc .. with the extra cycle if a page boundary is crossed(when branch is taken) all the time .
Let me rephrase that: The
car wasn't programmed very well.
There is a little slowdown when the bosses die, but that doesn't affect gameplay at all. I just wonder if Contra III was still doing some of the stupid mistakes that Gradius III and Super Castlevania IV did. Contra Hard Corps is definitely more explosive, but I think that's understandable considering it was made 2 years latter. Contra Hard Corps is comparable to Rendering Ranger R2, except that there are faster movements (at least compared to the ground levels. The last ship level is just stupid.) I'm don't really care too much about Rendering Ranger R2, because to me, it feels more like a tech demo than a game. It isn't to terrible though.
Just thinking, one thing I always found stupid was how Contra III uses a whole BG layer just for the tiny status bar. They could have at least counted the points, or better yet, made it out of sprites.
Edit: My post is very far behind, because I stopped working on it and came back to it latter.
Quote:
i think 65816 code should be close to 2 time faster.
Why? I don't have a clue about the 6502.
Quote:
The car wasn't programmed very well.
ah ok sorry for misunderstanding
Quote:
I'm don't really care too much about Rendering Ranger R2, because to me, it feels more like a tech demo than a game
Yes you're right, and i take this game as exemple only because is interesting on a technical side,and prove that the snes can move tons of sprites on screen without slowdowns .
Quote:
Why? I don't have a clue about the 6502.
Especially because the 65816 can do 16 bits operations just as fast as than the 6502 do 8 bits.
Quote:
i think 65816 code should be close to 2 time faster.
Definitely not... i think you can only reduce by a limited amount of cycles but you can try to prove me the opposite
Don't forget i took the best case for the 6502, "almost cheated"...
To be almost "comparable" to the 68000 counterpart code we should consider at least using absolute X indexed access here.
16 bits also bring a bit of penalty here.
Quote:
Definitely not... i think you can only reduce by a limited amount of cycles but you can try to prove me the contrary
In you code exemple it's all the adc and sbc who's take the whole of cycles, you can divide it by to with native 16 bits operations .
Plus your code is not optimal, and you can probably use the more larger 65816's ISA to take less cycles .
i let the 65816 coders here to prove that it can do .
, i'am not familiar with his instrution set and i cannot do the better possible code.
for Hu6280 is clearly not, you can doing a little bit better but nothing exceptional in this case .
TOUKO wrote:
Quote:
Definitely not... i think you can only reduce by a limited amount of cycles but you can try to prove me the contrary
In you code exemple it's all the adc and sbc who's take the whole of cycles, you can divide it by to with native 16 bits operations .
Plus your code is not optimal, and you probably use the more larger 65816's ISA to take less cycles .
i let all 65816 coders here to prove that it can do .
for Hu6280 is clearly not, you can doing a little bit better but nothing exceptional in this case .
Except for the 16 bits stuff the 65816 ISA can't really help here, i will try to convert the code myself...
About the Hu6280, i don't see how it can actually do in less cycles than a 6502, i would even say it can only do worth here !
Quote:
Except for the 16 bits stuff the 65816 ISA can't really help here, i will try to convert the code myself...
About the Hu6280, i don't see how it can actually do in less cycles than a 6502, i would even say it can only do worth here !
Damn, i'm always forget that you doesn't add any box in it
And now the 65816 version, assuming both A and XY are set in 16 bits, also i use DP for player coordinate in optimist condition (low byte = 0), also to make it faster i have to consider positives only values again as i don't see fast way of handling comparison on negatives values with the 65xx CPU (V flag is not affected by CMP ?!?)... else i need to do a combination of CLC + SBC.
Free feel to 65816 gurus to correct me / improve it:
Code:
5 ldy numEnemies
3 ldx #0
8
.loop
// TEST on X coordinates
4 lda <player.xmax
5(+1) cmp enemies.xmin,X
2 bcc .no_collid_cc
11
4 lda <player.xmin
5(+1) cmp enemies.xmax,X
2 bcs .no_collid_cs
11
// TEST on Y coordinates
4 lda <player.ymax
5(+1) cmp enemies.ymin,X
2 bcc .no_collid_cc
11
4 lda <player.ymin
5(+1) cmp enemies.ymax,X
2 bcs .no_collid_cs
11
.collid
3 lda #1
6 rts
9
.no_collid_cs
2+1 txa
3 adc #7
2 tax
2 dey
3 bne .loop
13
3-1 lda #0
6 rts
8
.no_collid_cc
2+1 txa
3 adc #8
2 tax
2 dey
3 bne .loop
13
3-1 lda #0
6 rts
8
So:
- 57 cycles per loop max
- 24 cycles at min if discarded on first test
- 35 cycles if discarded after complete X test
so we can consider about a bit less than 35 cycles for the mean... better than the 6502 definitely but not that much as i expected.
Then now we have to consider that we can have extra penalties when indexed data cross page boundary (what do that mean ? if X is above 256 then it happens at each access ?). Also the code is not comparable yet to the 68000 version as it requires positives bounds only.
But the important is that
even in this case, where we have many conditional tests and made things favorable to the 65816, if we compare the "mean" obtained cycles for the 65816
35 cycles against the 68000
58 cycles then we can see the MD 68000 is still faster than the SNES 65816, even considering fast rom the 68000 is about 35% faster here.
Quote:
so we can consider about a bit less than 35 cycles for the mean
yes in the better case for 68k, but if collision accur for the first test, 68k is far behind (it's 192 cycles) ..
Now the 65816 version properly handling signed comparison:
Code:
5 ldy numEnemies
3 ldx #0
8
.loop
// TEST on X coordinates
4 lda <player.xmax
2 clc
5(+1) sbc enemies.xmin,X
2 bvs .no_collid
13
4 lda <player.xmin
2 clc
5(+1) sbc enemies.xmax,X
2 bvc .no_collid
13
// TEST on Y coordinates
4 lda <player.ymax
2 clc
5(+1) sbc enemies.ymin,X
2 bvs .no_collid
13
4 lda <player.ymin
2 clc
5(+1) sbc enemies.ymax,X
2 bvc .no_collid
13
.collid
3 lda #1
6 rts
9
.no_collid
2+1 txa
2 clc
3 adc #8
2 tax
2 dey
3 bne .loop
15
3-1 lda #0
6 rts
8
- 67 cycles per loop max
- 28 cycles at min if discarded on first test
- 41 cycles if discarded after complete X test ~meanSo 41 cycles against 58 for the 68000 so an important advantage for the MD 68000 even for that type of case (and i can optimize a bit the 68000 code but i don't even care).
TOUKO wrote:
Quote:
so we can consider about a bit less than 35 cycles for the mean
yes in the better case for 68k, but if collision accur for the first test, 68k is far behind (it's 192 cycles) ..
This does not have any sense... 68000 is faster in
every case when you have lot of computation to do, that is the point !
We don't care about case where we are anyway fast enough.
I'm pretty sure compare does effect the N bit.
I was speaking about V flag (overflow), without it no way to compare signed value (or require many more operations)
When would you actually need that anyway?
When comparing signed numbers. The only way to do this on the 65xxx is through sbc (which affects v); cmp only affects n, z, and c.
If you're asking "why did you (presumably Stef) design your thing that way?", then that's a discussion of a very different type.
It is on topic, because were talking about optimizations and we shouldn't waste cycles unless there's a benefeit, and I don't see any difference in having a range of -32768 to 32767 compared to a range of 0 to 65535.
koitsu wrote:
If you're asking "why did you (presumably Stef) design your thing that way?", then that's a discussion of a very different type.
I did both implementation actually, unsigned and signed... but signed is more generic and should be the way to go in general, objects partially on screen (on the left) has signed values in their bounding box. If i want to avoid it, i have to "offset" them.
psycopathicteen wrote:
It is on topic, because were talking about optimizations and we shouldn't waste cycles unless there's a benefeit, and I don't see any difference in having a range of -32768 to 32767 compared to a range of 0 to 65535.
stef is a professional of examples which have no use in a game or a 2D machine ..
Like to do 50 32bits velocity computations, you see !!
I remember he estimated the 68k to be 2 time faster than the snes's CPU, for now nothing is going in that sense.
TOUKO wrote:
psycopathicteen wrote:
It is on topic, because were talking about optimizations and we shouldn't waste cycles unless there's a benefeit, and I don't see any difference in having a range of -32768 to 32767 compared to a range of 0 to 65535.
stef is a professional of examples which have no use in a game or a 2D machine ..
Like to do 50 32bits velocity computations, you see !!
Really you always surprise me Touko... Just some post ago you were again referring the useless case: "the 65816 is faster when we exit on first test"... oh great ! the 65816 can be faster when there is no need to be =)
And then you tell me this ???Also can you explain me why my collision code have no use in a game ? could you make it faster by the way ?
Quote:
I remember he estimated the 68k to be 2 time faster than the snes's CPU, for now nothing is going in that sense.
Are you sure of that ? Here we took a piece of code specially favorable for the 65816 (many conditional jump) and simple 16 bits, i also assumed player bounding box in DP... and still for comparable code we obtained 41 cycles (65816) against 54 cycles (68000). So now if we consider about 3 Mhz for the 65816 (with fast rom) and 7.67 Mhz for the 68000, then we are *really* close from the x2 ratio... so i let you imagine for a case more adapted for the 68000.
Quote:
Also can you explain me why my collision code have no use in a game ? could you make it faster by the way ?
i spoke to have so much velocity in a game, i took this example on sega-16, when you comparing with tomaitheous the efficiency of 65xxx (as you described crappy) and 68k, strangely it was 24bits velocity, really usefull for a 16 bits game of that area
TOUKO wrote:
Quote:
Also can you explain me why my collision code have no use in a game ? could you make it faster by the way ?
i spoke to have so much velocity in a game, i took this example on sega-16, when you comparing with tomaitheous the efficiency of 65xxx (as you described crappy) and 68k, strangely it was 32bits velocity, really usefull for a 16 bits game of that area
You can have 8.8 velocity but at least you require 16.8 positions don't you agree ? then you don't have choice, you require 2 operations on the 65816. On the 68000 you can go for whatever you want (16.8 + 8.8, 24.8 + 8.8...) but generally 16.16 is just more practical or more efficient so we use it.
Quote:
You can have 8.8 velocity but at least you require 16.8 positions don't you agree
Yes, if you want, but it's required to add velocity to all entire sprites of your game ..
it's required for player and some sprites, and you can use easily the ZP for that .
And even if the 68k is faster,it's by 20/30 cycles, nothing impressive for a "32bits",and if all the computation is even 1000 more cycles on 65816(which is far too optimistic) it's not a big deal or infaisible.
Is 16.8 less efficient than 16.16? I always say go with the faster example, even if it takes more RAM, especially on the SNES. Anyway, out of curiosity, why did you use sbv in your collision code? I don't even know what that does. The only comparison things I've ever used where: cmp, beq, bne, bcc, and bss. I don't get the processor status bits.
You know, one way you could avoid the whole register width fiasco with velocity is to actually have it be 3 bytes. I know the SNES can't do 24 bit instructions, but for objects with simple velocity like 2 pixels and no sub pixel stuff, you could just use the upper 2 bytes so you don't have to change accumulator width.
Quote:
sbv
Where ??
I don't see that in his code,bvs and bvc yes .
He probably meant sbc (v and c are keys next to one another on an American English keyboard).
yes, i think so
Espozo wrote:
Is 16.8 less efficient than 16.16?
On 68000 it is. The 68000 needs aligned accesses but can read either 16-bit half of a 32-bit value and can swap 16-bit portions of the 32-bit register. For 16.8 you'd need to use a slow microcoded rotation.
TOUKO wrote:
I don't see that in his code,bvs and bvc yes .
That's what I mean't I'm just stupid.
koitsu wrote:
He probably meant sbc (v and c are keys next to one another on an American English keyboard).
There's no way I could have gotten to this point without knowing how to subtract.
Espozo wrote:
You know, one way you could avoid the whole register width fiasco with velocity is to actually have it be 3 bytes. I know the SNES can't do 24 bit instructions, but for objects with simple velocity like 2 pixels and no sub pixel stuff, you could just use the upper 2 bytes so you don't have to change accumulator width.
wait, what was I thinking... You don't need to use it only for whole pixel values... When doing stuff for acceleration, you work with bytes 1 and 2, and when you add it to the coordinates, you use bytes 2 and 3. This uses an extra byte, but the SNES has more than enough ram. Wait how would you handle backwards velocity?
Quote:
That's what I mean't I'm just stupid.
don't worry
Quote:
Wait how would you handle backwards velocity?
With a negative value i think .
Random thought: what's the closest thing the SNES has to
Red Zone? (I personally prefer the less impressive indoor levels, but that's beyond the point) Although ironically the SNES may have an easier time if we were to recreate the outdoor levels thanks to mode 7 and the dedicated multiplication hardware (indoor levels is still hell though).
Espozo wrote:
Otherwise, things like this wouldn't be possible:
https://www.youtube.com/watch?v=0w-U-9izrGs (look at the part where there are a lot of ships swarming. The explotions have velocity too.)
Honestly that looks pretty stock for a shmup, but then again I can't expect every shmup to be Recca. But mentioning Recca may also be relevant because that's a game on the
NES, so whatever it does should be easily doable on both the SNES and the Mega Drive, really...
Espozo wrote:
If I'm not mistaken, psychopathicteen already found a very backwards and inefficient code to handle hioam in Gradius III. I'm sure there's other dumb stuff like that in that game.
Wasn't it Super R-Type? (which if I recall correctly somehow manages to spend most of the CPU time manipulating the sprite table -
ouch)
tepples wrote:
Espozo wrote:
Is 16.8 less efficient than 16.16?
On 68000 it is. The 68000 needs aligned accesses but can read either 16-bit half of a 32-bit value and can swap 16-bit portions of the 32-bit register. For 16.8 you'd need to use a slow microcoded rotation.
Or treat the 16.8 value as two separate values, which yeah I imagine is still slower to fetch, but if it stays on registers and is used often it could help when using carry (but you may as well use 16.16 at that point). Incidentally this also makes 32.16 rather cheap on the 68000, if you need large numbers.
Espozo wrote:
wait, what was I thinking... You don't need to use it only for whole pixel values... When doing stuff for acceleration, you work with bytes 1 and 2, and when you add it to the coordinates, you use bytes 2 and 3. This uses an extra byte, but the SNES has more than enough ram. Wait how would you handle backwards velocity?
I do this:
Code:
lda {x_velocity}
bpl +
clc
adc {x_position_lo}
sta {x_position_lo}
bcs ++
dec {x_position_hi}
bra ++
+;
clc
adc {x_position_lo}
sta {x_position_lo}
bcc +
inc {x_position_hi}
+;
It's about 28 cycles max. It's not too much faster than doing 16.16 format, but the benefit is that it makes it easier to accelerate an object.
psycopathicteen wrote:
Code:
lda {x_velocity}
bpl +
clc
adc {x_position_lo}
sta {x_position_lo}
bcs ++
dec {x_position_hi}
bra ++
+;
clc
adc {x_position_lo}
sta {x_position_lo}
bcc +
inc {x_position_hi}
+;
It's about 28 cycles max. It's not too much faster than doing 16.16 format, but the benefit is that it makes it easier to accelerate an object.
Why just not adding 8.8 velocity to a 16.8 position ? here i don't see how you handle sub pixel velocity. It seems like you are using a plain 8bit (signed) velocity and add it to your plain 16bit position.
Stef wrote:
to your plain 16bit position.
I really don't find 16.8 and 16 positions any different in games. (I probably couldn't tell you if a game was using either one even if I tried.) It doesn't affect velocity, just collision detection.
About how he handles sub pixel velocity, he probably just adds the high 8 bits of the 8.8 velocity to the 16 bit position. There's no sub pixel positioning, just velocity.
(by the way, does bpl branch if the highest bit is 0?)
Espozo wrote:
About how he handles sub pixel velocity, he probably just adds the high 8 bits of the 8.8 velocity to the 16 bit position. There's no sub pixel positioning, just velocity.
The only way I've seen to do that involves dithering using a bit-reversed retrace count. Otherwise, there are no visible velocities between stopped and one screen per 4.16 seconds.
Quote:
(by the way, does bpl branch if the highest bit is 0?)
Yes. The opposite is Body Mass Index/Broadcast Music Inc.
Well, here's a comparison of what psychopathic teen does and what I thought of doing: (both are using direct page)
Code:
Psychopathicteen's code: (16 bit)
lda {x_velocity} ;4
bpl + ;2
clc ;2
adc {x_position_lo} ;4
sta {x_position_lo} ;4
bcs ++ ;2
dec {x_position_hi} ;5
bra ++ ;3
+;
clc ;2
adc {x_position_lo} ;4
sta {x_position_lo} ;4
bcc + ;2
inc {x_position_hi} ;5
+;
bpl with bcc way: 18
bpl with no bcc way: 23
no bpl with bcs way: 18
no bpl with no bcs way: 26
My code: (16 bit)
lda #$???? ;3
sta XVelocity ;4
lda XVelocity+1 ;4
and #%0111111111111111 ;3 ;if velocity is positive
ora #%1000000000000000 ;3 ;if velocity is negative
sta XVelocity+1 ;4
lda XVelocity+1 ;4
clc ;2
adc XPosition ;4
sta XPosition ;4
first part (not in psycopathiceen's code and
same for positive or negative): 18
second part: 14
total (velocity doesn't need to be loaded on the second time
like it is, but it's unlikely it will be back to back like that): 32
Would you mind actually showing what you do to load velocity psychopathicteen?
My code is supposed to use 16-bit mode.
psycopathicteen wrote:
My code is supposed to use 16-bit mode.
Oh, my mistake.
. What's up with the {x_position_lo} and {x_position_hi} though?
Espozo wrote:
psycopathicteen wrote:
My code is supposed to use 16-bit mode.
Oh, my mistake.
. What's up with the {x_position_lo} and {x_position_hi} though?
{x_position_lo} = subpixels
{x_position} = pixels
{x_position_hi} = screen legnths
psycopathicteen wrote:
{x_position_hi} = screen legnths
What? Is that like {x_position} divided by 256? I guess you're using 16.16 fixed point? Mine's just 16, but I'd rather devote more CPU time to other things.
Code:
define x_position_lo($04)
define x_position($05)
define x_position_hi($06)
define y_position_lo($08)
define y_position($09)
define y_position_hi($0a)
{x_position} is the high-byte of {x_position_lo}, and {x_position_hi} is the high-byte of {x_position}
So it's 32.16 fixed point? A bit big, no? There's no way any level I'm going to make is going to be over 256 screens long.
No, it's 16.8 fixed point. I use 16-bit mode to process 2 bytes at once.
I just though of something: Because positions for objects are linked to the entire level and not the screen, how would you make objects like bullets disappear? Do you have another set of x and y positions that are relative to the screen?
By "disappear" do you mean "do not draw them to OAM", or do you mean "despawn them because they are far from the camera and can hit nothing visible"? If the latter, then a lot of games decode the level map in the camera's vicinity to a sliding window for display and collision. I guess objects have collision handlers for hitting the side of that sliding window.
tepples wrote:
If the latter, then a lot of games decode the level map in the camera's vicinity to a sliding window for display and collision. I guess objects have collision handlers for hitting the side of that sliding window.
That's what I mean. You could only fire the amount of bullets that is as big as the object table throughout the whole level otherwise. Because I plan on making it to where the camera only scrolls in one direction like Contra, so I can probably do it for all objects. (How in the world to platform games even save offscreen enemy positions anyway?)
Espozo wrote:
Do you have another set of x and y positions that are relative to the screen?
Data redundancy is always more error-prone than calculating everything from one value you know is correct.
The screen coordinates for any object can be calculated like this:
Code:
ScreenX = ObjectX - CameraX
ScreenY = ObjectY - CameraY
If the object is away from the camera/screen by more than a certain thereshold you can deactivate it. For objects that have fixed spawning points, you have to make sure these points are outside of the active area too, otherwise you might end up with the awkward situation where enemies appear to vanish.
I always was under the impression that if companies like Capcom and Konami weren't around, SNES games would've never had slowdown, because smaller companies would've had to think for themselves instead of just trusting Capcom and Konami's methods of doing stuff.
But how did they know their methods of doing stuff? It seems like an awful lot of the large companies actually weren't the most efficient programmers.
Quote:
But how did they know their methods of doing stuff? It seems like an awful lot of the large companies actually weren't the most efficient programmers.
Because they are used to works on arcade machines which have more horse power than you can dream, this why you can code like a pig without (almost) any slowdown .
Didn't they just outsource the work to smaller companies without crediting them, anyway? At least for stuff like ports and such (which tend to be the biggest culprits too). If anything it seems like that'd make matters worse, not better.
TOUKO wrote:
Because they are used to works on arcade machines which have more horse power than you can dream, this why you can code like a pig without (almost) any slowdown .
It amazes me how 3 enemies can bring about a 12 MHz 68000 to its knees. I wonder how the code looks...
It's very easy to write code that is slow. You should never be surprised by it. Where there is slowdown in any game it is simply because the developer did not treat it as a priority problem.
It has
nothing to do with outsourcing, being used to arcade boards, being a large company, or anything of the sort.
Nothing at all. The programmers
weren't lazy, or incompetent. Look at any development diary and you'll see that they're overworked, tired, scrambling to keep up with the schedule (and late anyway), and struggling not to run out the budget before they have to ship the game. (
Half the time, they don't even manage to do that.)
Games have slowdown simply because the person managing the project thought something else was more important. That's it. That's the root cause. Everything else is subservient to that.
rainwarrior wrote:
The programmers weren't lazy, or incompetent.
Then what explains the low frame rate in every single NES game by Micronics? What could have convinced them to deprioritize frame rate to such an extent?
Quote:
Games have slowdown simply because the person managing the project thought something else was more important.
I agree. But sometimes the "something else" is ability to run on the available hardware. All the software optimization in the world couldn't have made
Star Fox run at 60 fps while covering enough of the screen.
tepples wrote:
agree. But sometimes the "something else" is ability to run on the available hardware. All the software optimization in the world couldn't have made Star Fox run at 60 fps while covering enough of the screen.
Well of course, even though it would be impossible to program it, DMA bandwidth isn't even big enough either.
tepples wrote:
All the software optimization in the world...
Software is not the only avenue for optimization.
This is a short sentence, but it's very important to game development. If you aren't hitting your target framerate, you can change your game design, you can change the art assets, you can fire an underperforming programmer, you can build a better co-processor. There is a whole lot to game performance that doesn't involve software.
And if you're going to say that a more powerful Super-FX chip was too expensive, this is again obviously a prioritization issue (cost is a priority). Meeting your budget is entirely about managing your priorities. Meeting a deadline is a priority (time vs other factors). Every mistake an incompetent programmer makes can be fixed with enough time or money, but
game development is resource management, not building the best possible software.
Quote:
It amazes me how 3 enemies can bring about a 12 MHz 68000 to its knees. I wonder how the code looks...
doesn't looks at the first cps2 shmups then
Quote:
The programmers weren't lazy, or incompetent.
No, but i think they made bad programming habits because to work on powerful systems .
Maybe these games weren't written in ASM? (Probobly.) One thing that really gets me curious is that the first Metal Slug game really doesn't have that much slowdown, and if you were to get a situation from the first game and directly translate it to the second one, there'd be a lot of slowdown. I would have thought they would have just reused the game engine.
This reminds me of how Sega would put three 68000s in an arcade board, then heavily underutilize them (like, run nearly everything in one 68000 and then just a single not-really-time-critical task on the other two). It got me to the point of wondering if the other two 68000s were there just to make bootleg machines too expensive to make.
Why even have 3 relatively good processors, why you could just have one monster one? Sega seemed like they were a big fan of multiple processors (Sega Saturn). Sega also seemed like they were big on developing new, more powerful arcade boards when their older system could have handled it just as easily.
(This is kind of random, but didn't Konami make different boards for just about every single new game?)
Espozo wrote:
Why even have 3 relatively good processors, why you could just have one monster one?
Xbox 360 has three processors because chipmakers had hit a GHz wall.
Quote:
(This is kind of random, but didn't Konami make different boards for just about every single new game?)
Yes, to discourage arcade operators from converting one game to another by burning EPROMs. It's the same concept as the VRC2/4/6 pin swaps on Famicom.
tepples wrote:
Xbox 360 has three processors because chipmakers had hit a GHz wall.
I'm pretty sure the 3 68000 weren't "hitting a GHz wall".
I guess they just couldn't be clocked any faster before they failed? (This is random, but when I first looked at the processor for the Xbox 360, I thought it said Xeon, not Xenon. I was like, dang!
)
tepples wrote:
Yes, to discourage arcade operators from converting one game to another by burning EPROMs. It's the same concept as the VRC2/4/6 pin swaps on Famicom.
I guess Konami had more than enough money to develop new systems for every game, unlike poor old Irem, that really only had 2 mentionable ones, the M72 and the M92. (There were a couple before those, and there where a ton of slightly changed m72 ones, even to the point that I think some games were compatible over different boards.)
Espozo wrote:
Why even have 3 relatively good processors, why you could just have one monster one? Sega seemed like they were a big fan of multiple processors (Sega Saturn).
As I said: it could have made bootleg cabinets harder to make (making bootleg cabinets was extremely common so any deterrent worked). Also could have had something to do with repair service. Let's just say that arcade hardware design was rather shady back then...
Saturn was just the Saturn though. Still not anywhere as bad as the mess that would have been Jupiter.
Espozo wrote:
Sega also seemed like they were big on developing new, more powerful arcade boards when their older system could have handled it just as easily.
Eh dunno,
this is from the same year as Space Harrier (also Space Harrier was an early superscaler game, it's just that not many people know about the later ones as Sega never bothered with them past their initial arcade release).
Sik wrote:
(also Space Harrier was an early superscaler game, it's just that not many people know about the later ones as Sega never bothered with them past their initial arcade release).
They made a good number of systems in the superscaler series. If you look at just about any game on the Sega System 32, they probably could have been recreated on the Sega Y Board, you just would have needed to make the backgrounds out of sprites. Sega Sonic the Hedgehog sure as hell isn't using 4 512 color scalable BGs with a seemingly infinite number of 512 color scaling and rotating sprites.
Espozo wrote:
Sega Sonic the Hedgehog sure as hell isn't using 4 512 color scalable BGs with a seemingly infinite number of 512 color scaling and rotating sprites.
MAME seems completely unable to show the list of sprites and I can't find any spritesheet that seems to be consistent with the animations, not to mention some older MAME versions doing really weird stuff like showing the wrong side of a sprite (e.g. Sonic's bottom instead of the top). Makes me wonder what's going on with them.
Also don't forget that many of the objects are quite large or made out of many tiny sprites (especially stuff that breaks), that's something non-blitter sprite systems would have trouble with (e.g. Neo Geo would be pushed to its bandwidth limit) while for the superscaler it's rather easy. Also it can't rotate sprites as far as I know (only do scaling), only the whole framebuffer at once =P
I don't really understand why the Neo Geo is so glorified hardware wise. It's a monster as far as home consoles go, but compared to other arcade boards at the time, it really isn't anything to brag about. The Sega System 32 was released the same year, and it destroys the Neo Geo in just about every aspect. Off course, this actually seemed abnormally strong for an arcade system at the time. It's still fascinating to me how the Neo Geo survived for 14 years.
Espozo wrote:
It's still fascinating to me how the Neo Geo survived for 14 years.
If a computing platform sticks around for several years, eventually people figure out how to make efficient code for it. So instead of selling an expensive system board to arcade operators, companies found that they could sell a less expensive game for the Neo Geo MVS or Capcom CPS board they already owned. This made operators more likely to try a game.
See also
"Daddy System" on All The Tropes.
Twilight Princess ran at a higher resolution on the Wii? I don't think that's right. I'm pretty sure they both run at 640x480. I know this sounds ridiculous, but too me, the GameCube often looks more impressive to me than the Wii does, possibly due to developers just caring less and often giving the Wii PS2 ports that didn't even look good for the PS2, which was already less powerful than the GameCube. The Wii was in an awkward place graphically what people call the "6th generation" and "7th generation" consoles, and because it wasn't quite up to snuff with the Xbox 360 and PS3, it was often in the same camp as the PS2 because companies didn't want to make 3 versions of the same game. Rogue Squadron II and III look better than just about anything I've ever seen on the Wii, and I have a much higher respect for the GameCube for using powerful, but cost effective hardware when it was made, unlike the original Xbox which didn't care about the second thing and still hardly, if at all, looks any better than the GameCube.
Espozo wrote:
I don't really understand why the Neo Geo is so glorified hardware wise. It's a monster as far as home consoles go, but compared to other arcade boards at the time, it really isn't anything to brag about. The Sega System 32 was released the same year, and it destroys the Neo Geo in just about every aspect. Off course, this actually seemed abnormally strong for an arcade system at the time. It's still fascinating to me how the Neo Geo survived for 14 years.
Neo Geo was actually pretty weak for an arcade even back when it came out, in fact if you sum up the video bandwidth in terms of pixels it's pretty much on par with the Mega Drive (it just happens that the Neo Geo is more flexible on where it goes, but look at how unimpressive tends to be the parallax, if it's even there). The thing is that the Neo Geo 1) saw way more use (and remember SNK was
huge for fighting games) and 2) they decided to put a lot of focus on animations which tends to bring in way more attention, despite the fact that all you need for that is just more memory. (they
did account for that part, though...)
Sik wrote:
Neo Geo was actually pretty weak for an arcade even back when it came out, in fact if you sum up the video bandwidth in terms of pixels it's pretty much on par with the Mega Drive (it just happens that the Neo Geo is more flexible on where it goes, but look at how unimpressive tends to be the parallax, if it's even there).
Doesn't "on par" usually mean a little worse, not a little better? The MD had 2 BGs and sprites all across the screen for 960 pixels. I think I heard that the Neo Geo's overdraw is 96 sprites, so 96 x 16 = 1536 pixels, so the Neo Geo has about 1.5x the bandwidth. Not really much to brag about, especially since the Neo Geo came out 2 years latter and was an arcade machine. And yes, it's rare that you ever get more than 2 layers of parallax. I've barely ever even seen 3 layers, and when I do, there is almost no sprites and one of the layers only covers about half the screen horizontally. Just like whenever there are a lot of enemies, there's often only one layer.
Something tells me the Neo Geo would run into bandwidth problems here: (Maybe I'm biased though...)
If you think about it, I imagine the CPS1 has about 2x the overdraw for sprites, and it has 3 BG layers, so you could say 384 x 5 = 1920.
Just thinking, what do you think would be a good way to measure the SNES's bandwidth? I'm saying you treat 8 bit pixels as 2 pixels and 4 bit pixels as .5 pixels. In worst case, you'd have 512 pixels for BGs and in best case, you'd have 768. (Including hi-res mode.) You'd then add 272 for sprites, and you'd get 784 to 1040. I'll just find the average, which is 912. Actually worse than the MD.
Espozo wrote:
Twilight Princess ran at a higher resolution on the Wii? I don't think that's right.
It's not right, so you are right
What the link probably means by "resolution" would be native 16:9 support, something you need hacks (homebrew called Swiss or something like that) to achieve in GC games.
Espozo wrote:
The Wii was in an awkward place graphically what people call the "6th generation" and "7th generation" consoles, and because it wasn't quite up to snuff with the Xbox 360 and PS3, it was often in the same camp as the PS2 because companies didn't want to make 3 versions of the same game. Rogue Squadron II and III look better than just about anything I've ever seen on the Wii, and I have a much higher respect for the GameCube for using powerful, but cost effective hardware when it was made, unlike the original Xbox which didn't care about the second thing and still hardly, if at all, looks any better than the GameCube.
The Wii's GPU is essentially the same one found in the GC, just with beefier memory. Aside from that, it's copy+paste hardware. It even shares the same hardware bugs. It's not really a surprise that developers didn't focus on "next-gen" graphical fidelity for the Wii. Nintendo basically forced their hand to stick with previous-gen-like graphics. Coming up with good looking "cute" or "friendly" games (Tales of Symphonia: DotNW, Rune Factory, etc) was easy enough, but pushing other art styles was probably more challenging. Even later, graphically impressive games like Xenoblade Chronicles and The Last Story suffer frame rate issues (I kid you not, I dropped down to single digits in TLS during one fight), and both games already run at 30 FPS to make it easier to hit their target FPS.
I suppose during the GC-PS2-XBOX era, the difference wasn't as significant, so developers had more incentive to make their games look pretty on the GC. One exception to sub-par looking Wii games would be the Super Mario Galaxy games. Those games pretty much blow every other title I've seen on the Wii out of the water (and then look at them in HD in Dolphin!) But that's likely due to the fact that Nintendo was making it; they made the console, so they can do whatever magic they want
Shonumi wrote:
The Wii's GPU is essentially the same one found in the GC, just with beefier memory.
I can sure as hell tell you that that wouldn't have flown with any console prior they released.
The one reason Nintendo's games on the Wii looked better than everyone else's is that they didn't have to worry about gimped PS2 ports. The GC was strong back in the day, but not in 2006 - 2010 (it died right about after Super Mario Galaxy 2 was released, which is without a doubt the best looking title I've seen on the system. That's not saying a ton though.
)
Espozo wrote:
I can sure as hell tell you that that wouldn't have flown with any console prior they released.
It probably wouldn't have worked in the past. Even with the copy+paste hardware job they did on the DS, they notably improved the graphical prowess of their handhelds. But Nintendo hit the lottery with the Wii. That thing printed money; they had no reason to reconsider their ways until the next hardware generation.
Espozo wrote:
The one reason Nintendo's games on the Wii looked better than everyone else's is that they didn't have to worry about gimped PS2 ports.
Since when has Nintendo ever had to worry about gimped PS2 ports?
Shonumi wrote:
Since when has Nintendo ever had to worry about gimped PS2 ports?
For just about every cross platform game.
Espozo wrote:
Shonumi wrote:
Since when has Nintendo ever had to worry about gimped PS2 ports?
For just about every cross platform game.
Nah, I don't think so. Nintendo (or at least Nintendo fans) knows how much they rely on 1st party titles and only a core handful of 3rd party developers. It would have been a nice bonus if people would stop making bad ports and shovelware, but I don't imagine they lose any sleep over stuff like that.
Remember how they used to have good 3rd party titles back in the SNES days?
*sniff* I know, those were the days
Conversely, Nintendo has always been able to rock 1st and 3rd party on their handhelds. This explains my massive GB/GBA/DS/3DS collection?
Also what topic are we on? Not looking for a split, just noted how this has turned into a catch-all conversation (and I'm not helping).
Well, we were on the Neo Geo's bandwidth... This whole topic is pretty much general discussion for the general discussion thread. (Also, I'm apparently the author of this? I just noticed.)
Shonumi wrote:
What the link probably means by "resolution" would be native 16:9 support, something you need hacks (homebrew called Swiss or something like that) to achieve in GC games.
That or the fact that far more Wii consoles than GameCube consoles were connected to component monitors capable of progressive scan. But I have an N64 game called
GoldenEye 007 that supports 16:9 mode in either of two ways: letterbox or anamorphic. Enable both letterbox and anamorphic options, and it'll fill those fancy new 2560x1080 "scope" monitors.
As for Super NES, it'd be much harder to support selectable anamorphic widescreen because of the lack of sprite scaling. A game would have to include both 8x8 (PAR 32:21) and 8x6 (PAR 8:7) versions of all graphics and then use HDMA on the background scroll registers to use the smaller 8x6 version when set to 4:3 mode.
I know whoever wrote that bit on the link is using terms incorrectly or is mistaken about the Wii in general, but what does progressive scan have to do with saying the Wii had a better resolution than the GC? Not a TV technical person btw.
Espozo wrote:
Doesn't "on par" usually mean a little worse, not a little better?
Eh, more like "about the same" but yeah, I miscalculated it back then I guess =P Still not that great though, especially since shrinking eats up pixels without giving extra bandwidth (all 16 pixels still get fetched).
Espozo wrote:
The MD had 2 BGs and sprites all across the screen for 960 pixels.
976 actually, since it always fetches an extra column for each tilemap (i.e. 336 pixels instead of 320). Sounds wasteful but don't forget that horizontal scrolling causes an extra partial column to show up.
Espozo wrote:
Just thinking, what do you think would be a good way to measure the SNES's bandwidth? I'm saying you treat 8 bit pixels as 2 pixels and 4 bit pixels as .5 pixels. In worst case, you'd have 512 pixels for BGs and in best case, you'd have 768. (Including hi-res mode.) You'd then add 272 for sprites, and you'd get 784 to 1040. I'll just find the average, which is 912. Actually worse than the MD.
Actually more like 1140 in terms of coverage if you compensate for the lower resolution. Mega Drive in H32 mode is only 800 pixels, after all.
Sik wrote:
Espozo wrote:
The MD had 2 BGs and sprites all across the screen for 960 pixels.
976 actually, since it always fetches an extra column for each tilemap (i.e. 336 pixels instead of 320). Sounds wasteful but don't forget that horizontal scrolling causes an extra partial column to show up.
Espozo wrote:
Just thinking, what do you think would be a good way to measure the SNES's bandwidth? I'm saying you treat 8 bit pixels as 2 pixels and 4 bit pixels as .5 pixels. In worst case, you'd have 512 pixels for BGs and in best case, you'd have 768. (Including hi-res mode.) You'd then add 272 for sprites, and you'd get 784 to 1040. I'll just find the average, which is 912. Actually worse than the MD.
Actually more like 1140 in terms of coverage if you compensate for the lower resolution. Mega Drive in H32 mode is only 800 pixels, after all.
If we compare system "video bandwidth" then we have to consider the bpp as well.
NeoGeo: 96*16*4bpp =
6144 bits / scanlineMD (H40): ((336*2)+320)*4bpp =
3968 bits / scanlineSNES mode 5 (heavier for bandwidth): (512+8)*4bpp + (512+8)*2bpp + (256+16)*4bpp = 2080 + 1040 + 1088 =
4208 bits / scanlineSNES mode 1 (most used): ((256+8)*2)*4bpp + (256+8)*2bpp + (256+16)*4bpp = 2112 + 528 + 1088 =
3728 bits / scanlineSNES mode 0 (lighter for bandwidth):
3200 bits / scanlineI do not count internal bandwidth requirements for tilemap / sprite table fetch operations, I believe they should be almost equivalent in all system... But then you have to at least to consider extra stuffs bring by the hardware.
The SNES & MD offers scrolling capability and a tilemap organization for BG which is very pratical for ROM usage.
The MD has the bonus of offering columns scrolling using a dedicated RAM, to get it on the SNES you need to waste a 2bpp plan. Also the MD offers some accesses slots to VRAM from the CPU even during display, they are limited but thanks to the VDP FIFO they are still *very* useful (palette reprogramming, vertical scroll change).
If you're going to compare SNES in different modes you should also compare Mega Drive in different modes as well, since the bitrate is completely different in H32 and H40 (as I said, on the Mega Drive you only get 800 pixels in H32 mode, meaning the SNES actually has more pixelrate for the same resolution).
Also the FIFO is not that useful, I mean sure it's useful for stuff like vertical scroll changes which is just one or two words (it will still suffer a delay, but the 68000 won't be stopped), but for things like palette raster effects forget it, the FIFO won't help you once it gets filled (remember there's only room for 4 words, while a palette row takes up 16, and a full palette takes up 64).
Espozo wrote:
Doesn't "on par" usually mean a little worse, not a little better?
It means neither of those things. The idiom "
on par" means it is equivalent.
Theoretical bandwidth comparisons between a tile plane and a sprite system can mislead. Consider a mostly empty scroll layer, largely air but with some terrain in it, such as the playfield of many 16-bit platformers. In a sprite-based system like Neo Geo or the 32-bit and later consoles, the empty parts of the layer do not use bandwidth. In a tile plane, on the other hand, empty parts do use bandwidth.
Sik wrote:
If you're going to compare SNES in different modes you should also compare Mega Drive in different modes as well, since the bitrate is completely different in H32 and H40 (as I said, on the Mega Drive you only get 800 pixels in H32 mode, meaning the SNES actually has more pixelrate for the same resolution).
I compared severals mode on SNES and i also took the one which use the most of bandwidth. The idea is that in term of internal logic and ram speed the SNES PPU is not any faster than the MD one even if the system is 2 years newer which is a bit disappointing...
Quote:
Also the FIFO is not that useful, I mean sure it's useful for stuff like vertical scroll changes which is just one or two words (it will still suffer a delay, but the 68000 won't be stopped), but for things like palette raster effects forget it, the FIFO won't help you once it gets filled (remember there's only room for 4 words, while a palette row takes up 16, and a full palette takes up 64).
Maybe you're too used to the unlimited VRAM access of the PCE but i think the FIFO is definitely a real savior when it comes to access VRAM during active display. The FIFO is 4 word sized and i agree that it is not that much but still is does allow to write 5 words without wait state (1 slot is processed the time you write the data) and that is really handy for some effects (vertical scrolling change / background color change)
For a full palette update indeed it does not help much but anyway there is no way to get it done in a single scanline so imo it's even better to not try to update more than 7/8 words per scanline (using the hblank period) in this particular case.
Stef wrote:
Maybe you're too used to the unlimited VRAM access of the PCE
I've never touched the PCE.
Stef wrote:
but i think the FIFO is definitely a real savior when it comes to access VRAM during active display.
Point stands though that you simply can't upload much data, and that's heavily limiting if you want to do something fancy (e.g. good luck doing sprite multiplexing, Sonic 2 which does it has to disable display to pull it off in a reasonable amount of time, and you can't get many color changes on screen without artifacts).
Stef wrote:
For a full palette update indeed it does not help much but anyway there is no way to get it done in a single scanline so imo it's even better to not try to update more than 7/8 words per scanline (using the hblank period) in this particular case.
There are only 3 slots in hblank, the other 15 are spread over the visible area (and that's H40 mode, it gets worse in H32 mode).
Sik wrote:
I've never touched the PCE.
So i hardly see how you can find it that limiting, of course vram accesses are tricky during active period, the vdp is hell busy and I think it's already quite nice to have (even limited) access (snes is not that permissive here).
Quote:
Point stands though that you simply can't upload much data, and that's heavily limiting if you want to do something fancy (e.g. good luck doing sprite multiplexing, Sonic 2 which does it has to disable display to pull it off in a reasonable amount of time, and you can't get many color changes on screen without artifacts).
True, the sprite multiplexing need sprite table rewrite because of the internal sprites attributes cache (you can't just change sprite table address at mid frame)... Or you can use this feature to do some fancy effect (as castlevania).
Quote:
There are only 3 slots in hblank, the other 15 are spread over the visible area (and that's H40 mode, it gets worse in H32 mode).
3 fast or 3 slow slots ? I'm surprised because using hblank I was able to pull up to 7 or 8 words from the CPU without wait state (maybe because of the fifo).
Slot = a cycle where a write can go to video memory (VRAM, CRAM, VSRAM), there is no such thing as "fast slot" =| Sure, probably the FIFO will eat them up and free up the 68000, but the writes themselves will still be delayed and will happen mid-screen (which is particularly bad for palette writes since they cause CRAM dots).
I know what a slot is, but normally during hblank the vdp access should be fast, by slow I meant as slow than during active period (which is slower than the 68k can write). From my calculation hblank is about 44 68k cycles and so you should have more than 3 full speed slots but I guess the vdp uses part of it to complete/prepare next line rendering. The problem is that hint does let you to have full hblank access, almost time you had better to prepare stuff for next next scanline instead.
Active lines are busy throughout, HBL doesn't have any extra free time compared to visible area.
That actually explain... HBlank accesses are not any faster than active period accesses (it's what i meant by "slow" slots).
Strangely i was still able to fed about 8 words without any wait state on the 68000 in a code i made a long time ago, maybe the code was slow enough to allow 3 slots to be processed while writing them.
Stef wrote:
From my calculation hblank is about 44 68k cycles
44 cycles is what the 68000 takes up in acknowledging an IRQ.
Yeah but the hblank time is almost identical from what i remember (maybe 46 cycles actually). But that is just the hblank flag timing, not representing the real hblank i guess. And if the hblank does not provide extra slots (i mean fast slots) then it doesn't really mind...
It's then just timing related so your writes occur before data fetch or not (in which case you update data for next scanline)
But ok, indeed the hblank period does not provide faster access, you have to sacrifice some sprites for that (by using forced blanking).
Can you please stop talking about "fast slots"? There isn't any such thing, you're just going to cause even more confusion and I think you don't even know what "slot" means (again: it's the dot clocks where a write can go to VRAM/CRAM/VSRAM, nothing else - vblank also has slots at specific spots, for that matter).
Then are there parts of a scanline with "closely spaced slots" and "widely spaced slots"? If so, then my first guess is that "fast slots" are the ones closer together, which clear the FIFO faster.
And this is precisely why I wanted Stef to stop talking about fast slots. (・_・)
There aren't fast slots, period. Stef was referring to the writes that get "eaten" by the FIFO (i.e. those that don't overflow it), since those let the 68000 continue instead of having to wait for a slot. That obviously is completely useless for timing purposes since you still have very little room for actually changing values (worse, a slot refers to an access cycle, so VRAM writes take up two slots instead of one, since VRAM is byte-based). The only advantage is not making the 68000 waste time when doing a write or two (allowing IRQs to return faster), nothing else.
As for the spacing of the slots: there are 15 during active scan, one every 16 pixel column (the columns without slots eat up that cycle for memory refresh instead). During hblank the VDP is way too busy reading sprite data so there's only room for three slots.
EDIT: and just to make it clear, the FIFO can store four 16-bit words, so as long as you don't write more than four words you'll never force the 68000 to wait regardless of when you do it. Writes will still take a while to take effect though.
Sik wrote:
As for the spacing of the slots: there are 15 during active scan, one every 16 pixel column (the columns without slots eat up that cycle for memory refresh instead). During hblank the VDP is way too busy reading sprite data so there's only room for three slots.
Does this figure (15 in draw, 3 in hblank) apply to both 32-tile and 40-tile modes? Which timing diagram should I be looking at so I don't ask more dumb questions?
Sik wrote:
Can you please stop talking about "fast slots"? There isn't any such thing, you're just going to cause even more confusion and I think you don't even know what "slot" means (again: it's the dot clocks where a write can go to VRAM/CRAM/VSRAM, nothing else - vblank also has slots at specific spots, for that matter).
Thanks Sik but really i do now what is a slot... (again). I followed with a lot of interests the work of Nemesis about this:
http://gendev.spritesmind.net/forum/viewtopic.php?t=851I agree the term of "fast slot" but it was exactly what Tepples said so at least some people understand this expression:
Quote:
Then are there parts of a scanline with "closely spaced slots" and "widely spaced slots"? If so, then my first guess is that "fast slots" are the ones closer together, which clear the FIFO faster.
There is no need to tell "it give 3 slots" if you don't know *the period* for these 3 slots. If you have 3 consecutive slots then no slot available for sometime it's quite different from having 3 slots at regular time interval as my point was to know how much words you can write from the 68k without wait state (and depending the case the result is not the same). The timing about how the writes actually occurs is another matter but it wasn't what interested me here.
tepples wrote:
Does this figure (15 in draw, 3 in hblank) apply to both 32-tile and 40-tile modes? Which timing diagram should I be looking at so I don't ask more dumb questions?
No, H32 mode has less slots (I don't recall the exact spacing but H40 has 18 slots and H32 has 16 during active scan - and 198 and 161 during vblank period, since it's one slot every two pixels and some spent in refresh).
Stef wrote:
There is no need to tell "it give 3 slots" if you don't know *the period* for these 3 slots. If you have 3 consecutive slots then no slot available for sometime it's quite different from having 3 slots at regular time interval as my point was to know how much words you can write from the 68k without wait state (and depending the case the result is not the same). The timing about how the writes actually occurs is another matter but it wasn't what interested me here.
The timing matters because you can't make 7~8 palette writes in a single line and not have garbage pixels, you were arguing that because the FIFO ate them that it was safe =|
MD does must go with vsram for all kind of vertical scrolls ??
If so, this why some heavy vertical lines scroll are complicated to do, no ??
If i understand correctly, you have 18 slots per line when active display and it's tight to change Vscroll on each line with interrupt .
I'm not saying it's impossible (because it's not), but may be difficult ??
VSRAM is actually faster than VRAM though (since it's word-based instead of byte-based), and if you aren't using per-column scrolling (most games don't) you're literally using only two words from it.
Most games didn't do mid-screen vertical scroll changes not because it was hard, but rather because there isn't much need for it in the first place =P Per-line horizontal scrolling is way more useful in practice (especially when it comes to parallax). The biggest problem is not changing the scroll value but calculating the values to use in each line in the first place (and nothing is going to help you here, that's a game logic issue, not a hardware one).
3D racing games used it a lot to do the roads. Maybe not the best example since they don't run at 60FPS, but that's more to do with the fact that they like to use both scroll planes so they can have larger sprites on screen (offloading the largest ones to the second plane) so as you can imagine that takes a toll on performance. It doesn't help that they have to load new graphics on the fly as you move around (OutRunners wastes an extra frame when it needs to load new scenery sprites, for example).
Some games used it for vertical scaling (see: Gunstar Heroes title screen, cat boss in Adventures of Batman & Robin, etc.), but in practice it's not that useful when you don't have horizontal scaling as well so it wasn't used much for that purpose. (and no, Hydrocity's background in Sonic 3 didn't use it, amusingly)
Ok i see now, thanks for explanantions
Quote:
The biggest problem is not changing the scroll value but calculating the values to use in each line in the first place (and nothing is going to help you here, that's a game logic issue, not a hardware one).
yes of course, i said by difficult for game logic added with vertical effect, not by an hardware point of view.
rainwarrior wrote:
It's very easy to write code that is slow. You should never be surprised by it. Where there is slowdown in any game it is simply because the developer did not treat it as a priority problem.
It has
nothing to do with outsourcing, being used to arcade boards, being a large company, or anything of the sort.
Nothing at all. The programmers
weren't lazy, or incompetent. Look at any development diary and you'll see that they're overworked, tired, scrambling to keep up with the schedule (and late anyway), and struggling not to run out the budget before they have to ship the game. (
Half the time, they don't even manage to do that.)
Games have slowdown simply because the person managing the project thought something else was more important. That's it. That's the root cause. Everything else is subservient to that.
How is 32-bit physics calculation less time consuming for a programmer than using 16-bit physics?
More precision means less time spent debugging problems caused by overflow or underflow.
Aside from accidental overflow, often more than 16 bits of precision is necessary for the kind of physics you want to do. (e.g. I wanted at least 17 bits for the character in my NES game, which I've implemented as 24 bit.)
rainwarrior wrote:
Aside from accidental overflow, often more than 16 bits of precision is necessary for the kind of physics you want to do.
Are you talking like 16.16, or just 16? I think 16.16 should be way more than enough, and you probably won't be able to tell it apart from 16.8.
For the record, I for some reason feel like sharing that I am using 16.8 for velocity, and the .8 at the end gets disregarded to where it is then just 16 for things like collision detection. This is kind of random, but one thing I've always wanted to do is have a bit of a physics like thing in a game, to where if there is an explosion and someone is near by, they will go flying through the air. I see several problems with something like this, like how the thing would know which direction to fly in the air relative to the center of the explosion. I guess you would do a standard bounding box collision and then look at the center of the object in the blast radius relative to the center of the explosion? This kind of reminds me of how you'd do something like homing missiles.
I use 16.8 most of the time.
Espozo wrote:
Are you talking like 16.16, or just 16? I think 16.16 should be way more than enough, and you probably won't be able to tell it apart from 16.8
16.16 is 32-bit. This was directly in response to:
psycopathicteen wrote:
How is 32-bit physics calculation less time consuming for a programmer than using 16-bit physics?
Sik wrote:
The timing matters because you can't make 7~8 palette writes in a single line and not have garbage pixels, you were arguing that because the FIFO ate them that it was safe =|
What i meant by *safe* is that your writes aren't lost whatever happen which is already a good point... and the FIFO is definitely very useful because without it we would have got a painful penalty for any VDP access during active period, even a simple vertical scroll change would have cost a lot more CPU time (in the worst case).
I do know it does not make the writes into the internal VDP hardware faster and that you can't update a whole palette fast enough during the active period but imo that is not that hurting, just schedule it on several scanline... and if you really want to update a complete palette (16 colors) quickly you can always sacrifice some sprites and disable display right after the last column and blast a CRAM DMA, i never tried it but i'm almost certain we can transfer 16 colors without too much damage (in term of sprite).
I know the VDP timings and yes there's some serious damage in terms of sprite count actually, and that's with carefully timed code. (Overdrive even relies on this) Better leave that for situations where you can get away without sprites (like the credits in Puggsy, where palette changes are used to draw the scaled sprites).
Mickey Mania does it as well (disabling VDP during hblank) during the moose hunter run sequence and it still does display sprites (some are clipped for that specific reason hopefully that is a wanted effect in the situation).
I didn't know about the effect in the puggsy credits, this game is awesome on many points actually... And the credits is definitely impressive (just saw it on youtube, look like it inspired one of the Overdrive effect
)
Stef wrote:
(just saw it on youtube, look like it inspired one of the Overdrive effect
)
Nope, that was just a coincidence, although when Titan was struggling in Revision to get it working (since timing wasn't fast enough) and I was brought in at last minute I did point out that Puggsy did something similar... sadly there wasn't enough time to disassemble the game and find the relevant code to see how that was handled =P In the end I did some rushed blind programming (as in Fusion would show a blank screen, I just went by the assumption it'd work on the real thing - which it did), and by rushed I mean "we were literally testing with a flashcart while flashing the other in order to get testing done as quickly as possible" (yep, pipelined bugtesting). Sadly later it turned out there was another bug elsewhere so we had to can it =/ (it was for that last minute bugfixing rush that I was permanently brought into Titan though)
In the long term that was probably for the best since Overdrive was
horrible back then. Most of the best effects you see in the demo (the checkerboard, the rotator, the cube, etc.) weren't in the Revision version.
Sik wrote:
Nope, that was just a coincidence, although when Titan was struggling in Revision to get it working (since timing wasn't fast enough) and I was brought in at last minute I did point out that Puggsy did something similar... sadly there wasn't enough time to disassemble the game and find the relevant code to see how that was handled =P
Hehe, anyway the fun is in trying to understand how they done it and reproduce it.
Too bad you were that short on time...
Quote:
In the long term that was probably for the best since Overdrive was horrible back then. Most of the best effects you see in the demo (the checkerboard, the rotator, the cube, etc.) weren't in the Revision version.
I admit that are the effects i prefer as well, i still wonder how the 3D cube is done. In the NTSC version the cube is even bigger and still animated at 60 FPS. Given how the effect seems to rely on timing i guess it uses some tricks.
The cube relies on the scroll planes actually =/ (and the cube is bigger in PAL too, 1.1 isn't NTSC-only in case you wonder) Pretty sure there was an explanation of how it works in some site... if not huh just look at it while disabling individual layers and it becomes obvious, honestly. Trickier ones are the checkerboard (software rendering of both background and text), the rotation (abuses autoincrement), the 512 color part (the timing of just about everything, it's a good way to test whether a clone has hardware based off the real thing)...
And yeah I don't think I could have disassembled a game and then tracked down the raster effect interrupt handler from its credits code in less than an hour (that's literally how little there was left for the compo deadline, I'm not exaggerating).
Sik wrote:
The cube relies on the scroll planes actually =/ (and the cube is bigger in PAL too, 1.1 isn't NTSC-only in case you wonder) Pretty sure there was an explanation of how it works in some site... if not huh just look at it while disabling individual layers and it becomes obvious, honestly.
It was the first assumption i did for the cube stuff, using scroll plan as some racer game draws the road...
But then, you should be limited to 2 faces per line or something like that and at some point (when the cube morph to tetrahedron) then you can have up to 3 colors / faces per scanline which make too much combinations to fit in the scrollmap. But maybe i missed something... i will try to play with emulator to see the trick
Quote:
Trickier ones are the checkerboard (software rendering of both background and text), the rotation (abuses autoincrement), the 512 color part (the timing of just about everything, it's a good way to test whether a clone has hardware based off the real thing)...
The checkerboard is my preferred effect, still not sure how it is done but well for this one i just want to look at it
We can see the rotation use some approximations in the calculation as the pixels aren't always perfectly aligned but definitely impressive to see it happening
Quote:
the 512 color part
For this one i expected to see the famous CRAM blast bitmap mode, but you preferred the "more conventional" method and the result is actually quite nice (i specially like the transition effect in how the screen appear/disappear).
How much palette entries are wrote per scanline by the way ?
Quote:
And yeah I don't think I could have disassembled a game and then tracked down the raster effect interrupt handler from its credits code in less than an hour (that's literally how little there was left for the compo deadline, I'm not exaggerating).
1 hour ?
It was a rush compo actually :p
Stef wrote:
It was the first assumption i did for the cube stuff, using scroll plan as some racer game draws the road...
But then, you should be limited to 2 faces per line or something like that and at some point (when the cube morph to tetrahedron) then you can have up to 3 colors / faces per scanline which make too much combinations to fit in the scrollmap. But maybe i missed something... i will try to play with emulator to see the trick
There are a few cases where there are too many faces in a line and it glitches up, luckily those areas are small enough to be overlooked.
Stef wrote:
The checkerboard is my preferred effect, still not sure how it is done but well for this one i just want to look at it
CPU grunt, pretty much =/ The VDP helps with stretching it vertically, but horizontally the rendering is pretty much the 68000 at work. All of this while the 68000 rotates the text as well... (there's a reason why the text animation looks wonky)
Stef wrote:
For this one i expected to see the famous CRAM blast bitmap mode, but you preferred the "more conventional" method and the result is actually quite nice (i specially like the transition effect in how the screen appear/disappear).
We were originally going to have both - the old low-resolution phantom bitmap (covering half the screen, with the other half being a parallax with some text) and then the high resolution one. We ditched the former because Sigflup beat us to having it in a demo so we needed to brag about doing something better =| (oerg wasn't happy about it, and technically it was my fault since I was the one suggesting to release the original idea since something like it was already being discussed in SpritesMind and I didn't want the credit to go to the wrong person when it was eventually figured out)
Stef wrote:
How much palette entries are wrote per scanline by the way ?
A whole palette per line. The amount of sprites being processed goes down to about 30 per line (too few and the message around the planet flickers, too many and the "your emu suxx" message appears).
Stef wrote:
1 hour ?
It was a rush compo actually :p
I did say I was brought in at last minute =P
Doesn't 16.8 velocties take just as much CPU resources as 16.16 velocities anyway? That's why I use 8.8 velocities with 16.8 coordinates, since it only needs to deal with big numbers once, and do all the other calculations as 16-bit numbers.
BTW, is the 8.8 + 16.8 code really that hard to understand? It adds the 8.8 velocity to the 8.8 part of the coordinates, and then adjusts the top byte of the 16.8 coordinates. I waste a byte of memory after the 16.8 coordinates to avoid changing accumulator sizes.
I think the argument here was that 16.16 is best for the 68000 and 16.8 is best for the 65816.
Sik wrote:
I think the argument here was that 16.16 is best for the 68000 and 16.8 is best for the 65816.
I'm still thoroughly convinced that it really wouldn't make a difference to have sub pixel coordinates or not. (you definitely want it for velocity though.) Really, the only thing to do with the coordinates is collision detection, which I don't think needs to do any sub pixel stuff because if you think about it, the boundary of the hit box would vary by one pixel, even though there probably isn't going to be a situation where that level of accuracy is needed. It is good for the 68000 which can conveniently use 32 bit numbers, but on the 65816, I wouldn't even bother.
Espozo wrote:
I'm still thoroughly convinced that it really wouldn't make a difference to have sub pixel coordinates or not. (you definitely want it for velocity though.)
But if you don't use subpixel coordinates, you don't get the benefit of subpixel velocity, because it can't accumulate across frames. The only other way is to fake it like Sik does.
93143 wrote:
Espozo wrote:
I'm still thoroughly convinced that it really wouldn't make a difference to have sub pixel coordinates or not. (you definitely want it for velocity though.)
But if you don't use subpixel coordinates, you don't get the benefit of subpixel velocity, because it can't accumulate across frames. The only other way is to fake it like Sik does.
That's what I'd do.
Sik wrote:
There are a few cases where there are too many faces in a line and it glitches up, luckily those areas are small enough to be overlooked.
Ah ok, so there is this limitation, still pretty impressive to see it running at 60 FPS and that big
I made my own 3D flat rendering engine and of course it just can't display at 60 FPS because of the required bandwidth to just fill the rendering area :-/
Stef wrote:
CPU grunt, pretty much =/ The VDP helps with stretching it vertically, but horizontally the rendering is pretty much the 68000 at work. All of this while the 68000 rotates the text as well... (there's a reason why the text animation looks wonky)
I looked back at the effect and it appears the whole checkboard effect is rendered in a single plan i don't see much repeated patterns and so really wonder how the 68000 can fed that much data in VRAM at each frame (and that is even with the vertical assisted stretching which could not always work depending the situation). I wonder if there is no trick about the color arrangement as using one plan with 4 colors immediately me though about 1bit being used per color then we can somehow precalculate different combinations... I need to investigate with a debugger :p
Quote:
We were originally going to have both - the old low-resolution phantom bitmap (covering half the screen, with the other half being a parallax with some text) and then the high resolution one. We ditched the former because Sigflup beat us to having it in a demo so we needed to brag about doing something better =| (oerg wasn't happy about it, and technically it was my fault since I was the one suggesting to release the original idea since something like it was already being discussed in SpritesMind and I didn't want the credit to go to the wrong person when it was eventually figured out)
Yeah i remember the discussion on SpriteMinds, still having both could have been cool as well
Actually the method was even known from Sega staff, the "Blast Processing" actually comes from that, it's why I intentioned used the "blast" word here :p
Quote:
A whole palette per line. The amount of sprites being processed goes down to about 30 per line (too few and the message around the planet flickers, too many and the "your emu suxx" message appears).
Ok so transferring 16 colors per scanline eat a bit more than half of the sprite capabilities, good to know
Actually in games which are not CPU nor sprite intensive (as RPG) that is a workable solution to have more very colored background (even if it requires tricky timing, when you have them it's done).
Stef wrote:
I looked back at the effect and it appears the whole checkboard effect is rendered in a single plan i don't see much repeated patterns and so really wonder how the 68000 can fed that much data in VRAM at each frame (and that is even with the vertical assisted stretching which could not always work depending the situation). I wonder if there is no trick about the color arrangement as using one plan with 4 colors immediately me though about 1bit being used per color then we can somehow precalculate different combinations... I need to investigate with a debugger :p
I'll just ask Kabuto...
Stef wrote:
Ok so transferring 16 colors per scanline eat a bit more than half of the sprite capabilities, good to know
Actually in games which are not CPU nor sprite intensive (as RPG) that is a workable solution to have more very colored background (even if it requires tricky timing, when you have them it's done).
Yeah would have to be not CPU intensive (and not sprite loading intensive either) because Overdrive does a CPU spin instead of relying on hblank interrupts (i.e. CPU is busy looping through the entire active scan in order to get the right timing). Maybe it's more doable when the palette changes only happen at some points on screen, then you can just spin around for a few lines instead of the entire screen.
Sik wrote:
Yeah would have to be not CPU intensive (and not sprite loading intensive either) because Overdrive does a CPU spin instead of relying on hblank interrupts (i.e. CPU is busy looping through the entire active scan in order to get the right timing). Maybe it's more doable when the palette changes only happen at some points on screen, then you can just spin around for a few lines instead of the entire screen.
Yeah it's why i specified that, requiring so accurate timing mean you have to spent CPU in waiting for the correct time and if you want to update 1 palette at each scanline then you really require to spent the whole active time on it.
Doing it each 8th scanline would already release a lot of CPU time (busy ~2 scanlines then free for 6).