Hi.
The NES is obviously a very limited system, but the one resource i pay the least attention to is the code size. I think i am reaching 6K of code soon and I basically just have a guy running around in a room.
Do you guys care at all about code size, do you you just add extra pages of ROM whenever you run out?
What do you allow yourself to use macros for (16bit math, etc.) ? Do you often unroll loops? Any best practices I should be aware of?
Thanks.
-Mat
When I did homebrew for the TI83, code size was paramount and extremely important. Especially when you had 25K of user ram for all your programs.
When to add pagesWith very few exceptions, sizes of PRG ROM and CHR ROM on NES are powers of two. When you add more pages of ROM, you always double it, and if you double it too many times, you could exceed how much you had planned to pay per cartridge for replication. A few of these thresholds are especially painful.
For PRG ROM size:
- 32K to more than 32K, as you now have to include PRG bank switching hardware on the PCB and decide what goes in the fixed bank and what goes elsewhere.
- 64K to more than 64K, as you're now ineligible for the main category of the NESdev Compo (and thus exposure on the Action 53 anthologies to build an audience for your future solo cartridge releases).
- 256K to more than 256K, as many popular discrete mappers top out at that much, as does the MMC1 if CHR ROM is used.
- 512K to more than 512K, as you hit the limit of the PowerPak and many ASIC mappers.
For sizes of other things:
- 8K CHR ROM to more than 8K, as you now have to include CHR bank switching hardware on the PCB and figure out what will be displayed alongside what.
- Roughly 6K of DPCM samples to bigger than 6K, as you're now putting serious pressure on your fixed bank.
- 16K of enemy movement code to more than 16K, as you now have to split that across several banks and give each enemy type not only a movement routine entry point but also a movement routine bank.
- 16K of music code and data to more than 16K, as you now have to move a lot of music code into the fixed bank so that it can access sequence data in multiple banks.
- 2K of RAM to more than 2K, as you need to include WRAM and decoding hardware on the PCB.
- 32 bits of state preserved from one play session of your campaign to the next to more than 32, as you need to switch from an 8-character password to battery-backed RAM or self-flashability in order to save players' sanity.
- 8K of WRAM to more than 8K, as only a few well-known mappers are known to support that: MMC1 with CHR RAM, MMC5, and FME-7.
The PRG ROM of
Super Mario Bros. was just over 32K. As ShaneM discovered, there are a few parts of the program where tricky code-golf optimizations were made, and a few parts that were left unoptimized. Nintendo engineers optimized the code for size just enough to the point where it came under 32K.
Time-space tradeoffsModerate unrolling (factors of 4 to 16 or so) is also beneficial in time-critical code, such as video memory update routines. If you're trying to fit into 16K PRG ROM for NROM-128, you probably don't have that much stuff to push to video memory each frame, so you can get away with a less unrolled update loop than something that pushes the limits of video memory bandwidth the way
Battletoads does.
Subroutine call overhead on 6502 is 12 cycles: 6 for JSR and 6 for RTS. If a routine is called only a few times per 29780-cycle frame, this overhead may not amount to much.
It's also the resource I pay the least attention too. Making my code smaller affects my players' perception of the game less than making my code faster (dropped frames vs... an extra second of download on 56K?). (Before the pedantic, I deal in ROMs, making cartridge costs aren't a factor.)
I do turn a lot of jumps into branches (which can only tie or make code slower) for bytes. (There's usually always a flag you can branch on instead of a jmp.) I do end up caring a little bit about data size, but I just find that sort of compression fun.
I use macros to unroll loops, and in place of subroutines that are used often (to avoid the jsr/rts speed hit.) I don't use 'em too much for 16bit math since usually it makes optimizations harder to see. (If you have a 16bit add macro with a clc baked in, but you know the carry will be clear in some places where you place it. Or you know it will be set, and can just fix the constant. Or whatever.)
If you posted some code, it'd be easier to give tips based on what you're actually doing. My most general tip is the carry flag is super useful for all kinds of optimizations. Relatively few instructions change it so you can rely on it not changing (and branch based on a value it has had) for a while. As well you can set things up so you branch on carry set to a subtraction (so no need to sec) and on carry clear it wouldn't branch and would add (so no need to clc.) It's two cycles to set or clear and again... doesn't really change. So you can use it as a return value for a subroutine, and do a lot of other stuff before you actually use the value.
You can also check out this thread:
http://atariage.com/forums/topic/71120- ... ler-hacks/And this one:
http://www.atariage.com/forums/topic/11 ... -by-seven/And this wiki article:
https://wiki.nesdev.com/w/index.php/Syn ... structions (Reverse subtract is nice.)
As well as here:
http://codebase64.org/doku.php?id=base:6502_6510_mathsFor some cute examples of code to get your brain turning.
tl;dr: Worry the most about getting the game done, honestly. Use whatever resources are at your disposal for that, because none of the rest matters if no one will ever play it.
me personally, i always try to use the "least" amount of code as possible for anything on megaman odyssey. I like to say that im the master of "optimzing routines" - cause i've been doing it for so many years.
The game rarely ever has any lag frames "during" gameplay, and i've tested certain areas and situations with fceux movie-recordings and the lag counter, countless times.
except for Pyro Man level only, cause the SNES-Genesis style water IRQ is called like 30 times a frame
But im always so obsessed with finding or creating as many little shortcuts as possible even though i know it wont ever matter to a person in the world.
My game's size is currently 512 KB graphics, and 512 KB coding ...i have more than 50% free space on graphics, and probably about 3 or 4 "2000 byte banks" still avaiable on coding space, before i must upgrade to 1 MB
I use the MMC5 mapper, so the limit is 1 MB on both. never had to go past 512 kb yet.
I normally consider PRG-ROM to be a cheap resource, so I'll often resort to using unrolled loops and look-up tables if that results in performance improvements. However, that doesn't mean I don't care about code size, because even though 512KB of PRG-ROM (a "common" limit for programs that are not meant to be part of compilations) is a lot to fill, the NES can still only see 32KB at a time, and I really don't want to be switching banks back and forth during critical parts of my game engine.
Like other coders here, I do a lot of small optimizations, turning jumps into branches, removing redundant SECs and CLCs, and so on. I always try to draw attention to these optimizations using comments, so that I know I must be careful when editing code around them. Another thing I do is try to use as few variables as possible in any given logic block, to avoid excessive loading and storing. I also spend some time optimizing loops, avoiding special handling of the first or last iterations (which could require repeated code) and finding the optimal ending conditions (e.g. counting down instead of up to avoid a comparison), even if that means tweaking the inner logic a bit.
tokumaru wrote:
I really don't want to be switching banks back and forth during critical parts of my game engine.
This is really the main driver behind trying to keep code size down, myself. If I can keep all of the code and data that's relevant to each other inside the same bank, that means that everything is a lot simpler. MMC1 also makes you want to avoid bank switching when you can because of how many cycles it ends up using.
On top of doing small optimizations, I tend to use subroutines quite heavily. It doesn't matter if the code is very specific - if there's a block of code I use multiple times and cycles aren't very important, I make it a subroutine.
Another thing I do with subroutines a lot is to give them a "default" input, that can be overridden. For example, I have a "Display Enemy" which just takes a starting tile number, which is the most common case. "Display Enemy" just prepares an input for and drops into executing "Display Enemy Custom" which takes four arbitrary tile numbers, and I only have to store one copy of the pretty lengthy subroutine. Super Mario Bros does something similar, giving some routines multiple places to enter them that feed something else with different data depending on which entry point you used.
Personally I am very careful about code size, and I know I disagreed with tokumaru on that quite a few times
Optimizing for speed only makes sense if you come close to the limit where you use 100% of the CPU, while optimizing for size makes sense whenever you are close to a limit of 2^n of PRG-ROM and don't want to go for the upper 2^n size.
Limiting code size also makes the most sense if you want to limit to 32KB and avoid bankswitching altogether. (Yes, there is way to get more without bankswitching but they're non-cannonical). As soon as you'll implement any bankswitching I don't think it makes much difference whether you use, say, 64KB or 128KB, but each mapper has it's cannonical limits.
bleubleu wrote:
What do you allow yourself to use macros for (16bit math, etc.) ?
In most cases, macros should assemble exactly the same thing you would have written out by hand, adding no extra length of executable code nor execution time, only hiding the ugly internal details so you don't have to look at them every time you use them. Since they can make your code more concise and readable, you might even find optimization possibilities that weren't otherwise obvious. I also get fewer bugs when I make heavy use of macros, and the ones I do get tend to be easier to find and fix. Call me the macro junkie. Maybe that should have been my forum name.
Kasumi has a valid point about a few situations like whether or not a CLC is needed before a 16-bit add in a macro; but you can also have for example two different macros that do the same thing except that one has the leading CLC and the other does not, and give them the same names except that the one less used might have a trailing _ or something like that. That's not very common though, and of course just because you have a macro there doesn't mean you can't still do it the non-macro way if you want to.
I do 6502 nestable program flow control structures in macros too. See
http://wilsonminesco.com/StructureMacros/ . One of the simplest examples might be
Code:
CMP #14
IF_EQ ; clear enough that it really needs no comments
<actions>
<actions>
<actions>
END_IF
and the IF_EQ assembles a
BNEdown to the
END_IF, exactly as you would write by hand, but it doesn't need the label. The
END_IFis only used by the assembler, and it does not lay down any code. Again, it can be nested too, meaning a second
IF_EQ...
END_IFpair could be inside the first one, and another one inside of that one, etc., and the assembler will make each branch go to the right place. There are lots of different forms of these, even in the IFs group, like
IF_BIT ACIA_STAT_REG, 3, IS_SET, and of course other structures besides
IFs, like
BEGIN...
WHILE...
REPEAT,
FOR...
NEXT (including a 16-bit one),
CASE, etc..
Code size is something to be careful about, it does sneak up on.
At the start I just write code, I don't care about how big or slow it is. I just get it to do "the thing". Once I see the thing and I play with it, I can get it to the point I'm convinced that it will "stay". Then I will make it "sensible". Any thing that can be looped, tabled etc get looped and tabled. I then save size/speed optimisations for when it becomes critical and I can see how it has to work in the mostly complete code.
For, this "clc is not needed", "this jump can be a bne etc", I leave it to the tass optimizer to find those, it will find all of them in a second, it does love to show off.
In order to stop the 3 weeks of optimisation at the end of the project that is risky as you don't have time to test everything, I use BDD6502 and I take a week or so here and there to do a optimise and code cleaning pass, to break it up a bit.
Macros are nice, but they are mermaids.. they sing a sweet song and send you to your doom if you are not careful. If you make the "safe" its mostly ok. But you have to really plan to properly and understand how they work etc. I did have a lot of macros but I found they tend to make the code less readable and maintainable after a while. ADCB_W, ADCBX_W, IFBLT, IFBLTE, BAGTE etc. So I've evolved it to a syntax sugar + optimizer system which then gets the tass optimizer as well. After using it for the last 4 months everyday its too the point I can't even be bothered to write a small test case the old way.. Still has a lot of things it doesn't do that it needs to do though sigh...
just wanted to show this if i may, sometimes i'll have insanely long sections of nothing but JSR's only ..to do many different things in terms of enemy/boss movement patterns, like this
without all these JSR's, it would have been like 20x longer.
fun discussion topic
How many bank granularity is there with existing mappers? If I start with a 32kB bank, things will go smoothly, but if I pass that limit and have to rely on bank switching to get around it, trying to fit existing code in seperate 16kB banks would be really frustrating.
The 3 most common layouts:
32K (e.g. GNROM, AOROM, BNROM, Color Dreams, MMC1)
$8000-$FFFF: 32K switchable window
16K (e.g. UNROM, MMC1)
$8000-$BFFF: 16K switchable window
$C000-$FFFF: fixed to last 16K
2x8K (e.g. MMC3)
$8000-$9FFF: 8K switchable window
$A000-$BFFF: 8K switchable window
$C000-$FFFF: fixed to last 16K
And specialized layouts:
2x8K reconfigured for DPCM switching (e.g. MMC3, VRC4)
$8000-$9FFF: fixed to last 8K
$A000-$BFFF: 8K switchable window
$C000-$DFFF: 8K switchable window
$E000-$FFFF: fixed to last 8K
16K+8K (e.g. VRC6)
$8000-$9FFF: fixed to last 8K
$A000-$BFFF: 8K switchable window
$C000-$DFFF: 8K switchable window
$E000-$FFFF: fixed to last 8K
3x8K (e.g. FME-7, RAMBO-1)
$8000-$9FFF: 8K switchable window
$A000-$BFFF: 8K switchable window
$C000-$DFFF: 8K switchable window
$E000-$FFFF: fixed to last 8K
One reason for having a fixed bank is that on NES, the program, data, and DPCM sample bank are all $00 (in 65816 terms). So the mapper needs to subdivide the space so that a subroutine in one bank can access data in another bank.
Oziphantom wrote:
Macros are nice, but they are mermaids.. they sing a sweet song and send you to your doom if you are not careful. If you make the "safe" its mostly ok. But you have to really plan to properly and understand how they work etc. I did have a lot of macros but I found they tend to make the code less readable and maintainable after a while. ADCB_W, ADCBX_W, IFBLT, IFBLTE, BAGTE etc.
Take a different approach. Instead of using cryptic names, make it really clear what they're doing, and use the parameters to make like a sentence. If your ADCB_W means "Do a double-precision (16-bit) add-with-carry of B and W," you could change the macro name to something like _16bit_ADC, and make the line say for example,
Code:
_16bit_ADC B, _and, W ; B=B+W
(Unfortunately the assembler requires separating parameters with a comma, which is why there's a comma after the _and.) The "_and" (with the underscore or other character to keep the assembler from confusing it with the mnemonic) is an equate that does not actually get used by the macro. It's only there to make things more readable to humans. The comment clarifies where the answer goes. So this would assemble the same as
Code:
CLC
LDA B
ADC W
STA B
LDA B+1
ADC W+1
STA B+1
The same macro can be used to add different variables which you specify in the parameters, rather than being confined to B and W. Conditional assembly in the macro definition can do optimizations if necessary. Some assemblers let you say in essence, "If there's a fourth parameter, do the following;" so you could use the same macro to add more than just two numbers, and you could invoke it something like this:
Code:
_16bit_ADC B, W, _and, offset3 ; B=B+W+offset3
If your IFBLT means "if: branch if less than," and only assembles a BMI, it's not really clarifying or shortening anything. How about something like this instead, where a portion is skipped if the N flag is set:
Code:
IF_POSITIVE ; Negative result above causes it to skip the following lines.
<do_stuff>
<do_stuff>
<do_stuff>
END_IF
or to branch back to the beginning of a loop as long as the result is negative:
Code:
BEGIN ; (Or name it "DO" if you like)
<do_stuff>
<do_stuff>
<do_stuff>
UNTIL_POSITIVE
Then you don't even need a label (although you can still use one if you want to).
Thanks for all the answers. The IF_XXX macros are very interesting. I also find beq/bmi/bpl are often hard to follow.
I should have realized that such an opened question would send the discussion in many direction. Ill try to be more specific.
Where do you draw the line for inlining something (macros) vs. making it a subroutine. For example, what about a 16-bit addition (which is ~20 cycles or so). Do you accept to pay 12 cycles (jsr/rts) for it just for the sake of reducing code size or do you use a macro? Where is the cutoff?
-Mat
bleubleu wrote:
Where do you draw the line for inlining something (macros) vs. making it a subroutine. For example, what about a 16-bit addition (which is ~20 cycles or so). Do you accept to pay 12 cycles (jsr/rts) for it just for the sake of reducing code size or do you use a macro? Where is the cutoff?
You're forgetting about the extra code required for parameter passing if it's a subroutine. That can easily tip the scales in favor of inlining the code when it comes to short snippets like that.
Ah. To answer the more specific question, I usually only make things subroutines when they need ,x or ,y. (So making two objects collide, reading collision data.) I always take the byte hit rather than the cycle hit on 16bit math. My games do a lot of it, because they scroll in both directions.
I'll also make subroutines if it's a generic thing that's likely to make branching over it go out of range.
bleubleu wrote:
Thanks for all the answers. The IF_XXX macros are very interesting. I also find beq/bmi/bpl are often hard to follow.
I should have realized that such an opened question would send the discussion in many direction. Ill try to be more specific.
Where do you draw the line for inlining something (macros) vs. making it a subroutine. For example, what about a 16-bit addition (which is ~20 cycles or so). Do you accept to pay 12 cycles (jsr/rts) for it just for the sake of reducing code size or do you use a macro? Where is the cutoff?
It depends of course on the performance requirements at that point, and what straightlining would cost in terms of memory. But consider the following from
my web page on macros:
As you write an assembly-language program, you may see repeating patterns. If it's exactly the same all the time, you can make it a subroutine. That incurs a 12-clock performance penalty for the subroutine call (JSR) and return (RTS), but program memory is saved because the code for the subroutine is not repeated over and over.
There will be other times however where the repeating pattern is the same but internal details are not, so you can't just use a JSR. The differences from one occurrence to another might be an operand, a string or other data, an address, a condition, etc.. It would be helpful to be able to tell the assembler, "Do this sequence here; except when you get down to this part, substitute-in such-and-such," or, "under such-and-such condition, assemble this alternate code." That's where it's time for a macro.
bleubleu wrote:
Thanks for all the answers. The IF_XXX macros are very interesting. I also find beq/bmi/bpl are often hard to follow.
I should have realized that such an opened question would send the discussion in many direction. Ill try to be more specific.
Mat at this point, you can not blame yourself. Trust me, there is no forum on the internet that loves a tangent more than this on. You can start with the most specific of cases, that strictly defines what you mean and it will still end up 5 tangents away. I mean you can ask about tile maps on a NES and end up talking about the pros and cons of the Megadrive using a Z80 as a sound processor (ok to be fair it has never quite gone
that far, but the 'to the MD' has happened )
bleubleu wrote:
TWhere do you draw the line for inlining something (macros) vs. making it a subroutine. For example, what about a 16-bit addition (which is ~20 cycles or so). Do you accept to pay 12 cycles (jsr/rts) for it just for the sake of reducing code size or do you use a macro? Where is the cutoff?
-Mat
For when and why, you have to basically (mentally) profile the code.
Do I add this value to a lot of things? No inline it.
Do I add this value to a few things that are withing 256 of each other, in a lot of places? Yes Function it
Do I add this value a lot of times per frame but only in 2 places? No inline it.
At some point making a function pays off, I.e you add 3 bytes per call a 1byte static cost of rts and then eventually its starts to pay off depending upon how big the function is. However you might need to set up an LDX and/or LDY and/or LDA to which the cost gets higher and you need more and more to get a payoff. Something about the size of 16 add rarely pays off in the ZP case, not ZP it starts to get worth it a bit more.
Your limits may also affect the cost analysis. For example in a 4K game, if it saves you 2 bytes, its worth it. If you are use compression, inline it, the Compressor will be able to instance it at a lower cost. Have you gone over a bank and want to get back into a bank?
How much of The Legend of Zelda is code and how much is data?
For code size, I have not had problems with it. With various optimizations (both those that improve speed and those that improve size), and use of unofficial opcodes, etc. I also use tail call optimization, and have modified the ppMCK driver to use tail call optimization. Tail call optimizations will improve both speed and size. The level data and so on usually takes up more space than the code even though the data is compressed.
I use macros mainly to auto fill tables and deal with the banking and so on. I may also sometimes use macros for short stuff that doesn't need to be in a subroutine and that can benefit from improved speed, although sometimes the macro won't do because the optimization crosses the borders of the macro. Some macros for this purpose may take parameters, such as in a Z-machine implementation where one macro has a short (and fast, due to the mapper) sequence of instructions to read the next byte of the interpreted program, but such macro has parameters to control which registers to use and which instruction is used to load it, in order that you do not have to then move it to another register afterward, or save registers before, etc.
I do not generally use macros to form loops and so on though.
psycopathicteen wrote:
How much of The Legend of Zelda is code and how much is data?
Quick rough calculation:
Approximately 25% padding
Approximately 40% code (estimated by histogram)
Remaining 35% assumed to be data
I think ratio of code to data depends a lot on the scope of a game. The bigger your game is, less of it will be code, proportionally. (There are some rare exceptions to this, e.g. games with a lot of procedural generation, but I think it's more or less universally true for NES.)
My answer to OP's question though is, unhelpfully and probably unsurprisingly, "as careful as I need to be." I've made a bunch of ROMs where code size didn't matter at all (i.e. my goals were drastically smaller than the available size), and several where it mattered a lot. The answer to this question absolutely depends on the situation it is applied to.
rainwarrior wrote:
My answer to OP's question though is, unhelpfully and probably unsurprisingly, "as careful as I need to be." I've made a bunch of ROMs where code size didn't matter at all (i.e. my goals were drastically smaller than the available size), and several where it mattered a lot. The answer to this question absolutely depends on the situation it is applied to.
This hits the nail on the head. Sometimes you have to optimize for space in one part of your code, and optimize for speed (at the expense of space) in another, in the same codebase.
More than either of those, though, my default is to optimize for ease-of-development. That means optimizing for readability and getting-it-done. I don't tend to optimize for size or speed until I have to.
lidnariq wrote:
psycopathicteen wrote:
How much of The Legend of Zelda is code and how much is data?
Quick rough calculation:
Approximately 25% padding
Approximately 40% code (estimated by histogram)
Remaining 35% assumed to be data
If Zelda is a 128kB game, wouldn't that be like 3 or 4 banks of code? That must be a bankswitching nightmare. Why does Zelda need so much more ROM space than Super Mario Bros?
gauauu wrote:
More than either of those, though, my default is to optimize for ease-of-development. That means optimizing for readability and getting-it-done. I don't tend to optimize for size or speed until I have to.
Yes, I would say part of effective budgeting is leaving yourself room to optimize later if you need to.
If you know what you're doing, in a lot of cases you can work faster by making the code inefficient, which at the same time gives you space that you can reclaim with more work later. If you placed your bets wisely, you won't need to do that work later anyway, but otherwise it will help tremendously when you do run out of space and have this intentional buffer of inefficient code to take up. Like most things, this takes experience to be able to do well.
The other thing is that it's very worth estimating your needs up front. Try and guess how much code you need, how much data you need, how much ROM space you have, etc. Plan this out roughly up front. You won't know all the details but try to guess. Revise your estimates periodically as you go along (or better yet, have your tools automatically generate some stats for you). It's a lot easier to deal with the space crunch if you can see it coming earlier on.
...and in your planning, leave yourself some extra space! It's easier to add a little more at the end than it is to try and scale back a project that's overbudget.
psycopathicteen wrote:
lidnariq wrote:
psycopathicteen wrote:
How much of The Legend of Zelda is code and how much is data?
Quick rough calculation:
Approximately 25% padding
Approximately 40% code (estimated by histogram)
Remaining 35% assumed to be data
If Zelda is a 128kB game, wouldn't that be like 3 or 4 banks of code? That must be a bankswitching nightmare. Why does Zelda need so much more ROM space than Super Mario Bros?
It was ported from the FDS, and my guess is 128K was the next size up they could find for rom chips. As for overall size, the enemy mechanics are a good deal more complicated than what SMB deals with, and there's a fair amount of text. Also, it's a CHR-RAM game, so all of the tiles are stored in the PRG.
Loose memories from running it through IDA a bunch suggests that there might be even more than 25% unused space, but it's been a while since I looked at it.
The bankswitching isn't that much of a mess, but they also have a sizeable chunk of code they copy over to the SRAM area, as the 3 saves don't need the full 8KB.