I've come pretty far in my general purpose NMI routine, and each NMI I've got:
64 tile row update
30 tile column update
16 byte attribute row update
8 byte attribute column update
32 entry palette update
5 tiles of CHR RAM (80 bytes)
And it all fits just under the limit of Vblank (2240 cycles every time). I was thinking this was pretty good and doable (especially considering I don't need to dink around with extended Vblank) but there's one problem. I forgot about sprite DMA
!
So I was just wondering what exactly happens if you enable sprite DMA late? Does it have to at least be in forced Vblank? In another project I don't do the transfer in natural Vblank, and it works in FCEUXD, Nintendulator, and Nestopia. Perhaps that's because the screen is off and it's at the bottom (I'm pretty sure it is). Or is it just like you want to enable it in Vblank so sprites with Y coords at the top of the screen are shown? Thanks in advance.
PPU is constantly accessing OAM during rendering, so you cannot DMA during rendering or Bad Things (tm) happen.
If you disable rendering ($2001) or do it all inside VBlank you're fine (but be careful not to let DMA spill outside of VBlank -- it takes 513 cycles and all of those cycles must be outside of rendering).
Bigfoot has different sprites on the top and bottom halves of the screen, separated by a few scanlines of forced blanking.
There are ways to turn the rendering off early or on late, most of them depending on sprite 0 (a status bar at the top makes this a LOT easier) or a scanline counter (which needs your game to be put on a torn-up T*ROM board). Show me a screenshot or mock-up of your project, and I can give you some ideas.
Garage Cart #1's menu shows 128 sprites. Just forced blanking, but there were some minor issues with the sprite-ram addressing which was explored on the forum not that long ago.
Okay, so as long as rendering is off when I do it, I should be fine, right? I guess I'm going to have to stick with extended Vblank for NTSC if I want to update more than 2 CHR tiles (actually it's 3 cycles too long). But this is okay, since I use vertical mirroring and there are glitches at the top and bottom of the screen. If I disable rendering for the first 8 scanlines, I could have both Sprite DMA and 2 more tile updates. And those 8 scanlines would act kind of like BG clipping, but on the top of the screen instead of on the left.
And this is for ChateauLevania, which plays like Super Metroid or the new CastleVania games where the level is one big map. My status bar is made of sprites, though I might reconsider now that I think about it. But I don't really want to deal with timed code when rendering is enabled (though I suppose I could use MMC3's scanline counter just for the status bar. I would probably have to do Vblank updates, enable rendering for the status bar, turn on 8x8 sprites, force pattern tables to be left-BG and sprites-right, then wait for MMC3's counter which will turn off the screen, enable 8x16 sprites, do a DMA transfer, then turn it back on and set the scroll).
Sorry, kind of off topic, but for MMC3's scanline counter to work, does sprite rendering and BG rendering have to be enabled?
I'd say stay clear of extended VBlank until you really have to. This simplifies things a lot. You might consider introducing some sort of priority system for your VBlank routine so that you can get as many updates as you'd like, only some will take priority over some others, and you get rid of extanded VBlank.
Quote:
Sorry, kind of off topic, but for MMC3's scanline counter to work, does sprite rendering and BG rendering have to be enabled
Yes, both must be enabled (yes that counter is really meh).
Bregalad wrote:
I'd say stay clear of extended VBlank until you really have to. This simplifies things a lot. You might consider introducing some sort of priority system for your VBlank routine so that you can get as many updates as you'd like, only some will take priority over some others, and you get rid of extanded VBlank.
Now that you mention it, I might actually want to do that. The reason I made all routines take the same amount of time is so I could have extended Vblank without using any scanline counters (it would end on the same scanline each time so I would know which one I'm on and adjust the scroll accordingly).
But I don't think extended Vblank would cause
that many complications. Wouldn't you just use a simple formula to set the scroll midframe, assuming that the exact same amount of scanlines are blanked each time? What's more complicated about it than a status bar?
Celius wrote:
And this is for ChateauLevania, which plays like Super Metroid or the new CastleVania games where the level is one big map.
If you scroll only horizontally or only vertically at any given point, like Metroid 1, you can switch mirroring modes based on the shape of the walls around the camera at any given moment. (SMB3 uses horizontal mirroring for horizontal levels and vertical mirroring for vertical levels.) Or you can use 1-screen mirroring if you can get A*ROM, S*ROM, or TLSROM, putting the status bar in one page and the playfield in the other.
Quote:
My status bar is made of sprites, though I might reconsider now that I think about it.
Metroidvania games have been done both ways. Metroid 1 used sprites like Mega Man series and SMB2; the CV games for NES used a top bar like SMB1 and both Zelda games.
Quote:
But I don't really want to deal with timed code when rendering is enabled
As long as you can test in Nintendulator NTSC, Nintendulator PAL, Nestopia NTSC, and Nestopia PAL, timed code isn't
that hard. Here's what you do at the end of your vertical blank code:
- Wait for the vertical blank to end (sprite 0 hit bit to become 0).
- Copy OAM.
- Do some unrolled, fixed-size copies such as the leading edge of the scroll.
- Set the scroll position to the status bar using the 6-5-5-6 write pattern.
- Turn on rendering.
- Wait for sprite 0 hit bit to become 1 to trigger switching from the status bar to the playfield.
- Set the scroll position to the playfield using the 6-5-5-6 write pattern.
But if you're targeting both NTSC and PAL, notice that all this will make your PAL letterbox even bigger. But you do have a lot more vblank to play with, so you might just want to push the status bar against the top border in PAL builds without extending vblank.
My game scrolls in 2 directions at once quite often. My game isn't much like NES CVs or Metroid 1, it's more like new versions. That's partially my goal with this game, to have it be a step up from those games (though topping CV3 is in my opinion a tall order).
My status bar is currently made of 10 8x16 sprites to display health, mp, subweapon, and hearts amount. I still have plenty of sprites to work with, so using 10 isn't that big of a deal.
The main reason I don't want to do a status bar I suppose isn't because of timed code complications, but more the complications with the name table. Repositioning it all the time would be complicated, and would require more Vblank updates than just map decoding.
So what if I did this:
1. Disable rendering
2. Did 3073 cycles worth of PPU updates (what I calculated previously + DMA and 2 more CHR RAM tiles)
3. Set scroll with 6-5-5-6 pattern
4. Enable rendering
Am I missing something? That doesn't sound so complicated to me.
And also, I remember reading that you had to write the Y then X scroll for $2005 when doing midframe updates. Is this correct?
EDIT: Oh, and I do plan to make a PAL version of the game. But with PAL, these issues don't exist. 70 scanlines of Vblank? Yeah. No dinking around with extended Vblank there. With that I could update like 32 more tiles without extended Vblank.
It sounds like the extended VBlank method could work like it does for battletoads, but I've felt like it has been for a long time heavily overrated in this board.
Unrolled loops can do wonders when it is about gaining speed. I'd probably stay clear of extended VBlank if possible, but if your game engine really needs badly that much updates then you'd probably want to go the extanded VBlank way.
It will add more complications to your code as you'll have to have everything taking the exact same number of cycles and complicated scrolling updates, probably using some timed code and that stuff.
Again I don't want to do much publicity but I wrote a doccument that talks a bit about that stuff and gives tips on how to do timed code, so you shouldn't fear this, but again if you can avoid the extended VBlank you'd want to go that way.
I see no problem having the status bar with sprites, but yeah if you would switch to BG and not want 1-screen mirroring, you'd probably have to "move" the status bar when scrolling vertically which could add some complications and more updates, or do it the Kirby way which I find clever, but relies on the fact that the status bar's vertical size is constant.
I've already written code that takes an exact amount of cycles every time. Even when the routine is "disabled" because the information to be put on screen isn't all the way updated, it still uses up the same amount of cycles. I guess I don't know why I said the timing would be an issue, because it's not.
This is also why I do 64 tile writes for a row and 30 for a column. It updates the entire column and the entire row so there aren't any name-table division checks that would cause the code to run at a different rate. Though it's up to the programmer to position the tiles in the array that's copied exactly right, which isn't so hard. It actually makes some things easier.
And I could save some cycles by switching to zero page, but the palette, attribute and tile arrays take up $98 bytes, so I don't want to waste that much zero page. Unfortunately, I can't put the dynamic tile writes there because for each tile drawing routine it draws the tiles from ROM or RAM. If I tell the routine to draw from RAM, it has to be absolute in order to take the same amount of time as drawing from ROM.
What about putting stuff in stack page and use a unrolled string of "pla/sta $2007" ?
I can't see the benefits of doing that. PLA takes 4 cycles, just as lda Absolute,x takes 4 cycles (I never cross page boundaries). Also, I'd have to change $2006 to point to different sections. Currently, each tile update takes 160 cycles (it can read from ROM or RAM, and the number also includes the time it takes for JSR/RTS and setting up pointers and bankswitching), which really isn't bad. All the other routines read from RAM. Plus, you'd have to make sure you're not destroying addresses when putting stuff on the stack and that would be variably complex.
EDIT: So I've implemented extended Vblank for a simple demo. It extends Vblank like 10 or 11 scanlines, with no status bar, but it doesn't look really weird or anything I don't think. In my opinion, it doesn't look that skewed compared to left clipping. I think with the top and bottom clipped, it's so small that you don't notice. But if the top and bottom aren't clipped (like in Nintendulator), you'll definitely notice it more.
http://www.freewebs.com/the_bott/ExtendedTest.nes
You can move around with the control pad and it does that cool Metal Storm effect with fake parallax.
Here's the deal. This is for NTSC only. NTSC TVs mostly have the top and bottom clipped off. And if they are shown, it will look like it does in Nintendulator. The only benefit I have from that would be that it hides vertical mirroring glitches just like left clipping does for horizontal mirroring. Otherwise it looks kind of skewed. However, for PAL there would be no extended Vblank, so there would be no skewing. So would I be safe extending Vblank to this length?
Also, this allows me to update 10 CHR RAM tiles instead of just 5. So I think this engine gives me a lot of flexibility which I'm thinking heavily outweighs the cost of having it slightly skewed looking.
Well it seems to work fine under Nestopia so I guess it's all right.
You are right that there is not that much advantages in pla / sta $2007 chains, but Battletoads did that so I was under impression this has some advantage.
You can also do strings of lda immediate / sta $2007 in RAM for fastest possible speed, but it needs 5 times more byte in the RAM buffer, and to acess it you have to multiply the index by 5 and add 1.
At first I thought about doing strings of immediate LDAs and STAs for PPU updates, but when doing 310 writes to the PPU, that would take way too much code and also time (for copying ROM to RAM for tiles). Here's how my current code works. There's code executed in RAM (though for the demo it's in ROM) that does:
lda #PPUhigh
ldy #PPUlow
ldx #BankNumber
stx $8001 ;I use MMC3. This selects the 8k bank that holds the desired CHR data to be mapped to $8000-$AFFF.
ldx #TileInPage
jsr $xxxx
And 9 more times for the next 9 tiles. When it does JSR $xxxx, it goes to a routine that does something like:
sta $2006
sty $2006
lda $8000,x
sta $2007
lda $8001,x
sta $2007
lda $8002,x
sta $2007
...
lda $800E,x
sta $2007
lda $800F,x
sta $2007
rts
There's a routine for each page from $7F00-$AF00. It takes up a pretty big amount of ROM, but I have it to spare. Before coming to this routine, you load X with any multiple of $10. If the selected page is $8000 and X is $00, it will put the tile in that starts at $8000, if X is $10, it will put the tile in starting at $8010, etc. So I don't have to mess around with copying ROM to RAM to put it in $2007. I can just put ROM to $2007 (only for CHR updates though).
$7F00-$7FFF is for dynamic tiles. I could make this any absolute address and it wouldn't have to be a page long (though it'd obviously have to be at least 16 bytes). I figure allowing for 16 dynamic tiles is good for now.
Then for the rest I do:
jsr UpdatePalette
lda #PPUHigh
ldy #PPULow
jsr UpdateTileRow
lda #PPUHigh
ldy #PPULow
jsr UpdateAttributeRow
lda #PPUHigh
ldy #PPULow
jsr UpdateTileColumn
ldy #PPUHigh
lda #PPULow
jsr UpdateAttributeColumn
That's self-explanitory pretty much. I update an entire attribute/tile row/column every time. By that I mean I update 64 tiles and 16 attribute bytes for the rows. That way I don't do any checks which saves time.
And what's also great about this is for all of these updates, all I have to do to "lock" the routine (if the info is partially updated on an interrupt) is change the high byte of the JSR $xxxx for CHR updates and the low bytes for the other routines. In the case that addresses are partially updated and there's an NMI, it will simply go to a routine that wastes the same amount of time that the updating routine would take. I know this is kind of crappy, but I'm willing to make the sacrifice, and besides almost always each update will need to happen each frame.
Bregalad wrote:
I see no problem having the status bar with sprites, but yeah if you would switch to BG and not want 1-screen mirroring, you'd probably have to "move" the status bar when scrolling vertically which could add some complications and more updates, or do it the Kirby way which I find clever, but relies on the fact that the status bar's vertical size is constant.
Are the following observations correct? In other words, is this "the kirby way" as you put it? :
I was looking at Kirby's Adventure and also SMB3 today in FCEUXD, and I think they use a similar technique for the status bar. It appears that vertical scrolling in both games (in primarily horizontal levels anyway) is constrained to the height of two nametables which I think would imply that you could just keep the status bar in one place and never move it, right? After your split point, you just point the PPU to where the status bar is in one of your nametables?
By the way, thank you for posting your raster effects doc here. It helped me get a head start on learning the sprite 0 hit technique.