Super Cars / $2004 reads

Super Cars / $2004 reads
by Bregalad on 2008-04-18 (#32778)

Super Cars relies on $2004 reads to time its code. As if this wasn't enough, it also uses the $B3 illegal opcode (seems like it's LAX ($xx),Y). Is this game the more insane licenced NES game ?

What especially interest me is how it reads $2004. As far I know, acessing $2007 during rendering is really bad because not only it screws the counters up, but also it can possibly make the PPU try to read and write at the same time and things like that.
However, reading $2004 seems to do nothing special : It doesn't increase any counter, so it's possible to read it during rendering without affecting anything badly, right ? At least Super Cars and Micro Machines do this.

What I'm more curious to know if how do we know what is read back. Super cars seems to load the first byte of the SPR-RAM buffer, and stuck in a loop until it differs from the value of $2004. After Sprite DMA, it make sense that you read the first OAM byte again since you wrote all 256 of them from 0 up to 255, so you read the first back. Adress in $2003 is changed during rendering, so reading $2004 will likely output a different byte, and then that's how Super Cars time its code to know when VBlank stopped. As far I know it's also possible to wait sprite-zero hit flag to be cleared for a similar effect.

I ask myself if it wouldn't be possible to abuse this $2004 trick to do much better tricks than this. More specifically, you would put a dummy sprite somewhere in the screen, and it doesn't have to be sprite zero, any number would do. By repeately reading $2004 and wait for it to match your dummy's sprite Y position, is it possible to reliably wait for that Y position in question ? I guess it should be possible.

That would allow simple cartridges to time their code without relying only on sprite zero hits anymore, so if that works then it's possible to do "pseudo-sprite zero hits" more than once per frame, with a totally transparent sprite and/or with no background enabled.

Also, if $2003 modifies itself a way so that you can also read OAM in non-multiples of 4 indexes, this would be much more tricky, as you could read a X pos, palette byte or tile byte as well, so you have to be sure that your "hit sprite" have a Y pos smaller than all sprites that potentially have a X Pos, palette byte or tile number that matches its Y pos. Sounds like a headache.

Re: Super Cars / $2004 reads
by Disch on 2008-04-18 (#32779)

Bregalad wrote:

More specifically, you would put a dummy sprite somewhere in the screen, and it doesn't have to be sprite zero, any number would do. By repeately reading $2004 and wait for it to match your dummy's sprite Y position, is it possible to reliably wait for that Y position in question ? I guess it should be possible.

Not really... because:

1) Every sprite's Y position is examined every scanline (at least until 8 sprites are found -- then things get weirder). So at best you could time to a certain point within a scanline... but would not be able to pick which scanline (at least not without Sprite-0 hit or something like that -- but if you're doing that then you don't really need any more timing mechanisms)

2) Things besides Y coords are examined when sprites are found to be in range. Attributes, tile numbers, and X positions can all be read back from $2004 acting as potential false positives for your dummy sprite.

3) The PPU moves much faster than the CPU, making catching a single sprite value every frame pretty much impossible. LDA $2004 takes 4 CPU cycles -- in that time the PPU has already passed through 4*3/2 = 6 OAM fetches. So even if you get around the above 2 problems somehow, you'd still only have a 1/6 chance of actually connecting with your dummy sprite value (it doesn't stay on the bus very long)

by dvdmth on 2008-04-18 (#32789)

Reading $2004 to detect the end of V-Blank can be useful if you're in a situation where the sprite 0 hit flag isn't always available as an indicator (either because sprite 0 isn't guaranteed to hit or because you have the sprite and/or BG layer disabled on the previous frame). Reading $2004 will return the value $FF on the first 64 PPU cycles of every active scanline (including the dummy line), so unless OAM[0] contains the value $FF, this will provide a very reliable way to tell when rendering starts.

If conditions are right, you can also detect H-Blank by looking for $FF starting in the middle of a scanline (after the initial 64 cycles). If there are no sprites on the following line, and assuming there are no $FF values as Y coordinates, you will see $FF starting at cycle 257 of each line and will continue to see it until cycle 319. If I recall, this is how Micro Machines depends on $2004 reads during active display.

Writes to $2007 during display cause major problems, but reading from $2007 will only corrupt the remainder of the current scanline. Some of Nintendo's own titles, including Zelda and SMB3, read from $2007 during raster effects (apparently the developers thought you had to read $2007 to get the scroll registers to latch, which isn't true).

by Bregalad on 2008-04-18 (#32791)

Oh, thanks for pointing that out. So yeah reading $2004 can be usefull for detecting start of frame, but during the frame there isn't much use of it. Detect HBlank maybe, but anyway this isn't really needed, as you usually position sprite 0 hit and/or adjust timing of the code so that writes to PPU regusters are done at a correct time.

However, you could count scanlines without the use of any timed code like that. A loop that wait for $2004 to be $ff for 2 consecutive reads, and then wait it to be different than $ff, and repeat while increasing a counter can reliably cout the scanlines (under the condition no Y-Pos if $ff, which isn't hard to do since $ff isn't on screen anyway). The probabitily that 2 sprites both use tile $ff and both are output via the $2004 buffer on 2 consecutive reads is really low.

by tokumaru on 2008-04-18 (#32794)

I thought the same thing, about the scanline counting I mean. This might be easier and more reliable than timed code if all you're doing is raster effects. Not really useful in an actual game though.

I'm still trying to think of a possible way to use $2004 to detect scanlines... that would be really useful!

by blargg on 2008-04-18 (#32798)

Right, scanline counting wouldn't be of any practical use since it would require timed code between the start of VBL or sprite hit, and the time you started counting scanlines. If you have timed code, you can just time the scanlines as well. Maybe if your timed code varied by a small number of clocks, it would be helpful.

by Disch on 2008-04-18 (#32802)

I just thought of a way to make this work.

Rather than hooking the Y coord, which is next to impossible, perhaps you could hook the X+attribute+tile fetches.

Consider the following:

1) OAM Y coords are fetched
2) If Y coord is in range, tile, attribute and X coord are fetched successively
3) Next Sprite's Y coord is fetched afterward
4) Process repeats

Since tile+X+attribute fetches are all consecutive, this gives you a 6 dot window instead of a 2 dot window to catch the desired value. Not much of an improvement on its own, but if you have 8 identical sprites on the same line, this gives you about a 6*8=48 dot window (with 16 of those dots being the Y coord -- so not quite 48 dots)

If the tile ID, X coord, and attribute fetches are all the same key value -- catching this window is a good possibility. Finding HBlank from here could involve the Micromachines style "look for FF" or whatever it does, but I don't know how reliable that'd be.

Blah... I'm not organizing my thoughts very well in this post... but basically here's my idea

- reserve a key value in OAM. "$E3" is probably good because this is something the attribute byte can represent (doesn't use any unimplimented bits)

- reserve 8 sprites (not just 1) to be the marker sprites

- make sure these sprites are stored consectutively in OAM (like reserve the last 8 sprites or something)

- give all these sprites an X position of $E3, set their attributes to $E3, and set them to draw tile $E3, and put them all on the same Y coord (note: Y cannot be $E3!)

- don't let ANY OTHER sprite have any value of E3. This means you can't let sprites have X coords of E3.. nor can they use tile E3, etc, etc. Doing so may allow false positives.

Then... at any time $2004 returns $E3, you know it's fetching one of your key tiles, which means you found the desired scanline. And since there's multiples sprites, the window is bigger, so you'll be able to reliably catch the sprite every time.

Note that you can't use the Y coord as the key value. It may seem like a good idea since this would have the window be gap-less, but since Y is fetched on every scanline, it would lead to false positives. You also can't use values the attribute bits can't represent. $E4 is no good, for example, since this would be stored in OAM (and read back) as $E0.

Actually I don't even think you'd need 8 marker sprites...

Code:
LDA #$E3
loop:
CMP $2004 ; 4 cycles (6 fetches)
BNE loop ; 3 cycles (4.5 fetches)

So you'd be polling every 10.5 fetches -- which means you would only need 3 marker sprites to ensure you don't skip over the window.

I'm too lazy to write a test ROM to try it out ... but in theory it should work. I don't think it'd be any better than sprite-0 hit for most things, though. The only way I can see this being a good option is for turning the PPU on late (to extend VBlank).

You could do that by putting the marker sprites near the top of the screen, then after you finish VBlank, enable sprite rendering only (leave BG rendering disabled) -- then poll $2004 to find the scanline and once it's found, turn BG rendering on.

It's also repeatable. That is, you can put marker tiles on multple scanlines and do the check several times (but they would have to be at least 9+ scanlines apart, or 17+ if you're using 8x16 sprites).

by Bregalad on 2008-04-19 (#32807)

Oh yeah. This is interesting, but terribly impratical. Even if it worked, the limitations are horrible. Cannot use tile $e3 is okay since it would most likely be transparent, but cannot use attributes $e3 and X-Pos $e3 for real sprites seems like a real limitation. Okay, you could cheaply substitute X-Pos $e3 for $e4, hoping nobody notices it. But for attributes, that's more a pain. However, you could just made all your sprites always in front of BG no matter what (most games does this), so the attribute will never be $e3 because that's back of background.

So we can kind of work arround those limitations, but then if you reserve the last n sprites (4 or 8 ?) for this, then they can be bypassed by normal sprites on the same scanline, and wouldn't be fetched. So they have to be front priority, and if there is 8 of them "hide" other sprites (this could actually be usefull in some cases).

So yeah if that works it could be usefull in a case or two, but is really hard to work on, since it imposes limitations on valid sprites.

Quote:
You could do that by putting the marker sprites near the top of the screen, then after you finish VBlank, enable sprite rendering only (leave BG rendering disabled) -- then poll $2004 to find the scanline and once it's found, turn BG rendering on.

Yeah but then you couldn't write to $2007 during that time, which is as far I know the only real reason to "extend VBlank". Hide scrolling glitches maybe ? Bleh, not worth the trouble mentionned above.

by tokumaru on 2008-04-19 (#32814)

In my opinion, these limitations are no worse than, say, the ones MMC3's scanline counter imposes you. I think this idea has potential, specially if it's possible with only 3 sprites. Meaning you can still have some sprite action at the same point.

Any tile could be made transparent, so the number usued shouldn't matter for the tile index. And since the tile is transparent anyway, the ettributes do not matter either. The only serious limitation is the X coordinate. If it was possible to pick an unlikely attribute combination that placed the sprites near the left or roght edges of the screen, that would be great.

A value of 0 would be ideal for the X coordinate, but preventing the use of unflipped sprites in front of the background using palette 0 could be a problem... On the other hand, since it's just one of the palettes, there might be a way to avoid a situation like this, through careful use of that palette. It could, for example, be used for a HUD or such things, but not be available for general use by game objects.

by Disch on 2008-04-19 (#32816)

Bregalad wrote:
Yeah but then you couldn't write to $2007 during that time, which is as far I know the only real reason to "extend VBlank". Hide scrolling glitches maybe ? Bleh, not worth the trouble mentionned above.

The thing that makes turning the PPU on late hard is that the scroll gets funked up unless you set it explicitily with crafty $2006 writes and turn it on at a very specific scanline. This can't normally be done without timed code because you have no way of knowing what scanline the PPU is on, and thus can't adjust your scroll appropriately.

However with this method, you can use the sprite trick to find the desired scanline, then fix the scroll and turn on the BG, letting you turn the PPU on late without any scroll glitches.

Example: You want to turn the PPU on 20 scanlines late. If you give yourself 2 scanlines padding, this means you can leave the PPU off for 18 extra scanlines:

- turn PPU off in VBlank
- allow your writes to spill out past end of VBlank (but don't go past scanline 18)
- once your writes are complete, turn on sprite rendering (but not BG rendering)
- begin looking for marker sprites
- once marker sprites are found, you know what scanline you're on... so you can now adjust your scroll and turn BG rendering on.

To elaborate further, you would be able to use $E3 for X/attributes for sprites below your marker sprites (kind of like how you can break the MMC3 rules once you're no longer using the IRQ counter -- once you already found your marker tiles, you can break the "can't use $E3 rule").

$E3 would still be a forbidden Y coord for any sprite, though, simply because all the Y coords are fetched every scanline

You might be able to use an attribute un-friendly value like $FE as the marker value. This would leave more gaps in your window since you'd only be able to catch X coord and tile fetches, so to compensate you'd have to have more marker sprites (probably all 8). I don't feel like doing the math to figure out the minimum though.

Also I'm not so sure about the 3 sprite thing anymore -- you might need more. I may have been tired when I came up with those numbers. I'll double-check my math and repost later.

EDIT:

I was right -- 3 sprites isn't enough. I must've been tired.

Here is some crap I did in notepad:

--------------------------------------------------

Y = Y coord fetch
A = attribute fetch
T = tile fetch
X = X coord fetch
- = bad/unimportant fetch
O = OK to catch here
u = OK to catch here only if attribute friendly value used

CMP $2004
BNE loop = 7 cycles = 21 dots

Trying to catch marker during sprite evaluation (dots 64-255):

Code:
YYTTAAXXYYTTAAXXYYTTAAXXYYTTAAXXYYTTAAXXYYTTAAXXYYTTAAXXYYTTAAXX <-- fetches
--OOuuOO--OOuuOO--OOuuOO--OOuuOO--OOuuOO--OOuuOO--OOuuOO--OOuuOO
- u O <-- check every 21 dots
- O O <-- and every possible sync sequence
O O u
O - u
u - O
u O O
O O -
O u -
- u O
- O O
O O u
O - u
u - O
u O O
O O -
O u -
- u O
- O O
O O u
O - u
u - O

This means:
- would need minimum of 5 marker sprites for attribute friendly key values ($E3). This is true because there is always an 'O' or 'u' in the first two checks. And the first two checks examine 5 sprites.

- would need all 8 marker sprites for attribute unfriendly key values ($FE). This is true because all three checks are needed (sometimes an 'O' isn't found until the 3rd check), and the third check requires all 8 sprites.

---------------------------------------------------

Trying to catch marker during sprite fetches (dots 256-319):

Code:
YTAXXXXXYTAXXXXXYTAXXXXXYTAXXXXXYTAXXXXXYTAXXXXXYTAXXXXXYTAXXXXX
-OuOOOOO-OuOOOOO-OuOOOOO-OuOOOOO-OuOOOOO-OuOOOOO-OuOOOOO-OuOOOOO
- O u
O O O
u O O
O - O
O O O
O u O
O O -
O O O
- O u
O O O
u O O
O - O
O O O
O u O
O O -
O O O
- O u
O O O
u O O
O - O
O O O

This means:
- Only 5 marker sprites needed, even with attribute unfriendly value!
- Additionally, the X coord is always caught! So don't need to rely on a tile or attribute fetch!

The problem with catching during this part of the scanline is that the other part is run first, so you may catch your desired value there rather than here, which would make finding HBlank more difficult with timed code alone -- you'd need another method.

-----------------------------------------------

Once you find your scanline, you can sync to HBlank by looking for two consecutive $FF values read from $2004 (since the first 64 dots in a scanline will give $FF). Time wasted between these checks should be minimalized to avoid muckups. So something like the following code could be used:

Code:
LDA #$FE ; marker value
LDX #$FF ; FF to sync with HBlank

find_marker:
CMP $2004
BNE find_marker

find_hblank:
CPX $2004
BNE find_hblank
CPX $2004
BNE find_hblank

; here, you are between dots 36 and 57 on your desired scanline.
; you can then use some timed code to find HBlank, where you'd reset your scroll
; and turn BG rendering on

Since catching marker during sprite fetches doesn't require the key value to be attribute friendly, $FE seems like the ideal choice, since it won't conflict with attributes, it's not a good X coord (sprites that would have that coord would be mostly invisible anyway), and it will never be a Y coord ($F0 or $F8 are just as good to hide off-screen sprites)

One thing to note about this HBlank sync method is that $FF can be read back during dots 256-319 if less than 8 sprites are found to be in range on the scanline. Since this increases the sync window, I would recommend you have 8 marker sprites just to fill this out so that you won't ever get $FF until the start of the next scanline. This will make HBlank syncing easier.

Just remember the rules:

1) you need 5 marker sprites (though I recommend 8 to make HBlank syncing more reliable)
2) All marker sprites must be consecutive in OAM*
3) All marker sprites must be found to be in-range. Don't let them get bumped off by visible sprites with higher priority!
4) Use $FE as a key value -- give all your marker sprites an X coord of $FE
5) Marker sprites attribute and tile can be anything (but you might as well use $FE as the tile and have it be transparent)
6) Never use $FE as a Y coord for any sprite
7) Don't use $FE as a Tile number or X coord of any sprite until you're done with all these checks for the frame
8) Refrain from using $FF as a Y coord to hide off-screen sprites to make HBlank syncing more reliable (don't want Y coords to be read back as false positives). $F0 or $F8 are just as good (in fact... better!)
9) Marker sprites should be relatively high priorty (low sprite numbers). Sprites 1-9 are good.

*rule 2 isn't really true. As long as all marker sprites are found to be in range consecutively (that is, you can have sprites 1,2,3,4,6 be marker sprites, as long as sprite 5 is not in-range on the scanline). But that's a hypertechnicality. It's best to just have them consecutive... so I would say treat that like a rule.

With this new info -- the restrictions don't seem nearly as bad. Attribute conflicts were the biggest problem, IMO, and now that that's out of the way, the biggest problem is making sure sprites don't end up with an X coord of $FE. If this actually works (read: I didn't test any of this, it's all still theoretical at this point... until someone actually makes a ROM to try this) I could see this being very useful.

EDIT AGAIN:

Crap -- just realized that you could get false positives if marker sprites are found to be in range as a 9th+ sprite (which could happen because the PPU gets funky after 8 sprites are found). To avoid this... I added rule 9 above.

ONE LAST EDIT:

I figured this is given, but I should mention it anyway.

This stuff would be applicable to NTSC only. You could do a similer thing on PAL, but due to the slower CPU clock you'd probably need more thank 5 marker sprites and some other timing stuff would be different.

by blargg on 2008-04-19 (#32818)

Disch, very nifty scheme. Seems to work with only 3 sprites set aside for it. Plus, a "key" of $00 instead of $E3 seems to work just as well. With the last 3 sprites at line Y, I'm able to have all the other sprites at Y+1; the other 3 bytes of all 256 sprites are filled with the key value. I haven't written a thorough test, but this one at least runs for 255 frames and adjusts the delay before starting polling for each frame. It looks like for some values of Y, the timing can vary up to around 50 PPU clocks, probably when enabling sprite rendering at more pathological times during the scanline (my test waits a scanline after enabling sprite rendering, before polling, otherwise I get much worse glitches).

Let's discuss this on #nesdev on EFNet, if anyone's around...

by Bregalad on 2008-04-19 (#32819)

Quote:
To elaborate further, you would be able to use $E3 for X/attributes for sprites below your marker sprites (kind of like how you can break the MMC3 rules once you're no longer using the IRQ counter -- once you already found your marker tiles, you can break the "can't use $E3 rule").

I'm not that sure. All 256 sprites are fetched every scanline, no matter if you're reading $2004 or not. So any sprite below the marker ones would also be fetched.

Also, your timings diagrams are great, you did an excellent job ! However, one sad thing is that they are only valable on NTSC, as PAL doesn't have 3 dots per cycle, but 3.2, or in other word while the CPU run 5 cycles the PPU run 16 pixels, and not 15 as in NTSC. So the 7 cycle loop would be a non integer number of PPU cycles (22 + 2/5), completely screwing the thing up. The only way to get integer number of pixels on PAL is to have a loop whose cycles in a multiple of 5, since a 5-cycle loop sound impossible we'd have to make a 10-cycle loop that reads $2004, and catch a value each 32 PPU cycles, which is each 16 fetches. I don't know if 8 sprites are enough at all in that case.

Also I cannot try anything on hardware because I have no EPROM programmer nor EPROMs, something should test this with powerpak probably (since it's not mapper related it should be trustable).

by tokumaru on 2008-04-19 (#32823)

Bregalad wrote:
I'm not that sure. All 256 sprites are fetched every scanline, no matter if you're reading $2004 or not. So any sprite below the marker ones would also be fetched.

It's 64 sprites, not 256... =) Anyway, only the Y coordinate of every sprite is fetched every scanline, the other values are only fetched if the Y coordinate indicates it's in range for the next scanline. So it's safe to use the key value for X coordinates after the tricky spot.

Anyway, I think this is great. It probably is just a matter of using a value that doesn't affect your X coordinates very much. In some aspects this is better than sprite 0 hits, such as reusability, but there's the big disadvantage of wasting up to 8 sprites, instead of only one.

by blargg on 2008-04-19 (#32827)

My tests focus on using this to delay enabling PPU rendering. For that, 3 sprites work very reliably, and 2 sprites work decently (a little more variance in timing). I haven't tried it for mid-rendered-frame switches yet.

by lft on 2008-04-21 (#32877)

Bregalad wrote:
However, one sad thing is that they are only valable on NTSC, as PAL doesn't have 3 dots per cycle, but 3.2, or in other word while the CPU run 5 cycles the PPU run 16 pixels, and not 15 as in NTSC. So the 7 cycle loop would be a non integer number of PPU cycles (22 + 2/5), completely screwing the thing up. The only way to get integer number of pixels on PAL is to have a loop whose cycles in a multiple of 5, since a 5-cycle loop sound impossible we'd have to make a 10-cycle loop that reads $2004, and catch a value each 32 PPU cycles, which is each 16 fetches. I don't know if 8 sprites are enough at all in that case.

But what if you write the loop like this:
Code:
lda #$fe
find_marker: cmp $2004
nop
beq found
cmp $2004
bne find_marker
found: ...

The loop takes 15 cycles, which is divisible by 5, but the maximum delay between bus peeks is 8 cycles. This way, 8 sprites might be sufficient on a PAL system as well. Anyone feel like working out a new diagram? =)

by blargg on 2008-04-21 (#32879)

This is loop is 10 clocks, also a multiple of 5:
Code:
loop: bit 0 ; 3
cmp $2004 ; 4
bne loop ; 3

by lft on 2008-04-21 (#32882)

blargg wrote:
This is loop is 10 clocks, also a multiple of 5:
Code:
loop: bit 0 ; 3
cmp $2004 ; 4
bne loop ; 3

That's true, but it takes 10 cycles between every read, so it will require more sprites in order to guarantee a hit, compared to my loop. But we'd have to draw a timing diagram for the PAL case to find out if it crosses the 8-sprite limit or not, and that means work...

by Disch on 2008-04-21 (#32889)

I don't see why having there be an integer number of dots between polls is benefitial. It doesn't matter that 3.2 isn't an even number -- it's still best to have the smallest gap between polls. That gives you the best chance of hitting a window of any size.

The only number that would be bad is a multiple of 8 (in which case if you miss one sprite, you'll miss all of them because you'll hit each sprite at the same point). Since 22.4 (3.2 * 7) is not divisible by 8, I don't see a problem.

Because the PAL CPU is slower, you'll probably need at least 1 more sprite to get steady results. Then again blargg is apparently getting steady results with only 3 sprites when I calculated you'd need 5 sprites -- so perhaps I'm being overly cautious.

I'm quite sure that even in a worst case scenario, though, using all 8 sprites and the normal 7 CPU cycle loop will work just fine even on PAL.

by blargg on 2008-04-21 (#32891)

Disch wrote:
Then again blargg is apparently getting steady results with only 3 sprites when I calculated you'd need 5 sprites -- so perhaps I'm being overly cautious.

Three sprites at the end pass for all cases. This is at the end of evaluation. The --##- line shows which bytes are the key (#) and not (-). The R line shows when the $2004 reads occur. Slide it around and one always falls on a #.
Code:
v-----evaluation-------vv------rendering-------v
...YYTTAAXXYYTTAAXXYYTTAAXXYTAXXXXXYTAXXXXXYTAXXXXXYFFFFFFFYFFFFFFFYF...
...--######--######--######-#######-#######-#######------------------...
R R R

by Disch on 2008-04-21 (#32893)

Rendering doesn't occur immediately after evalutaion of those sprites, though. Especially not if only 3 sprites on the line are found. There'd be a gap between when sprite 63 is evaluated and when the first in-range data is fetched.

Though I'm sure it still syncs up in a similar fashion.

by Bregalad on 2008-04-22 (#32918)

What you all says applies with the marker byte on X Pos + Attribute + Tile #, or with it only on Tile # ?