Polygon filling..

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic

Polygon filling..
by doynax on 2008-03-28 (#32100)

For the last few days I've been toying with some code for drawing polygons in order to create an eventual 3d demo. Anyway things are beginning to get hairy (being a novice to both NES hacking and 3d graphics doesn't help either) so I guess I wanted to ask whether it's already been done and I'm wasting my time, or if it's worth putting in the effort. There's Elite of course and that tank demo (anyone got a live link for the code?) which are most of the way there already, but that's about all I've found. Then again how do you come up with search terms for this kind of thing..

Also do you guys have any thoughts on how to increase the vertical blanking time? Unfortunately I kind of need both CHR-RAM (duh) and SRAM which according to the wiki seems to limit me to MMC1, among the popular configurations anyway, which lacks raster interrupts. An ugly hack would be to cut only the top part of the screen and time the transfer code as to avoid crossing into the visible section. On the bright side the DMC channel seems to be a fairly high precision interrupt source, would this be a feasible alternative to MMC3 IRQs?

Oh and are the NES emulators really as bad as they seem when it comes to low-level CPU emulation details or have I just had bad luck so far? E.g. Nintendulator doesn't support the SBX instruction, NesterJ doesn't implement RMW dummy writes, FCEUXD SP doesn't update the memory value after ISC/DCP, etc. Nestopia is the only emulator left which still runs my damned code :(

by Bregalad on 2008-03-28 (#32101)

Yeah, very few 3D things are available on the NES right now so you can innovate a lot in this departement.
Most games does their 3D with raster tricks (racing games).

The only "real-3D" engine I know is for Elite and that Tank Demo, you can find sources on the main nesdev page (nesdev.com), but they're crappy doccumented. I remember another demo that did really basic stuff at decent speed because it uses only nametable tricks (no CHRRAM) so that helps a lot. You can use MMC3, MMC5, FME7, .... mappers who can produce accurate interupts (altrough MMC3 is a bit crappy), and all of them "officially" have SRAM. MMC1 hasn't external interrupts, but you can do a lot of things with timed code, and even more with DMC interrupts + sprite zero hit combo.
Same goes for discrete logic mappers who doesn't officially support SRAM, but you can emulate them with SRAM and even add SRAM on real hardware, only adding a simple 74HC08 and gate.

Quote:
Oh and are the NES emulators really as bad as they seem when it comes to low-level CPU emulation details or have I just had bad luck so far? E.g. Nintendulator doesn't support the SBX instruction, NesterJ doesn't implement RMW dummy writes, FCEUXD SP doesn't update the memory value after ISC/DCP, etc. Nestopia is the only emulator left which still runs my damned code

Accuracy among emulators is really variable, but I'd suggest not using dark undocumented opcodes. Does they really save you bytes or execution time ?

by doynax on 2008-03-28 (#32103)

Bregalad wrote:
The only "real-3D" engine I know is for Elite and that Tank Demo, you can find sources on the main nesdev page (nesdev.com), but they're crappy doccumented.
The download link seems to be dead and there's nothing on archive.org, still I don't imagine that there's all that much to learn from it aside from the fact that this kind of thing is at all feasible.

Quote:
You can use MMC3, MMC5, FME7, .... mappers who can produce accurate interupts (altrough MMC3 is a bit crappy), and all of them "officially" have SRAM.
MMC5 doesn't seem seem to have CHRRAM though, and MMC3 cartridges with both SRAM and CHRRAM are said to be "rare" according to the wiki (only used in obscure Japanese titles or something like that). Wiring up RAM rather than ROM shouldn't be all that hard if the mapper itself stays the same, but I'm suspicious of the fact that (almost) no one seem to have done it. At any rate I'd like to stick to basic bog-standard hardware if at all possible, preferably easily tested by people on this board with real hardware equipment.

Quote:
MMC1 hasn't external interrupts, but you can do a lot of things with timed code, and even more with DMC interrupts + sprite zero hit combo.
That's what I figured. I'm wary of pitfalls though. Running into interrupt hardware problems halfway through a project is not my idea of a fun experience (anyone remember the dog-slow Commodore serial bus?).

Quote:
Accuracy among emulators is really variable, but I'd suggest not using dark undocumented opcodes. Does they really save you bytes or execution time ?
Loads of them actually. My line drawing innerloops use SBX extensively for instance, LAX and ISC saves another few cycles and bytes in the outer loops. We're talking about something like 32k of unrolled code here so space is a factor here.

by tepples on 2008-03-28 (#32104)

Bregalad wrote:
Same goes for discrete logic mappers who doesn't officially support SRAM, but you can emulate them with SRAM and even add SRAM on real hardware, only adding a simple 74HC08 and gate.

Do you care to add a rewiring guide to the wiki to add PRG RAM to A*ROM/B*ROM/U*ROM boards, so that we know exactly which hardware the iNES board descriptors really specify?

And as for the undocumented opcodes, do any games licensed by Nintendo use them?

by Roth on 2008-03-28 (#32105)

doynax wrote:
The download link seems to be dead...

The links worked fine for me actually:

http://www.iancgbell.clara.net/nestank/

On that page you can grab the source for the tank game, and on this page:

http://www.iancgbell.clara.net/elite/nes/index.htm

You can get the Elite ROM, but I don't see a source anywhere.

by doynax on 2008-03-28 (#32106)

Roth wrote:
doynax wrote:
The download link seems to be dead...
The links worked fine for me actually:
Works for me too, now.. Must have been a temporary problem.
Thanks.

Re: Polygon filling..
by tokumaru on 2008-03-28 (#32107)

doynax wrote:
Unfortunately I kind of need both CHR-RAM (duh) and SRAM which according to the wiki seems to limit me to MMC1

I'm sure the use of CHR-RAM is not as "duh" as you think. Since you'll be working with real 3D and filled polygons, it might be wise to reduce your resolution so that you have any chance of a decent framerate.

CHR-ROM does a pretty decent job emulating large "pixels". If you divide each tile in 4 large pixels, it's possible to have all combinations with the 4 colors fit inside the 256 tiles you have. You can even double the vertical resolution by drawing the image to both name tables (making a 64x120 "pixels" image), and squeeze it inside a single screen using interrupts or timed code (although this will take away time that would otherwise be used to compute the next frame).

If you use less than 4 colors it might even be possible to fit more pixels inside each tile, increasing the resolution.

I really think this is a better option than CHR-RAM, which would be pretty slow to update, as opposed to the name tables.

Re: Polygon filling..
by doynax on 2008-03-28 (#32108)

tokumaru wrote:
I'm sure the use of CHR-RAM is not as "duh" as you think. Since you'll be working with real 3D and filled polygons, it might be wise to reduce your resolution so that you have any chance of a decent framerate.

CHR-ROM does a pretty decent job emulating large "pixels". If you divide each tile in 4 large pixels, it's possible to have all combinations with the 4 colors fit inside the 256 tiles you have. You can even double the vertical resolution by drawing the image to both name tables (making a 64x120 "pixels" image), and squeeze it inside a single screen using interrupts or timed code (although this will take away time that would otherwise be used to compute the next frame).

If you use less than 4 colors it might even be possible to fit more pixels inside each tile, increasing the resolution.

I really think this is a better option than CHR-RAM, which would be pretty slow to update, as opposed to the name tables.
All true, but frankly 4x4 pixel effects just look a bit too chunky for me (though they're actually quite popular on the C64). I think I'd rather attempt a 1x1 effect at a reduced resolution or frequency instead.
At any rate my back-of-the-envelope calculations seems to suggest that something like a simple spinning cube should be manageable in a 160x160 window at half frame rate (here in PAL-land anyway). Of course I might easily have misjudged things badly so we'll see how it turns out..

by Bregalad on 2008-03-28 (#32109)

Quote:
CHR-ROM does a pretty decent job emulating large "pixels". If you divide each tile in 4 large pixels, it's possible to have all combinations with the 4 colors fit inside the 256 tiles you have. You can even double the vertical resolution by drawing the image to both name tables (making a 64x120 "pixels" image), and squeeze it inside a single screen using interrupts or timed code (although this will take away time that would otherwise be used to compute the next frame).

Heheh, I kind of love this kind of tricks. Also, if anyone uses that method, there is already the data for it in high-hopes.nes so that you don't need to manually draw the 256 tiles.
Actual use of this would be hard without CHR ROM midframe bankswitching + 4 screen mirroring, as it uses two pattern tables and two nametables.

I use 4x4 "pixels" graphics in my game, not for 3D, just to say "Game Over" in big letters in a cool fashion (among with some other text as well). It would normally use 16 tiles (monochrome), but I don't use all 16. At least those "pixels" are really big, and I doubt a game could look any good with BG drawn like that.

Quote:
Do you care to add a rewiring guide to the wiki to add PRG RAM to A*ROM/B*ROM/U*ROM boards, so that we know exactly which hardware the iNES board descriptors really specify?

Oh, I may be doing that, but I don't know where it takes place since I don't know any Nintendo-name board which have SRAM wired up with a discrete chip, exept Family Basic which is a particular case. ANDing PRG A13, PRG A14, /ROMSEL, M2 and connect the output to SRAM /CE doesn't seem a complicated task, tough.

Quote:
MMC5 doesn't seem seem to have CHRRAM though, and MMC3 cartridges with both SRAM and CHRRAM are said to be "rare" according to the wiki (only used in obscure Japanese titles or something like that).

You're almost right, but even worse MMC3+CHRRAM+RAM is used in Final Fantasy III, which was an obscure Japanese title when it was released and is now part of one of the most well known series, so many avid collector wants this cart. A couple of other actual obsure japansese titles uses the same config too.

by MottZilla on 2008-03-28 (#32111)

Final Fantasy 3 I was going to mention as many people have converted MMC3 boards to use CHRRAM for making repros of FF3. So it can and has been done

by doynax on 2008-03-30 (#32152)

Okay, so I seem to have hit a snag with the interrupt handling..

How exactly is the DMC clocked and when are the interrupts triggered? In Nestopia an interrupt triggered from the NMI handler jitters by like three whole scanlines, while in FCEU there's just the expected few cycles from normal interrupt latency. Hopefully I've just forgotten to reset something, or Nestopia is the inaccurate emulator, but I also seem to recall reading somewhere that DMC IRQs were kind of unpredictable..

Switching to an MMC3 mapper wouldn't solve my problem either since it counts the PPU's memory accesses and I want to use it to reactivate a blanked screen. Apparently VRC carts can perform real cycle counting but I kind of doubt anyone has the hardware setup for testing with that chip and I still want to avoid non-standard hardware if possible.

Actually the PPU itself also exhibits some rather interesting behavior. When activating the display mid-screen it start showing the third line of whatever tile row the address register points to. And this happens in both emulators.
Now I haven't seen anything like a guide to PPU raster tricks or a useful model of its internal behavior, so I guess I'm going to have to dig around in some emulator sources *shudder*. Which one is the most accurate when it comes to this kind of thing?

Meh.. Does anyone here have any ideas on how to show a stable vertically centered window with the top and bottom parts blanked?

by tepples on 2008-03-30 (#32155)

"Third row" behavior might have something to do with the fact that the upper nibble you wrote to PPUADDR was 2xxx. Try writing 0xxx, 1xxx, or 3xxx to see what other rows you can trigger. Then see "The skinny on NES scrolling" to learn what's really going on.

by doynax on 2008-03-30 (#32156)

tepples wrote:
"Third row" behavior might have something to do with the fact that the upper nibble you wrote to PPUADDR was 2xxx. Try writing 0xxx, 1xxx, or 3xxx to see what other rows you can trigger. Then see "The skinny on NES scrolling" to learn what's really going on.
Thank you, that does seem to work =)
With that I can display the first row in its entirety, plus with a second DMC interrupt at frequency $f and length 3 I've covered my 176 pixel window perfectly.

Now if only the DMC is (or can be made to be) stable on hardware..

by doynax on 2008-03-30 (#32159)

From studying Nestopia's code it appears as if they simply tell the DMC to run for x cycles at the end of some emulation quantum, and trigger whatever IRQs are needed from within that loop. Not very precise..

Anyway I put together a small test ROM which I'd be eternally grateful of someone could test on hardware for me.

Here's the results on the emulators I happened to have lying around:
FCEU: 7 cycles
Nestopia: 3 scanlines
Nintendulator: 1 scanline
NesterJ: stable
Nesticle: blank screen (*gasp*)
http://www.minoan.ath.cx/~doynax/6502/blanking.zip

by doynax on 2008-03-31 (#32173)

Damn it.. I just realized I have another serious problem.

If the display itself screws up the PPU address register then I have to restore the address when switching back to the graphics transfer 'task'. Of course there's no way to actually read back the current address so I'll have to try to infer it from the return address and register/variable states inside of the IRQ handler.
I suppose it can be done fairly efficiently with some tables and some code, sort of a manual version of the implicit exception handling chains used by most C++ compilers, but it'll also most definitely be nasty and error-prone.

And I thought the interrupt-based approach would be easier than cycle counting.. =(

Oh, and I'm still looking for someone to run the test ROM. I'll convert it to use a mapper, CHR-ROM or whatever if that would help.

by _cgtr_ on 2008-03-31 (#32177)

I just ran your test rom on the real hardware.

Nintendulator's output is the closest match to the real hardware.

by doynax on 2008-03-31 (#32180)

_cgtr_ wrote:
I just ran your test rom on the real hardware.

Nintendulator's output is the closest match to the real hardware.
I've rechecked Nintendulator and it seems to shift by multiple lines, regardless of what I stated earlier, though its emulation is definitely cycle-exact which is ominous. So do you mean that it shifts by a single line, which could probably be fixed by waiting a few extra cycles before turning on the display, or more than one, in which case I'm probably screwed?
In case of the former I've prepared a second test where you can control the delay with the controller, which hopefully should make it possible to stabilize it to never shift vertically.

Anyway, your help is much appreciated. Now I think I'll go further check Nintendulator's code some more to see if I can figure out what's going on.

http://minoan.ath.cx/~doynax/6502/blanking2.zip

by Disch on 2008-03-31 (#32181)

The DMC is constantly running, even when disabled. There's no way to make it stop, really. Instead you just tell it to not fetch bytes and to be silent. However internally it's period divider still clocks itself, and its shifter keeps shifting.

Because of this, whenever you "turn on" the DMC it doesn't start actually playing until it finishes its current cycle. Which, depending on the set frequency, could be up to several hundred CPU cycles! This means that you have a very large window where the IRQ could happen, making precise timing pretty much impossible. This is why games only use DMC IRQs as a rough timer, then use sprite-0 hit to fine-tune their timing.

Even at the fastest DMC playback speed, there's 54 cycles between delta shifts. This means you have about a 54*8 = 432 cycle window that the IRQ will fall in (several scanlines!). And it only gets worse with slower playback speeds.

by doynax on 2008-04-01 (#32193)

Disch wrote:
Even at the fastest DMC playback speed, there's 54 cycles between delta shifts. This means you have about a 54*8 = 432 cycle window that the IRQ will fall in (several scanlines!). And it only gets worse with slower playback speeds.
Gaah! I give up, there just aren't any decent timing values for a centered screen. Not if I want to minimize the amount of busy-waiting (i.e. use high frequencies) and support both PAL and NTSC anyway.

Time to go to plan B: Timed vblank code..

by Disch on 2008-04-01 (#32196)

You can still use DMC IRQs, just fall back to sprite 0 for fine-tuning the timing.

You could make this work by having a chunk of your nametable nothing but blank (transparent) tiles except for a small marker tile a few scanlines above where you want your actual mini-screen to start. Put sprite 0 on top of that marker tile and wait for sprite-0 to hit after the DMC IRQ triggers.

Of course this means:

1) you will need to have the PPU on for the top portion of the screen (can't make PPU accesses during that time). Although you can still turn the PPU off and make accesses after you turn the PPU off for the bottom portion of the screen.

2) You will need to keep a large chunk of one of the nametables empty so that garbage isn't rendered for the top portion. But if the actual display window is smaller, then I don't see this as being a real problem.

3) A small marker tile will be visible above the main window. But note you can get around this by palette manipulation. If you have the BG tile have an entry in its palette that's the same as the backdrop color, and give sprite-0 low priority, then the marker tile will be invisible.

I would think that waiting 8 or so scanlines for sprite 0 hit is far more preferable than waiting for 70 or so in timed loops.

by doynax on 2008-04-01 (#32197)

Disch wrote:
You can still use DMC IRQs, just fall back to sprite 0 for fine-tuning the timing.

You could make this work by having a chunk of your nametable nothing but blank (transparent) tiles except for a small marker tile a few scanlines above where you want your actual mini-screen to start. Put sprite 0 on top of that marker tile and wait for sprite-0 to hit after the DMC IRQ triggers.
Actually I tried that method. You can even blank the top part of the screen as long as you turn it on again right before the sprite 0 hit loop. Then just scroll to the visible portion of the nametable and you get a nice stable display.
The real problem is that the DMC timing granularity is way too high, and by an unlucky coincidence there aren't any reasonable timing values for setting up a 20-22 character high screen. It sort of works on NTSC, but there isn't anything even close on PAL (which happens to be my main target).
Of course I could use zero length samples with frequency $f to get an IRQ every 400 cycles, but that would waste about 10% of the blanking period on interrupt handling.

Disch wrote:
I would think that waiting 8 or so scanlines for sprite 0 hit is far more preferable than waiting for 70 or so in timed loops.
I meant timed code as in rewriting the VRAM upload code to run for a predictable number of cycles and calculate ahead of time just how many tiles I can afford to transfer per interrupt. Then just wait for whatever remainder you've got left and turn on the screen.
This means that I have to split anything unpredictable out of the vblank, costing me a bunch of cycles and memory, but at least it'll be a tad more NTSC friendly since the vertical blanking portion itself will run faster. Plus I won't need any code to figure out the vblank code's PPU address after an interrupt, which ought to be far trickier to write than the cycle counting..

by Disch on 2008-04-01 (#32198)

Quote:
The real problem is that the DMC timing granularity is way too high, and by an unlucky coincidence there aren't any reasonable timing values for setting up a 20-22 character high screen. It sort of works on NTSC, but there isn't anything even close on PAL (which happens to be my main target).

By my math, it looks like PAL actually works out much better. But maybe I fudged up somewhere. Here's what I came up with:

For a 20 character (160 scanline) screen to be centered, you want 40 scanlines above and below the image. On NTSC this works out to needing to wait 20+1+40 = 61 scanlines from NMI, and on PAL this is 70+1+40=111 scanlines.

I cut 9 scanlines from that so that there's wiggleroom (52 lines NTSC, 102 PAL). In CPU cycles this worked out to:

NTSC: 5910
PAL: 10869

For PAL, DMC freq $D goes 624 cycles/byte, and $F goes 400 cycles/byte. This worked out to:

Code:
624 * 17 = 10608 <-- freq=D, len=1
+ 400 * 1 = 11008 <-- freq=F, len=0

11008 cycles is a little over, but it should be within the given wiggleroom. Of course you'd want to prefix this with a freq=F, len=0 to keep the window small. So you'd probably end up doing this:

Code:
NMI
freq=F, len=0
IRQ (should happen virtually instantly)
freq=D, len=1
IRQ
freq=F, len=0
IRQ
wait for sprite 0

The first IRQ should happen right away as long as the DMC wasn't still playing something from last frame (which it shouldn't be).... unless my understanding of the DMC is all wrong. Either way it's worth trying.

For NTSC the numbers were much uglier. Any 17 length went too long.. even the highest frequency. So the best thing I could find was going with really low frequencies:

Code:
3184 * 1 = 3184 <-- freq=0, len=0
+ 2720 * 1 = 5904 <-- freq=3, len=0

But again, if you prefix that with with the freq=F bit, the window should stay relatively small (432 cycles)

by doynax on 2008-04-01 (#32201)

Well, your math seems sound but it's not what I'm actually seeing ;)
With a single $0d interrupt of length $01 triggered from the start of the frame NMI I'm getting up to 24 lines of busy-waiting (~16% of vblank) until the 40th scanline, and I can't imagine the emulators would be (consistently) off by that much. On the other hand $0d is very nearly perfect on NTSC. Feel free to check it out yourself if you want to.
Besides I could hardly afford to waste 10 scanlines as "wriggle room" in any case. I'm going to be uploading up to 240 tiles (maybe 160 on average) per rendered frame, plus the nametable, so every cycle is precious.

As I see it I've got the option of either stringing together a chain $0f interrupts of length zero (with a longer "seed" interrupt), just letting the screen be off-center, using a VRC mapper or trying my hand at timed code. And right now the timed code seems like the cleanest solution, with a decent mapper I could even start the blanking period without any busy-waiting and use DMC for its intended purpose.

edit:
By the way there's a third way of setting the DMC interrupts. Instead of starting the first from the NMI handler you could let the bottom one trigger the top, and constantly calibrate things by way of sprite-0 hit testing and 9-sprites-on-a-line tests (I'm assuming that flag is accurate enough for timing anyway). But I didn't have any more luck with that method when I tried it on an emulator with stable DMC resets.

by doynax on 2008-04-02 (#32222)

Whee! My new predictable transfer method (that is it takes a predictable number of cycles, I haven't gotten around to writing the code which makes the predictions yet) actually fills some polygons now =)
Of course there's nothing "3d" about it yet or any form animation either for that matter, so I can still only hope my performance estimates are in the right ballpark, but laying down some vertices manually and watching it place characters is oddly exciting.

That is it draws this fugly little "ship":

Using this character set:

by Bregalad on 2008-04-03 (#32228)

Looks promising. Keep the work up.

by doynax on 2008-04-04 (#32238)

Does DMC DMA always steal the same two(?) cycles. That is can it block the processors arbitrarily, does it wait for the bus to become available or is it more complicated than that?

For instance the C64's 6510 can't just be halted in the middle of a write period, that is when performing consecutive writes, so the VIC starts tying to stop the processor three cycles early (the worst case is an interrupt/BRK with 3 write cycles) and actually halts it at the first read. Meaning that up to three cycles are truly lost, something which can be exploited in stable raster code by placing an RMW instruction right before the DMA to reclaim two of the cycles..

I ask because I'd like to be able to predict how many cycles will be stolen over a span of several thousand cycles when running DMC. Something which would be a nightmare if it works anything like on the C64. I'm thinking the APU has the sames issues but unlike a graphics chip it can afford to fetch the new data a few cycles late, but it's not something I want to be surprised by later on and for all I know the APU might use an even more convoluted scheme.

edit: The wiki claims DMC DMA takes four cycles, which would correspond to three cycles of waiting for a write period to settle plus one cycle for actually fetching the sample. Has anyone put together a test ROM which compares the timings of a sequence of NOPs to that of RMW instructions? Then again other sources have claimed all sorts of numbers for the DMA thievery so I'm not panicking just yet..

by doynax on 2008-04-04 (#32242)

I wrote a test ROM myself only to discover that Nintendulator appears to follow this theory. Well sort of, the numbers don't quite add up to what I'd have expected but an lsr $ff is faster than a sta $ffff,x which in turn is faster than a lda ($ff),y.
I'm uploading the damned thing anyway just because I spent so much time on it ;)

So I guess I'm going to have to figure out a transfer loop where the stores don't "resonate" with the DMC frequency. For instance a simple unrolled lda abs,x / sta ppu_write sequence takes eight cycles per byte, which obviously is a factor in all the DMC frequencies, and thus the DMA would either always steal three or four cycles. In my case that would correspond to 30 or so cycles of variance, more than enough to throw off the sensitive raster code.

Would anyone blame me if I took the easy way out switched to an MMC3 setup? I'm still caught up in these damned raster tricks when all I really want to do is try out the polygon filling code and play with some 3d math.

by Celius on 2008-04-04 (#32257)

This is always something that I've found interesting.

I was at one point thinking of making a wireframe engine for the NES. This can be done with an altered raycasting equation as well as a reliable line-drawing routine.

I made a sample of this for Qbasic. I don't know how most wireframe 3D works, but I use the raycasting equations to find the placements of points with X,Y, and Z coords on the screen. These points are points where lines intersect. So after finding those points, I draw lines between them and I have wireframe 3D.

But if you can get that done, and combine it with what you're doing, you'll have some pretty nice looking 3D.

This might even be cool to have "CG" in NES games. Now you've made me really interested in this. I would love to have 3D stuff in my games. And it would be awesome to have colored 3D instead of just wireframe.

by Bregalad on 2008-04-05 (#32264)

Quote:
This might even be cool to have "CG" in NES games.

Yeah, I've always trough about that too. Not they don't have to be 3D calculations or anything, just "pre-rendered" stuff that would look impossible on the NES before one see it.
By using pre-defined CHR banks, a lot of raster tricks and sprites, I guess there is a lot of stuff to be done in that departement.
Quote:
Well sort of, the numbers don't quite add up to what I'd have expected but an lsr $ff is faster than a sta $ffff,x which in turn is faster than a lda ($ff),y.

If I remember right lsr $ff would be 5 cycles, sta $ffff,X would be 4 cycles if X=0 and 5 if X different than zero (because the lower byte is $ff, if it would be $00 the instruction would always be 4 cyles regardless of the value of X, and I highly recommand this if you're doing timed stuff), and lda [$ff],Y is 5 cycles, 6 if a page boundary is crossed. So here the faster seems sta $ffff,X

And why do you use the DMC at all ? To play music or for interrupts ? If it is the former, just play music with no DMC, and if it is the later, I through you used to bypass interrupt techniques.

EDIT : Look here for timing : http://www.6502.org/tutorials/6502opcodes.html
However I'm pretty sure they got ora $xx and and $xx wrong, they say 2 cycles but that's impossible since the CPU fetches at least 3 bytes for thees instructions. And eor $xx is 3 cyles, and I don't see why it would be any different.

by doynax on 2008-04-05 (#32268)

Bregalad wrote:
Quote:
This might even be cool to have "CG" in NES games.

Yeah, I've always trough about that too. Not they don't have to be 3D calculations or anything, just "pre-rendered" stuff that would look impossible on the NES before one see it.
By using pre-defined CHR banks, a lot of raster tricks and sprites, I guess there is a lot of stuff to be done in that departement.
There are a number of NeoGeo games which display prerender CGI sequences, mapped directly (uncompressed) from CHR ROM. Of course there aren't many 1000+ megabit NES cartridges around but something along those lines is certainly feasible with clever compression. For instance you could precalculate many of the time-consuming details when drawing a real 3D sequence (i.e. transformations, clipping/overlapping, line slopes and even some of the CHR allocation details, etc..) and still get relatively small movies. Think something along the lines of the animations from Flashback..
Hell one of my all-time favorite C64 demos is a line art animation (http://noname.c64.org/csdb/release/?id=4358).

Quote:
Quote:
Well sort of, the numbers don't quite add up to what I'd have expected but an lsr $ff is faster than a sta $ffff,x which in turn is faster than a lda ($ff),y.
If I remember right lsr $ff would be 5 cycles, sta $ffff,X would be 4 cycles if X=0 and 5 if X different than zero (because the lower byte is $ff, if it would be $00 the instruction would always be 4 cyles regardless of the value of X, and I highly recommand this if you're doing timed stuff), and lda [$ff],Y is 5 cycles, 6 if a page boundary is crossed. So here the faster seems sta $ffff,X
An absolute indexed LDA run in four cycles by first adding up the index and the low-byte of the address then reading from the resulting address, without page wrapping, in the fourth cycle. Then if it detects a carry it's forced to hijack the ALU for another cycle to increment the high-byte and read from the real address (or something functionally equivalent, though I suspect that it always carries into the high-byte but does so while already reading the dummy byte). Naturally trying to apply this optimization to a store, i.e. adding dummy stores to what could potentially be the wrong address, would be disastrous so it's forced to always performs dummy-reads instead and uses up the full five cycles regardless page crosses.
You can even exploit these the dummy reads in cases, say you wanted to write to every other byte of VRAM for instance then you could use a STA $20ff,x with x=8 making dummy read access one of the $2007 mirrors at $20ff and the real write another mirror at $3007. Combined with RMW reads and dummy-writes you can access the same register up to four times in a single instruction.

My point above was that all three versions should take 5 cycles (assuming that the zeropage indirection doesn't cross a page) yet DMC DMA will be able to block some longer than others resulting in varying effective code speeds. A very nasty thing when you're trying to write timed raster code.

Quote:
And why do you use the DMC at all ? To play music or for interrupts ? If it is the former, just play music with no DMC, and if it is the later, I through you used to bypass interrupt techniques.
Yes, I'm trying to (ab-)use it to generate raster interrupts in order to prolong the vblank period. Performing the kind of 3D I have in mind involves transferring large quantities of tiles per frame so the normal blanking period just doesn't cut it, not even on a PAL machine.

Quote:
Look here for timing : http://www.6502.org/tutorials/6502opcodes.html
However I'm pretty sure they got ora $xx and and $xx wrong, they say 2 cycles but that's impossible since the CPU fetches at least 3 bytes for thees instructions. And eor $xx is 3 cyles, and I don't see why it would be any different.
True, they most certainly take three. Personally I prefer AAY64's opcode lists (http://unusedino.de/ec64/technical/aay/c64/bmain.htm) or Graham's tables (http://www.oxyron.de/html/opcodes02.html) when I need to look up some detail of the 6502 instruction set. The AAY one in particular covers just about everything that's useful to me and is comfortably hyper-linked, and I have yet to discover a single an error in either document.

by Celius on 2008-04-05 (#32277)

Pre-Rendered seems like it would take a ton of space. I don't know how much space polygon-fill definitions would take, but my wireframe plan has 6 bytes (Well, probably 12 if I want 16-bit coordinates) per line-def. There are 3 things that define each end-point. The X, Y, and Z coordinates. Pre-defined graphics would take a ton of space compared to this. Even if you didn't calculate the movements, and had pre-defined screens made up of these lines/fills, it would save you a lot of space.

In my game, I think I'm going to have to extend Vblank by 16 scanlines (8 at the top, and 8 at the bottom), and I'm planning to do this with the MMC3. Its scanline counter is kind of a shindig, but it's better than nothing. Extend Vblank is all you're trying to do, right? Then you should use a scanline counter.

by doynax on 2008-04-05 (#32282)

Celius wrote:
Pre-Rendered seems like it would take a ton of space. I don't know how much space polygon-fill definitions would take, but my wireframe plan has 6 bytes (Well, probably 12 if I want 16-bit coordinates) per line-def. There are 3 things that define each end-point. The X, Y, and Z coordinates. Pre-defined graphics would take a ton of space compared to this. Even if you didn't calculate the movements, and had pre-defined screens made up of these lines/fills, it would save you a lot of space.
Naturally pure coordinate definitions are far smaller for anything but the most complex of objects. But by sharing tiles between frames and using lossy compression (i.e. not requiring the tiles to be a perfect match) you can get decent results. Frankly the name tables are more of a problem than the patterns but even they can be compressed by attempting to only record what's changed between frames, plus all the usual techniques of course. I played around with vector quantization a while back and believe me you can get some surprisingly good results with these kinds of techniques..
Quote:
In my game, I think I'm going to have to extend Vblank by 16 scanlines (8 at the top, and 8 at the bottom), and I'm planning to do this with the MMC3. Its scanline counter is kind of a shindig, but it's better than nothing. Extend Vblank is all you're trying to do, right? Then you should use a scanline counter.
Shutting off the PPU at the bottom edge is easy enough with the MMC3's raster IRQs. The problem is extending the blanking into the top border since you need precise timing to reactivate the screen, yet the raster IRQs won't work here since the MMC3 uses the PPU's memory accesses to orient itself.
Now the easy way to solve this is to use sprite-0 hit testing to find the start of the screen (and keep a blank area of the name table to display while waiting for the "hit"). Things get a bit trickier if you occasionally want to return early without using up the full vblank period, but if it's only eight lines we're talking about then you might as well a single special case for when you're going to be finished before the end of the real blanking period. The important point is that the code should be designed never run past the blanking area, because trying to abort it with interrupts is very nearly impossible..

In my case I wanted to try to avoid using MMC3 since carts with both PRG and CHR RAM are so rare, this I have to try to abuse DMC IRQs as a raster interrupt source (something it plainly never was designed for). Also I want to be able to exit vblank arbitrarily since I'm extending it so far.

by Celius on 2008-04-05 (#32284)

Like was mentioned before, FF3j has that combination. I was able to purchase that cart for $8. It was in kind of poor condition, and it didn't have a box/manual, so it was fine.

Well, I suppose if shutting it off on top is a problem, you can shut it off for 16 scanlines on bottom. It's pretty noticeable, but it's an option.

It might even be possible to have some pre-defined line segments to mess around with. But I think most of the time, it would be good just to have a fast line-drawing routine.

This is pretty hard to deal with when you have name tables and pattern tables. If the CHR bank was the size of the name table, it wouldn't be a problem at all. It gives me a headache to think about polygon filling with name tables in mind.

Working with sprites is probably easier in a lot of cases, since you have pixel-based coordinates.

by doynax on 2008-04-05 (#32286)

Celius wrote:
Like was mentioned before, FF3j has that combination. I was able to purchase that cart for $8. It was in kind of poor condition, and it didn't have a box/manual, so it was fine.
It's as much about authenticity, i.e. using hardware setup I might have been allowed to play with if I had to write this thing back in the day, as being able to test things easily. Plus a good bit of general stubbornness at this point, though I must confess that I'm leaning more towards taking the easy way out every day..

Quote:
Well, I suppose if shutting it off on top is a problem, you can shut it off for 16 scanlines on bottom. It's pretty noticeable, but it's an option.
16 lines looks awfully skewed though. My current PAL timing is 12 pixels off (6 really but the differences get sort of doubled) and that looks pretty bad.. For a game I'd try to keep within 8 or so extra lines if I could.

Celius wrote:
This is pretty hard to deal with when you have name tables and pattern tables. If the CHR bank was the size of the name table, it wouldn't be a problem at all. It gives me a headache to think about polygon filling with name tables in mind.
Line drawing to a character screen isn't all that hard, not in theory anyway. Just keep a copy of the nametable in RAM which starts out as all blanks and then whenever you cross a tile boundary when drawing the lines you check if has already been allocated a unique tile (in which case you reuse it) or if you have to place a new one. Polygon filling is a bit trickier of course but not really by much, and the basic principles are the same..
I suppose it *might* even be possible to reuse identical line segments by hashing together the entry and exit coordinates, plus various other bookkeeping and lookup logic per tile. Or even using a bunch of predefined tiles and just handling the parts where lines meet or end dynamically. But the precision necessary for it to actually be able to reuse anything would certainly be depressingly low..

Then again you could create a monochrome bitmapped 184x176 screen with double buffering by packing two tiles into the separate bitplanes of a single pattern and using palettes tweaked to only show one at a time. Without double buffering (and with a mid-screen raster IRQ to switch character sets) it fill the screen with room to spare. Hardly ideal but it's the route I'd take if I wanted to write a Mario Paint clone for a standard CHR RAM cart..

by Celius on 2008-04-05 (#32287)

You know, I just thought about something. If you were to just make "CG" for the NES, you wouldn't even have to do 3D calculations, if you defined intersection coordinates for each frame. Since it would be pre-defined anyway, you could just use X and Y coords. However, if you're looking for not pre-defined 3D looking things, you'd need a Z coordinate.

I'd need to think about line-drawing for a while. But this really makes me want to have some cheesy "CG" movie in my game. But I see what you're saying about keeping a copy of the name table in RAM. It's a pretty simple idea, it's just that it might take forever to actually draw the lines

by doynax on 2008-04-05 (#32297)

Celius wrote:
You know, I just thought about something. If you were to just make "CG" for the NES, you wouldn't even have to do 3D calculations, if you defined intersection coordinates for each frame. Since it would be pre-defined anyway, you could just use X and Y coords. However, if you're looking for not pre-defined 3D looking things, you'd need a Z coordinate.
Exactly, anything short of the actual line drawing can easily be precalculated to great effect.

Quote:
But I see what you're saying about keeping a copy of the name table in RAM. It's a pretty simple idea, it's just that it might take forever to actually draw the lines
Believe me the overhead of clearing and sending the entire screen to the PPU easily outweighs having to deal with tiles. It's just a matter of optimizing the hell out of the code ;)
And regardless whether or not my implementation proves to be fast enough Elite certainly manages to pull it off.

by Bregalad on 2008-04-05 (#32299)

I guess it should be possible to make CG by using big pixels for background/stuff that needs few detail and (2D) sprites for parts that needs details (such like head of characters). That way it can be done in all games with both CHRROM and CHRRAM techniques.

However, real 3D rendering would rock. I guess it should be possible to render simple 3D for object while the background would still be technically 2D or pseudo 3D.

by Celius on 2008-04-05 (#32315)

I tried to make a PSet function a while back, and it ended up taking about 100 cycles. That's way too long.

But I know I can probably optimize it. Trust me, if I can get my line routine working, that's going to be pretty much half the work for pre-calculated CG. The other half is filling the polygons, which seems like a big headache. Was that a big headache, doynax?

But yeah, that's all that is to doing CG, if you want real-time 3D, you're 2/3 of the way there. The next step would just be making X,Y,Z coords into X and Y coords. I don't know if there's a better way to go about this.

If you want, I can post my math used (Even though it includes some raw trigonometry) for my Qbasic wireframe simulator up here.

by doynax on 2008-04-05 (#32318)

Celius wrote:
I tried to make a PSet function a while back, and it ended up taking about 100 cycles. That's way too long.
Right, plotting a random pixel takes a huge amount of time. The trick is not to redo all that work per pixel because once you've dealt with that first one then figuring how to plot the next is a great deal easier. Then there are about a million dirty tricks you can do to speed things up, and I'd certainly love to brag about some of my wilder bithacks, but I seriously doubt much of it would be useful for much beyond this particular application.

Quote:
But I know I can probably optimize it. Trust me, if I can get my line routine working, that's going to be pretty much half the work for pre-calculated CG. The other half is filling the polygons, which seems like a big headache. Was that a big headache, doynax?
No, not really (well, I suppose it caused its fair share of problems but that had more to do with unpredictable timing than the logic itself). The trick is to not actually draw polygons as such but rather only draw the all outlines and then fill in the gaps in a single pass. Look up a technique called XOR-filling if you want the details, but suffice to say that it comes down to using EOR rather than LDA when sending the tiles to the PPU. Also we're talking full-on concave n-gons with self intersection here, not just those puny triangles todays GPUs deal with ;)

Quote:
But yeah, that's all that is to doing CG, if you want real-time 3D, you're 2/3 of the way there. The next step would just be making X,Y,Z coords into X and Y coords. I don't know if there's a better way to go about this.
I figure I'm pretty much stuck with a normal 16-bit perspective division. Perhaps there'll be enough free ROM space to set up some huge logarithm or reciprocal tables, but I honestly doubt it'll be a bottleneck.

by tokumaru on 2008-04-05 (#32319)

Celius wrote:
If you were to just make "CG" for the NES, you wouldn't even have to do 3D calculations, if you defined intersection coordinates for each frame. Since it would be pre-defined anyway, you could just use X and Y coords.

Do you know Another World/Out of this World and Flashback? Both have cool animated sequences achieved through a technique that can't be much different than what you described. It's really cool how it's possible to have those long animated sequences, because 2D points need so little storage space.

Sadly, there aren't NES versions of those games, so I can't say how good it'd perform on our good old system.

by doynax on 2008-04-05 (#32321)

tokumaru wrote:
Do you know Another World/Out of this World and Flashback? Both have cool animated sequences achieved through a technique that can't be much different than what you described. It's really cool how it's possible to have those long animated sequences, because 2D points need so little storage space.
Looking at Another World it's staggering just how much pure atmosphere and, well, storytelling the guy managed to squeeze into a few measly polygons. And that most paradoxical feeling of freedom, in a totally linear game where you spend 95% of the time repeating the same twitch sequence over and over again..
All in all it's easily one of the most fascinating computer games of all time.

But I fear it would be unfeasible on the NES without a lot of custom hardware. Feel free to prove me wrong though ;)

by Celius on 2008-04-05 (#32322)

Yeah, I've played those, I think they are really cool. They have pretty good graphics. I think Out of this World has better graphics though.

I never even thought about doing pre-defined 3D. Not in the sense where you have pre-defined images, but pre-defined X,Y coords based off of X, Y, and Z coords. When I find a way to do this, I'll definitely be having some movies in my games.

I suppose filling large polygons wouldn't be so hard. I was thinking that it would be cool to find a way (I'll look into this AFTER I look into filling polygons regularly) to fill polygons with a mixed color (Like the checkerboard technique where you alternate between colors every other pixel to create the illusion of another color), but that would probably be a lot of work.

It might even be easier to do the opposite in many of these cases. Instead of filling the polygons, delete what lies outside the borders. But that would assume that polygons aren't connected, which is preposterous.

But I'll look up XOR filling. It sounds pretty interesting.

by Bregalad on 2008-04-06 (#32328)

If you're going to make CGs, why deal with polygons at all ? I mean you can use them to create your CG, but once they're done you can "convert" images to NES without using any polygons. Just have parts that needs detail use sprites, like people, even if that means one full sprite per frame, and parts that needs less detail should fit in 256 tiles at once using BG.

I guess draw a line is really easy you know. Basically if you want a line that pass between two points, I'll call the first point (x1;y1) and the second (x2;y2), then maths tells you that the equation of your line is y=m*x+(y1-m*x1), where m=(y1-y2)/(x1-x2) (derivate of your line). Then if you want the line to be a segment, you should just shrink it so that only (x;y) cords that x1<x<x2 and y1<y<y2. (assuming you reorder your point so that x1 is smaller than x2, etc....)

Of course if you actually want to make it fast, you'll need a trick or two. I'd suggest first calculate the derivate of your line (y1-y2)/(x1-x2), but if you want to get rid of the division, just calculate y1-y2 and x1-x2 separately. Then start on the first point, increment your Y coordinate by a fractional multiple of (y1-y2) and your X coordinate by a fractional multiple of (x1-x2) and reapeat until you reach the end point.

by doynax on 2008-04-06 (#32330)

Bregalad wrote:
Of course if you actually want to make it fast, you'll need a trick or two. I'd suggest first calculate the derivate of your line (y1-y2)/(x1-x2), but if you want to get rid of the division, just calculate y1-y2 and x1-x2 separately. Then start on the first point, increment your Y coordinate by a fractional multiple of (y1-y2) and your X coordinate by a fractional multiple of (x1-x2) and reapeat until you reach the end point.
The classic way to avoid the division is to use Bresenham's algorithm, which deals with fractional numbers all the way through rather than to ever divide anything. Bresenham doesn't suffer from any precision problems either, and it tends to look pretty cheesy if the slope is off-by-one so the lines don't actually line up (which can easily happen with 8-bit precision unless you're careful).
Of course the Bresenham innerloop isn't quite as fast as dealing with fractions, since you can just add the fractional value to a running error accumulator and see if it overflowed to determine whether to jump to the next pixel on the minor axis. Besides that you often get the slope as a by-product of something else anyway, like clipping.

And then there's my method of cheating by precalculating everything..

by tokumaru on 2008-04-06 (#32331)

Hey, Bresenham is THE way to draw lines. I can't think why anyone would use anything else... It's really simple to implement in 6502 ASM too. Actually modifying the pattern tables is the hard part.

by doynax on 2008-04-06 (#32334)

tokumaru wrote:
Hey, Bresenham is THE way to draw lines. I can't think why anyone would use anything else... It's really simple to implement in 6502 ASM too. Actually modifying the pattern tables is the hard part.
It certainly is a beautiful algorithm but it's hardly the most efficient way to do things. On the 6502 you really need the accumulator both to deal with the Bresenham error term and to combine the pixels, which involves at least a pair of transfers per pixel on the minor axis. Plus it's tricky to avoid the second "correcting" addition to get the error term positive again, especially if the loop is already unrolled on other things.
I've managed to get my "calculation" down to a single two-cycle SBX #$?? per iteration by precalculating the slopes and employing a bithack or two..

As for the pattern table the key to efficiency is to unroll the loop for the entire tile, that is have 8x8 pieces of code to deal with all the possible pixels within a tile. Naturally this wastes a great deal of ROM space and just keeping the branches in range can be a nightmare but the performance gained is easily worth it.

(And, yes, I'm inordinately proud of my line drawing code. So sue me..)

by tokumaru on 2008-04-06 (#32335)

Oh, I didn't mean it was good for your demo. You certainly have made the right choices. I don't think there is a better way to handle that sort of rendering than using a lot of tables. When I was planning a raycaster for the NES (can't say if I'll ever make it) it also used a great deal of pre-calculated values.

I meant that, as a general line-drawing algorithm, Bresenham was better than anything that included divisions, for example.

by Celius on 2008-04-06 (#32337)

Yeah, actually, I think it might just be a good idea to not try to PSet directly into the pattern tables. I would take a certain section of RAM and dedicate it to holding copies of the lines. The most calculating you'd have to do would be for the placement on the name tables. Actually, it'd be like doing the PSet function on the name tables with tiles. I can see how this is possible. The only thing I don't get at this point is how to assure that you won't have repeating tiles.

Bregalad wrote:
If you're going to make CGs, why deal with polygons at all ? I mean you can use them to create your CG, but once they're done you can "convert" images to NES without using any polygons. Just have parts that needs detail use sprites, like people, even if that means one full sprite per frame, and parts that needs less detail should fit in 256 tiles at once using BG.

Because complete pre-definition takes up too much space. If you can calculate it, you don't have to use nearly as much space. 6 bytes for one line. It's 16 for one tile! If there was a line 256 pixels long, that would be 32 tiles. That is 512 bytes! Why waste all that space when you can just use 6 bytes and a little math? Oh, and if you're wondering why you'd want to use crappy looking 3D compared to nice/colorful/detailed sprites, it's because it's funny! And also, if people would have been playing the game back in the day, they would have crapped their pants in awe of the 3D graphics.

I don't know how many bytes it would be for filling at this point, but I'm sure it's nothing big. For tiles, the most it would be is as many tiles as there are lines + 1 (For a completely solid tile).

Well, it seems that the amount of space required to do this is 5k (1k for a copy of the name table, and 4k for the copy of the BG part of the pattern table). Guess what the really good news is for me. In my game, I use SRAM to hold saved game information. However, it only takes up a little more than 3k! So the really great news is that I can make up for that "little more than 3K" by predefining the solid tiles. That makes me excited.

Another thing that worries me is drawing a line in the same 8-pixel field as another line. But I'm sure it's possible.

EDIT: Oh, and the Bresenham line algorithm is really the way to go, I think. Since the only multiplication/division it requires is by multiples of 2.

by doynax on 2008-04-07 (#32397)

I think I've finally implemented a working raster "system", that is making sure that the char and nametable transfers can run properly during an extended blanking period, and ironed out the remaining line drawing bugs and fill-convention issues. You see the drawback with XOR-filling is that off-by-one errors can screw up the rest of the (vertical) scanline, so you have to be damned sure that your lines actually reach their proper destinations..

So without further ado I reveal to you the very first proof-of-concept demo of filled polygons for the NES: http://www.minoan.ath.cx/~doynax/6502/polynes.nes
To run this you'll need an emulator with a fairly complete implementation of illegal opcodes, which seems to exclude everything except Nestopia out-of-the-box (though at least FCEU can be trivially patched). Hardware-wise it's just a 128k MMC1 cart with SRAM and CHR-RAM (Zelda without battery backing in other words), but I have no illusions that it'll work on a real console just yet. Oh, and it's written for NTSC since I wanted to target the lowest common denominator.

Now there's still a great deal of code cleanup/polish, compatibility testing and optimization and left to do before I can move on to actual 3D work. But with a bit of luck it shouldn't be too long now (yeah, right..)

Alas, my only regret is that I failed to provoke anyone into stating that polygon filling is impossible ;)

by tokumaru on 2008-04-07 (#32400)

Pretty smooth animation, but I get most of the frames all garbled, very few ones are glitchless... is that normal?

doynax wrote:
Alas, my only regret is that I failed to provoke anyone into stating that polygon filling is impossible

Ahhh... proving people wrong is more exciting, huh? Don't worry, you seem to be doing a great work anyway! =)

I'm one that believes that the NES is capable of much more than what has already been done with it. It's just amatter of figuring out HOW.

by doynax on 2008-04-07 (#32402)

tokumaru wrote:
Pretty smooth animation, but I get most of the frames all garbled, very few ones are glitchless... is that normal?
No.. But then I don't expect much in way of compatibility yet either. It's supposed to work, without glitches even, in Nestopia 1.37 and my patched FCEUXD-SP version. Which happen to be the only emulators I know of which can run it to any degree at all. What did you try it in?
tokumaru wrote:
doynax wrote:
Alas, my only regret is that I failed to provoke anyone into stating that polygon filling is impossible ;)
Ahhh... proving people wrong is more exciting, huh? Don't worry, you seem to be doing a great work anyway! =)
It's not so much proving people wrong (however enjoyable that is) as doing "the impossible" that is a great motivator.

by tokumaru on 2008-04-07 (#32404)

doynax wrote:
What did you try it in?

Nestopia 1.37, weird...

doynax wrote:
It's not so much proving people wrong (however enjoyable that is) as doing "the impossible" that is a great motivator.

I get it, but if you did it, then it wasn't impossible. So, if you think about it, the fact that nobody said it's impossible is a sign that they have faith in you... =)

by doynax on 2008-04-07 (#32405)

tokumaru wrote:
doynax wrote:
What did you try it in?

Nestopia 1.37, weird...
I have an idea.. You haven't disabled the 8-sprite limit by any chance, have you? I'm using it for timing so disabling it screws things up pretty badly.
Quote:
I get it, but if you did it, then it wasn't impossible. So, if you think about it, the fact that nobody said it's impossible is a sign that they have faith in you... =)
Or, more likely, that they've never heard of me but don't particularly care in either case :)

by tokumaru on 2008-04-07 (#32407)

doynax wrote:
I have an idea.. You haven't disabled the 8-sprite limit by any chance, have you?

That's not it... when I disable it, the screen jumps around, so it becomes worse. With the limit enabled, the tiles are just messed up, but they are in the correct places.

EDIT: Most frames are a variation of this:

by doynax on 2008-04-07 (#32408)

tokumaru wrote:
doynax wrote:
I have an idea.. You haven't disabled the 8-sprite limit by any chance, have you?

that's not it... when I disable it, the screen jumps around, so it becomes worse. With the limit enabled, the tiles are just messed up, but they are in the correct places.
Okay, I'm stumped..
I've tried downloading a fresh (Win32) version of Nestopia 1.37 and fetching my own ROM from the server. And it works out-of-the-box.
Worse, I can't find any setting to tweak besides the sprite limit that fucks up it. Even PAL mode actually works, albeit slowly.

Any chance you could send me your emulator + ROM set just as a sanity test?

by Celius on 2008-04-07 (#32409)

Very nice demo! It really gives me hope that I can also implement 3D in my games.

Now, why do you deal with illegal opcodes? Just curious.

And I really like that it's NTSC, so that really means its possible, since that has a really short Vblank time.

I have to say, I am quite envious =). I still have a lot of work to do in order to get polygon filling (I basically am at square 1). Seriously, you are 2/3 of the way to a 3D game. All that's left is turning X,Y,Z coords into X, Y coords, and you have it.

You have put me on a mission to make polygon filling on my games. But congratulations on that demo, it looks really good.

EDIT: It worked for me in Nestopia 1.09.

by tokumaru on 2008-04-07 (#32410)

Maybe I'm doing something wrong... I'll get a fresh copy too, I'll let you know. Maybe the ROM got corrupted somehow?

by doynax on 2008-04-07 (#32411)

tokumaru wrote:
Maybe I'm doing something wrong... I'll get a fresh copy too, I'll let you know. Maybe the ROM got corrupted somehow?
Possibly, but I seriously doubt a corrupted ROM would've worked even this well. What you have looks just like the initial EOR bytes were corrupted (not initialized to zero, or similarly screwed up by a previous frame). Odd..
edit: I just realized that you're running the slightly earlier version with a square and a slightly smaller size which I originally uploaded. So I suppose that the problem must have had something to do with that. I guess it must be some bug in the line drawer which was only exposed by the early version and if you waited just long enough, or something along those lines, but I'll be damned if I can see how that could be as I've been using Nestopia as my main emulator all along.
Celius wrote:
Now, why do you deal with illegal opcodes? Just curious.
Simply because they save a lot of time and space (plus they're often just plain convenient). I could certainly get by without them but it would slow things down.

Want to load X from an indirect zeropage address? Use LAX (zp),y
Perhaps you want carry to be cleared after a right shift. Just use ASR #%1111110.
Lets say you want to subtract 5 from X and don't mind clobbering A. Then use TXA / SBX #5.
Need to decrement a counter and compare it to a final value in some outerloop? Why not write LDA #limit / DCP counter.
And so forth..

Quote:
And I really like that it's NTSC, so that really means its possible, since that has a really short Vblank time.
Actually the visible area is only 160 scanlines, the rest is just one great extended blanking period (by some 400%). Using only the standard 20-lines wouldn't be even remotely enough, sorry..

Quote:
I have to say, I am quite envious =). I still have a lot of work to do in order to get polygon filling (I basically am at square 1). Seriously, you are 2/3 of the way to a 3D game. All that's left is turning X,Y,Z coords into X, Y coords, and you have it.
Keep in mind that this method is far from general-purpose as it uses up enormous amounts of RAM and ROM, plus it has issues with overlapping objects in particular. But something similar could certainly be used for cut-scenes.

by tokumaru on 2008-04-07 (#32417)

Got the new ROM, both work fine in Nestopia 1.35, but are screwed up in 1.37 (even the one with the triangle). There's got to be something wrong with my Nestopia, since nobody else has seen those effects. I'll just download it again and check.

EDIT: Found it, I had some stupid cheat enabled... that probably screwed up your logic. Sorry about that, but I would expect an emulator to make the cheats game-specific. Sorry again for wasting your time... X-[

by Celius on 2008-04-07 (#32425)

doynax wrote:
Want to load X from an indirect zeropage address? Use LAX (zp),y
Perhaps you want carry to be cleared after a right shift. Just use ASR #%1111110.
Lets say you want to subtract 5 from X and don't mind clobbering A. Then use TXA / SBX #5.
Need to decrement a counter and compare it to a final value in some outerloop? Why not write LDA #limit / DCP counter.
And so forth..

Quote:
And I really like that it's NTSC, so that really means its possible, since that has a really short Vblank time.
Actually the visible area is only 160 scanlines, the rest is just one great extended blanking period (by some 400%). Using only the standard 20-lines wouldn't be even remotely enough, sorry..

Umm... I have NEVER heard of those! Wow, that's really weird. How does that work? And why are they illegal?

That's okay that it takes that long. Still, this is really cool, and you could do a lot with this.

And why does it use an enormous amount of ROM? Is there a bunch of pre-defined stuff?

by tepples on 2008-04-07 (#32434)

I don't think "illegal" isn't the right word, as there's no law against using them.[1] Perhaps "undocumented" is the right word.

A decent-size portion of the 6502 die is a gate array that decodes the opcodes. But it's full of "don't care" values for undocumented opcodes, and these "don't care" values result in a partial decoding "between" documented opcodes. This often means the behaviors of multiple opcodes get stacked together, resulting in the strange behaviors.

Wikipedia tells more.

[1] There might be a law that indirectly bans them in the context of a monopoly holder's business practice. If Nintendo holds a copyright on 10NES, along with a de facto monopoly on video games, along with a refusal to approve any game whose program uses undocumented opcodes, then the undocumented opcodes are in fact illegal. But otherwise, no.

by doynax on 2008-04-08 (#32445)

Celius wrote:
Umm... I have NEVER heard of those! Wow, that's really weird. How does that work? And why are they illegal?
The drawback of the illegal instructions is of course that many emulators, NES clones and assemblers don't support them or only have partial support.
Anyway as tepples said they're combinations of existing instructions operations and addressing modes introduced by accident rather than design. As such some have quite complicated or even unstable behavior. Furthermore different authors have given them different names, or names based on what their behavior was originally thought to be, which turned out wrong.
The opcode lists I posted earlier (AAY64, Graham's opcode list) cover many of the details. Finding uses for these often involves a bit of ingenuity and as well as adapting your code to fit them. At first almost all of them seemed rather pointless to me but I've since found uses for most of them along the way until nowadays I wouldn't even want to write 6502 code without them.

Here's a non-obvious but illustrative example of the kind of ways you may find to (ab-)use them:
Lets say you want initialize an array to an ascending sequence of numbers (say 0, 1, 2, 3, 4, 5, 6 and 7) as fast as possible. The obvious way to do this is to start out with zero and use INX/INY to get to the next number. This being a a single-byte, two-cycle instruction is as good as it gets, right? I mean don't you have to generate each new number somehow anyway? Not with illegals.. ;)
You see there is this instruction called SAX which stores the A ANDed together with X into memory. You can thus store three possible values directly from these registers, that is A and X themselves as well as the combined AND.
Code:
LDA #%00000001
LDX #%11111110
CLC
SAX v0 ;; %0000 = %0001 & %1110
STA v1 ;; %0001
ADC #%00000010
SAX v2 ;; %0010 = %0011 & %1110
STA v3 ;; %0011
ADC #%00000010
SAX v4 ;; %0100 = %0101 & %1110
STA v5 ;; %0101
ADC #%00000010
SAX v6 ;; %0110 = %0111 & %1110
STA v7 ;; %0111
Cute, eh? And there is an even trickier version which uses SBX to avoid clearing carry.
The point is that successfully using the illegals often comes down realizing some particular property of these instructions which you just can't get from the normal ones and then finding a way to exploit it.

Quote:
And why does it use an enormous amount of ROM? Is there a bunch of pre-defined stuff?
I've mostly spent it on large amounts of unrolled code. There's very nearly 32k of line drawing loops, and I've completely unrolled the code which clears the character set and tilemaps so there a 24k sequence of STAs in there somewhere, plus there's this big 12k atan2 table for directly finding out the angle of a point and another 5k of precalculated line slopes, and a few other things. About 96k in all by now I should think, but most if it ought to be avoidable without loosing too much performance, I just didn't see a reason not to use what space I have for optimization purposes whenever possible.
Your main problem with using these techniques in a game environment would be the RAM usage. You see since uploading new graphics to the PPU usually takes more than one frame you have to be able to work on the next one simultaneously, something which I've chosen to do by double buffering the in-RAM versions of the character set and tilemaps. SO combined that with a few other necessary tables and I'm now down to only 320 bytes of contiguous RAM free.
Then again how much game data do you really need to store when showing the cut-scenes between levels?

by doynax on 2008-04-08 (#32452)

Here are some patches for Nintendulator and FCEUXD-SP to get this damnable demo to run. This gives me some hope for hardware compatibility as I really hadn't expected Nintendulator to be fine with my raster code without some heavy tweaking.

Nintendulator (0.965 beta) needs implementations of the ANC/ASR/ARR and SBX illegals added to CPU.c
Code:
22a23,24
> #define ILLEGALWARNING 0
>
1009a1012
> # if ILLEGALWARNING
1010a1014
> # endif
1014a1019
> # if ILLEGALWARNING
1015a1021
> # endif
1018a1025
> # if ILLEGALWARNING
1019a1027
> # endif
1034a1043
> # if ILLEGALWARNING
1035a1045
> # endif
1052a1063
> # if ILLEGALWARNING
1053a1065
> # endif
1068a1081
> # if ILLEGALWARNING
1069a1083
> # endif
1087a1102
> # if ILLEGALWARNING
1088a1104
> # endif
1092a1109
> # if ILLEGALWARNING
1093a1111
> # endif
1103a1122
> # if ILLEGALWARNING
1104a1124
> # endif
1119a1140
> # if ILLEGALWARNING
1120a1142
> # endif
1138a1161
> # if ILLEGALWARNING
1139a1163
> # endif
1153a1178,1253
> //////// new illegals ////////
> static __forceinline void IV_ANC (void)
> { /* AND + copy bit-7 to carry */
> # if ILLEGALWARNING
> EI.DbgOut(_T("Invalid opcode $%02X (ANC) encountered at $%04X"),Opcode,OpAddr);
> # endif
> CPU_MemGet(CalcAddr);
> __asm
> {
>    mov al,CPU.LastRead
>    and CPU.A,al
>    setz CPU.FZ
>    sets CPU.FN
>    sets CPU.FC
> }
> }
> static __forceinline void IV_ASR (void)
> { /* AND + LSR */
> # if ILLEGALWARNING
> EI.DbgOut(_T("Invalid opcode $%02X (ASR) encountered at $%04X"),Opcode,OpAddr);
> # endif
> CPU_MemGet(CalcAddr);
> __asm
> {
>    mov al,CPU.LastRead
>    and al,CPU.A
>    shr al,1
>    mov CPU.A,al
>    setc CPU.FC
>    mov CPU.FN,0
>    setz CPU.FZ
> }
> }
> static __forceinline void IV_ARR (void)
> { /* AND + ROR */
> # if ILLEGALWARNING
> EI.DbgOut(_T("Invalid opcode $%02X (ARR) encountered at $%04X"),Opcode,OpAddr);
> # endif
> CPU_MemGet(CalcAddr);
> __asm
> {
>    mov ah,CPU.FC
>    mov CPU.FN,ah
>    add ah,0xFF
>
>    mov al,CPU.LastRead
>    and al,CPU.A
>    rcr al,1
>    mov CPU.A,al
>    setc CPU.FC
>    test al,0xFF
>    setz CPU.FZ
> }
> }
> static __forceinline void IV_SBX (void )
> { /* X = (X & A) - #imm */
> # if ILLEGALWARNING
> EI.DbgOut(_T("Invalid opcode $%02X (SBX) encountered at $%04X"),Opcode,OpAddr);
> # endif
> CPU_MemGet(CalcAddr);
> __asm
> {
>    cmp CPU.FC,1
>
>    mov al,CPU.X
>    and al,CPU.A
>    sbb al,CPU.LastRead
>    mov CPU.X,al
>
>    setnc CPU.FC
>    setz CPU.FZ
>    sets CPU.FN
>    seto CPU.FV
> }
> }
>
1188,1191c1288,1291
< case 0x03:AM_INX(); IV_SLO();break;case 0x13:AM_INYW(); IV_SLO();break;case 0x0B:AM_IMM(); IV_UNK();break;case 0x1B:AM_ABYW(); IV_SLO();break;case 0x07:AM_ZPG(); IV_SLO();break;case 0x17:AM_ZPX(); IV_SLO();break;case 0x0F:AM_ABS(); IV_SLO();break;case 0x1F:AM_ABXW(); IV_SLO();break;
< case 0x23:AM_INX(); IV_RLA();break;case 0x33:AM_INYW(); IV_RLA();break;case 0x2B:AM_IMM(); IV_UNK();break;case 0x3B:AM_ABYW(); IV_RLA();break;case 0x27:AM_ZPG(); IV_RLA();break;case 0x37:AM_ZPX(); IV_RLA();break;case 0x2F:AM_ABS(); IV_RLA();break;case 0x3F:AM_ABXW(); IV_RLA();break;
< case 0x43:AM_INX(); IV_SRE();break;case 0x53:AM_INYW(); IV_SRE();break;case 0x4B:AM_IMM(); IV_UNK();break;case 0x5B:AM_ABYW(); IV_SRE();break;case 0x47:AM_ZPG(); IV_SRE();break;case 0x57:AM_ZPX(); IV_SRE();break;case 0x4F:AM_ABS(); IV_SRE();break;case 0x5F:AM_ABXW(); IV_SRE();break;
< case 0x63:AM_INX(); IV_RRA();break;case 0x73:AM_INYW(); IV_RRA();break;case 0x6B:AM_IMM(); IV_UNK();break;case 0x7B:AM_ABYW(); IV_RRA();break;case 0x67:AM_ZPG(); IV_RRA();break;case 0x77:AM_ZPX(); IV_RRA();break;case 0x6F:AM_ABS(); IV_RRA();break;case 0x7F:AM_ABXW(); IV_RRA();break;
---
> case 0x03:AM_INX(); IV_SLO();break;case 0x13:AM_INYW(); IV_SLO();break;case 0x0B:AM_IMM(); IV_ANC();break;case 0x1B:AM_ABYW(); IV_SLO();break;case 0x07:AM_ZPG(); IV_SLO();break;case 0x17:AM_ZPX(); IV_SLO();break;case 0x0F:AM_ABS(); IV_SLO();break;case 0x1F:AM_ABXW(); IV_SLO();break;
> case 0x23:AM_INX(); IV_RLA();break;case 0x33:AM_INYW(); IV_RLA();break;case 0x2B:AM_IMM(); IV_ANC();break;case 0x3B:AM_ABYW(); IV_RLA();break;case 0x27:AM_ZPG(); IV_RLA();break;case 0x37:AM_ZPX(); IV_RLA();break;case 0x2F:AM_ABS(); IV_RLA();break;case 0x3F:AM_ABXW(); IV_RLA();break;
> case 0x43:AM_INX(); IV_SRE();break;case 0x53:AM_INYW(); IV_SRE();break;case 0x4B:AM_IMM(); IV_ASR();break;case 0x5B:AM_ABYW(); IV_SRE();break;case 0x47:AM_ZPG(); IV_SRE();break;case 0x57:AM_ZPX(); IV_SRE();break;case 0x4F:AM_ABS(); IV_SRE();break;case 0x5F:AM_ABXW(); IV_SRE();break;
> case 0x63:AM_INX(); IV_RRA();break;case 0x73:AM_INYW(); IV_RRA();break;case 0x6B:AM_IMM(); IV_ARR();break;case 0x7B:AM_ABYW(); IV_RRA();break;case 0x67:AM_ZPG(); IV_RRA();break;case 0x77:AM_ZPX(); IV_RRA();break;case 0x6F:AM_ABS(); IV_RRA();break;case 0x7F:AM_ABXW(); IV_RRA();break;
1194c1294
< case 0xC3:AM_INX(); IV_DCP();break;case 0xD3:AM_INYW(); IV_DCP();break;case 0xCB:AM_IMM(); IV_UNK();break;case 0xDB:AM_ABYW(); IV_DCP();break;case 0xC7:AM_ZPG(); IV_DCP();break;case 0xD7:AM_ZPX(); IV_DCP();break;case 0xCF:AM_ABS(); IV_DCP();break;case 0xDF:AM_ABXW(); IV_DCP();break;
---
> case 0xC3:AM_INX(); IV_DCP();break;case 0xD3:AM_INYW(); IV_DCP();break;case 0xCB:AM_IMM(); IV_SBX();break;case 0xDB:AM_ABYW(); IV_DCP();break;case 0xC7:AM_ZPG(); IV_DCP();break;case 0xD7:AM_ZPX(); IV_DCP();break;case 0xCF:AM_ABS(); IV_DCP();break;case 0xDF:AM_ABXW(); IV_DCP();break;

For FCEUXD-SP (1.07) I merely had to change the addressing modes of ISC and DCP in ops.c:
Code:
352,358c352,358
< case 0xC7: LD_ZP(DEC;CMP);
< case 0xD7: LD_ZPX(DEC;CMP);
< case 0xCF: LD_AB(DEC;CMP);
< case 0xDF: LD_ABX(DEC;CMP);
< case 0xDB: LD_ABY(DEC;CMP);
< case 0xC3: LD_IX(DEC;CMP);
< case 0xD3: LD_IY(DEC;CMP);
---
> case 0xC7: RMW_ZP(DEC;CMP);
> case 0xD7: RMW_ZPX(DEC;CMP);
> case 0xCF: RMW_AB(DEC;CMP);
> case 0xDF: RMW_ABX(DEC;CMP);
> case 0xDB: RMW_ABY(DEC;CMP);
> case 0xC3: RMW_IX(DEC;CMP);
> case 0xD3: RMW_IY(DEC;CMP);
361,367c361,367
< case 0xE7: LD_ZP(INC;SBC);
< case 0xF7: LD_ZPX(INC;SBC);
< case 0xEF: LD_AB(INC;SBC);
< case 0xFF: LD_ABX(INC;SBC);
< case 0xFB: LD_ABY(INC;SBC);
< case 0xE3: LD_IX(INC;SBC);
< case 0xF3: LD_IY(INC;SBC);
---
> case 0xE7: RMW_ZP(INC;SBC);
> case 0xF7: RMW_ZPX(INC;SBC);
> case 0xEF: RMW_AB(INC;SBC);
> case 0xFF: RMW_ABX(INC;SBC);
> case 0xFB: RMW_ABY(INC;SBC);
> case 0xE3: RMW_IX(INC;SBC);
> case 0xF3: RMW_IY(INC;SBC);

by Celius on 2008-04-08 (#32478)

It's unfortunate many emulators don't support these opcodes. Some of these would be really useful. And it's really dumb since most people will be playing games we make on emulators, and if it doesn't work, they probably won't go download an updated version that might get it working. Like XAA seems like I would use it a lot. It's really too bad.

But if you're just interested in developing for play on the real thing, I suppose you can do these. I'm interested in playing on the real thing, but I'm also interested in the ability to play it on an emulator.

But does Nestopia support all of the illegal opcodes?

by tepples on 2008-04-08 (#32480)

The ultimate goal of any NESdev project is to produce a program that runs on a Nintendo Family Computer or Nintendo Entertainment System. Perhaps if there way to recompile 6502 assembly language into C, with high-level emulation of the PPU, it might be possible to make one binary that runs on PC hardware running Windows and another binary that runs on NES hardware (and good emulators). It's a bit easier to make a cross-platform game on the GBA or DS because its 32-bit CPU more readily accepts compiled C.

by doynax on 2008-04-08 (#32481)

Celius wrote:
It's unfortunate many emulators don't support these opcodes. Some of these would be really useful. And it's really dumb since most people will be playing games we make on emulators, and if it doesn't work, they probably won't go download an updated version that might get it working. Like XAA seems like I would use it a lot. It's really too bad.

But if you're just interested in developing for play on the real thing, I suppose you can do these. I'm interested in playing on the real thing, but I'm also interested in the ability to play it on an emulator.
The reason that the emulators have such poor support for them is precisely that so little NES software uses them, no one would dream of releasing a C64 emulator without them because everyone and their grandmother are using them. So lets start including them in our NES games and demos and force things to change. Then again some people still use Nesticle so maybe things will never change..
Also you forgot the read the footnote on XAA. It's highly unstable and unpredictable so you couldn't have used it any event.

Quote:
But does Nestopia support all of the illegal opcodes?
I don't know, but it has supported everything I've tried so far which ought to be a good cross-section of them all. If anything useful is missing I'm betting it's the predictable "unstable" ones, i.e. those that AND with the page number once in a blue moon.

by Roth on 2008-04-08 (#32482)

If these were applied correctly to emus, I believe that would mean the Chinese Biohazard would be playable! Trying to save state just to get through to see how it was done altogether is a real pain.

by doynax on 2008-04-09 (#32537)

I've been looking into the transformations, i.e. the 3D side of thing, and I fear writing them off as trivial may have been premature (not to mention arrogant).
A lot of people can probably do this sort of thing in their sleep but I've never written any 3D code on a 'limited' system before so I don't know all the tricks yet. Unless you count fooling around in QBasic on my 386 that is, though come to think of it I had performance issues back then too (until I finally figured out that floating point was evil anyway).

Essentially I've got to:
work out a rotation matrix (from Euler angles)
multiply said matrix with the model's vertices
divide each vertex by Z for perspective correction
perform back-face culling
and light the polygons

Working out the rotation matrix may only have to be done once per frame but it's still a significant part of the work when you're dealing with simple objects. Working out the combined rotations you get a matrix built from products and sums of sines and cosines of the three rotation angles. That is something like this, where sx is the sine of the angle of rotation along the X axis and so forth:
Code:
[ sx*sy*sz + cx*cz, sx*cy, cx*sz - sx*sy*cz ]
[ cx*sy*cz - sx*cz, cx*cy, -cx*sy*cz - sx*sz ]
[ -cx*sz, sy, cy*cz ]

Naturally I'll make use of (co)sine tables but there's still an awful lot of multiplications in there and few common factors, so if I want full for the 16-bit precision vertices later on then we'll need 24-bit precision here. Luckily I remembered to check my high school formula collection and it so happens that it lists some highly useful identities for multiplying sines. For instance: 2 * cos x * cos y = cos(x - y) + cos(x + y)
Now divide this by two and substitute X for Z and we've the final entry of the matrix. With only additions, subtractions and table lookups *and* without losing any precision at that. Neat :)

Then you've got to multiply the vertices with the matrix. This ought to be fast and easy for fixed objects. Just design objects with lots of symmetry and nice integral coordinates then work out all of these constant multiplications by hand. With a simple cube you'd just add or subtract each of the basis vectors to get those eight points.

The perspective correction is worse though and 8-bit precision will definitely not cut it. I've reserved a full 16k bank for the transformations' code and tables, so perhaps a set of logarithm and exponential tables might (just barely) have enough precision. Or perhaps a big reciprocal table and combine four of those cute square table multiplications to get a 16-bit division. A classic division algorithm with a bit of unrolling might even be fast enough, but in the past I've always been able to avoid those by cheating.

I'm most uncertain about the culling and lighting though. They're basically the same thing, i.e. working out the angle between polygon and the camera or light source, in fact if I let the camera be the light source they are the same. IIRC I'd be taking the dot product between the light/camera vector (0,0,1) and the surface normal, so I ought to be able to work out the normals in advance and rotate them just like the rest of the vertices yet only calculate the Z components since we're just going to toss the rest away anyway.

I'm not so sure about this part though. Aren't you supposed to do culling and lighting in screen space or something? XOR-filling cannot handle any overdraw so back facing polygons damn well better be hidden. Plus making the lighting look good when all you've got is a four color palette might get messy. And I don't even want to think about handling non-trivial (i.e. concave) objects with potential overdraw. I suppose early software renderers and 3D hardware without Z-buffering (like the PSX) must have dealt with it somehow.

At any rate I'd love to hear some suggestions from people with real experience in 3D programming.

by ReaperSMS on 2008-04-09 (#32539)

Considered using MMC5? or are you already using it? The multiplier would probably be handy.

Pretty much everything comes down to a ton of dot products. If you can keep the space around, you probably want to try and avoid using eulers for all rotations, as there are a variety of issues with them. The downside is that quats are probably a bit of a no-go, though there are some pretty speedy sqrt and rsqrt algorithms out there.

Vertex lighting is usually computed in view space before the perspective divide, and then the colors are interpolated across. If you're just going with directional face lighting though, it doesn't really matter that much. As you mentioned, if you tie the light to the camera, all you care about is the viewspace Z component of the transformed normal, which is one dot product.

Culling usually happens after the perspective divide, as part of the actual triangle setup. Computing the gradiants requires finding out the screenspace area of the triangle, and if that is negative, it's backfacing. You probably want to cull quite a bit earlier than that though.

I would suggest perhaps ignoring lighting, since you don't have all that many colors to work with anyways.

Correct culling is a matter of determining which side of the face the camera is on, and is the sign of the dot product between the face normal and a vector from a point on the face to the camera. The bad news is this means you have to transform the normal and at least one of the points into view space to determine if it is visible.

How are you handling clipping? Punting on it with a "don't hand down geometry that needs clipping" policy?

by Zepper on 2008-04-09 (#32540)

-...

by doynax on 2008-04-09 (#32541)

ReaperSMS wrote:
Considered using MMC5? or are you already using it? The multiplier would probably be handy.
MMC5 would certainly be nice, especially the reliable scanline counter, but it's just too obscure and complicated to use if I want to have any real chance of getting things tested on hardware. Plus it doesn't support CHR-RAM, or at least no official games seem to use it.

Quote:
Pretty much everything comes down to a ton of dot products. If you can keep the space around, you probably want to try and avoid using eulers for all rotations, as there are a variety of issues with them. The downside is that quats are probably a bit of a no-go, though there are some pretty speedy sqrt and rsqrt algorithms out there.
Normally that would be true but I don't actually want to do anything with the numbers, just animate simple spinning objects. In other words I just want to be able to let an object rotate by some variable amount along each axis each frame. So stability and such are non-issues.
Of course if you've got a faster alternative method for setting this up then I'd be happy to hear it.

by doynax on 2008-04-09 (#32542)

ReaperSMS wrote:
Culling usually happens after the perspective divide, as part of the actual triangle setup. Computing the gradiants requires finding out the screenspace area of the triangle, and if that is negative, it's backfacing. You probably want to cull quite a bit earlier than that though.
I'll only have a single low-polygon object at a fixed point in the middle of the screen so I there isn't much to gain from early culling. Plus with XOR-filling I'm actually drawing the outlines rather than the polygons themselves which involves a bit of special handling..

Quote:
I would suggest perhaps ignoring lighting, since you don't have all that many colors to work with anyways.
Perhaps but I've seen it done before in many C64 demos and it can look pretty good. Especially with a dynamic palette.

Quote:
Correct culling is a matter of determining which side of the face the camera is on, and is the sign of the dot product between the face normal and a vector from a point on the face to the camera. The bad news is this means you have to transform the normal and at least one of the points into view space to determine if it is visible.
View space as in with perspective correction and everything? In that case couldn't I just use a dot product to find the orientation of the polygon on screen and avoid transforming the normal?

Quote:
How are you handling clipping? Punting on it with a "don't hand down geometry that needs clipping" policy?
There is no clipping between objects as there will only be one and I'm hoping to avoid clipping against the screen edges as doing it efficiently would complicate some timing-sensitive raster code, it is certainly doable though, especially if I'm willing to waste tiles along the upper edge of the screen.
Anyway that was pretty much what I wanted to ask. With XOR-filling overdraw just won't render correctly so if I want to draw non-trivial objects I'm going to have to clip things manually not just draw them in back-to-front order. The question is how can I efficiently determine which polygons might have to be clipped, and what's the most efficient way of performing the clipping itself?

Please note if I ever actually wanted to use this thing in a NES game then it'd be as a way to compress cut-scenes, and then I'd naturally have everything pre-calculated and just store a list of polygons to draw. This whole business is just for me to prove I can do the "3D" bit on the NES as well ;)

by ReaperSMS on 2008-04-09 (#32544)

Clipping and culling are two different things. If you restrict things to convex polygons, and decree that no objects shall intersect, then backface culling is all you need.

Culling is the elimination of entire polygons, due to them being outside the view frustum or backfacing. Usually pretty fast.

Clipping is modifying the polygon to fit within a boundary, such as the view frustum. It involves chopping edges apart and adding new ones, and tends to be slow.

Most clipping you can actually ignore, as long as you have some sort of scissoring happening at the screen edges in your rasterizer. The only time you *absolutely* need to clip is if a polygon intersects the near plane, as without that the perspective divide explodes.

Coordinate spaces:

Object space: space the vertices are modeled in. Origin is usually the center or some other handy spot on the object (such as the feet)

World space: space objects and the camera are placed in, relatively arbitrary

Camera/View space: space relative to the camera. The viewpoint is at the origin, looking down the positive or negative Z axis generally. Handy for lighting, as the View vector for a particular vertex is simply the negative of the position.

Clip space: 4D homogeneous space, result of taking view space vertices and transforming them through the projection matrix. Usually X and Y range from +/-W, Z from 0 to W or 0 to -W. Handy for clipping, as your clip planes are usually X = -W, X = W, Y = -W, Y = W, etc.

Normalized Device Coordinate space: the result of perspective dividing clipspace verts. X and Y cover +/- 1, Z usually runs from 0-1.

Screen space: NDC coordinates moved through the viewport transformation, which is generally just a scale and offset to get corrdinates in 0-w and 0-h

Usually the transforms get broken down into Model->View (OpenGL MODELVIEW), View->Clip (PROJECTION), and NDC->Screen (Viewport)

As for just spinning a particular object, if you don't care about numeric precision eventually squashing or stretching it in odd ways, you can just store the transformation matrix, and do a single matrix multiply to rotate that matrix each frame to get the rotation. It might work out to fewer operations than constructing it from the eulers every time.

by blargg on 2008-04-10 (#32547)

tepples wrote:
I don't think "illegal" isn't the right word, as there's no law against using them. Perhaps "undocumented" is the right word.

Both are misleading, because they are documented, and they don't cause any hardware trap as illegal would suggest (except the several that halt the processor). I use the term "unofficial", since they aren't described in the official 6502 manuals from various manufacturers.

by doynax on 2008-04-10 (#32548)

ReaperSMS wrote:
Clipping and culling are two different things. If you restrict things to convex polygons, and decree that no objects shall intersect, then backface culling is all you need.
Right. The question is how would I deal with concave objects? Traditionally you could have used, say, a BSP tree to draw in the correct order but I've got to avoid overdraw altogether and clipping polygons against each other quite expensive.
I think I'll take the easy way out and that part of it.

Quote:
Coordinate spaces:

.
.
.

Screen space: NDC coordinates moved through the viewport transformation, which is generally just a scale and offset to get corrdinates in 0-w and 0-h
That's a useful list, I was less than certain about the proper names for things.
In that case I've got to do back-face culling after the perspective correction, which seems pretty reasonable since a surface nearly parallel to the Z axis would get it's far side turned inwards and thus be hidden. So I suppose I'm stuck with doing a pair of multiplications per-surface for the test.

Quote:
As for just spinning a particular object, if you don't care about numeric precision eventually squashing or stretching it in odd ways, you can just store the transformation matrix, and do a single matrix multiply to rotate that matrix each frame to get the rotation. It might work out to fewer operations than constructing it from the eulers every time.
Hm.. As far as I can see a matrix-by-matrix multiplication is 27 multiplications, while the original contains only 16 to begin with. And since it only contains multiplications of sines and cosines I believe I can to a bit of magic algebra to reduce everything to additions, subtractions and lookups in the sine tables.

by Bregalad on 2008-04-10 (#32552)

Oh, my 3D rendering really needs a lot of calculations, even for really simple objects.
I can't even imagine how much calculation it needs for modern games that actually can look almost real with one milion or so geomertic shapes per screen.

You can get fast multiplication by having a 2-entry multiplication table, but man that takes space !

by Bananmos on 2008-04-10 (#32561)

You've probably considered this already, but if you just want to spin something, and you don't care about the downsides of Euler angles (gimbal lock, etc) then there's no real reason two use three euler angles as I see it.

Two will do just fine, and can get your 3D object to any arbitrary position. That will make it easier to compute the rotation matrix.

by doynax on 2008-04-10 (#32563)

Bananmos wrote:
Two will do just fine, and can get your 3D object to any arbitrary position. That will make it easier to compute the rotation matrix.
Plus the transformations themselves can take advantage of it and cut out a few multiplications. I'm going to try to make a full 3-axis rotation work first though.

At any rate building the matrix now "only" takes 20x 24-bit additions/subtractions, eight or so 8-bit additions/subtractions plus a bunch of table lookups. Okay, we're still talking about a few hundred cycles but compared to say a matrix-by-matrix multiplication it's nothing. I suppose this is a standard trick but I've never seen it mentioned before and I think it's a rather neat hack.

Here wsin is just a sine table, hsin/hcos are (co)sine tables divided by two and qsin/qcos are the same sines divided by four.
Code:
com1 = qsin(x - y + z) - qsin(x + y + z) + hcos(x + z);
com2 = qsin(x + y - z) - qsin(x - y - z) + hcos(x - z);
com3 = qcos(x + y - z) - qcos(x - y - z) - hsin(x - z);
com4 = qcos(x + y + z) - qcos(x - y + z) + hsin(x + z);

mat.xx = com1 + com2;
mat.xy = hsin(x - y) + hsin(x + y);
mat.xz = com3 + com4;
mat.yx = com3 - com4;
mat.yy = hcos(x - y) + hcos(x + y);
mat.yz = com1 - com2;
mat.zx = hsin(y - z) - hsin(y + z);
mat.zy = wsin(y);
mat.zz = hcos(y - z) + hcos(y + z);

And this works nicely in my prototype. Damn but it just doesn't feel right having the line drawing reduced to 30-lines of C code when the 6502 implementation took 5000 lines of assembly code and uses up more than 48k of ROM space.
Anyway I may have sort of accidentally gotten sidetracked with writing a fire cube for a while there, but I'm back to trying to work out the minimum necessary precision now..

by ReaperSMS on 2008-04-10 (#32567)

The trick to fast matrix multiplies is knowing what's in your matrices, and thus where all the zeroes are.

If you think the basic transforms are bad, you should see all the stuff that goes on after you get the thing projected. Perspective-correct interpolation is a lovely mess.

by doynax on 2008-04-10 (#32570)

ReaperSMS wrote:
The trick to fast matrix multiplies is knowing what's in your matrices, and thus where all the zeroes are.
Ahh, if only I had any zeros (or ones) to play with..
Another trick is to realize that for a fixed model you're actually multiplying by constants so you can unroll these multiplications and use nice whole numbers to speed things up significantly. Which means that we can get all the way from Euler angles right up to the perspective correction without performing any (general-purpose) multiplications.

On a related note I've been playing with the precision and oddly enough it appears that 8-bits is more than sufficient for the matrix when dealing with simple objects. Since the vertex coordinates are so clean (i.e. plus or minus ones for a cube) the perspective division is the only place where I lose much in the way of precision. Which means that I can probably make the division table based using the usual logarithm trick :)

Quote:
If you think the basic transforms are bad, you should see all the stuff that goes on after you get the thing projected. Perspective-correct interpolation is a lovely mess.
Believe me, I know. I wrote this texture mapper (with sort of speedy bilinear filtering!) a while back and had to deal with that whole bit of nastiness. It took a weekend with pen and paper to work things out..

by doynax on 2008-04-11 (#32614)

Behold!
(to be run in Nestopia, like last time.)

Okay, admittedly it's still of little or no practical use but I'm ecstatic about finishing what I set out to do in the first place anyway.
I just can't believe I managed to get the math whole math bit working so easily (relatively speaking..), I suppose the time spent writing the C prototype payed its way many times over in the end. The entire thing runs in less than 2500 cycles, and that's without any insane optimizations or big precision cuts. Oh, and I'm working on some actual interesting objects but even figuring out the code and data behind a cube stretched my spatial intelligence to its limits.

By the way I've been thinking about putting together a small demo part from it all. Perhaps someone with a bit of musical talent feels could be persuaded to compose something for me? I don't have any particular preferences aside from that it can't waste too much raster time or use DMC samples.

by dXtr on 2008-04-11 (#32617)

that's really impressive. great work
Going to try this on my powerpak tomorrow

by tokumaru on 2008-04-11 (#32618)

Looks cool, but there are 2 things bothering me...

First, the lighting. It should make things cooler, but with such few colors, the effects is betraying you. Color transitions are too abrupt, so it looks like the cube is flashing or something. Maybe it'd look better if each face simply had it's own color. Or you could add some dithering to virtually increase the number of colors you have. The simple checkerboard dithering pattern might not be hard to implement with XOR-filling.

The other problem is distortion, it looks like the cube is too close to the camera, so the perspective is distorting the image too much. That should be simple to fix, though, just move the cube away from the camera.

by doynax on 2008-04-11 (#32620)

dXtr wrote:
Going to try this on my powerpak tomorrow :)
That would be most excellent. I'll put together a PAL version for you when I wake up properly.

tokumaru wrote:
Looks cool, but there are 2 things bothering me...

First, the lighting. It should make things cooler, but with such few colors, the effects is betraying you. Color transitions are too abrupt, so it looks like the cube is flashing or something. Maybe it'd look better if each face simply had it's own color. Or you could add some dithering to virtually increase the number of colors you have. The simple checkerboard dithering pattern might not be hard to implement with XOR-filling.

The other problem is distortion, it looks like the cube is too close to the camera, so the perspective is distorting the image too much. That should be simple to fix, though, just move the cube away from the camera.
Heh, you managed to zero in on the two effects that are mostly just there because, well, they're supposed to be there. A lot of 3D demos have been known to cheat by using a simple parallel perspective so I exaggerated the effect a bit.
The lighting is worse though and still manages to look remarkably crappy regardless of how much I try to tweak it. I've been toying with the idea of allocating the palette entries dynamically based on whats actually visible but I'm not sure how well it'd work out for more complex images. Perhaps I just ought to scrap it like you said.

by dXtr on 2008-04-13 (#32683)

tried polygon2.nes with no success. (probably because it was a PAL NES?)

by doynax on 2008-04-14 (#32694)

dXtr wrote:
tried polygon2.nes with no success. (probably because it was a PAL NES?)
Possibly, although I doubt it since the timing ought to be correct anyway. Actually it should be more compatible since it won't have to deal with the tricky case of starting the display somewhere in the upper border.

No, mostly likely it's caused by some PPU or APU abuse of mine. Perhaps the sprite overflow testing has come back to bite me, or DMC interrupts don't quite work as they're supposed, or one of the illegal opcodes turned out to be unstable after all. Would you be willing to run a suite of test ROMs for the various tricks I'm employing?
I don't like relying on the charity of others to do my debugging for me but I don't see that I have any other alternative. The PowerPak seems to be sold out and installing a CopyNES requires some serious soldering, so that's right out. And besides my NES has a bad cartridge port to begin with and I've only got so much money to spend on this sort of thing..

As for the project's progress I've trying to add some (more) interesting objects to the demo, but I've run into some annoying issues along the way. Consider a dodecahedron for instance, you can't just scale the basis vectors by some nice and clean factors to calculate the vertex coordinates rather it uses irrational numbers like the golden ratio, and there isn't a three-coloring of it yet dynamic lighting looks always comes out looking horrible no matter how I tweak things.

by Celius on 2008-04-15 (#32735)

Wow, this is really cool! I agree with tokumaru that it looks distorted. It almost looks like it's just a really big cube moving really fast. But it really does look like a cube that's really close to the camera

Dithering seems like it'd be kind of hard to implement. Well, you would probably end up with lots of borders where the checkered pixels don't alternate correctly.

I really need to find out more about XOR filling, this is just too cool to not have in a game.

by doynax on 2008-04-15 (#32737)

Celius wrote:
Wow, this is really cool! I agree with tokumaru that it looks distorted. It almost looks like it's just a really big cube moving really fast. But it really does look like a cube that's really close to the camera
Am I the only one who thinks it looks kind of cool? ;)

Quote:
Dithering seems like it'd be kind of hard to implement. Well, you would probably end up with lots of borders where the checkered pixels don't alternate correctly.
Normally it wouldn't be that hard with XOR filling. By filling the odd and even scanlines independently of each other you could still get a nice dither patter while only drawing lines (albeit twice as thick lines). However it would easily triple the code size of the liner drawers and slow them down a great deal besides, but the real kicker is that you kind of have to write to VRAM in a straight fashion while this technique wants to process the odd/even bytes separately thus necessitating a separate EOR pass rather than doing it in-line.
Well.. You could dither horizontally only, sort of stripe it that is, but it'd look pretty bad.

I figure the only viable option to get more colors is to make better use of the palette. For instance a cube only ever has three sides visible at a time, so they could each have their own palette entry. For more complex objects you might set up the four background palettes such that each successive one replaces the darkest color of the previous palette with a new lightest color and be careful never to cross more than four consecutive shades within a single 16x16 block. At least, I think I might be able to pull that off in realtime..
Quote:
I really need to find out more about XOR filling, this is just too cool to not have in a game.
There doesn't seem to be much information available on them on the net. I could send you some simple example code in C if you want to but frankly it's quite straightforward once you get your head wrapped around it.

Imagine that you've got one big monochrome bitmap to fill. Then the code will simply walk through the screen column-by-column, XORing an accumulator with the pixels already written there and storing the new values as you go. On the 6502 this would be done one whole 8-pixel byte at a time (and this is why we're filling vertically rather than horizontally).

In other words:
Code:
for(x = 0; x < 256; ++x) {
pixel acc = 0;
for(y = 0; y < 240; ++y) {
acc ^= screen[x][y];
screen[x][y] = acc;
}
}

Now imagine that before filling you had drawn two parallel horizontal lines, one right above the other, on an otherwise clear screen. What happens in this loop is then that the parts above the line are cleared (since we start with acc = 0), then once we reach the upper line our accumulator get flipped and we start filling pixels all the way down to the lower line where it gets flipped once again back to blank pixels.
This extends naturally to arbitrary polygon outlines and such too, except that you'll have to be take care never to overdraw anything or you'll get see that interference effect I had in my first demo. And handling four colors rather than two is a simple matter of doing the same with with each bitplane separately.

The main thing to worry about is the fill convention because if anything is a bit off then the errors will propagate all the way down the screen. The key here is to only draw each horizontal pixel along the line exactly once, except for either the leftmost or rightmost pixel which has to be excluded. This applies even to y-major lines where you'd only draw the first pixel in each vertical run (so they won't actually be filled lines).

Now there are any number of complications to doing this on the NES but you'd better experiment with the basics in a high-level language first anyway.

Re: Polygon filling..
by jargon on 2008-04-16 (#32746)

tokumaru wrote:
CHR-ROM does a pretty decent job emulating large "pixels". If you divide each tile in 4 large pixels, it's possible to have all combinations with the 4 colors fit inside the 256 tiles you have. You can even double the vertical resolution by drawing the image to both name tables (making a 64x120 "pixels" image), and squeeze it inside a single screen using interrupts or timed code (although this will take away time that would otherwise be used to compute the next frame).

If you use less than 4 colors it might even be possible to fit more pixels inside each tile, increasing the resolution.

I really think this is a better option than CHR-RAM, which would be pretty slow to update, as opposed to the name tables.

split 8x8 tiles into 4 columns of 2 pixels wide by 8 pixels tall for left-to-right palette permutations 0,1,2,3 though 3,2,1,0 and use scan line interrupt every 2 pixel rows to change palette per unique tile.

this will allow you 32 tiles per row to swap of the permutations:
0123 0132 0213 0231
0312 0321 1023 1032
2013 2031 3012 3021
3102 3120 3201 3210

which is 16 unique tiles total. by using the 2px scan line method you recurse that once.

this allows 256 unique tiles using only 16 CHR tiles with a display resolution of 128x120.

this provides a bitch load more memory for lookups by placing it in unused name table space.

remember, for drawing a triangle, always split a render into the upper and lower portion at the corner vertically between the other two, then render those two triangles top half first, also check whether the corners are clockwise or counter-clockwise in order to deduce faster if that surface is even a visible side, i usually use clockwise for polygons on the inner surface of a polyhedron.

beyond that i use simple z-order split of existing polyhedrons in event two overlap.

creating a multiplication matrix for the 3 axis using sums of binary-split increments of rotation from within a look-up table instead of raw cos/sin works much more effectively.

beyond that, if you decide to implement textures, i highly advise to not take perspective into account for textures and simply render them as if they were an orthogonal re-orientation of each triangle, as the NES would suck CPU cycles like crazy taking that into account.

my advice for perspective is to use exponential projection as i described here:
http://nesdev.com/bbs/viewtopic.php?p=32745#32745

i only left out of that description that is best to simplify the look-up table so that only each depth plane is used in which the scalar comes to a full integer when translated.

Re: Polygon filling..
by tokumaru on 2008-04-16 (#32747)

jargon wrote:
split 8x8 tiles into 4 columns of 2 pixels wide by 8 pixels tall for left-to-right palette permutations 0,1,2,3 though 3,2,1,0 and use scan line interrupt every 2 pixel rows to change palette per unique tile.

Except that changing the palette mid-frame is not such a simple task. There is not enough time in HBlank to modify a lot of bytes, specially considering that you have to set the scroll again after you've done it.

To change the palette you'd have to use a bit of the visible scanline, reducing your horizontal resolution. Also, when working with the palette when rendering is turned off, the color currently pointed by the PPU gets displayed, so you'd have color glitches while updating the palette.

Re: Polygon filling..
by doynax on 2008-04-16 (#32750)

jargon wrote:
remember, for drawing a triangle, always split a render into the upper and lower portion at the corner vertically between the other two, then render those two triangles top half first, also check whether the corners are clockwise or counter-clockwise in order to deduce faster if that surface is even a visible side, i usually use clockwise for polygons on the inner surface of a polyhedron.
Right.. Except I use XOR-filling rather than traditional polygon scan conversion. And that changes a lot of things.

Quote:
beyond that i use simple z-order split of existing polyhedrons in event two overlap.
As in drawing the polygons in back-to-front order? As I've stated earlier XOR-filling can't handle overlapping polygons, it's the main drawback of the method. I would have to do actual clipping to display overlapping or concave objects, and I have no idea how to do that with reasonable performance (well, aside from precalculating everything).

Quote:
creating a multiplication matrix for the 3 axis using sums of binary-split increments of rotation from within a look-up table instead of raw cos/sin works much more effectively.
Sorry, but I don't understand what you're trying to tell me.
I'm using precalculated (co)sine tables, and the trigonometric product rules to avoid multiplications, if that's what you mean. At any rate the whole thing is running in less than 500 cycles for 16-bit precision so it's not really a bottleneck anymore.

Quote:
beyond that, if you decide to implement textures, i highly advise to not take perspective into account for textures and simply render them as if they were an orthogonal re-orientation of each triangle, as the NES would suck CPU cycles like crazy taking that into account.
Right, no.. Just plain flat-shaded polygons for me. I don't even want to think about how slow texture mapping would get..

Quote:
my advice for perspective is to use exponential projection as i described here:
http://nesdev.com/bbs/viewtopic.php?p=32745#32745

i only left out of that description that is best to simplify the look-up table so that only each depth plane is used in which the scalar comes to a full integer when translated.
Again, I don't quite follow you. However I am using logarithm and exponential tables for the perspective division.

by Bregalad on 2008-04-16 (#32751)

There is just one thing with XOR flling... The way you described it at least :
- Two vertical consecutive pixels should never be both set, as it would normally be with all lines where dy>dx
- All hidden lines should effectively be hidden, else they'll affect the filling
- No polygon should be on the top of the screen else the whole culumn will be inverted.

Did I get it right ? I'm not the one using it anyway, but I guess it's kind interesting. And congratulations for your multiplication matrix. Mastering trigonometric identities (there is hundred of them !) is necessary to speed things up.

by doynax on 2008-04-16 (#32752)

Bregalad wrote:
There is just one thing with XOR flling... The way you described it at least :
- Two vertical consecutive pixels should never be both set, as it would normally be with all lines where dy>dx
- All hidden lines should effectively be hidden, else they'll affect the filling
- No polygon should be on the top of the screen else the whole culumn will be inverted.

Did I get it right ?
Yeah, that's pretty much it =)
The top edge is usually handled by drawing a horizontal line along the clipped part of the polygon. Naturally you wouldn't draw anything when clipping the sides or bottom though.

Quote:
I'm not the one using it anyway, but I guess it's kind interesting. And congratulations for your multiplication matrix. Mastering trigonometric identities (there is hundred of them !) is necessary to speed things up.
I'm just surprised that I've never seen that trick before, but presumably everyone is using floating point these days so the multiplications don't matter.
Another neat application of the same identities is to easily generate sines waves of different amplitudes from a single table. It's the sort of trick that might save you some ROM space if you're working on a 32k game ;)

by Bregalad on 2008-04-16 (#32759)

Quote:
I'm just surprised that I've never seen that trick before, but presumably everyone is using floating point these days so the multiplications don't matter.

Yeah, but one day will come where the power of consummer electronics will stop to double each year, but the demand will continue to increase as usual, and this may happen sooner than what some belive. That day, people that know little dirty tricks will be able to survive fine, but people who grown used to apparently-clean-but-in-reality-dirty very high level language will be screwed and will break down.
Quote:
Another neat application of the same identities is to easily generate sines waves of different amplitudes from a single table. It's the sort of trick that might save you some ROM space if you're working on a 32k game

Yeah, if I were to do such a trick the first idea I'd have would be to use shift instructions to get multiple of the table entries.

by Anders_A on 2008-04-27 (#33065)

doynax wrote:
Behold!

Wow! That's amazing. If you get that thing to run on a real NES it'll be the coolest thing made for the NES to date.

I find the distorted perspective cool. 3D on retro machines is usually very "vanilla", and it adds a bit of flavour. I'd love to see a version with one palette entry reserved for each of the visible faces of the cube and a nice color ramp for shading.

I've been thinking about doing something like this, but considering all the tricks you've mentioned in this thread I never thought of and the speed you got it running at there is no chance in hell I would have been able to pull it off.

Kudos!

by thefox on 2008-04-27 (#33070)

Very cool! Nice to finally see some real progress in the NES demo department If you need help debugging it so it works on the real deal, I might be able to help (I have a PowerPak and decent amount of NES development experience).

by jargon on 2008-04-27 (#33074)

this is my goal with my game engine:

by doynax on 2008-04-28 (#33080)

Here's a new demo: http://doynax.googlepages.com/polynes3.nes
Lately I haven't been working as much as I might have on this. But at least I've reworked the transformation code for better precision, added a couple models and fixed a few bugs.

Anders_A wrote:
I'd love to see a version with one palette entry reserved for each of the visible faces of the cube and a nice color ramp for shading.
I'm working on it but updating the NES palette wasn't as easy as I thought it'd be. Apparently it has to be written during hblank or the true vblank period, not just in a blanked part of the screen.

Anders_A wrote:
I've been thinking about doing something like this, but considering all the tricks you've mentioned in this thread I never thought of and the speed you got it running at there is no chance in hell I would have been able to pull it off.
I may have made it seem harder than it really is, you know.. ;)

thefox wrote:
If you need help debugging it so it works on the real deal, I might be able to help (I have a PowerPak and decent amount of NES development experience).
I'd love to have your help with testing it on hardware. There's still a quadrillion emulator-visible bugs left to deal with but I'll send you a PM (or something) when I get around to writing some test ROMs.

I'm still searching for a musician by the way, or just a donation of an unreleased tune for that matter. Any ideas where I might find such a creature?

by Bregalad on 2008-04-28 (#33083)

Honestly, your third demo is amazing, way better than the other 2. Seeing all regular polygons fitting a simple demo like this is amazing, and it actually looks 3D unlike the other two.

And yeah palette have to be uploaded during VBlank. Memblers and I tried to upload it during HBlank, but even for changing only color 0, this is a lot of bothering and while it's fun to deal with this, in the best case you see glitches on the rightmost 16 pixels or so.

by hap on 2008-04-28 (#33095)

Yeah, looks good
as for the music, you could ask on the NES Music subforum.

by Celius on 2008-04-28 (#33097)

Yes this looks very good. However, there's still some weird distortion for the triangle faced ones.

And I didn't know you couldn't update the palette in a blanked period of the screen. What happens if you do?

by Zepper on 2008-04-28 (#33099)

doynax, you're pretty impressive, my congrats dude! Awesome work, it's a cool 3D demo!

By the way, what unofficial opcodes do you use?

by doynax on 2008-04-28 (#33101)

Bregalad wrote:
And yeah palette have to be uploaded during VBlank. Memblers and I tried to upload it during HBlank, but even for changing only color 0, this is a lot of bothering and while it's fun to deal with this, in the best case you see glitches on the rightmost 16 pixels or so.
I only need to update three colors, and it did seem to work fairly well for me. By initially placing the PPU address at $3f00 and then writing the other three entries in rapid succession (i.e. preloading values into A/X/Y and using an indexed dummy read to skip over the background color) the "buggy" span is reduced to nine clocks plus whatever variance you've got in your synchronization method. The annoying part is that Nestopia and Nintendulator set the sprite overflow flag at very different parts of the scanline (and I wouldn't be the least bit surprised if real hardware is different still), so I'd have to use some other timing mechanism to get hblank uploads to work.
At any rate I figure it just isn't worth the compatibility nightmare since splitting up the character upload loop and inserting the palette code somewhere in the middle would work nicely, it'll just mean a shitload of extra work for me..

Celius wrote:
Yes this looks very good. However, there's still some weird distortion for the triangle faced ones.
I know.. I'll just have to tweak the tetrahedron and octahedron someone without sacrificing the depth for the rest of the models. Not that there's anything wrong with it per se, the objects are just very large and/or very close to the viewer.

Celius wrote:
And I didn't know you couldn't update the palette in a blanked period of the screen. What happens if you do?
When you point the PPU at the palette RAM it decides to render the palette entry it's currently hovering over for some reason.
Sometimes I wonder whether at some stage the NES design team sat down and brainstormed how to create the nastiest, most insidious system ever for writing raster code..

Fx3 wrote:
By the way, what unofficial opcodes do you use?
Lets see.. There's LAX, SAX, SBX, DCP, ISC, ASR, ARR, ANC and a few illegal NOPs. That ought to about cover it I think, but I wouldn't be too surprised if an immediate LAX or SLO/SRE/RLA/RRA instruction snuck in there as well. Plus the halt instructions for debugging of course but that doesn't really count.
If you're trying to decide which instructions to implement in your emulator then I suggest supporting everything except perhaps for the unstable ones (though even those have occasional uses). With some decent documentation and a good sample implementation (VICE for instance) it shouldn't take too long.

by Anders_A on 2008-04-28 (#33105)

doynax wrote:
The annoying part is that Nestopia and Nintendulator set the sprite overflow flag at very different parts of the scanline (and I wouldn't be the least bit surprised if real hardware is different still),

You really need to test on the real NES as there are no emulators as accurate as those for the C64. I have my copynes stashed away at the moment, as I didn't want to have an extra computer set up just for the compatible parallell port, but I just ordered an USB copynes. I'll be glad to help you test stuff out when I get it. It'll be PAL only though.

doynax wrote:
Sometimes I wonder whether at some stage the NES design team sat down and brainstormed how to create the nastiest, most insidious system ever for writing raster code..

Yeah, that truly sucks. You can't do much more then scrolling or changing the color emphasis bits.

Quote:
Lets see.. There's LAX, SAX, SBX, DCP, ISC, ASR, ARR, ANC and a few illegal NOPs. That ought to about cover it I think, but I wouldn't be too surprised if an immediate LAX or SLO/SRE/RLA/RRA instruction snuck in there as well. Plus the halt instructions for debugging of course but that doesn't really count.
If you're trying to decide which instructions to implement in your emulator then I suggest supporting everything except perhaps for the unstable ones (though even those have occasional uses). With some decent documentation and a good sample implementation (VICE for instance) it shouldn't take too long.

Are you certain all of these work on the NES? the NES cpu is a modified 6502 after all. Couldn't those modifications alter the behaviour of the undocumented opcodes?

by tepples on 2008-04-29 (#33113)

I seem to remember reading that the only modification from the original 6502 to the 2A03's CPU core was erasing the part of the chip that had the decimal mode circuitry, as Ricoh couldn't afford to license that from MOS Technology. That wouldn't have changed the values of the don't cares in the decoder gate array, which is where most of the extra instructions come from. But anything that involves $EE or $EF or $11 or other values where bit 4 is prominent might have changed behavior, as those values seem to originate from decimal mode.

by Anders_A on 2008-04-29 (#33114)

Found it!

I had a vague recollection of this thread yesterday when I wrote the above but couldn't find it.

Aslong as all opcodes you use are the ones marked XX or uu in the matrix they are tested on the real NES by Kevin and should work as expected.

by Celius on 2008-08-24 (#36322)

Sorry to dig this old thread up, but it's relevant, and there's no point starting another one.

So I have a line routine, and I've made a wireframe model. I've also made my line routine XOR-fill friendly, so there aren't terrible errors in attempting to fill the polygons. However, there is a problem I'm having.

When there are two or more polygons stacked on top of each other, the XOR Fill colors every other one. I see why this is, I'm just not sure how to avoid this. If I could somehow fill each polygon separately, then my question would be answered. The only thing is, I'm not sure how I'd go about doing that. Thoughts?