Still working on the APU - things are coming along quite well actually. I need to go pick up a 3.5mm-to-3.5mm cable from RadioShack or somewhere today so I can post some audio samples that you can listen to. The sound isn't quite right yet, but hey, I'm just happy to hear something!
Anyway, I am about to start implementing the DMC and was reading up on the Wiki documentation. It all seemed to make sense to me but something caught my attention.
From Wiki (
http://wiki.nesdev.com/w/index.php/APU_DMC):
Quote:
The 6502 cannot be pulled off of the bus normally. The 2A03 DMC gets around this by pulling RDY low internally. This causes the CPU to pause during the next read cycle, until RDY goes high again. The DMC unit holds RDY low for 4 cycles. The first three cycles it idles, as the CPU could have just started an interrupt cycle, and thus be writing for 3 consecutive cycles (and thus ignoring RDY). On the fourth cycle, the DMC unit drives the next sample address onto the address lines, and reads that byte from memory. It then drives RDY high again, and the CPU picks up where it left off.
This matters, because it can interfere with the expected operation of the controller registers, reads of the PPU status register, and CPU VRAM or SPR reads if they happen to occur in the same cycle that the DMC unit pulls RDY low.
I have 2 questions regarding this:
1) If a sprite DMA transfer is already in progress (and therefore already in control of the bus and already deasserting the RDY signal on the CPU), does a DMC DMA operation override (interrupt) the sprite DMA process or does the DMC wait for the entire sprite RAM transfer to finish before taking control of the bus?
2) About the mention of the DMC waiting 4 CPU cycles before taking control of the bus...Does the sprite DMA transfer module do the same thing (i.e. wait 4 cycles to ensure that the CPU has finished with its last operation)? I ask because...umm...I currently don't wait for the CPU to finish its current operation before starting the sprite DMA xfer. I just pull RDY low, take control of the bus, and start the transfer. LOL, I'm thinking now that could be a bad thing.
Thanks!!
2) DMC DMA has to wait for up to three writes in a row before stopping the 6502 off the bus, which might happen on a BRK instruction, IRQ, or NMI. Sprite DMA doesn't have to wait as long because it always occurs immediately after STA/STX/STY $4014, instructions that produce only one write. (Games are expected not to use read-write-write instructions like INC when accessing $4014.)
Awesome, thanks tepples! I think my current implementation is OK then.
Anybody for #1 ?
Does this means the CPU could crash if an IRQ/NMI were to interrupt a sta/stx/sty $4014 instruction ?
IRQ and NMI have the same read pattern as the BRK instruction, which means they start with two reads to fetch (and discard) the opcode.
Nobody knows the answer to my first question? There must be somebody...
:'(
Pleeeeease...lol.
I understand your frustration, but I don't own a logic analyzer with which to watch the NES address bus while it executes a test program.
No need for a logic analyzer, since it should be CPU-observable behavior. You just need to have an IRQ or NMI interrupt said instruction at various positions and see what happens.
Based on the following tests, DMC DMA adds 4 cycles normally, 3 if it lands on a CPU write, 2 if it lands on the $4014 write or during OAM DMA, 1 if on the next-to-next-to-last DMA cycle, 3 if on the last DMA cycle. The test ROMs here print the below outputs, and verify that they match what's expected.
sprdma_and_dmc_dma.zip
They also verify that the bytes copied to OAM match what is expected, so DMC DMA isn't corrupting the data. Further, the DMC sample playing during the test is all $55 bytes, so if the DMC DMA read were corrupted, it'd be audible. I recorded output and don't see any corruption.
This test has DMC DMA occur at each cycle in a piece of code, and prints how many cycles the code took, including any extra cycles the DMA added. For example, this code generates the output after it:
Code:
sta $100 ; 4
lda $100 ; 4
sta $100 ; 4
sta $100 ; 4
T+ Clocks (decimal)
00 20
01 20
02 20
03 19
04 20
05 20
06 20
07 20
08 20
09 20
0A 20
0B 19
0C 20
0D 20
0E 20
0F 19
The code should take 16 cycles, but DMA adds four. However, when it lands on the write cycles of the three STA instructions, it only takes three. You can clearly see the pattern of the STA-LDA-STA-STA in the result, confirming that it's really measuring something useful.
Now, the code that tests sprite DMA:
Code:
lda #$07 ; 2
sta $4014 ; 4 + 513/514
sta $100 ; 4
T+ Clocks (decimal)
00 527 +4 LDA #$07 ; 2
01 528 +4
02 527 +4 STA $4014 ; 4 + 513/514
03 528 +4
04 527 +4
05 526 +2
06 525 +2
07 526 +2
08 525 +2
09 526 +2
0A 525 +2
0B 526 +2
0C 525 +2
0D 526 +2
0E 525 +2
0F 526 +2
...
200 525 +2
201 526 +2
202 525 +2
203 526 +2
204 524 +1 DMA next-to-next-to-last cycle
205 525 +1 DMA next-to-next-to-last cycle
206 526 +3 DMA last cycle
207 527 +3 DMA last cycle
208 527 +4 STA $100 second cycle
209 528 +4 STA $100 second cycle
20A 526 +3 STA $100 fourth cycle
20B 527 +3 STA $100 fourth cycle
I've manually listed the number of DMA cycles added (clocks-523/524), and what instruction is executing. The main snag is that sprite DMA takes 513 OR 514 cycles, depending on whether it's started on an even or odd 2A03 cycle. I'm assuming this is very similar to $4017 writes being delayed a cycle if on an odd 2A03 cycle.
The way this test works, the test code begins on even/odd 2A03 cycles based on the time it has arranged the DMC DMA to occur. This complicates things. At the end of OAM DMA and after, it means that DMC DMA is only hitting every other cycle of the test code. You can see this in the STA $100 after OAM DMA, where DMC DMA takes three cycles for two different times. This is because both times it's landing on the fourth cycle of STA $100 (I tried other instruction sequences to be sure of this, and it checks out).
Maybe someone with a logic analyzer can see what's really going on. The above is about as much as you're going to get with a CPU test alone.
Fascinating stuff.
This suggests a shared DMA unit, running on a 2-cycle period, writes happening in the first period, reads happening in the last. There's probably a start flag that suppresses the address bus switchover for the first 1.5 cycles.
Random brainfart guess as to what the sequence and priorities look like:
(all sequences happen in parallel for each cycles, all inputs are read before updating outputs)
bus_request is asserted by 4014 writes and DMC intermediate buffer empty & DMC active
4014 writes set spr_page to data, spr_flag to 0, and spr_byte to 0
Code:
C0:
if (bus_request) bus_grant <= 1;
if (bus_grant)
if (spr_flag)
addr <= {spr_page, spr_byte}
data <= spr_data
spr_byte <= spr_byte + 1
C1:
bus_request <= (4014 written | spr_flag) | (~bus_grant & DMC enabled & buffer empty)
if (bus_grant)
if (DMC)
addr <= DMC_addr
DMC_data <= data
else if (SPR)
addr <= {spr_page, spr_byte}
spr_data <= data
spr_flag <= 1
The above is likely completely incorrect when interrupts factor in, and the cycles may be flipped. The important bit would be that there's a bus request DMA cycle, followed by the working DMA cycle(s). DMC takes priority over the read, and the SDMA can tell when it's data isn't valid. Request cycles can be flagged as late as the end of C0.
4014 writes that land on C0 will assert bus_request for C1, and bus_grant will kick in at the next C0. The CPU will be stalled out by RDY in C1 if it's a read, or continue on it's merry way if the 4014 access was RMW (it looks at RDY late in the cycle).
Once granted, DMC will read it's data. If this is during a SDMA, the grant will already be flagged, so the request is hidden, and there's only a 2-cycle overhead due to stalling the SDMA one DMA cycle. If the DMC pops up near the last-ish SDMA cycle, specifically when SDMA doesn't have to read any more, it will only delay things one cycle, through extending RDY past the end of SDMA.
Holy cow Blargg! Thank you so much! What have I done to deserve such generosity??? I need to look over your post in much greater detail to really understand everything your saying - I'm heading into work at the moment. I actually just got a nice tektronix logic analyzer so maybe with your test software we can finally put an end to this mystery once and for all. First, I need to understand the clock cycling as well as you apparently do. Then I could even provide the analyzer traces here on NesDev (or my site).
In fact, now that I think about it, I'm wondering if there are any other long standing questions that could only be answered by a logic analyzer that I might be able to answer while I have the whole thing hooked up? Does anyone know of any? Maybe this would be an opportunity for me to give something back to the NesDev community when they have helped me so much...
Thanks again!
Jonathon
blargg wrote:
Based on the following tests, DMC DMA adds 4 cycles normally, 3 if it lands on a CPU write, 2 if it lands on the $4014 write or during OAM DMA, 1 if on the next-to-next-to-last DMA cycle, 3 if on the last DMA cycle. The test ROMs here print the below outputs, and verify that they match what's expected.
Okay this one leaves me with a big WTF...
First of all my output doesn't match anything in Blargg's description.
Blargg can you provide the source? Is the description you provided in this thread what should be seen on the screen?
It relies on DMC timing and operation being correct. I'll have to finish updating my APU tests and releasing them. This DMC DMA during sprite DMA is one of the hardest-core tests I've written, depending on many other things being perfect.
blargg wrote:
It relies on DMC timing and operation being correct. I'll have to finish updating my APU tests and releasing them. This DMC DMA during sprite DMA is one of the hardest-core tests I've written, depending on many other things being perfect.
Ok so the fact that I don't [yet] have a completely cycle-accurate APU implementation is what is causing those ridiculously large cycle counts?
After I found your APU test suite a couple days ago I rearchitected my APU from a PPU-frame based one to one that is at least running in its own 'frame' field [independent of CPU or PPU 'frames']. However, I still do one 240Hz frame of the APU at once (ie, generate 735/4 samples of sound based on the current state running forward in time).
I think I'll go see if I can get it down to the cycle...
If your APU isn't running on a per cycle basis, how could it possibly be doing the DMA cycle stealing correctly?
ReaperSMS wrote:
If your APU isn't running on a per cycle basis, how could it possibly be doing the DMA cycle stealing correctly?
Yes my APU isn't cycle accurate yet but my sprite DMA is accounted for, so I was expecting numbers for the sprite DMA portion that somewhat matched. I didn't realize the sprite/DMC DMA were used to test each other. I thought it was testing whether or not the # of DMA cycles was correct for *either*.
I also noticed there's only a couple of ROM bytes different between the .nes file and the _512.nes file. What is different about the two variants? I get the same results for both...
Yeah, it's kind of crazy: it's using the DMC to count the number of cycles the sequence of code takes, while at the same time having DMC DMA occur at a particular point in the code.
The _512 variant just has the DMC DMA occur 512 clocks later in the test code, so you add $200 to the time column to see the real offset. The only difference in the ROM would be a couple of delay values, and the CRC.
EDIT: newer APU tests on the wiki.
blargg wrote:
It relies on DMC timing and operation being correct. I'll have to finish updating my APU tests and releasing them. This DMC DMA during sprite DMA is one of the hardest-core tests I've written, depending on many other things being perfect.
Okay I've rearchitected my CPU/APU to be cycle-accurate and in sync with each other [run 3 PPU cycles, one CPU cycle, one APU cycle, lather, rinse, repeat]. I'm nearly there, just for some reason I can't get the logic right to have DMC DMA take only two cycles on the write-cycle (5 in screenshot above) of the STA $4014!
I pass both sets (PAL/NTSC) of old APU tests, only failing the IRQ timing test by one CPU cycle...[still scratching head on this one as indicated in a prior post!]. The new tests all pass individually (running the rom_singles) but for some reason I get failure code 16 on test 7 (dmc_basics) if I run the "all-in-one" test ROM. Apparently there's some condition in my APU that is cleared by a ROM reset that causes the tests to individually give a passing result but not when combined and mapper-switched in and run from a non-reset state.
Curiously I was failing the dmc_rates test both in rom_singles and in the "all-in-one" until I changed my timer divider from a "count up from zero to period then reset to zero" to a "count down from period to zero then reset to period" implementation. With the count up approach I pass everything but the dmc_rates test (either as rom_singles or in the "all-in-one"). I believe this has to do with the way the DMC is being "filled" and started by the rate test so that the whole length of the actual tested period can be measured. In the count up approach it takes too long to get to the newly set period, which causes a "rate too long" error.
So you have sprite DMA taking 513/514 cycles, depending on whether the $4014 write is on an even or odd cycle? Have you tried having DMA take two cycles if the write cycle occurs one cycle before DMA?, or something like that?
I have no way of running the multi-test ROMs, as my devcart only has 32K PRG. I provide those only because I know many people don't want to write any automated testing support into their emulator, and would thus find the singles tedious. I'd much rather just provide the singles, and not deal with building the multi-ROMs. It's possible that a multi-test won't pass on a NES. If you pass the singles, you're good, period. Before releasing I test all the singles by loading them and powering the NES off then back on, to be sure they are solid. Maybe we can find someone to verify my multi-tests with a PowerPak whenever I need to release them.
Quote:
Curiously I was failing the dmc_rates test both in rom_singles and in the "all-in-one" until I changed my timer divider from a "count up from zero to period then reset to zero" to a "count down from period to zero then reset to period" implementation.
I'll have to add a specific test for this, since as far as I know none of the APU counters count up; the only time the period for a timer is examined is when it's being reloaded, not every time it's clocked. It's good at least one caught this.
BTW, your screenshots would be less imposing if you just took it of the NES screen and nothing more. BTW, it looks like your NES window is stretched vertically by about 5 or 6 pixels, judging by the frequency of stretched characters.
blargg wrote:
So you have sprite DMA taking 513/514 cycles, depending on whether the $4014 write is on an even or odd cycle?
Yes.
blargg wrote:
Have you tried having DMA take two cycles if the write cycle occurs one cycle before DMA?, or something like that?
Yes that's what my logic does but I'm still tracing through to see why it doesn't do that on the write cycle.
blargg wrote:
I have no way of running the multi-test ROMs, as my devcart only has 32K PRG. ... It's possible that a multi-test won't pass on a NES. If you pass the singles, you're good, period.
That's good to hear but it'd be even better to see the results of the "all-in-one" on a real NES! Perhaps I'll remove the "all-in-one"'s from my
test ROM status.
blargg wrote:
Before releasing I test all the singles by loading them and powering the NES off then back on, to be sure they are solid. Maybe we can find someone to verify my multi-tests with a PowerPak whenever I need to release them.
I need to get me one of them PowerPak's.
blargg wrote:
I'll have to add a specific test for this, since as far as I know none of the APU counters count up; the only time the period for a timer is examined is when it's being reloaded, not every time it's clocked. It's good at least one caught this.
The only time I examine the period is when I am checking to see if I need to emit a clock (and hence reload the counter). Not sure why count-up/count-down made a difference yet, but I'll keep thinking on it.
blargg wrote:
BTW, your screenshots would be less imposing if you just took it of the NES screen and nothing more. BTW, it looks like your NES window is stretched vertically by about 5 or 6 pixels, judging by the frequency of stretched characters.
Yes, essial put in some logic to stretch the emulation window that doesn't quite work right at minimal. I haven't bothered to figure that part out yet. I was kind of hoping someone with more OpenGL knowledge than myself could step in and help get it right.
[/url]
Maybe I'm not understanding what you mean by count-up/count-down. There are two basic ways to do a timer: decrement count until it reaches zero, then emit clock and reload with period -or- increment counter from zero and compare with period; once it is >= period, emit clock and reset to 0. The latter is not how any of the APU timers work, and is what I took you to mean by count-up. I can't think of what else you could have meant...
blargg wrote:
I have no way of running the multi-test ROMs, as my devcart only has 32K PRG. I provide those only because I know many people don't want to write any automated testing support into their emulator, and would thus find the singles tedious.
As I understand it, the support for automated testing wouldn't be too much harder than the support for a Famicombox or a PlayChoice 10.
Quote:
Before releasing I test all the singles by loading them and powering the NES off then back on, to be sure they are solid. Maybe we can find someone to verify my multi-tests with a PowerPak whenever I need to release them.
I could run a multi-test on an NTSC NES-001 for you.
blargg wrote:
Maybe I'm not understanding what you mean by count-up/count-down. There are two basic ways to do a timer: decrement count until it reaches zero, then emit clock and reload with period -or- increment counter from zero and compare with period; once it is >= period, emit clock and reset to 0. The latter is not how any of the APU timers work, and is what I took you to mean by count-up. I can't think of what else you could have meant...
Your definition of count-up/count-down is equivalent to mine. The latter was my implementation that was failing the dmc_rates test. The former is the implementation that I switched to that started passing the dmc_rates test, but then I stopped passing the "all-in-one" test with that implementation.
tepples wrote:
I could run a multi-test on an NTSC NES-001 for you.
Really? It'd be nice to know the pass/fail status of the all-in-one. If it passes, then there's still work to do for me.
Pardon my incompetence, but I can't find exactly to which ROMs you're referring.
tepples wrote:
Pardon my incompetence, but I can't find exactly to which ROMs you're referring.
http://wiki.nesdev.com/w/index.php/Emulator_tests#APU
The first one.
Quote:
apu_test tests many aspects of the APU that are visible to the CPU.
"All 8 tests passed" on NES+PowerPak and Nestopia 1.40.
NESICIDE wrote:
The only time I examine the period is when I am checking to see if I need to emit a clock (and hence reload the counter). Not sure why count-up/count-down made a difference yet, but I'll keep thinking on it.
Imagine the following situation with both implementations: period of some timer is 100, and timer has just emitted a clock. Then period is changed to 50. With count-down, the next clock will be 100 input clocks later. With count-up, it will only be 50 clocks later.
tepples wrote:
Quote:
apu_test tests many aspects of the APU that are visible to the CPU.
"All 8 tests passed" on NES+PowerPak and Nestopia 1.40.
Thanks tepples! I had Nestopia stuck in PAL mode, so I was getting failures there but didn't spend time trying to figure out why until you pointed out the above.
tepples wrote:
Quote:
apu_test tests many aspects of the APU that are visible to the CPU.
"All 8 tests passed" on NES+PowerPak and Nestopia 1.40.
And now NESICIDE. Yay!
I took a closer look at my execution trace and noticed that with the sprdma_and_dmc_dma.nes test ROM, the DMA was occurring in the wrong place relative to the CPU cycle. I do an APU cycle on each CPU memory access [read, write, DMA]. If I run the APU cycle *after* the memory access then I get a failure both in the apu_tests.nes and in sprdma_and_dmc_dma.nes. If I run the APU cycle *before* the memory access then I pass both.
Blargg, I removed the chaff from the image...just for you!
I still fail the second (+512) sprdma_and_dmc_dma.nes test though. I take it this is supposed to show the effect of DMC DMA on the opposite end of the sprite DMA event (ie. the last 16 or so sprite DMA cycles?) It seems to be failing on the first 16 cycles though so I'm not sure.
105 test ROMs down [passing], 31 more to go [failing to some degree].
<completely aside>Does anyone know if Chris Covell's RasterDemo3*.nes (*=,a,b,c,...) work on a real NES?
- How did you manage to get this test to pass? I mean, do you control the number of "eaten" CPU cycles by the DMC DMA?
Zepper wrote:
- How did you manage to get this test to pass? I mean, do you control the number of "eaten" CPU cycles by the DMC DMA?
Yes. My PPU/CPU/APU are all cycle driven. Most emulators (Nestopia 1.40, for one) do the sprite DMA all in one big chunk (a quick for loop) on any write to $4014 and just let the cycle accounting mechanisms take care of the held-off CPU time. My PPU does the sprite DMA inline with its cycle emulation, one cycle at a time, for 513 or 514 cycles. A write to $4014 simply sets up the internal state, the cycle-by-cycle emulate method does the rest.
Thus, whenever a DMC DMA occurs, I know whether or not I'm in the middle of a sprite DMA or not. And, because the PPU actually does the 513/514 cycle DMA one cycle at a time, I know exactly where in the sprite DMA I am when the DMC DMA occurs. My CPU also provides the necessary "read/fetch, or write" information about the most recent cycle (useful, of course, only if the PPU isn't in the middle of a sprite DMA). From there it's just a few easy checks to see whether to have 3, 2, or 1 wait state cycles before the DMC DMA cycle.
I now pass both of the sprdma_and_dmc_dma test ROMs and have shifted attention over to trying to pass the $4016/DMC effect test ROMs.
My code is all up on
Gitorious if you're curious.
Quote:
And, because the PPU actually does the 513/514 cycle DMA one cycle at a time, I know exactly where in the sprite DMA I am when the DMC DMA occurs.
The 2A03 does the DMA, not the PPU, as far as I know. I'm pretty sure the PPU just sees it as a series of $2004 writes every other CPU cycle.
NESICIDE wrote:
From there it's just a few easy checks to see whether to have 3, 2, or 1 wait state cycles before the DMC DMA cycle.
- That's my question, what checks do you do? I couldn't locate anything in your code.
Zepper wrote:
NESICIDE wrote:
From there it's just a few easy checks to see whether to have 3, 2, or 1 wait state cycles before the DMC DMA cycle.
- That's my question, what checks do you do? I couldn't locate anything in your code.
@Zepper: Look for calls to C6502::STEALCYCLES and calls to C6502::DMA. Also look at CPPU::PPU(unsigned short addr, unsigned char data) for IOSPRITEDMA (where $4014 is written and the transfer is set up).
@blargg: I agree the DMA cycles originate from the 6502...I just left the implementation in the PPU because I originally just did the quick 256-byte for-loop copy there. I only do a DMA or STEALCYCLES call once every 3 or 3.2 PPU cycles. I'll look into moving it to be more correct.
EDIT: files are emulator/cnes6502.h and .cpp, emulator/cnesppu.h and .cpp, and emulator/cnesapu.h and .cpp.
I'm struggling with an APU/SDL sync problem at the moment tho so even though the APU is "flawless" according to blargg's tests it sounds like poop warmed over. I really hate the SDL callback interface. Not flexible *at all*!
Hey all, this is a bit of a bump but I think it's worth it. I recently posed my question #1 (from wayyyyy back in the original post that created this thread) to Kevtris over PM. We had a few back and forths, but this is the ultimate answer that he provided:
kevtris wrote:
In my implementation, the sprite DMA takes precidence as it must, to prevent graphics corruption. The sample DMA is not stalled though and it will attempt to fetch a sample if it needs one, which happens to be sprite data at the time.
Be that as it may, I have not heard any audible artifacts from this happening even when it plays a byte of sprite data as sample data; probably because the samples are really noisy as-is, and 8 samples doesn't make much difference because the rest of the waveform is noisy as hell.
A test would be to play a continuous DPCM stream of 00 or ff to peg the DAC counter at one of the ends, then sprite DMA the opposite.
like if you DPCM 00h's sprite DMA ff's and watch the output. if it blips then I am doing it right. if not, either sprite fetches are getting screwed, or DPCM is being deferred or injecting extra reads (doubtful).
I think this is incredibly interesting. Essentially what Kevtris' is saying is that a DMC DMA operation will never be stalled when an OAM DMA operation is in progress - nor will an OAM DMA operation be stalled as a result of an in-progress DMC DMA operation. The DMC DMA (at least in his design) will just receive corrupted data (which will actually be sprite data) whenever an OAM DMA xfer needs to occur. Given how accurate Kevtris' emulator has been touted to be, especially for the most bizarre and picky games, I would expect that this is the correct implementation. Not only that but for some reason it just seems to me like the kind of thing Nintendo would do to save engineering time and money.
So it's pretty funny because the short, one-word answer to my original question #1 is: Neither.
Anyway, I just wanted to share that bit of info and make sure that it got out into the community. Kevtris said he was cool with me sharing his answer. I'm curious if anyone has any additional thoughts on it.
Pz!
Bump, has the behavior (corrupted DMC samples during sprite DMA) been confirmed?
Hiya dwedit,
I'm pretty positive that Kevtris' implementation is correct. I tested the following three different implementations using Duck Hunt:
Method 1: Stop OAM DMA while DMC DMA is occurring
Result: duck/dog sprites visually corrupted on screen whenever "Quack!/Arff!" sounds occurred
Method 2: Stop DMC DMA while OAM DMA is occurring
Result: heard audible distortion in "Quack!/Arff!" sounds
Method 3: Do not pause DMC DMA if OAM DMA occurs, but DMC receives sprite data while OAM DMA is occurring (this is kevtris' method)
Result: no audible or visual distortion whatsoever
I also tested it with Kung Fu and received similar (and worse) results. Kung Fu will eventually freeze if you use method 1. Likely because it uses the DMC channel much more frequently than Duck Hunt.
Hope that helps!
After re-reading the thread, Blargg mentioned that DMC used two cycles if it happened during a 4014 sprite dma transfer. So does it interrupt the sprite DMA to perform the DMC read? What's it doing during those two cycles?
Why would stopping Sprite DMA to do a DMC read corrupt the sprites?
Also wondering because Nintendulator doesn't pass the timing tests for DMC reads during sprite DMA.
Dwedit wrote:
After re-reading the thread, Blargg mentioned that DMC used two cycles if it happened during a 4014 sprite dma transfer. So does it interrupt the sprite DMA to perform the DMC read? What's it doing during those two cycles?
Why would stopping Sprite DMA to do a DMC read corrupt the sprites?
Also wondering because Nintendulator doesn't pass the timing tests for DMC reads during sprite DMA.
I finally found the post I wanted to reply to with
this topic post. I plan to implement what I'm observing from my Visual2A03 trials in my emulator. As you said, it does indeed interrupt the sprite DMA to perform the DMC read...for two cycles. The read/write/read beat of the sprite DMA is kept such that the DMC read occurs where the sprite read from the sprite memory page would otherwise occur. Then in the cycle that should be a write to 2004 there's just a CPU read from the PC value before the sprite DMA started. Then the sprite DMA picks up where it left off.
Is source code available for the sprdma_and_dmc_dma test roms?
Looks like it was an off-the-cuff test. I've located the source and will package it the next time I'm on my old Mac.
blargg wrote:
Looks like it was an off-the-cuff test. I've located the source and will package it the next time I'm on my old Mac.
Would be appreciated. Could PM it too if you have it and feel like it - off-the-cuff state if better than nothing.
Haven't traced through the code to figure out what's going on yet, but for some reason the output ends up way off. I pass all the apu_test, apu_reset, and cpu_interrupts_v2 tests (and all the ppu_vbl_nmi tests except 07-nmi_on_timing.nes (off by 1-2 ticks - might be some analog thing going on there)).
I end up with
Code:
tests/sprdma_and_dmc_dma/sprdma_and_dmc_dma.nes FAILED
T+ Clocks (decimal)
00 3768
01 3767
02 3768
03 3767
04 3768
05 3767
06 3766
07 3765
08 3766
09 3765
0A 3766
0B 3765
0C 3766
0D 3765
0E 3766
0F 3765
7461977F
SPRDMA and DMC DMA
Failed
tests/sprdma_and_dmc_dma/sprdma_and_dmc_dma_512.nes FAILED
T+ Clocks (decimal)
00 3766
01 3765
02 3766
03 3765
04 3766
05 3765
06 3766
07 3767
08 3768
09 3767
0A 3768
0B 3767
0C 3768
0D 3767
0E 3768
0F 3767
2EA11D4D
SPRDMA and DMC DMA
Failed
Here's the DMC code in case anyone can spot any obvious bugs/misunderstandings. The sample loading timing is ballparked at the moment and misses some corner cases, though I don't think by enough to warrant the huge error.
channel_updated is for sample generation and can be ignored.
Registers:Code:
// $4010
void write_dmc_reg_0(uint8_t value) {
static uint16_t const dmc_ntsc_periods[] =
{ 428, 380, 340, 320, 286, 254, 226, 214, 190, 160, 142, 128, 106, 84, 72, 54 };
if (!(dmc_irq_enabled = value & 0x80)) {
dmc_irq = false;
update_irq_status();
}
dmc_loop_sample = value & 0x40;
dmc_period = dmc_ntsc_periods[value & 0x0F];
}
// $4011
void write_dmc_reg_1(uint8_t value) {
unsigned const old_dmc_counter = dmc_counter;
dmc_counter = value & 0x7F;
if (dmc_counter != old_dmc_counter)
channel_updated = true;
}
// $4012
void write_dmc_reg_2(uint8_t value) {
dmc_sample_start_addr = 0x4000 | (value << 6);
}
// $4013
void write_dmc_reg_3(uint8_t value) {
dmc_sample_len = (value << 4) + 1;
}
Clocking:Code:
void tick_apu() {
...
if (--dmc_period_cnt == 0) {
dmc_period_cnt = dmc_period;
clock_dmc();
}
...
}
static void clock_dmc() {
if (dmc_bits_remaining > 0) {
if (dmc_sample_buffer & 1) {
if (dmc_counter < 126) {
dmc_counter += 2;
channel_updated = true;
}
}
else
if (dmc_counter > 1) {
dmc_counter -= 2;
channel_updated = true;
}
dmc_sample_buffer >>= 1;
if (--dmc_bits_remaining == 0 && dmc_bytes_remaining > 0)
load_dmc_sample_byte();
}
}
Status:Code:
// $4015
uint8_t read_apu_status() {
uint8_t const res =
(dmc_irq << 7) |
(frame_irq << 6) |
(cpu_data_bus & 0x20) | // Open bus
((dmc_bytes_remaining > 0) << 4) |
((noise_len_cnt > 0) << 3) |
((tri_len_cnt > 0) << 2) |
((pulse[1].len_cnt > 0) << 1) |
(pulse[0].len_cnt > 0);
frame_irq = false;
update_irq_status();
return res;
}
// $4015
void write_apu_status(uint8_t value) {
...
// We need to clear the DMC IRQ before handling the DMC enable/disable in
// case a one-byte sample is loaded below, which will immediately fire a
// DMC IRQ
dmc_irq = false;
update_irq_status();
// DMC enable bit. We model DMC enabled/disabled through the number of
// sample bytes that remain (greater than zero => enabled).
if (!(value & 0x10))
dmc_bytes_remaining = 0;
else {
if (dmc_bytes_remaining == 0) {
dmc_sample_cur_addr = dmc_sample_start_addr;
dmc_bytes_remaining = dmc_sample_len;
// If a sample byte is currently being played, the sample is
// restarted only after it has finished
if (dmc_bits_remaining == 0)
load_dmc_sample_byte();
}
}
...
}
Sample loading:Code:
static void load_dmc_sample_byte() {
// Timing: http://forums.nesdev.com/viewtopic.php?p=62690#p62690
// TODO: Open bus?
assert(dmc_bytes_remaining > 0);
assert(dmc_bits_remaining == 0);
dmc_sample_buffer = prg(dmc_sample_cur_addr);
// We use tick() since the PPU as as well as the rest of the APU should
// keep ticking during the fetch.
// TODO: Is this done before or after the IRQ is generated? (Should be
// invisible though.)
unsigned const delay = doing_oam_dma ? 2 : doing_read ? 4 : 3;
for (unsigned i = 0; i < delay; ++i) tick();
dmc_sample_cur_addr = (dmc_sample_cur_addr + 1) & 0x7FFF;
// Putting this after the delay ensures that we can't get a recursive
// invocation of load_dmc_sample_byte(), since it can only called be called
// through dmc_bits_remaining going from 1 to 0 while the CPU is stalled
dmc_bits_remaining = 8;
if (--dmc_bytes_remaining == 0) {
if (dmc_loop_sample) {
dmc_sample_cur_addr = dmc_sample_start_addr;
dmc_bytes_remaining = dmc_sample_len;
}
else
if (dmc_irq_enabled) {
dmc_irq = true;
update_irq_status();
}
}
}
Still a bit confused as to what this test is doing and expects to happen at different points.
For example, the
test_ routine which gets called 10 times to test different sample loading locations starts like
Code:
test_:
jsr print_a
pha
eor #$FF
pha
setb $4012,<((dmc_sample-$C000)/$40)
jsr pre_test
jsr time_code_begin
; Start DMC
setb $4015,$10 ; fill sample buffer
setb $4015,dma*$10
...
time_code_begin is just an alias for
begin_dmc_timer, and
dma is 1, meaning #$10 gets written to $4015 twice.
Does this code expect the first write to $4015 to immediately start a new sample? When I get there the last byte of the sample used for synchronization in
begin_dmc_timer is still playing, meaning the first write to $4015 queues up another sample and the second write to $4015 has no effect at all.
Similary, there's a call to
time_code_end (alias for
end_dmc_timer) at the end of
test_.
end_dmc_timer starts with
Code:
.align 64
end_dmc_timer:
; Restart
lda #$1F
sta SNDCHN
nop
sta SNDCHN
; Rough sync
ldy #-$45
@coarse:
nop
lda #$10
bne :+
: dey
bit SNDCHN
bne @coarse
; DO NOT write to memory. It affects timing.
; Fine sync
ldx #-$2
@sync:
....
When this routine gets called the sample has already finished playing with some margin. Is this expected? What's the significance of -$45 as a constant?
I believe this repeatedly does sprite DMA and has a DMC DMA read occur at various relative times, then shows how long the sprite DMA took, in cycles.
Yup, got that much. Trying to understand how the test code itself works and what assumptions it makes though since the output is so off despite passing all the APU tests.
I added more comments in sync_dmc.s and dmc_timer.s:
sprdma_and_dmc_dma2.zip
blargg wrote:
I added more comments in sync_dmc.s and dmc_timer.s:
sprdma_and_dmc_dma2.zipThanks.
Was occupied for a while, but I'll start looking into it now.
There is something I don't get in
end_dmc_timer, which might also be a clue to what's wrong with my code/understanding. The coarse sync part looks as follows:
Code:
; Returns in XA number of cycles elapsed since call to
; time_code_begin, MOD dmc_timer_modulo. Unreliable if
; result is dmc_timer_max or greater.
.align 64
end_dmc_timer:
; The arbitrary starting X and Y values for the
; loops merely set an adjustment added to the
; final count.
; Restart sample, which will immediately
; finish since nothing's playing, then
; start again which will ensure the flag
; stays set until the second one begins.
; This means that bit 4 of SNDCHN will be set
; a fixed amount of time after begin_dmc_timer
; completed.
lda #$1F
sta SNDCHN
nop
sta SNDCHN
; Coarse sync
; Get within a few cycles of when DMC sample finishes.
; Keep a count since each iter is 16 cycles.
ldy #-$45
@coarse:
; 16 cycles/iter
nop
lda #$10
bne :+
: dey
bit SNDCHN
bne @coarse
...
ldy #-$45 is the same as
ldy #187, and y will be decremented every 16 cycles until one sample byte has played.
Since the period is set to 428, it should take roughly between 428*7 and 428*8 cycles to play one sample byte, depending on the current alignment with the DMC timer. However, (428*7)/16 = 187.25 and (428*8)/16 = 214, meaning y will either end up very small or underflow. For y << 4 | x to have the expected value of ~528 y would have to be about 32 though, which seems impossible.
Any ideas? Is this some off-by-one error?
OK, I've given in and done a careful timing analysis so we can see exactly how this works (and thankfully it all squares away):
Code:
begin_dmc_timer:
php
jsr sync_dmc
sync_dmc:
...
bit SNDCHN ; reads just as bit 4 is cleared
; new bit cycle begins at current high rate
bne @sync ; 3
; -1
pla ; 4
rts ; 6
pha ; 3
lda #$00 ; 2
sta $4010 ; 4 switch to lowest rate for remaining 7 bits
pla ; 4
nop ; 2
plp ; 4
rts ; 6
... code
jsr end_dmc_timer ; 6
end_dmc_timer:
lda #$1F ; 2
sta SNDCHN ; 4 starts immediately, thus causes DMA read now
; 4 DMC DMA read
nop ; 2
sta SNDCHN ; 4
ldy #-$45 ; 2
@coarse:
; 16 cycles/iter
nop ; 2
lda #$10 ; 2
bne :+ ; 3
: dey ; 2
bit SNDCHN ; 4 reads 74+timed code cycles after earlier BIT SNDCHN
bne @coarse ; 3
The DMC timer's rate doesn't change until the next bit, so the first bit is 54 cycles, and the remaining 7 bits are 428 cycles. So SNDCHN bit 4 will be cleared 54+428*7=3050 cycles later. Thus if the loop finds it clear the first time through, it's been at minimum 3050 cycles since the earlier BIT SNDCHN, or 2976 cycles of user code between calls. 2976/16 = 186, so Y should be 186 if the coarse loop never iterates. -$45 is 187, and the DEY always executes at least once, so this gives the correct Y.
Code:
; -1
ldx #-$2 ; 2
@sync:
lda #$1F ; 2
sta SNDCHN ; 4
; 4 DMC DMA
lda #179 ; 3402 delay
: nop
nop
nop
nop
nop
nop
sec
sbc #1
bne :-
inx ; 2
lda #$10 ; 2
bit SNDCHN ; 4 reads 3424 cycles after BIT SNDCHN in coarse loop
; then every 3423 cycles
beq @sync ; 3
The fine sync loop reads at the same relative time to the DMC sample ending as the coarse loop, thus it always reads after the sample ended on the first iteration and loops back at least once. If the user code took 2976 cycles, it will find the sample not ended on the second iteration and exit the loop, leaving X at 0 as desired.
If the user code took one cycle more, the two loops would have read one cycle later, and the fine sync loop would have run one time more, leaving X one greater at 1.
I think I used negative initial values for X and Y because I tuned this empirically; I started out with 0 for X and Y, timed some zero-cycle user code, then took the resulting value, broke it into the high 8 bits and low 4 bits, negated these, and put them in Y and X, so that I'd get 0. I never did the above careful timing analysis, because in my experience it's time-consuming and can easily overlook something critical, while empirical testing and tuning is efficient and reliable since it involves writing edge-case tests to be find the stable range. It was fun to verify this by careful analysis though, so thanks for the opportunity.
But wouldn't the coarse loop in end_dmc_timer always iterate between (428*7)/16 = 187.25 and (428*8)/16 = 214 times, regardless of timing? end_dmc_timer starts a new sample, and that should set the number of remaining bits to 8, meaning it'll have to go through 8 DMC clocks before loading the final sample byte and clearing SNDCHN bit 4.
Iterating between 187 and 214 times makes y way out of range compared to the expected value, so I must be missing something.
Quote:
end_dmc_timer starts a new sample, and that should set the number of remaining bits to 8, meaning it'll have to go through 8 DMC clocks before loading the final sample byte and clearing SNDCHN bit 4.
I think we've found the problem
APU DMC wrote:
When an output cycle ends, a new cycle is started as follows:
* The bits-remaining counter is loaded with 8.
* If the sample buffer is empty, then the silence flag is set; otherwise, the silence flag is cleared and the sample buffer is emptied into the shift register.
When the timer outputs a clock, the following actions occur in order:
[...]
* The bits-remaining counter is decremented. If it becomes zero, a new cycle is started.
Nothing can interrupt a cycle; every cycle runs to completion before a new cycle is started.
If there's no sample byte loaded in time, it'll output
eight silence bits, even if you get a sample byte loaded just after the first silence bit begins.
Just to further verify, when I change end_dmc_timer to load Y and X with zero before the loops and uncomment the jsr print_y/jsr print_x lines, and run this,
Code:
jsr time_code_begin
jsr time_code_end
I get 45 02. So for zero user cycles, $45 needs to be subtracted from Y at that point, and $02 from X. So it's equivalent to just load Y with -$45 and X with -$02 before the loops.
(And can I get an amen! I've finally begun porting my console development programming setup to my Linux box, and was able to do this testing just now without having to power up my old Mac. Very convenient now.)
Ahh, that explains it. Hadn't realized silence on the DPCM channel worked like that.
So the bits remaining count is updated each DMC clock regardless of whether a sample is playing or not, and the DPCM can only transition from silent to playing at the boundary after a "silent sample byte"?
Thanks!
Quote:
So the bits remaining count is updated each DMC clock regardless of whether a sample is playing or not, and the DPCM can only transition from silent to playing at the boundary after a "silent sample byte"?
Right.
A test for that would be nice if you ever update the apu_test tests. I pass all of them without implementing it.
Started digging into the circuitry a bit, and it looks like the 2A03 handles rdy for OAM DMA and PCM reads in a pretty clever way. Rather than just pulling rdy low for a fixed safe "minimum" time, it first pulls it low and then waits for a CPU read. Once it sees a read, it knows rdy must have kicked in, and moves on to doing the transfer.
In case it's helpful to future people pulling their hair out over this one:
To pass, it's important that the first $4014 write lands on an even cycle (and so only adds a single dummy cycle). Since the test synchronizes to the DMC, getting this right depends on the DMC clocks happening on even cycles too, which in turn depends on the power-on value of the DMC timer (the one counting down and getting reloaded with the period when it reaches 0). On the real thing, I think the power-on value is 428 (or the equivalent at least - it uses a linear feedback shift register), though only the even/oddness should matter for this test.
If you get 528,527,528,527,... instead of 527,528,527,528,..., that's likely the problem.
I'm getting error in line 05 = 528, which should be 526. All the others are correct. Any help, please?
EDIT: found the problem, I was setting the number of DMC stealing cycles _after_ a STA $4014 write, which should be _before_ it.