Adding features to discrete mapper with multipurposed CIC

Adding features to discrete mapper with multipurposed CIC
by infiniteneslives on 2017-03-09 (#190805)

Starting a more in depth conversation about an idea I recently shared in discussion on methods for parallax techniques.

So the CIC is the 'necessary evil' of a ~50cent chip that must be included on any game seeking to be published in 72pin form. The current popular choice is the attiny13, selected by both Jim's cool and Krikzz. They utilize the 4Mhz CIC clock as the mcu's clock source and instruction cycle counting to ensure proper timing of communications with the main board CIC lock. The selection of the attiny13 is a rather easy choice to target for a CIC solution. It effectively the lowest cost microchip (AVR/PIC) solution that has adaquate i/o and nvm to store the current region.

For several different NES/SNES projects I've been considering an alternate CIC solution that might allow the 50cent budget of the CIC to go towards a more powerful chip that would be capable of being dual tasked with CIC comms and some sort of mapper interfacing. The attiny13 doesn't really have the io nor time to spare for dual tasking.

Looking over Krikzz's implemenation after the first few cycles where the current stream is determined, the main loop spins for 260+ cycles (65 usec) between each Din/Dout transaction. I took some quick logic analyzer captures to confirm there is ~75usec between each transaction. So the attiny13 operating at 4Mhz is only spending ~15% of it's time processing CIC communications. With not other mcu timers available and nothing better to do with it's time the attiny13 has no choice other than cycle counting. Trying to make use of the 2 remaining i/o, during that under utilized time would be rather challenging with exact cycle counting and not options for interrupts etc.

So I started entertaining the idea of other chips available on the market. For this purpose I discounted all other microchip offerings as they will be more expensive than the attiny13. With those options discounted, and only the requirement of 5v supply and NVM/eeprom available there are a couple interesting options.

Cypress's CY8C4013SXI-400 is an interesting option with it's 16Mhz cortex M0, but only has 5io. I'm tempted to target it considering it's cheaper than the attiny13 in volume, but the lack of spare i/o significantly limits what's possible. The recent SROM cracking that might double the flash for free is an interesting bonus however..

The two most interesting prospects I found were STM offerings. The STM32F030F4 with it's 48Mhz cortex M0 was quoted to me with better pricing than the attiny13. Even though it's a 3v part, it has enough 5v tolerant i/o to get the CIC job done. This is a nice option for a cartridge with 3v mapper logic. And the 48Mhz M0 certainly has enough power to have only a small portion of it's time consumed by CIC communications. I'm still kind of interested in tasking one of the STM32F0 family parts with both CIC and mapper tasks. But for the purpose of this conversation, it's not a viable solution due to too few 5v tolerant io.

That brings me to what I've deemed the 'winner' of the STM8S003F4 datasheet here
It's a rather basic 8 bit mcu with common features including:

16Mhz & 128Khz internal RC oscillators
8KByte flash, 1Kbyte SRAM, 128Byte eeprom
Nested vector interupt controller including external interrupts
2x 16bit adv/gen purp counters, 1x 8bit basic counter
16x GPIO
UART, SPI, I2C interfaces

This all comes at a significant price drop compared to the attiny13 in volume. The price alone is enough to motivate me to create my own CIC implementation using the STM8. But there's a considerable amount of extra hardware getting left unused if only tasked to be a CIC. If only being tasked as a CIC the easilest solution would probably be to clock the mcu externally by the 4Mhz CIC clock. Then implement the CIC in much the same method that the attiny13 did. But that's not much fun, and I think I can do better than that.

My goal is to run at 16Mhz and use one of the timers to interrupt the mcu every ~75usec to handle the next CIC transaction. Doing that means the CIC CLK signal is pretty useless. But one would have to take care to keep aligned with the CIC's clock. The interrupt would have to come early and maybe poll the Din pin when it's expected to be high to sense how far the mcu has drifted and correct it's internal timer. I'm expecting that worst case 15usec out of every 75usec will be utilized for CIC transfers. That's more time than Krikzz is utilizing with the attiny13, and we're running 4x faster. Maybe this solution could get to 5usec or less, either way it doesn't matter too much, it's still some portion of time the mcu MUST prioritize CIC transfers.

Now comes the question of if this extra hardware is going to be utilized by the NES somehow, the CPU has to have a means to interface with the mcu. This is not a simple feat with the expectation of being free. I argued to myself that all the hoops that would need to be jumped through would make disinterest one to the point where you'd want to simply invest a couple dollars on a mapper more capable than a discrete mapper.

Here's the pinout and port numbering with some preliminary assignments I've came up with:

Code:

                          _________________  _______________
NES CPU D3              -| PD4/UART_CLK    \/           PD3 |-   CPU D2
NES CPU D4/UART TX      -| PD5/UART_TX                  PD2 |-   CPU D1
NES CPU D5/UART RX      -| PD6/UART_RX              ISP/PD1 |-   CPU D0
                        -| /RST                    MISO/PC7 |-    SPI?
MAPPER REG BIT          -| PA1/OSCIN               MOSI/PC6 |-    SPI?
CIC Din                 -| PA2/OSCOUT               SCK/PC5 |-    SPI?
                        -| VSS                          PC4 |-   NES /IRQ
                        -| Vcap                         PC3 |-  CIC Dout?
                        -| VDD                      SCL/PB4 |-   NES CPU R/W
SPI? CICrst?            -| PA3/SPI_NSS              SDA/PB5 |-   NES A13
                          ---------------------------------

The simplest method I could come up with would be to designate an unused mapper register bit to have the NES signal/interupt the mcu that it wants it's attention to start communicating to it. However this mcu interrupt must have a lower priority than CIC comms. Assuming a '377 is being used for the mapper reg we probably have an unused bit, even a BNROM utilizing a '161 has a unused bit if the PRG-ROM is <= 256KByte.

When the mapper bit is set (presumably $8000.7) the mcu would be instructed to start listening to CPU writes when CPU A13 is high. This maps mcu's register to $6000-7FFF, but also maps/overlaps the PPU $2000-3FFF. This was the fewest number of pins I could come up with for decoding that seems reasonable. Moving the mcu reg bits to SPI's PC bits would give more bits for decoding and potentially decoding CPU A14 perhaps. But A13 seems sufficient as it blocks writes to RAM and the APU which seems helpful to me. The user would have to take care to not accidentally write to the PPU and mcu at the same time, but one should already be very deliberate when writing to the PPU.

My proposed pin assignments would allow for 4bit nibble wide read/writes at a minimum. If one wasn't looking to utilize the UART then the entirety of PORT D could be used for 6bit wide accesses.

There is a problem though as we can't be certain the mcu is always able to listen to writes to $6000. The mcu could be currently interrupted by CIC comms which must have a higher priority. I can't think of a very clean way to get around this without adding dedicated logic. Maybe the simplest idea is to have the NES set the mcu interrupt bit $8000.7, then the mcu waits for upcoming CIC comm to complete. Once done, it interupts the NES CPU which uses it's interrupt routine to complete the transfer. The NES would have the maximum time (~60usec) to complete the transfer. This is probably a preferred solution if the NES CPU is looking to make big transfers to the mcu. Maybe a big transfer would be verified by reading back a checksum.

Another idea might be to write to $6000, but require the value to be read back from the mcu before being certain it stuck. This would probably be a preferred solution for small transfers as we typically have >80% chance the mcu is listening.

You could maybe combine the two ideas to remove the need to use NES /IRQs for each transfer. Maybe the NES can simply read from the mcu at $6000 after setting the $8000.7 mcu interupt bit. And the mcu provides a designated value if there is sufficent time to write a nibble or two before the next CIC xfr.

Anyway, that's my idea and here's the place to toss out any other probably better ideas you guys might have. My primary goal for such an interface is that it's effectively free being able to be implemented with wires alone. It's not out of the question to add logic gates to implement the idea, but personally I'm not interested in doing so. Start adding a gate here or there and it's no longer free. I'm not even sure I have the pcb space currently to support routing the signals I've proposed. I'll probably have to re-route a large portion of my current design to make room for the CIC to be placed closer to the PRG side of the board.

As far as ideas of what could be done with utilization of something like this it's up to the imagination. The mcu probably isn't going to be fast enough to implement any sort of CHR effects like finer backswitching or anything. Even permitting selectable NT mirroring is sorta out of the question as you'd need to add more logic.

As mentioned in my other post, unfortunately this mcu doesn't have any external pins available to clock the internal counters. So you'd have to utilize the internal 16Mhz/128Khz oscillators for an IRQ timer. The SPI bus is open on my pinout proposal above, and things could be shifted around to make the I2C bus available instead. This potentially be connected to a large serial flash rom for lots of rom storage. But it's not going to be as fast as one might like due to the limitations put on transfers. One of the pins could be routed to EXP6 to implement some basic expansion sound perhaps. You could even get crazy and utilize the UART interface to connect a cheapo BT/WiFi module, but if you're interested in that doesn't make much sense to restrict your budget to a discrete mapper..

Anyway, my guess is chances are this idea won't go anywhere, but it's fun to talk about. At this point I can say I'm going to do everything I can to migrate to the STM8 for my NES/SNES CIC solutions for the benefit of my other designs. So from that point the hardware will be sitting idle waiting to be put to good use.

Re: Adding features to discrete mapper with multipurposed CI
by Dwedit on 2017-03-11 (#190887)

Heh, once you have a 16MHz Thumb processor, why even bother making a NES game at all. You can run the game logic on the coprocessor instead.

Re: Adding features to discrete mapper with multipurposed CI
by tepples on 2017-03-11 (#190896)

Except in this case, the problem with going the Hayazashi Nidan Morita Shogi 2 route of putting game logic on an ARM coprocessor is that everything has to be timed to ensure the CIC key logic executes at the appropriate time.

Re: Adding features to discrete mapper with multipurposed CI
by calima on 2017-03-11 (#190913)

Does the CIC keep asking at runtime? Wikipedia makes it sound like it only checks at startup.

Re: Adding features to discrete mapper with multipurposed CI
by tepples on 2017-03-11 (#190918)

calima wrote:

Does the CIC keep asking at runtime?

Put a licensed Game Pak in a stock NES-001 Control Deck, turn it on, and pull out the Game Pak. The game will freeze. If the power light also starts blinking, this means the lock chip requires the key chip to continue to communicate. I just tried it with Bionic Commando, and yes, it starts blinking. That's why you need to perform the pin 4 modification to use TapeDump on a front-loader.

calima wrote:

Wikipedia makes it sound like it only checks at startup.

A charge pump-driven stunner may need to run only during the opening copyright screen, but a proper clone needs to communicate continuously.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-03-11 (#190920)

The NES CIC exchanges another bit of the 2^160 period random number generator output approximately every second. The lock will immediately start the reboot loop on the NES if it fails.

Re: Adding features to discrete mapper with multipurposed CI
by rainwarrior on 2017-03-11 (#190921)

tepples wrote:

A charge pump-driven stunner may need to run only during the opening copyright screen, but a proper clone needs to communicate continuously.

Does the stunner approach actually halt/crash the internal CIC until the next reset or something?

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-03-11 (#190978)

lidnariq wrote:

The NES CIC exchanges another bit of the 2^160 period random number generator output approximately every second. The lock will immediately start the reboot loop on the NES if it fails.

The lock and key exchange the next bit approximately every 75usec. If the key doesn't produce the proper value, the lock panics and stops paying attention to the key. Once it panics it sits in a loop resetting the console every other second.

The actual time between exchanged bits of the pseudorandom number actually depends on the precise current value of the currently calculated/"mangled" number. But the lock and key are operating through the "mangle" calculations in lock step. So each takes the same exact amount of time to perform the calculation of the next bit.

Been digging through segher's disassembly to get a better understanding of how the lock behaves exactly and understand what it's expecting from the key. Good news is from what I can tell the lock doesn't pay any attention to the key's output aside from the precise moment every ~75usec. So I'm thinking this could be abused to alleviate the problems with the CIC mcu needing to prioritize CIC comms over everything else. Appears one could simply output the next stream value early. Appears the key could just toggle the Dout line to whatever the upcoming value is instead of providing a precisely timed 750nsec pulse of the current bit as the original does.

That means this implementation can give the NES CPU priority so long as the NES isn't steaming data for longer than ~50usec.

Additionally I confirmed what I was seeing on the scope with the code. The lock sets its output high for 8 cycles after resetting the key. So I'm thinking this could be used to sense reset instead of the CIC reset pin. So with a little effort there can minimize down to the mcu only needing 2 io for CIC comms.

Dwedit wrote:

Heh, once you have a 16MHz Thumb processor, why even bother making a NES game at all. You can run the game logic on the coprocessor instead.

The biggest problem for running game logic on a low cost ARM coprocessor is providing a enough rom for the mcu, and a adequate interface between the mcu and PPU.

Re: Adding features to discrete mapper with multipurposed CI
by elseyf on 2017-03-14 (#191139)

This sounds interestening. Is the CIC Firmware going to be open source, and if so, is it supplied in a manner that it could be easily worked into a custom program running on the mcu?
I see this as a free means to do some basic DRM protection for a new homebrew game, the mcu could interface some additional extertal memory (namely spi Flash) which contains encrypted data, and would decode it when requested (in this case the mcu is kind of a blackbox). Best would be to encrypt program data and only supply to the NES the code which is currently needed, but this is then left to decide by the programmer. The interface could use the NES IRQ Signal to halt the NES from transferring data when the mcu is supposed to act as a CIC, this allow at least for a seamless interface.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-03-15 (#191226)

I find discussions about DRM to be somewhere between "boring" and "annoying". Regardless, the biggest problem is that because necessarily no emulator already supports whatever scheme you'd want to add, and because the NES is so limited, any DRM you add stands a very good chance of breaking any existing timing-sensitive code (and you'll be surprised at how much NES code becomes timing sensitive...) so then you need to reimplement the DRM scheme in your own in-house debugging emulator fork.

Now, the part I actually find interesting:

I spending a little time thinking about this, and I believe a modern PIC with some Configurable Logic Cells (and maybe the Peripheral Pin Select) could actually subsume all of this, adding GNROM banking, mirroring control, and an IRQ. The down sides are a serial interface to programming the mapper, and not being able to write to Flash....

A CLC would let us detect writes (/ROMSEL OR R/W); the PPS lets us forward that to the SPI module clock input (or else use an external pin). The CPU is responsible for taking the received bytes and relaying them to the various pins (≈GNROM), configuring another CLC (mirroring control), and the timer (IRQ).

Meanwhile, another CLC and a timer clocked by the CIC clock lets us actually use the timer hardware to copy the already-calculated CIC output stream at exactly the right moment. The CPU's involvement is minimal, leaving more than enough time for ... whatever's left.

Re: Adding features to discrete mapper with multipurposed CI
by tepples on 2017-03-15 (#191230)

lidnariq wrote:

I spending a little time thinking about this, and I believe a modern PIC with some Configurable Logic Cells (and maybe the Peripheral Pin Select) could actually subsume all of this, adding GNROM banking, mirroring control, and an IRQ. The down sides are a serial interface to programming the mapper

Could the serial interface be made compatible with a subset of the MMC1?

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-03-15 (#191233)

The PIC's serial ports (MSSP/EUSART) really want to deserialize writes 8 (or 9) bits at a time, so I think that counts as "possible but not trivially so".

All of the PICs with three or four CLCs only have four CLC pin inputs, and to emulate an MMC1 we really need to pay attention to five pins: A14, A13, D0, /ROMSEL, & R/W. But I think we could probably actually get that by cleverly using the timer 1 gate function. Don't get IRQs if it's an MMC1 subset, though, and there's no good reason to be compatible with the MMC1 if one does add IRQs.

It probably mostly depends on whether the hardware can manage all of the CIC timing without the CPU involvement, and the CPU "just" needs to calculate the next bit and next time.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-03-23 (#191768)

elseyf wrote:

Is the CIC Firmware going to be open source, and if so, is it supplied in a manner that it could be easily worked into a custom program running on the mcu?

My goal would be to make it available for community use so people could utilize it for whatever they pleased such as your DRM idea if they wanted. So at a minimum I would provide the source code on request. Not sure what licensing option I will choose when the time comes to releasing something that's fully functional. But I'd like it share it with anyone that would put it to good use.

lidnariq wrote:

The configurable logic is an interesting feature that could help out quite a bit. The biggest deterrent personally is that it isn't 'free' in comparison to the stm8 solution. Looks like the lowest cost part with a CLC, eeprom, [EDIT: and enough spare i/o] is the attiny814 (appears they're called CCL on avr cores). The attiny814 is rather generous with 12 i/o, and comparably priced to the attiny13. It only has one logic cell though, but that's enough to bit bang mapper writes with the benefit dedicated logic. The attiny814 does have a little speed boost with 20Mhz internal osc compared to stm8s003's 16Mhz. But if you're running core off of the 4Mhz CIC clock then that's irrelevant.

The cheapest offering with 4x CLC's looks to be the 16F15344. In 3k volume, microchip's claiming 63cent pricing which isn't bad considering one could potentially integrate all the discrete mapper logic chips into the PIC. But it's going to be hard/impossible to reach that volume without leveraging all of my CIC consumption into a single solution. Either way this is even further from free in my personal quest. But it would be a pretty decent solution for someone looking to make a custom dedicated solution.

Quote:

It probably mostly depends on whether the hardware can manage all of the CIC timing without the CPU involvement, and the CPU "just" needs to calculate the next bit and next time.

Assuming my analysis of segher's disassembly holds true, my idea of simply outputting the next bit anytime during the 79usec between bit exchanges removes the CPU from time sensitive involvement.

infiniteneslives wrote:

Additionally I confirmed what I was seeing on the scope with the code. The lock sets its output high for 8 cycles after resetting the key. So I'm thinking this could be used to sense reset instead of the CIC reset pin. So with a little effort there can minimize down to the mcu only needing 2 io for CIC comms.

So I no longer see this as viable. I see in segher's disassembly (@ $007 below) where the lock appears to be setting PORT0.0 (Lock Dout, Key Din) shortly after reset, but before initializing the data stream. I was operating off of corrupt brain memory when I thought I had seen this in my logic analyzer captures. Lock Dout never goes high during this time, the first time it goes high after reset is during the stream ID nibble exchange.

Code:

         ;; LOCK START
05f: 75      lbmi 1   ; H := 1

         ; forever {
06f: 55      in      ;   A := P0
077: 66      ska 2   ;   if A.2 = 0
07b: f3      t 073   ;      last
07d: 21      lbli 1
03e: 31      ldi 1
01f: 70      ad
04f: 43      xd      ;   [1:1]++
067: ef      t 06f   ; }

073: 30      ldi 0
079: 20      lbli 0
03c: 4a      s      ; [1:0] := 0
05e: 32      ldi 2
02f: 21      lbli 1
057: 46      out   ; P1 := 2   // reset host and key
06b: 00      nop
075: 30      ldi 0
03a: 46      out   ; P1 := 0   // run key
01d: 20      lbli 0   ; L := 0
00e: 31      ldi 1
007: 46      out   ; P0 = 1   // *** SHOULD BE SETTING LOCK Dout high (Key Din)
043: 3e      ldi e   ; A := e

         ; while A <> 0 {
061: 01      adi 1   ;   A++
030: e1      t 061   ; }
058: 00      nop
06c: ae      t 02e   ; goto 02e

         ;; KEY START
         ; (L = 0, A = P0)
076: 65      ska 1   ; if A.1 = 0   // if not test mode
03b: 8c      t 00c   ;   goto 00c
05d: 00      nop

         ;; INIT LOCK, OR KEY IN TEST MODE
         ; (L = 0)
02e: 47      out0   ; P0 := 0            //  *** Here it appears to clear PORT.0 which looks to be set @007
017: 7d 00   tml 200   ; call 200   // init stream
065: 7d a0   tml 320   ; call 320   // magic
019: d1      t 051   ; goto 051

What I had seen in the analyzer captures was Din and Dout being held high prior to the reset signal. I had assumed that the Lock was setting it's Dout prior to resetting the key, but it's actually the key setting both Din and Dout prior to being reset as it ends up in panic/die on initial boot prior to being reset by the lock. I removed the key from the circuit completely and confirmed the Lock was holding Din & Dout low prior to resetting the Key. So it appears the CIC's Din/Dout pins are unidirectional.

Does anyone know or have a good guess as to what type of drivers are on these PORT pins? I had assumed they were unidirectional push-pull. But this apparently isn't the case. The lock and key both clear their PORT0.1 "data input" pins during normal transactions. And when they set their PORT0.0 pin, the logic 1 beats out the other chip's logic 0. So my guess is outputting a logic 0 is effectively a pull-down, and outputting logic 1 is pushed sourcing current to override the other chip's logic 0 pull-down.

This makes sense when considering PORT0.2 for the lock's seed capacitor. The lock will set the seed pin whenever it sets it's data out pin during transactions, and then read the seed pin if it ends up getting reset. So I guess it's aiding in getting a random seed on a warm reset utilizing bidirectionality on the PORT0.2 pin.

So CIC reset appears to be a vital input. I thought other ways to get around requiring the CIC reset signal, but there's no way to log data in and determine proper timing in adequate time as the key must output it's interpretation of the stream ID being even/odd as the first bit transaction post stream ID transfer. You could figure it out in time if there were always at least two bits set in the stream ID, but that obviously won't work.. [EDIT: too bad the CIC reset signal is active high otherwise could just route it to NRESET on stm8, guess one could still do this to save an i/o but you'd need to add an inverter.]

Anyway, I've got a firm enough grasp on the mangle algorithm and timing of everything so I'm going to start working on my own implementation targeting the stm8s003. I'll start by taking the easy way out and clocking the mcu core with the 4Mhz CIC clock and ensure timing by cycle counting.

One thing I didn't realize until working through segher's disassembly and looking over my analyzer captures is that the CIC cpu core is actually running at 1Mhz. Appears to be 4 clock cycles per machine cycle. I haven't seen this mentioned anywhere else in all my research. Although I should have gathered this when peeking at Krikzz's solution, but I didn't have a good enough grasp on all the timing at that time. My discussion below uses usec and cycles interchangeably.

The Skinny on CIC transactions and timing
I'll share my little breakdown of the timing and transactions for anyone curious. thefox's tengen translated to C is the best high level reference, but it took awhile looking at segher's disassembly before I could fully understand what's happening and wrap my head around the timing of everything.

Negative time: Lock determines 4bit seed value "steam ID" 0-15 prior to resetting key
Time 0 usec: Lock resets Key
Time 33 usec: Lock outputs bit3 of stream ID
Time 48 usec: Lock outputs bit0 of stream ID
Time 63 usec: Lock outputs bit1 of stream ID
Time 78 usec: Lock outputs bit2 of stream ID
-This 4bit stream ID becomes the first nibble of the key's table nibble 1.

Time 201usec:
main loop starts: first task is to transfer 1-15 bits of current table's least significant bits, final task is to perform mangle on tables, repeat..

First transaction is always transfers all 15bits, subsequent transactions are 1-15 bits and determined by nibble 7's value on re-entry of main loop.
"effective main loop time 0": lock and key transfer their table's least significant bit of nibble 1.
-Check what the other chip sent and confirm it matched expected value

Every 79 cycles the next LSbit of ram is transferred and checked, if doesn't match expected value die/panic.

Once all bits are transferred perform mangle of both the chip's own ram table, and the table it's calculating for the other chip to know what to expect.
-Number of mangles to be performed on the table is based on nibble 15 on entry of mangle calculation plus 1. So each table will get mangled 1-15 times.
-The time it takes to perform a table mangle calculation is either 78 or 84 cycles/usec depending on if the sum of nibble 2 + nibble 3 + 1 is > 16 or not on entry of mangle calculation.
-Each table also takes another 29 cycles/usec to process separate from the mangle calc loop. 2 tables = 58 cycles/usec.

The mangle calculation time varies based on ram values on entry. The theoretic minimum would be if both tables only performed one mangle calc @ 78usec + 29usec for table = 107usec * 2 tables = 214 usec. I would estimate the average mangle calculation time to be ~80usec * 15 + 58 = ~1250usec.

Now that the new value of each table is calculated the main loop restarts.

My thoughts out loud:
What follows is a bit of me thinking out loud on this whole idea of multitasking the CIC. So at least my idea is publicly documented so someone else can seek this out further if they'd like to in the event I don't end up doing so myself.. So in a solution where the CIC is being multitasked, and we're allowed to output the next bit early instead of a precise 3usec pulse, this is my general plan after initialization and first transaction is complete:

1) Set a timer for mangle calculation time. We can quickly determine how many mangles will be performed by looking at nibble 15 of each table. We just don't know how long each mangle will take without performing each iteration of calculations. Each mangle will be 78 or 84 usec, so let's assume all calcs take the min 78usec and set our timer based on that, plus the 2x 29cycle table time.

2) Perform the mangle calculations on both tables tallying up each calc that was 84 cycles instead of 78. Once all mangles are done, increment the timer by 6cycles times our tally. Now we'll get an interrupt from the timer when it's the right time to start transacting data.

3) We can now work ahead a little to try and make the transactions easier/faster. Perhaps it would be helpful to transfer the upcoming data transaction into a temporary condensed location separate from the lock/key tables' LSbits. This would also allow us to perform the mangle calculations for the next iteration early. This improves upon my thought above as we could figure out the mangle time and perform all the mangle calculations one iteration in advance.

4) Perform next transaction getting interrupt every 79cycles to output the next bit in the stream. Once transaction is complete load counter with predetermined mangle calculation time which we've already calculated in advance from the step prior. Go back to step 2 and repeat forever...

The above is time referenced for the time that the lock will latch the key's bit. However we want to give ourselves some slack time. Since we can output the next bit early, it's better for to reference our time 'zero' to 79usec *before* the lock latches the key's bit. So we need a 79usec timer to interrupt us just *after* the lock has retrieved it's data. So long as the timer automatically reloads and is setup to interrupt us again exactly 79 cycles later, we don't need this "CIC transaction timer" interrupt to have the highest priority. The idea is for the timer to simply be keeping time for us. The NES CPU can have higher interrupt priority than the CIC, so long as the NES CPU doesn't take more than ~77usec of our time in one burst.

Well that's enough rambling for now... Time to start writing some code!

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-03-23 (#191774)

infiniteneslives wrote:

Does anyone know or have a good guess as to what type of drivers are on these PORT pins? I had assumed they were unidirectional push-pull. But this apparently isn't the case. The lock and key both clear their PORT0.1 "data input" pins during normal transactions. And when they set their PORT0.0 pin, the logic 1 beats out the other chip's logic 0. So my guess is outputting a logic 0 is effectively a pull-down, and outputting logic 1 is pushed sourcing current to override the other chip's logic 0 pull-down.

I could have sworn that the lock/key were properly crossed over?

I just re-tested with a continuity meter and it certainly seems to be the case that pins 1/2 connect to pins 2/1 ...

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-03-23 (#191778)

lidnariq wrote:

infiniteneslives wrote:

I could have sworn that the lock/key were properly crossed over?

I just re-tested with a continuity meter and it certainly seems to be the case that pins 1/2 connect to pins 2/1 ...

Yes of course, I wasn't questioning that. Sorry not sure what you think I meant, or where I'm not being clear.

My point was that the CIC chip can drive an output logic 1 on both PORT0.0 (data out pin 1) & PORT0.1 (data in pin 2). And it can read them as inputs as well. So it seems all (or ones of concern anyway) pins are bidirectional. But there's no sort of direction register as you would find on a modern mcu like a DDR port on an AVR. The CIC can simply always read the input and set the output. Each CIC writes a logic 0 to its data input (pin2 PORT0.0), in the same instruction that it's setting logic 0/1 to its data output (PORT0.1 pin1).

Since the data in/out are crossed over as you pointed out, it would appear the CIC always outputs a logic 0 on its input pin, and the other CIC is free to "also drive" that same line to a logic 1 without causing conflicts. Since this doesn't create an issue I'm left to conclude that a logic 0 doesn't actually sink current, and instead is just a pulldown. Presumably the other chip sources current when driving a logic 1 which overcomes the other chips weak pulldown. So kinda resembles opposite of open drain, so open source I guess? Like I2C but inverted..?

It just doesn't seem familiar to me, it doesn't appear to be a traditional RTL, TTL, or CMOS driver. Maybe I'm missing something though. [EDIT: unless RTL with a pulldown instead of a pull-up resistor was common place?]

My primary curiousity is some confusion on my part about the I/O drivers might explain why the lock isn't driving its data output pin high as it would seem to be trying to do in the code as I pointed out shortly after it resets the key. It clearly has no problems driving a logic 1 on that same pin only a few cycles later when transmitting the stream ID. But I'm also not convinced that segher's disassembly is 100% accurate, it's not like we have a legit data sheet for the sharp mcu.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-03-23 (#191780)

infiniteneslives wrote:

Sorry not sure what you think I meant, or where I'm not being clear.

I was confused about utilization, rather than hardware specifics.

Quote:

unless RTL with a pulldown instead of a pull-up resistor was common place?

I feel like I remember seeing something to that effect somewhere in the RE thread ...?

There is a die shot of the 3193 on Visual6502, and it kinda looks like there's very strong drivers one way and very weak ones the other.

It's also possible it could be something like the 8051, which also lacks a DDR/TRIS register: it drives the output pins strongly for the first 2 instruction cycles (24 master clock cycles) after one of the output registers are written. I think. Something like that, anyway.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-03-25 (#191824)

I realized another thing about the PICs with four CLCs, pertaining to MMC1 emulation...

They can emit a CLC output to more than one pin. And that lets us emulate a 74'32 or a 74'08, and that lets us get the UNROM/m180 banking styles of the MMC1.

To set up UNROM-style banking, we know that each PRG pin will either be always 1, or CPUA14, depending on the value written to the banking register.

In fact, this lets us get MMC1/VRC1 style CHR banking, too; we now need two CLCs to emit PPUA12 and NOT CPUA12, and the pin multiplexers let us control whether a given output is 0, 1, PPUA12, or NOT PPUA12.

Unfortunately, we now have overwhelmingly run out of CLCs: between needing to latch CPUA13, CPUA14, CPUD0, CPUD7 if PRG banking mode can change, buffer one of PPUA10/A11 for mirroring control, buffer CPUA14 for PRG banking control, and buffer both PPUA12 and NOT PPUA12 for CHR banking...

I think this is the last tangent I have on this specific topic. If I do write an MMC1 subset in a PIC it'll get its own thread.

Re: Adding features to discrete mapper with multipurposed CI
by Dwedit on 2017-03-26 (#191858)

Okay, one other idea...
Using the flash memory for small save files? Then you can pretend you have an EEPROM chip or something.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-03-26 (#191871)

Dwedit wrote:

Using the flash memory for small save files?

Yeah that would certainly be possible. The stm8 has 128Bytes of eeprom, which could be made dedicated to the game. There would undoubtedly be some portion of the 8KB flash program memory left unused as well. Both eeprom and flash are single byte erasable/programmable to make things simple.

Having save data in the CIC has the benefit of being much easier to program save operations compared to putting save data in flash PRG-ROM. You don't need to copy and execute save routine from SRAM. The save routine wouldn't differ between mappers. It doesn't require added logic gates. One of biggest benefits is you don't have to erase in 4KB sector chunks, and give up PRG-ROM code space to save operations. The whole operation can be abstracted to r/w eeprom byte number commands.

This all of course requires not only a means for the NES CPU to both read and write from the CIC. Reads will probably be a bit more of a challenge.

One of my goals is to try and come up with an setup that's primarily mapper independent. Trying to come up with a specific (GNROM/MMC1/whatever) modeled mapper kind of dooms the idea to never being adopted IMO. If one comes up with a nonstandard mapper chances are slim that someone besides yourself will adopt the mapper. But if a mapper independent expansion like this were created, then someone could tack it onto the mapper of their choice whenever they decide to make use of it.

I like your ideas on the CLC solutions lidnariq. If you further investigate the CLC options I would be interested to see it in action. I need to spend more time investigating how they work exactly. It's nice to see low cost mcu offerering being made with available programmable logic finally. Hopefully the trend continues and maybe someday we'll get a hefty chunk of programmable logic on the order of 20-32 macrocells sub $1 someday.

For my own purposes I'm not married to the stm8, it just happens to have set the bar on minimum pricing I was able to find for CIC min specs. Beyond that I heavily value using one part for as many designs that I produce as possible. The fewer chips I have to maintain inventory of the better for both pricing and time. And the fewer footprints I need to support the simpler the PCB layout. These are some of the biggest reasons microchips CLC offerings are unappealing for me. That said, if someone had a game they needed help producing a PCB for that implemented CLCs I would entertain the idea.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-03-26 (#191884)

infiniteneslives wrote:

Reads will probably be a bit more of a challenge.

Within the available hardware, i.e. without adding a "real" port like Membler's squeedo, I can't think of anything less awful than "do/don't bankswitch depending on bit in EEPROM"

Quote:

One of my goals is to try and come up with an setup that's primarily mapper independent.

I suppose the value in taking something that starts as a simple-ish discrete logic mapper and adding mirroring control and/or and IRQ is that a subset can be tested in any emulator.

Quote:

I like your ideas on the CLC solutions lidnariq. If you further investigate the CLC options I would be interested to see it in action. I need to spend more time investigating how they work exactly. It's nice to see low cost mcu offerering being made with available programmable logic finally. Hopefully the trend continues and maybe someday we'll get a hefty chunk of programmable logic on the order of 20-32 macrocells sub $1 someday.

When my current projects are done I'll probably actually look into making a PIC-as-MMC1-subset...

And then something that actually is a better match for the hardware.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-03-26 (#191887)

lidnariq wrote:

infiniteneslives wrote:

Reads will probably be a bit more of a challenge.

Within the available hardware, i.e. without adding a "real" port like Membler's squeedo, I can't think of anything less awful than "do/don't bankswitch depending on bit in EEPROM"

Yeah something crazy like that would be one way to do it I suppose. Perhaps another unconventional way would be to come up with a specific sequence of IRQs to communicate data back to NES CPU.

If I can pull off a asynchronous CIC where the mcu is running off internal 16Mhz oscillator instead of 4Mhz CIC clock, I'm fairly confident I'll be able to decode and output data straight from mcu gpio to the databus.

One way to get around decoding would be to have an mcu interrupt on the '161/'377 mapper bit and be prepared to output data on the bus exactly N cycles afterwards. So the NES would be required to have LDA$5000 immediately after STA $8000 for example. If needed extra time in the mcu interrupt routine to retrieve data, maybe just require a specific number of NOPs between STA $8000, LDA $5000. Doing that might even allow the mcu to be running at 4MHz on the CIC clk.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-03-28 (#192080)

So been digging into the code and started implementing things with the stm8.

infiniteneslives wrote:

Quote:

It probably mostly depends on whether the hardware can manage all of the CIC timing without the CPU involvement, and the CPU "just" needs to calculate the next bit and next time.

Assuming my analysis of segher's disassembly holds true, my idea of simply outputting the next bit anytime during the 79usec between bit exchanges removes the CPU from time sensitive involvement.

I've found this plan to output data early won't work. Even though in segher's disassembly, the lock only verifies the Key's Data output once every 79usec, my console is checking it twice. I've shifted the key's output around to determine the thresholds. The Lock appears to check that the Key's data out is low 6cyc/usec prior to setting it's output (Lock Data in). This agrees with the tengen disassembly where data input is checked low 5/6cyc before setting Data out; if data in is high it dies/panics.

Segher doesn't seem to mention which of the two NTSC CIC's his disassembly is for perhaps it's a 6113, and the one in my console is probably a 3193. I haven't bothered to take a screwdriver to my console to confirm. One of the few posts that kevtris didn't delete from the RE thread reports that the only combo of 6113/3193 as lock/key that doesn't work is when 6113 is used as the lock, and the 3193 is used as the key. Perhaps this extra Data input check requiring data to be low has something do to with it.

Beyond that, since segher's disassembly is now fairly safe to assumed to not be the code running on my consoles; that may explain why I don't see the Lock setting Data out for ~9cyc immediately after reset as shown in segher's disassembly.

I was able to determine that there is a roughly 6.7usec window of when the key must transition from low to high when outputting a logic 1. Using the rising edge of Lock Dout as a reference:
-Key Dout must be low 4.9usec beforehand.
-Key Dout must be high 1.8usec afterwards for a logic 1 bit transfer.
-After the transfer, the key is able to leave it's output high all the way up until ~5.5usec before the next bit transfer.

So that's not necessarily terrible news, as a 6.7usec time allowance equates to 26 mcu cycles when running at 4Mhz, and 107 cycles when running at 16Mhz. So there's some hope for servicing the CIC LOCK, and NES CPU both during a single 6usec window of time esp when running at higher frequencies.

While working on my stm8 implementation, I realized there is a rather vital feature that is necessary from a mcu being dual tasked by the CIC and NES CPU. It's a feature that's not available on most AVR's, I'm not sure about PICs as I'm not as familiar with them. If the CIC 4Mhz clock is to be used as a source for the mcu core, the mcu needs to have the ability to switch to it's internal oscillator in application. Because one can't count on there always being a 4Mhz cic clock on toploader and clone consoles. On most AVR's the clock is selected via fuse bits which can't be modified by the application code (an external programmer is needed). So in that situation you'd have no choice but to only run the mcu core off the internal oscillator to keep NES CPU services functional. Perhaps one method around this would be to use the 4Mhz CIC clock to feed a counter input instead.

The stm8 is rather flexible with it's clock source selection and will even switch itself back to the internal oscillator if the external clock source is to fail. It always starts up on the internal osc, and all clock selection modes are available to be programmed in application.

One thing I'm finding to be a bit of an annoyance with the stm8 is cycle counting the instructions with it's 3 stage pipeline isn't very straight forward. Extra cycles get added beyond the execution cycle count when the instruction prefetch buffer needs to be flushed, etc. It also seems alignment of my instructions also has an effect presumably due to the 32Byte prefetch buffer size. I've found the most practical means to ensure proper cycle counting is to verify my 'estimates' with actual measurements with the logic analzer. Then tweak as necessary and ensure that all conditional variances are verified as well before moving on to next operation. So that's a bit annoying but seems to be stable thus far after lots of tweaking and verification.

So far I've captured the stream ID, and output the first 15 transfer bits successfully. Now the challenge of properly timing the mangle calculation. If the cycle counting gets to be too much of a headache I might just switch over to running only on the internal 16Mhz oscillator and using timer counters and interrupts to handle all the timing as I envisioned for a dual tasked version. Shouldn't be too hard to correct for drift by sampling the Lock's output once per transaction.

Re: Adding features to discrete mapper with multipurposed CI
by Myask on 2017-03-28 (#192081)

infiniteneslives wrote:

I haven't seen this mentioned anywhere else in all my research.

in said RE thread, Quietust wrote:

Such a data logger has already been made by kevtris, and the data stream is extremely sparse - short bursts of '1' (each 3 clocks long) followed by huge spans of '0' (ranging from 76 to well over ten thousand).

Note that the CIC is clocked at 4MHz, but the clock is effectively divided by 4, resulting in two 1MHz data streams.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-03-29 (#192113)

infiniteneslives wrote:

I'm not sure about PICs as I'm not as familiar with them. If the CIC 4Mhz clock is to be used as a source for the mcu core, the mcu needs to have the ability to switch to its internal oscillator in application. Because one can't count on there always being a 4Mhz cic clock on toploader and clone consoles. On most AVR's the clock is selected via fuse bits which can't be modified by the application code (an external programmer is needed). So in that situation you'd have no choice but to only run the mcu core off the internal oscillator to keep NES CPU services functional. Perhaps one method around this would be to use the 4Mhz CIC clock to feed a counter input instead.

PICs give you enough rope here.
1- There's the "Fail-Safe Clock Monitor" fuse setting, which will switch to an internal oscillator if the external one fails. It's handled automatically in hardware.
2- In many models, you can switch clock as needed at run-time anyway, but there may be a significant delay if the target oscillator isn't already running
3- In most models, you can measure the frequency of an external clock source (Capture every 16 rising edges), and-
3b- In some models you can tune the frequency of the internal clock source (Roughly 0.1-0.3% precision)

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-03-29 (#192114)

Yeah I kinda forgot about that thread till lidnariq brought it up again. Read through it all again and still some useful info in there. It's really disappointing to see all the kevtris posts and a few others getting deleted.

I only assume bunnyboy asked that of kevtris as part of his terms to purchase his PIC implementation making it private, and hopes to make it more challenging for others. Was sad at the time to see a community project for everyone's benefit to try to be monopoloized and for their profit alone. Thankfully there was enough other documentation left around to allow others to make their own implementation. But now it's kinda sad just because we lost a bit of history that was contained in that thread. Suppose if I really cared I should reach out to bunnyboy/kevtris and see if the discussions were saved and can be revived now that the monopoly has been dead for a few years.

Quote:

PICs give you enough rope

That's good to hear. Was always annoyed with how easy it is to brick an AVR due to clock issues. Glad to see they're the outlier here. I suppose in their minds making clock selection nonselectable in application could help prevent the application from bricking itself, at the worthy cost of lack of flexibility.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-03-29 (#192115)

infiniteneslives wrote:

Was always annoyed with how easy it is to brick an AVR due to clock issues.

That's mostly the result of the Arduino community releasing AVRs with bootloaders that do self-flashing, though, rather than using "real" in-circuit programming.

To be fair, in comparison to Microchip's programming interface, Atmel's is a horrific pile of variations and (for larger parts) occupies all the pins, so that decision is fairly justifiable...

Re: Adding features to discrete mapper with multipurposed CI
by Myask on 2017-03-29 (#192135)

infiniteneslives wrote:

Ugh. Yeah, that might be nice. (Not all of Kevtris's posts were blanked, though.)

I also wonder if phpbb ~~was always at war with Eastasia~~ has revision history of posts.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-04-01 (#192396)

Well I got my synchronous implementation working with the STM8 core running on the 4Mhz CIC clock. Still have to implement region switching, and warm reset, but that'll be cake.

The cycle counting had me chasing my tail for quite awhile as SDCC kept relocating my code. The same exact source code was often time compiling to different cycle timing depending on how the wind blew at SDCC.. With the variance in cycles per instruction due to it's 3stage pipeline and prefetch buffer the compiler's relocation of my code was maddening. Eventually I figured out how to force the location of my code to remain constant. Then tweaking the exact cycles became rather easy when I started trading 2xNOP's (1Byte each), with an equivalent NOP such as "ADD A, #0" which is still 2 bytes, but only one execution cycle. That way I could adjust the cycle count without relocating the code that followed my delay.

After learning about the weak pulldowns on the original CIC's i/o, I started dual purposing the CIC RESET pin as a debug pin. Worked pretty well, to aid the timing analysis when my logic analyzer only had 3 channels working till I repaired it. This also means that one can practically get by with only dedicating 2-3 mcu i/o to CIC use. The reset pin is required, but after getting the reset sync from the lock during startup, the mcu can use it as a dedicated output pin. The mcu will be able to detect if the user hits reset on the console because the data in stream will stop, at which point the mcu can set the reset pin back to an input and wait for the reset sync from the lock again. So that's a nifty way to regain another mcu i/o pin.

There's only a couple i/o pins I had planned to dedicate to mcu outputs though. Perhaps SPI MOSI would be the best choice to dual purpose with CIC reset. However I think one might be able to get by tie the IRQ pin to CIC RESET assuming the console's IRQ pullup doesn't over power the CIC's pull-down. Better yet, just tie CIC DATA IN with CIC RESET, the resulting signal should be the ORing of the two signals. Doesn't get much more perfect than that! The sync from the RESET gets embedded directly into the LOCK's DATA OUT signal! When the stream stops coming from the LOCK you simply wait for the reset sync pulse.

I just did a quick check with my logic analyzer with CIC RESET tied to LOCK DATA OUT, and the resulting signal was ORed properly. So back down to only needing 3 CIC i/o pin for a synchronous cycle counted solution. That will only be 2 pins for an asynchronous setup where the mcu runs off it's own 16+Mhz oscillator cutting out CIC CLOCK.

I'll probably take a break for awhile and focus on some other projects for a bit. At some point I'll come back around and tinker with the timers trying out my async idea. Had a simple idea for a proof of concept to actually implement the discrete mapper flipflop in the STM8 while running the CIC at the same time. I'm not sure it can be pulled off, but looking at the timing requirements I think it's feasible. I don't think it would be a good idea or of much benefit to implement the mapper register in that way. But if that can be done with stability, then that would be a pretty good proving point for the hardware and underlying code.

Re: Adding features to discrete mapper with multipurposed CI
by Memblers on 2017-04-01 (#192411)

I don't have anything useful to add to the discussion, just wanted to say that I think this a pretty cool endeavor however it turns out. I would gladly consider licensing code that does something like this for some future mappers, assuming it didn't increase my unit cost weighed against the features provided (of course, it's utility greatly depends on what else ends up on the board).

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-05-19 (#196187)

Thanks Memblers, I'm not sure how far down this rabbit hole I'm going to go, but we'll see!

Today I accidentally stumbled on what might be my new favorite logic IC. Can't believe I didn't know or think to check if such an animal existed till now. The 74LVC1G97 basically like a micro-PLD with 'configurable' gate functions. Effectively a 5cent mux, which isn't that far off from what a switch costs in volume. Adding this guy to a discrete mapper would allow selectable H/V mirroring. Would take some sort of interface to be able to toggle mirroring on the fly during game play obviously. But at a minimum, I'm looking to make it so I can remove the mirroring switch from my discrete mapper design. Then the kazzo can have the ability to twiddle an eeprom bit in the STM8 CIC, and the CIC can set H/V mirroring at boot time. No need to open the case and toggle a switch anymore!

The fact it's also 5v tolerant, and can be configured as any number of gates is pretty legit. Means I only have to stock one IC for dozens of potential uses, and also simplifies the assembly process with only one part to pick from when it comes to a single gate.

Re: Adding features to discrete mapper with multipurposed CI
by Memblers on 2017-05-19 (#196204)

Oh, that is nice, I didn't know about these. I rarely peeked into Digikeys "multi-purpose logic" category, assuming it would be oddball stuff. But that definitely is handy. I was about to use a 1G32 OR gate in something, seems kind of foolish now since the 1G97 will do the same and more for the same price. The only cost is connecting one extra pin to VCC/GND, and that's nothing. Looks like there is also 1G57, 58, 96, 98, 99 with other assorted functions. Cool find.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-05-28 (#196830)

Been making some progress on my CIC implementation this week. I've got a fully functioning multi-region CIC implementation running on the STM8, fully tested in with all 4 region CICs. The current implementation has the mcu core being clocked by 4Mhz CIC clock, so the core is synchronous and uses cycle counting to ensure timing. Code changes became much simpler once I realized that pipeline placement/timing of code that followed my new changes wouldn't be affected if my changes were always in multiples of 4 bytes.

Learned a few things I was asking myself earlier in this thread, so I wanted to share that info. Perhaps at some point I'll make my investigation more complete and post differences between 6113 and 3193 CIC versions on the RE thread.

Segher's disassembly of the NES CIC seems to be a 'cartridge CIC' KEY only 6113 version. When looking through his disassembly I was confused as to how it could all be working if that code was running on the console's LOCK. In his disassembly, the LOCK set's it output high shortly after resetting the KEY. And if you read the disassembly from the key perspective, that Dout pulse after reset, sends the KEY to go perform the 'magic' code that has an unknown opcode. If the key were to go perform that code, it's timing will end up being off (behind) reguardless of what that unknown instruction performs.

As discussed in my earlier post in this thread, I wasn't seeing the LOCK take it's Dout high after reset with the logic analyzer. Which means segher's disassembly doesn't go off and perform that magic code, so it's timing is fine and everything works. At the time I wondered why Dout wasn't getting set high after resetting the Key as the disassembly showed. Just recently I socketed the CIC on my console, and replaced the 3195 CIC (always in consoles as LOCK), with a 6113 which is only found in cartridges. When watching signals with logic analyzer, low and behold the LOCK was setting Dout high after resetting the KEY just as segher's disassembly shows. And the stream ID is delayed as one would expect due to the magic code being ran. If I put the 3195 in the cartridge, nothing works just as kevtris found.

Knowing all this, it now makes sense as to why the only combo of 6113-3195 that doesn't work is 6113-LOCK, 3195-KEY. This also explains why the KEY's Dout can't be too early, even though segher's disassembly suggests that it can be. The 3195 panics if KEY's Dout is high more than a couple cycles before the bit is supposed to be transferred. The 6113 doesn't make this check, but the 3195 must be.

Beyond that, I tested out my idea of dual purposing an mcu pin with the reset and KEY Din signals. Wired together for ORing worked for the most part. Everything looked good on the logic analyzer, and my implementation was able to pick up on the reset signal embedded into the KEY Din stream. But for some reason it wasn't the most stable, as pressing reset multiple times would cause problems. If I kept tapping reset it would come back again and start working, but eventually fall out again as I kept tapping reset. I'm really not sure what the problem was. Everything looked good in the LA captures, but for some reason it was falling out after a few bit transfers at times. I should probably have connected up to the oscope to get a better visual of the signals, but my first guess is that there is too much tension between the ORed output drivers on the console CIC. I wondered if using a proper logic OR gate would resolve the issue, but then came up with a better idea. I ended up wire ORing the KEY Dout and reset signals together. This ended up working great, so the mcu just changes it's DDR for the pin after the reset signal is latched. With this trick to cut down mcu i/o to 3 pins was every bit as stable as my 4 pin version.

So next step is to give an asynchronous implementation a try, ditching the 4Mhz CIC clock and letting the mcu run off it's internal 16Mhz oscillator. I suppose it would be best to ensure there aren't any issues with the whole idea of having the NES CPU read and write directly from the mcu's pins as a mapper register. Without that working, there isn't much value in an asynchronous SMT8 CIC being dual purposed as a discrete mapper 'co-processor'.

Even without the asynchronous solution, the current implementation will be of good use for my non-discrete mapper designs. Having a mcu at my disposal for boot time tasks can be rather useful effectively providing NVM to a CPLD which doesn't have any internal flash memory available. This will be helpful for things like my VRC board to work for all variants without needing to reconfigure the CPLD. Will also allow an MMC1 board to work as multiple different configurations that are normally incompatible with each other such as SEROM, SXROM, SNROM, SKROM etc.

Although I've been eyeing the stm32f030 as a dual purpose CIC as of late too, and there's quite a few reasons it's a better choice than the STM8 depending on one's goals. The STM8 still looks to be the better choice for expanding a discrete mapper. But when looking to expand the abilities of an ASIC mapper, the features STM32 starts to shine.

First reason is that the STM32 has more 5v tolerant i/o than the STM8 when powered by 3v. And powering the dual purposed CIC by 3v becomes desirable when the CPLD isn't 5v tolerant itself. Rather not need to worry about level shifting signals between the mcu and CPLD.
The other nice feature of the STM32 is that the PLL is able to multiply the external clock input. The PLL can multiply the external clock from 4Mhz to the max core freq of 48Mhz, this can't be done with the STM8. So the STM32 can be fed with 4Mhz CIC clock, and have a synchronous timer ensuring proper CIC transaction timing. This avoids timing drift issues between the mcu and console CIC.
Lastly, the STM32 just has a lot more horsepower with it's 48Mhz 32bit cortex M0 core, compared to the 16Mhz 8bit STM8. Implementing semi-complex synths, high speed external interfaces (SDcard, USB, Bluetooth, Wifi, etc), are all more possible with the STM32. Additionally the stm32 has better tool chains available to it in terms of free C compilers and libraries.

So I'm not sure where I'll take this next. But that's where I'm currently at. I'll probably have to shelve this project for a bit while I focus on some other high priority projects. Now that I've got a basic CIC STM8 CIC working I can convert all my designs away from the attiny13 which was my original and initial goal.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-06-01 (#197164)

Having a hard time putting down this whole idea of a discrete mapper with CIC mapper expansion.. The infinite number of possibilities that could be unlocked without increasing the BOM cost by a single cent is hard to keep myself from day dreaming about. I've came up with what I think is a fairly clean way to handle mcu mapper register reads and writes. But until someone comes to me and wants to write software targeting this idea, I have a hard time motivating myself to devote the time to fully developing this idea. On top of that, the idea of implementing this in an emulator on anything but a highly abstracted level sounds like living hell. So for now, I'll just document my idea here in public as best I can. Good chance I'll use it as reference in the future if there is outside interest in the event this idea does become a reality. Or perhaps someone else would like to take my idea and run with it which I'm perfectly fine with.

infiniteneslives wrote:

My proposed pin assignments would allow for 4bit nibble wide read/writes at a minimum. If one wasn't looking to utilize the UART then the entirety of PORT D could be used for 6bit wide accesses.

There is a problem though as we can't be certain the mcu is always able to listen to writes to $6000. The mcu could be currently interrupted by CIC comms which must have a higher priority. I can't think of a very clean way to get around this without adding dedicated logic.

Now that I'm more familiar with the STM8 and the CIC's requirements, I've got a better idea of how to handle R/W accesses from the NES CPU. The key comes from making the mcu register r/w interrupt higher priority than the CIC comm timer. This is possible because of the relatively large 6.7usec window we have to output CIC stream bit, lets just call it a 5usec window to be conservative. With that large of a window, there is time to service the potential for a CIC transfer inside the mcu mapper register r/w isr *if* there's an explicit definition of how to r/w to the mcu mapper reg.

My hardware proposal is still similar to my original idea of dedicating one of the discrete mapper flipflop bits to interrupt the mcu. Let's say $8000.7 for discussion's sake. The NES CPU must set this bit, then r/w from the mcu register, and clear $8000.7 in rapid succession. If we explicitly state how this instruction sequence is to be performed, it also provides the benefit of simplifying address decoding. We can actually create a large number of mcu registers effectively decoded by NES CPU address, while only utilizing CPU R/W, and CPU D0-4 as mcu inputs. But how??

This is my NES CPU instruction sequence proposal on how to write a byte to the mcu:

Code:

pseudo code as preparation for write is not timing sensitive, just trying to illustrate idea:
-load A with byte that would like to write to mcu 
-transfer A to Y register (Y will contain the lower nibble to write to mcu reg)
-shift A register to the right 4 times (places upper nibble of mcu reg value in bits 3-0)
-transfer A to X register (X will contain the upper nibble that'll be written to mcu reg, but it's placed in bits 3-0 as that's all the mcu sees)
-load A with current bank of discrete mapper register
-set bit 7 of A (this bit being set @ $8000.7 will interrupt the CIC mcu for mapper r/w)

;now that everything's prepared, perform the mapper write:
STA $8000   ;write to discrete mapper with bit 7 set
STY $5000   ;write lower nibble to mcu
STX $5000   ;write upper nibble to mcu
AND #$7F    ;clear bit 7 so we can disable mcu's interrupt
STA $8000   ;write to discrete mapper with bit 7 clear, CIC mcu interrupt complete

So we're defining the exact sequence of what must be done whenever $8000.7 is set. We know a STY and STX instruction will follow immediately after $8000.7 is set. And the NES CPU won't waste any time clearing $8000.7 once write is complete. This creates a very specific timing constraints from the CIC mcu's perspective.

And since we know exactly what NES CPU instructions are being used, we can simplify address decoding by sniffing the CPU data bus alone. The mcu doesn't need to decode any CPU address lines with this trick, but it has visibility of whichever CPU data pins it's connected to. For this implementation, I've chosen to only connect the STM8 to the lower nibble CPU D0-3. The tssop-20 STM8 doesn't have a full 8bit wide GPIO port pinned out, the most it gives is 6 pins with PORT D1-6.

By sniffing D0-3 during the STX/STY instructions, we can glean CPU A11-8, and CPU A3-0 for the upcoming write cycle. We can afford to cut out CPU A13/14 for mcu decoding purposes, as the mcu is no longer listening for $6000-7FFF as in my original idea. Here the mcu can only decode CPU A11-8 & A3-0. But that's pretty legit as it gives us a potential for 256 mcu mapper registers to work with. To be clear, CPU A15-A12 aren't actually being decoded with this implementation. Selection of $5000-5FFF for the location of the mcu register is arbitrary, that's simply an a convenient address space which doesn't conflict with anything else in the NES CPU memory map.

Since the mcu can sniff D3-0 during the STX/STY opcode fetch, it can differentiate between STY/STX with the lower nibble of opcodes $8E/$8C. So we don't have to require both the upper and lower nibble always be written, nor in a specific order. We just have to pick a convention of X/Y being Hi/Lo nibbles and stick with it.

The requirement to clear $8000.7 asap comes from the fact we can't tie up too much of the mcu's time, as the $8000.7 interrupt is getting set as the mcu's highest priority interrupt. So we have to free it so i can get back to CIC mangle calculations and such in it's main thread. While there's an abundant amount of time that could allow for more than 1 byte to be written at once, things get complex quick trying to define a larger r/w routine with explicitly defined timing to provide the mcu.

If one had an application where larger transfers were desired, my idea about requesting the CIC mcu to interrupt the NES CPU when there's a sufficiently large period of time that CIC comms can be ignored is the better solution and would be relatively easy to implement. We need a means to transfer a single byte before we can solve the KByte transfer solution.

Before we get too far, I want to come up with a definition of our convention for mapper register reads. I'm expecting that this can be pulled off somehow. Although details on the best way to do this didn't start to come to me until I started thinking about how the mcu ISR would work. The CIC mcu ISR for register r/w gets tricky quick. There's a lot of things it needs to ensure and they're all timing sensitive. One of the biggest issues becomes accounting for the 5 cycle jitter for when the ISR starts executing. Putting more burden on the ISR with tasks like determining if the 6502 is reading or writing really starts to become a challenge. We don't have much option with this hardware definition to use a separate ISR for both reads and writes. The only good way to have separate R/W ISRs would be to devote another discrete mapper flipflop bit, one for reads, one for writes. I don't much like that idea though, we may not have bits to spare.

Here's my KISS solution that combines the NES CPU mcu register reads into the same routine with writes:

Code:

pseudo code as preparation for write is not timing sensitive, just trying to illustrate idea:
1) load A with byte that would like to write to mcu (can skip to step 5 if only care about reading from mcu)
2) transfer A to Y register (Y will contain the lower nibble to write to mcu reg)
3) shift A register to the right 4 times (places upper nibble of mcu reg value in bits 3-0)
4) transfer A to X register (X will contain the upper nibble that'll be written to mcu reg, but it's placed in bits 3-0 as that's all the mcu sees)
5) load A with current bank of discrete mapper register
6) set bit 7 of A (this bit being set @ $8000.7 will interrupt the CIC mcu for mapper r/w)

;now that everything's prepared, perform the mapper write:
STA $8000   ;write to discrete mapper with bit 7 set
STY $5x0x   ;write lower nibble to mcu
STX $5x0x   ;write upper nibble to mcu
;write complete, now read back the old value that was in the mcu register
LDY $5x0x   ;read old value from mcu register (lower nibble)
LDX $5x0x   ;read old value from mcu register (upper nibble)
AND #$7F    ;clear bit 7 so we can disable mcu's interrupt
STA $8000   ;write to discrete mapper with bit 7 clear, CIC mcu interrupt complete

;At this point we've effectively completed a SWAP operation between X/Y registers lower nibbles and mcu mapper register $5x0x

This may seems a little confusing as to why we're writing, and then reading. And what if you didn't want to overwrite the value of a register, and you only wanted to read it? My thought is that the mcu register definitions would overcome this issue. We've got up to 256 registers to work with, so just define them as read only, or write only as needed. So the NES CPU code you're writting probably only cares about read or write, but by using a swap operation, we can tackle two birds (read & write) with one stone (mcu ISR).

Additionally I'm going to discard my earlier idea that the mcu will decode STY/STX by sniffing the opcode. As I get into the details of the ISR, the more that we can simplify with convention of the 6502's r/w routine, the easier life is for the STM8. So for discussion's sake we'll require the sequence of STY-STX-LDY-LDX as lined out by the routine above. Additionally, we'll effectively require that routine to be copy pasted into 6502 assembly code, with only possible changes to be the mcu register address. The x's in $5x0x denote address nibbles that can be modified. But the address for all four load/store's addresses must match. The mcu ISR isn't going to have time to decode each and every one and adapt on the fly. If the 6502's read/write routine is running in rom, this definition would require a separate routine for each register. That may not be an issue if only using a few registers. A more versatile way would be to execute the routine from SRAM and use self modifying code to change the absolute address of the STY-STX-LDY-LDX instructions prior to executing the read/write routine.

Now to try and explain how all this would work from the CIC mcu's perspective... So now we've got an explicitly defined timing of bus operations from the time that the mcu receives it's $8000.7 interrupt, we can utilize cycle counting within the mcu mapper r/w ISR to latch address and data from the NES CPU. But since this ISR is designed to be of higher priority than the dedicated CIC comm ISR, the mapper r/w ISR must also handle necessary CIC comms should they be needed while it's running.

I've gotten into the details of how the STM8 CIC KEY would run asynchronous from the console's LOCK in previous posts in this thread. The basic idea is that there's an mcu timer which is used for counting down to when the next CIC transfer needs to occur. My plan is to use TIM2 for this purpose which in reality can only count up, but math can turn that around. The timer's ISR will account for drift of the clocks by polling LOCK's Dout when expected to be high. That ISR will also set/clear KEY Dout as necessary, but it's a lower priority routine than this mcu register r/w ISR I'm about to discuss.

The CIC mcu is running at 16Mhz with 62.5nsec period, and the NES is running at 1.79Mhz with a period of 559nsec (assuming worst case NTSC). So there are ~8.9 STM8 cycles per 6502 cycle. And we've got a window of 5usec that a CIC bit must be output when needed. That CIC window equates to ~8.9 cycles on the 6502, and 80 cycles on STM8. So it looks as though we've got plenty of time to get everything done if our ISR is smart enough.

Here's some psuedo code and STM8 assembly to give timeline of how I picture the ISR to work, cycle numbers on left are STM8 cycles. I'm sure there are some errors on exact timing of everything, but this gets the idea accross.

Code:

0: NES CPU sets $8000.7 to trigger ISR (6502 end of STA $8000 cycle T3)

1-6: complete instruction in execute cycle (1-6 cycles)  -Ooof!  we'll have to account for that potential jitter...

2/7-11/16: push registers to stack (9 cycles)

8-17: jump to ISR (docs not explicit on # of cycles, assuming it's 1 cycle like the JMP instruction)

9-18: start executing ISR 
       Oops!  The 6502 has executed ~1-2 cycles by this point..  
       We don't have a good way to ensure we can sniff T0 & T1 of the first STX/STY, which means we don't know if it's STX/STY
       One possible solution would be to define that a NOP is required between STA $8000 and STX/STY.
          -No one likes wasting time!  And this still doesn't solve the jitter issue.
       Another would be to just make it convention that STY is first, however ADL in T1 (our ability to sniff CPU A3-0) may have passed us by.
          -This is the reason I made the decision to nix the ability to handle different orders of STX/STY, and require all addresses to match.
       We also have to account for the 5 cycle ISR latency jitter, and get aligned with the 6502.
       We could let the ISR spin polling CPU R/W and align itself when it goes low.
          -This is half of the reason why STore is first, and LoaD is second.
          -Other half of reason is logically this is only way X/Y registers can be preserved during a SWAP.
       Perhaps it's for the best that STY T0 & T1 have passed us by as we didn't yet have a way to account for ISR jitter anyway

So at this point we know PRG R/W will go low around cycle 27, but we're somewhere between cycle 9-18 and don't know where..
Additionally every ~80 STM8 (or ~8 6502) cycles we need to check the CIC comm timer and output a bit if necessary.

;spin until R/W low for STY T3
rw_still_high:
BTJT    rw_port, #rw_bit, rw_still_high    ;2/3 cyc

STY cycle T3 starts around STM8 cycle 27
Now we've accounted for jitter and we're ~29 STM8 cycles since 6502 set $8000.7

;Delay a few cycles until CPU D3-0 should be valid for STY T3
30:   NOP, NOP...

;Latch CPU D0-3 for STY T3
33:   MOV  low_wr_data, data_port

6502 is about to go from STY T3 to STX T0 (occurs at STM8 cycle ~36) , this is a good time to handle a CIC comm if needed.

;STM8 assembly rough idea of how check if time to output CIC comm (total 4 STM8 cycles)
LDW    X, TIM2_CNTR           ;2cyc
SUBW  X, #$FFF0                 ;2cyc
JRMI    no_comm_needed    ;1/2cyc
MOV     Dout_port, out_val  ;1cyc
no_comm_needed:

"reset" count for CIC comm window.

We're in the middle of STX T0 currently.  Delay until can sniff ADL from STX T1
NOP, NOP...

;Latch CPU D0-3 for STX T1 to sniff CPU A3-0
50:   MOV  low_addr, data_port

;Delay till STX T2 to sniff CPU A11-8
NOP, NOP...
59:   MOV  high_addr, data_port

;Delay till STX T3 to latch upper nibble of mcu register write
NOP, NOP...
68:   MOV  high_wr_data, data_port

69-99:
All data has been latched for mcu register write, we also know CPU A11-8 & A3-0 for upcoming register read.
We'll assume that the register address can be mapped to a fixed block of STM8 SRAM.
During this time we'll consume a few STM8 cycles to piece together latched high_addr:low_addr
and map that to an STM8 address we can set the X register to point to.
Copy, shift, and mask that the lower nibble of data into data_output_port for upcoming LDY T3
Copy, shift, and mask that the upper nibble of data into SRAM for quick access when time for LDX T3.

Perhaps 30 cycles isn't enough time to handle all that, but it should be for simple tasks.
Worst case require a NOP inserted between STX-LDY if needed.
Even better idea: move AND #$7F instruction between STores and LoaD instructions!

100:
register read lower nibble already stored in data port output register.
Set port register DDR to enable register data to drive 6502 data bus D3-0
Delay while 6502 is latching read for LDY T3
107:
disable data port output drivers with mcu DDR

108:
It's been ~72 cycles since we checked if a CIC comm was needed.  Perfect time to check again.

Copy prepared SRAM byte back in cycles 69-99 from SRAM to data port output register

Delay till LDX T3

136:
enable data port output DDR
Delay while 6502 is latching read for LDX T3
142:
disable data port output drivers with mcu DDR

Need to wait for NES CPU to clear $8000.7
This will take 6 cycles on 6502, STM8 can't return from interrupt until complete to prevent re-entry.
Should perform some more CIC comm timer checks during this time.
STM8 IRET takes a whopping 11 cycles, worst case a CIC comm timer interrupt occurs during that IRET.
Need to ensure adequate time for CIC comm timer interrupt to handle a comm that's needed as this ISR returns.
Additionally this routine left KEY Data high if a comm was needed.
Need to ensure the CIC comm timer ISR will clean up after this routine and clear Dout when no longer needed.

;return back to main thread where CIC mangle operations can continue.
;or whatever request made by the 6502 via this routine can be performed.
IRET

Phew, There you have it! So this mcu register r/w routine could hold a higher interrupt priority for the STM8 mcu compared to the CIC comm timer which would have second priority. Any other interrupts would have to have a lower priority than these two, and the STM8 must be set to nested interrupt management mode. That way higher level interrupts are able to interrupt lower priority ones ensuring mcu register r/w are always serviced, and no CIC comms are missed. Beyond all this one just needs to ensure the mcu isn't over worked and that it has adequate time to complete CIC mangle calculations.

The biggest risk for this would be if the NES programmer were to perform multiple mcu register r/w operations back to back. Would have to do some worst case analysis on the time required for mangle calculations. This entire ISR is ~200 STM8 cycles, which is only ~13usec, that's a relatively small amount of time on the scale of the CIC timing and calculations.

My current mangle table routine is ~100 STM8 instructions, isn't well optimized, and takes ~42usec to perform mangle, and spins for ~30usec waiting for the console CIC to perform it's calculations. So that's some where ~60% cpu utilization during the most intensive calculations. During bit transfers, the CIC timer xfr ISR should only need ~5-10% cpu utilization tops. A conservative estimate would be that 70% of CIC time is mangle calc, and 30% is bit transfers. That weighted average STM8 cpu utilization comes out to ~50% which I consider a rather conservative estimate.

A practical rule that would keep from over utilization would be to require ~20usec (~35 NES CPU cycles) between mcu register accesses.

EDIT: My original estimate was flawed in that it neglected to calculate the fact that an asynchronous STM8 CIC would be running at 16Mhz, not 4Mhz. Here's a better cpu utilization estimate:

Code:

Mangle calculation: 
STM8: 100 instructions = 10.5usec
CIC average mangle 80usec = 13% CPU utilization during mangle calculations

Bit transfers:
Estimate timer ISR to run for ~5usec average maximum (time that pulse is high plus drift trimming)
CIC period of bit transfers 79usec = 6% utilization during bit transfers

Average number of bit transfers = 8 * 79usec = 632usec
Average number of mangle calcs = 16 * 80usec = 1280usec
% time bit transfers = 632 / 1912 = 33%
% time mangle calc  = 1280 / 1912 = 67%

Weighted utilization:
bit transfers 6% * 33 = 1.98%
mangle calc 13% * 67 = 8.71%
total utilization 1.98% + 8.71% = 10.7%

So in reality the CIC operations only utilize ~10% of the STM8's processing time. The register read/write ISR is ~12.5usec, with the time the 6502 is going to have to spend processing data coming in and out it's going to have a hard time overloading the STM8 with r/w accesses alone. In practice one might set a rule to not let that 12.5usec exceed 75% of the STM8's utilization. That would equate to providing the STM8 with a 4usec (~8 NES CPU cycles) between mcu register accesses. That's only a couple of instructions which isn't really enough to do anything worthwhile between accesses. In practice I wouldn't expect the STM8 cpu utilization to become an issue until it started being tasked with compute intensive tasks such as sound synthesis, or large UART data transfers perhaps? Those tasks would be make a lesser priority than register accesses, and CIC comms, so at least they wouldn't risk locking up the console/CIC.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-07-18 (#200419)

Been sinking my teeth into USB lately, finally got a pretty good grasp of the protocol and everything necessary to implement it. Got me thinking about another possible use for this project. I believe it would be within reason for the STM8 to act as a USB 1.1 host to simple devices like a mouse/keyboard. My idea would be implemented somewhat similarly to V-usb with bitbanging on the i/o, requiring very few external components.

I couldn't easily get google to tell me what version of USB the majority of keyboards/mice utilize, so I checked a more recent Dell one I have sitting around and it reported 1.1 in the device descriptor. I'm thinking it would make sense that the majority of them use 1.1 in efforts to be more compatible, and no real need for 2.0 speeds.

In reality though, supporting a PS/2 protocol would be *MUCH* less effort, and probably be more stable/compatible. That and USB to PS/2 converters are cheap if one doesn't have a PS/2 keyboard/mouse. But that wouldn't have the same cool factor and push one of the lowest cost mcu's on the market (STM8) to it's highest limits!!

The real annoyance with a cartridge providing support for an external peripheral is making a connector accessible for plugging into, and all that can't really be considered 'free' anymore. At which point a $1-3 bluetooth module starts to make more sense. Bluetooth would have the benefit of getting as simple as protocols get for the CIC mcu with SPI. But has the drawback of compatibility with whatever device the user happens to own which I can only assume would be a nightmare. Perhaps that's not the case though, I've never tinkered with BT as a developer, only have my user experiences to taint my impression of BT. I started to look into it awhile back and the annoyances with lack of compatibility between BT versions was enough of a deterrent..

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-07-18 (#200420)

USB HID devices are apparently supposed to be always at 1.5Mbit/s (USB1.0) speeds.

If there's a hub in the way, the hub might retime it to 12Mbit/s speeds.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-07-18 (#200421)

lidnariq wrote:

USB HID devices are apparently supposed to be always at 1.5Mbit/s (USB1.0) speeds.

Curious if you have a link/quote on that, suppose I should just look at the standard but I always second guess what I'm reading with those things.. If I've learned anything about USB so far it's that the standard is more like a guideline, and things work differently in practice than documents suggest.

USB HID devices are certainly not required to be 1.5mbit, plenty of projects/products utilize mcu's like the stm32 which only support 12mbit. They then take advantage of HID class to get around driver requirements. So can't see much in the way of one making a 12mbit USB keyboard. I'd assume keyboards with built in hubs step up the speed as you mention. It makes sense that keyboard manufactures would only utilize 1.5mbit for compatilbity/cost reasons, but IDK what keyboard/mouse manufactures are really doing...

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-07-18 (#200422)

Hm. I know I had read that somewhere, but I can't find any source corroborating it.

And certainly any of the high speed (1kHz) mice must run at at least 12Mbit/s (although they might default to a lower rate?), so...

Probably best to forget I said that.

Re: Adding features to discrete mapper with multipurposed CI
by na_th_an on 2017-07-26 (#200925)

infiniteneslives wrote:

I've came up with what I think is a fairly clean way to handle mcu mapper register reads and writes. But until someone comes to me and wants to write software targeting this idea, I have a hard time motivating myself to devote the time to fully developing this idea. On top of that, the idea of implementing this in an emulator on anything but a highly abstracted level sounds like living hell.

What exactly do you need? You know I can write games

What kind of features would it support? PRG/CHR banking? 16K/32K? CHR banking? 8K, more granularity? Or I have missed it completely and it has nothing to do with this?

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-07-26 (#200957)

na_th_an wrote:

infiniteneslives wrote:

What exactly do you need? You know I can write games

Really I just need some interest from someone who would like to develop games/tools targetting this hardware to give me some motivation. I know personally I'm not likely to get past a demo proving the hardware functional. Other thing that's needed/helpful is if the developer/someone else is willing to add emulator support themselves if they feel it's necessary. I can provide a flashable prototype cartridge & programmer for quick build testing on real hardware. I assume creating low level emu support which will actually emulate the limits of the STM8 core properly will be a challenge for the most seasoned emu developer. So if one wanted to target this hardware, would be best to test on real hardware frequently anyway.

na_th_an wrote:

What kind of features would it support? PRG/CHR banking? 16K/32K? CHR banking? 8K, more granularity? Or I have missed it completely and it has nothing to do with this?

As for what features are possible, the specifics are up in the air at the moment. Should things progress from here, I would likely focus on implementing the features said developer(s) were most interested in. This STM8 "CICOprocessor" is only capable of certain types of features if we don't equip the board with any extra logic to help it out. The original idea here is to have the ability to add the features below to nearly any discrete mapper. I only need one mapper register flipflop bit to perform what I have in mind. So the developer could pick from something like BNROM, AxROM, colordreams, or UxROM including any homebrew variants.

The idea of this CICOprocessor is for it to be an expansion that could be added to nearly any mapper. We don't necessarily need to limit it to discrete mappers, I only had the idea to target discrete mappers as it provided these features without adding expensive hardware. The CICOprocessor isn't well suited for bank switching tasks, so the memory banking would be left up to whatever mapper choice the developer made. Choosing discrete mappers still limits you to 16/32KB PRG banking, and 8KB CHR banking effectively. Desires of finer banking effectively necessitates addition of a CPLD on the board, doing that opens a can of worms. Putting a CPLD on the board allows the mcu to run at 3v at which point I'd like to target the STM32 instead and we've lost our sense of scope. So for now I'd like to focus on what features are possible for expansion of discrete mappers.

Potential features the CICOprocessor are capable of adding to a discrete mapper
Keep in mind, CICOprocessor registers would be limited to 4bits in size due to it only having access to D0-3.

~~or 128Khz LSI~~

solderless expansion port dongle

nibble register interface

I've already converted most of my NES designs to the STM8 CIC and planning to release my boot features once I complete my rewrite of the inlretro/kazzo software release. It's about time for me to layout a new discrete mapper board for my next PCB order. I might be able to pull of a direct swap from attiny13 to STM8 on my discrete design, but with all this I'm tempted to rearrange everything and start from scratch since I'll need a new set of stencils anyway. If I start from scratch I should be able to make space for routing CPU signals to the CIC as I've proposed so far. I plan to add the H/V mirroring mux so I can axe the toggle switch. I will probably add a cap and 1-2 resistors to provide PWM DAC allowing synth support. Beyond that, I will probably route the SPI/UART signals to the edge of the board for prototyping or direct soldering of a proposed module/wiring to the PCB. Whether or not these things will be made use of in the near future I'm not sure, I've got quite a few irons in the fire right now.. This minimalist CICOprocessor idea is fun, but without external motivation it's not very high on my development task list.

Re: Adding features to discrete mapper with multipurposed CI
by FrankenGraphics on 2017-07-26 (#200961)

Quote:

2) Expansion sound

Some thoughts... I suspect composing music for a wholly new "sound chip" might be daunting. You'd either need to:
1) develop a branch of FT supporting it so you can write music with quick aural feedback
2) edit, write to cart, play on hardware, rinse and repeat
3) compose in a notation program or midi editor, then convert that to the specific format, by hand or script or both, or
4) do it all in theory then punch in the data (which usually turns out rather primitive, which would go against the purpose of the feature).

Then there's need of a music playing engine capable of playing it, so there's still that to do.

Implementing one of the sound chips (or parts of one) currently supported by FT might help. Actually, it'd be attractive.

Another possibility with the expansion sound is not having sfx interrupt bgm notes. You could select from an options menu wether to use internal channels or external for sfx (depending on if you have the dongle or not). If the ext sound is interfaced as the internal ones would, the sfx part of the sound engine wouldn't need to be much different for the two options (which goes against the "supported by FT" argument as far as BGM goes, but for SFX, it would be fine).

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-07-26 (#200964)

You bring up great points FrankenGraphics. Personally I'm not really motivated to write the FT support much like my lack of interest for writing emulator support. So this would be up to the developer/others to overcome, although I'm all for making design choices that aid in their effort.

FrankenGraphics wrote:

Implementing one of the sound chips (or parts of one) currently supported by FT might help. Actually, it'd be attractive.

This was my initial thought. Having the cartridge mimic the likes of VRC6, MMC5/2A03, or sunsoft5b would be within reason. Only armed with the multitasked STM8 CIC and PWM DAC, not sure we can afford to get too caught up in the details of replicating the originals with near perfection. With the minimalist goals one would have to accept the CICOprocessor synth for what it is flaws/quirks/personality and all.

Quote:

Then there's need of a music playing engine capable of playing it.

This would be an issue even if the CICOprocessor synth is to mimic existing synths as we only have the 4bit registers to work with. Not sure how many sound engines support 'standard' expansion audio right now anyway. That said the audio registers should be able to be arranged in a somewhat similar fashion to the originals if desired.

Quote:

Another possibility with the expansion sound is not having sfx interrupt bgm notes.

This would be a good option if the creator wanted a minimal experience difference for consoles without expansion sound support. I'm no musician/composer but I don't think it would be the worst thing in the world if only backup/background voices were placed on the expansion synth. Gimmick! for example plays most sfx on the synth which everyone notices when missing. But there are also some voices on the synth which I didn't even realize were missing without the synth for the longest time. Personally, I think Gimmick's songs stand alone pretty well even when the synth is missing. But for the trained ear, and one who recognizes the fullness that's missing, will appreciate the extra channels. My point/opinion is, the music doesn't have to fall apart when a channel is missing in order to be appreciated for when it is present.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-07-26 (#200965)

Arbitrarily assuming that the desired features are "some kind of IRQ" and "some kind of sound", I suppose it's worth asking just how much synth can fit?

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-07-26 (#200982)

lidnariq wrote:

Arbitrarily assuming that the desired features are "some kind of IRQ" and "some kind of sound", I suppose it's worth asking just how much synth can fit?

Yes that's the bigger question. My estimates put CIC operations at ~10% utilization of the STM8 core, so there should be a fair amount of CPU time available for audio synthesis/mixing. In practice I'm sure it'll come down to a trade off between number of channels and desired sample rate. I've yet to get my hands dirty with audio synthesis on these mcu's but have my sights set on doing so. I'll have to report back once I've got some hard data.

FWIW I had to take back my mention of 128KHz LSI being a possible clock source for timers. We're effectively limited to fmaster = 16Mhz with prescalers. There are 3 timer counters available, here's my current plan for each:

TIM1 16bit advanced control timer: up/down counter with auto-reload. Prescaler set to any integer from 1-65536. This being the only up/down counter available makes it the best candidate for PWM DAC providing center aligned PWM mode. This counter also has best pinout to GPIO. This is the most capable timer on chip, it shouldn't be needed for CIC operations, so we get to task it as desired.

TIM2 16bit general purpose timer: up counter with auto-reload. Prescaler any power of 2 from 1-32768. This timer should be adequate for CIC timing management. But it sure would be nice to have a 16bit counter for an IRQ timer if TIM1 is being to work as a PWM DAC. TIM2's pinout isn't as great with a few mapped to what's been chosen for CPU D0-3. There are two other outputs available which conflict with the SPI pins, but one of those is slave select which could me mapped elsewhere via software.

TIM4 8bit basic timer: up counter with auto-reload. Prescaler any power of 2 from 1-128. This counter doesn't have any channel outputs pinned to GPIO. Being an 8bit counter makes it more challenging to use for CIC timing as the theoretical max mangle time is ~2.7msec = 44k clocks at 16Mhz. That theoretical max doesn't even fit in a 16bit counter without prescaling/rollovers. One should be able to pull off using this smaller timer for CIC operations but it will certainly be more challenging and require more CPU time counting rollovers. One solution to fit the CIC in TIM4 would be to set at a higher prescale during mangle timing, and finer prescale between bit transfers.

Suppose I'll make it my goal to pull off CIC timing using TIM4, and leave TIM1 & TIM2 for sound & IRQ timer.

Re: Adding features to discrete mapper with multipurposed CI
by FrankenGraphics on 2017-07-27 (#200989)

infiniteneslives wrote:

not sure we can afford to get too caught up in the details of replicating the originals with near perfection. With the minimalist goals one would have to accept the CICOprocessor synth for what it is flaws/quirks/personality and all.

This shouldn't be a problem, i think. If we can use the current version of FT to give us a rough but good enough idea, and then verify how it actually sounds on hardware now and then, that's much easier than needing to verify it on HW after each edit. It should be ok in most cases that FT won't be What You Hear Is What You Get; it's still usable for the sake of composing. If it's close enough, interface-wise and sound-wise, it wouldn't be as dependent on true FT support. What's left then is making a converter and the engine itself. If FT support would arrive eventually, it would be a great addition, but using the synth wouldn't be dependent on it this way. So, i'd like to propose to have the synth making conversion as straightforward as possible. The more that is, so to speak, one for one (not taking into account the register size), the better. If it sounds a bit different, that's not too much of a worry.

If you as a composer/dev plan on using these "expanded" carts, you must have a kazzo anyway, so the hardware requirement is already met.
If you're a team, though, you must convince the composer (or yourself if you're the composer) to get a kazzo for testing at home. But it's not that much of a step, i think? I mean, it's a one-time 20-30 usd + shipping depending on version.

Quote:

I'm no musician/composer but I don't think it would be the worst thing in the world if only backup/background voices were placed on the expansion synth.

I agree. A composer should be able to write music that would carry the idea on its own for the internal sound, and use ext. sound as support. Overtones (playing a harmony at a low volume), drum support (like snare tone while internal tri is still playing bass), extra echoes, chorus, and the occasional extra harmony. Or go all in, and add an "only works with dongle" sticker, if you really want to.

If the tuning on one, several or all notes isn't 100% perfect relative to that of the internal (which has its own temperament), you might even exploit that as a chorus effect without having to fine-pitch bend it.

Re: Adding features to discrete mapper with multipurposed CI
by na_th_an on 2017-07-27 (#201026)

As far as modifying emulators to provide support, I'm afraid all I can do is to dully simulate some features, such as the ability to switch mirroring by software. It's quite a task.

The different features you mention sound great. The extra sound channels, for example, or the aforementioned mirroring toggle. I could design a simple UNROM64 game which would change mirroring mid game, for example, but I can't really go any further as, as I said, my abilities are quite limited when it comes to providing emulator support.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-07-27 (#201028)

na_th_an wrote:

but I can't really go any further as, as I said, my abilities are quite limited when it comes to providing emulator support.

To be clear, this statement has the caveat assumption that full emulator support is required for your development. If one were willing to test builds primarily on hardware that would be a way to get around full emulator support. I'm willing to provide development hardware kits at little to no cost. Lots of hardware testing will be necessary anyway especially early on while the mapper is still in 'beta' form. Typically emulator authors are more interested in supporting new mapper features when there is already a game that utilizes the mapper.

Another way around emulator support might be to test and develop on a similar mapper that's already supported by emus. But only utilize the mapper in a way that 'emulates' the target discrete mapper + CICOprocessor. That would simpify porting the mapper specific read/write routines over to the new mapper. FME7/Sunsoft5 might be a good choice especially if the emu supports CHR-RAM and the end target is UNROM + CICOprocessor. Even better if the emu supports >8KB CHR-RAM. FME7/Sunsoft5 can emulate UNROM banking, and has selectable mirroring, timer based IRQs, along with audio expansion.

Re: Adding features to discrete mapper with multipurposed CI
by na_th_an on 2017-07-28 (#201084)

For something as simple as simulating the H/V mirroring switch in software, I can modify simple emulators such as Nester. Fceux should be easy to modify as well, as I understand the code I've studied (but I can't get it to compile no matter what I try - I'll try to address that issue later in the proper subforum, btw). It's just a behaviour simulation rather than true emulation. I would trap whatever you have to do from the game code to perform the switch, and order the emulator to act accordingly.

And I can always target, as you said, FME7 and perform the required changes to turn it into a UNROM+CICO.

I mean - I wouldn't need actual hardware to test while developing. I can always finish the software and send it to you once I have tested it in emulators, so there's no need for expensive overseas shipments

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-07-29 (#201181)

na_th_an wrote:

I mean - I wouldn't need actual hardware to test while developing. I can always finish the software and send it to you once I have tested it in emulators, so there's no need for expensive overseas shipments

I would actually prefer to put the hardware in your hands if you were taking the time to target the CICOprocessor. Would make the "build - test - report - rebuild" process much easier for both of us. The shipping costs are insignificant.

So I'll take this discussion as there being notable interest in my crazy CICOprocessor idea. I'm rather thankful I took the time to keep detailed notes in this thread about how I plan to execute everything. Being close to 2 months since I presented my nibble register interface I had pretty much forgot all the specifics on my idea..

I'll do my best to make progress on this effort sooner vice later and make progress reports in this thread.

Re: Adding features to discrete mapper with multipurposed CI
by na_th_an on 2017-07-31 (#201302)

Just the addition to H/V mirroring switching and the IRQ counter to simple discrete logic mappers is a plus. I'm sure many programmers target a more expensive ASIC board just for one of those features.

Re: Adding features to discrete mapper with multipurposed CI
by calima on 2017-07-31 (#201315)

Don't forget single screen.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-07-31 (#201317)

calima wrote:

Don't forget single screen.

While single screen is possible, it would require the CICO to drive CIRAM A10 with one of it's pins directly. Which is incompatible with the tiny mux idea I plan to implement with software selectable H/V. Because of that, and the fact single screen AxROM style mirroring is a trival addition to any discrete mapper I don't think using the CICO for single screen is worthwhile.

It's not that one couldn't have single screen and CICO on the same board. You just can't have selectable H/V/single via software all at once without adding more logic chips.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-08-16 (#202638)

Quick little update. I finally ditched SDCC (C compiler) for the STM8. I should have never bothered with C in the first place with the STM8. I thought I would take advantage of C for simplifing the initialization code and everything. And when I realized how there wasn't really an option to including asm files in a SDCC build I took the cheap way out and wrote the entire CIC operations with inline assembly. The inline assembly is pretty annoying to work with but I made it work.

I just migrated everything over to pure assembly and have been using naken_asm which has been great. I optimized everything in the process and became aware just how poor SDCC was.. My seed initialization routine ended up compiling into a horrendous mess. Hand writing that ram init routine alone cut my code by about half.

In the end I went from ~2.5KB to just over 1KB with my synchronous NES implementation by migrating init code from C to assembly. There's still room for more optimizations that would easily get me well under 1KB. When I move on to my asynchronous implementation, I expect the code to shrink by a fair amount as a decent number of timing NOP's will be removed. But some extra code will be needed to handle the timer operations too.

So in the end I'm expecting the actual CIC code to consume 1KB or less of the 8KB available on the STM8. Leaves a pretty decent program flash budget for all these potential features.

Starting to get a rough idea on how I plan to manage getting by with a 8bit TIM4 alone to handle the CIC timing. I'm expect that running without a prescaler will be helpful/necessary for more precise timing. So that will require software to count rollovers, but that code can be mid-low priority so I think it'll work okay.

But for now I've got to focus on implementing at getting a synchronous SNES CIC up and running. Once that's done, I'll start chipping away at an asynchronous NES CIC and some proof of concept with the nibble registers for adding features!

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-08-27 (#203358)

Have something of an update on this project... Perhaps I'm getting a little too deep for most people's reading interests. But my previous posts like this were rather helpful for my own idea development and later reference. I'll go ahead and give the "Way Too Long; Not Going To Read" version first and if you're up for some light reading you can continue...

WTL;NGTR:
Recently got SNES CIC implemented on STM8, but had issues with stability due to mcu clock source. That helped motivate me to start a more in depth planning of an async STM8 CIC which the NES CICOp project also requires. I ramble about multiplication of large numbers and my plan to keep timing calibrated. Discover that the targeted stm8s003f3 does indeed have GPIO available for clocking internal timer 1 "TIM1". This discovery opens up viability and/or additional features for the NES CICOp project I previously didn't thing possible such as legit PPU scanline counting.

SNES STM8 implementation problems:
I recently got my SNES CIC implementation running with the STM8. The first board/chip I used for testing works great. I've let it run for hours and it would run strong over night. While attempting to prototype a new design I hacked a STM8 onto a breakout board and glued it onto the backside of an old SNES flash board I had sitting around and used wires to connect all the pins. Unfortunately that setup was very flakey, and the CIC would drop out after ~1-30sec.

I tinkered around a bit, trying to determine the cause. Added extra capacitors to the breakout board as it was only powered from a pair of small wires, but that didn't help. I was a little skeptical of supply noise anyway considering the core is internally regulated to 1.8v with it's own external cap. I moved the CIC clock supply wire around from back side to front side of the board where it was more exposed, and that seemed to make the issue worse. I set my logic analyzer up to watch the CIC signals and debug pin when it dropped out. Found that the STM8 appeared to be resetting some times mid-stream. Other times it was making errors during the mangle calc, too many/few mangles, etc. Got in with the debugger to read the reset cause and found that the times it reset appeared to be due to illegal opcode execution. So seems that the CPU was faulting mis-reading instruction data. Depending on how it was mis-read it would result in a valid opcode that caused erroneous mangle calc, or an invalid opcode causing a the STM8 to reset. Bummer...

I later tried another board where the STM8 was closer to the ideal setup with it being well powered and as close to the connector as possible without all the lengthy wires of the previous setup. This improved matters, but would still fallout after a few hours of play. The setup was very similar to my first which has never fallen out. So perhaps some chips are more sensitive than others, I've only sampled 3 so far but with ~50% having problems I definitely need a solution.

The CIC clock is relatively clean looking at the oscope shot, and the STM8 datasheet doesn't give much for external clock specifications. Calls for "about 50% duty cycle" I measured 53% pretty close.. The datasheet goes so far as to say square, trigangle, and sine wave signals are acceptable clocks. So while the rise/fall times of 14/21nsec are pretty slow, they're a far cry from sine/triangle rise fall times..

I first tried buffering the clock through a single NOR gate I had sitting around (scope shot). That seemed to fix everything. I haven't ran it over night yet, but the second hack of a board with the breakout board ran for hours with no problems when it wouldn't even run for 1min previously. The NOR gate tightened up the rise/fall times to ~2.8nsec, and also introduced some ringing. The clock is inverted due to the NOR function, so the duty cycle became 56% which makes sense considering the virgin clock has a slower fall time.

Curious what would happen if I slowed the clock edges I tried adding a 20pF and separately a 220pF cap between the clock an ground. That only exacerbated the issue, the 3rd board which typically lasted a few hours only lasted ~min with the 220pF cap.

So I'm still not 100% sure what's going on here, ST doesn't give much of a spec for the external clock and I'm only running at 3.1Mhz which is at the low end of the 0-16Mhz spec. I never had this issue when working on the NES, and I had some pretty godawful wiring setups with 5-6inch wires going from the cart to the dev board in the beginning. Still need to do some more testing, but adding a logic gate as a clock buffer seems to be the best fix at the moment.

Asynchronous CIC implementation planning:
All that brought me back around to my idea of having an asynchronous CIC implementation that doesn't have the cart's mcu CPU core run off the 3-4Mhz CIC clock signal. One potential fix to the problem above is to cut the clock out of the equation completely! Certainly not an easy feat, but the motivation from the "NES CICOp" I figured may as well give it my best shot.

Looking at the numbers, an async SNES CIC is going to be quite a bit more challenging than NES due to the ~75% slower clock, and 3x as many mangle calcs. So the STM8 needs to be much more accurate with it's timing to meet the same ~3usec output window because it's counting "in the dark" for about 4 times as long compared to the NES CIC. So if it can be pulled off with the SNES, then NES shouldn't be a problem at all.

I took a closer look at how the STM8 timers work, and thankfully the prescalers are able to be changed on the fly. So targetting the simplest 8bit counter TIM4 looks hopeful. I can set the prescaler to it's max and divide by 128, which gets a max count of 2.048msec with 16Mhz HSI clock. The max theoretical time between bit transfers on the SNES is ~10.5msec, so software will only have to count 5 TIM4 rollovers at most, and at the last rollover, the prescaler can be tuned down to divide by 1 for fine tuning just prior to bit transferring. This allows long time periods to be measured with high precision (no jitter), but the accuracy due to timing difference between the STM8 HSI and CIC clock must be well calibrated to get the accuracy along with the precision we need.

I determined the calibration needs to allow for 0.01% tuning steps which equates to 1usec steps for the 10.5msec max theoretical SNES mangle time. NES only has a max theoretical mangle time of 2.7msec, so 1usec steps would only require 0.037% tuning steps. In binary, 1/128K gives us 0.0076% trim steps which should be more than adequate.

For a max tune step, the STM8 HSI is spec'd to be 1% accurate with factory tuning at 25C, and 5% across the temp range. If we go up to binary 1/32 step that gives a max tune of +/- 6.2% which should be enough. That means we need a 13bit calibration factor for +/- 6.2% range with 0.0076% step size. Could add a few more bits to round off to 15-16bits but it's probably overkill..

The delay count requires 14bits to measure up to 10.5msec in 1usec step size. But having a few extra bits for fractions of 1usec will be beneficial to keep us from adding jitter between timing events. The NTSC SNES CIC machine cycle is 1.3usec after all, so that fraction becomes a pain as rounding errors add up over time. Adding 4 more bits for fractions of 1usec allows us to get down to the smallest step size of the 16Mhz counter.

So in total there's 18bits of delay count to be multiplied by a 13bit calibration factor to determine a delay offset. The STM8 thankfully has a 8bit hardware multiplier. 18b * 13b factors produce a 31bit product. With an 8bit multipler that equates to 6 multiply operations, and ~7 summations to get the final product, the result gets truncated down to a 15bit offset which then gets signed depending on pos/neg calibration factor. That signed offset then gets added to the desired delay for the final timer count value.

My plan is to then use TIM4 in coarse count mode (8usec steps) until 8-16usec of the delay remain. For the final fine delay TIM4 will get switched to fine mode (62.5nsec steps). At the end of that delay the next bit will be output to the LOCK. While that 8-16usec fine count is occuring, a fixed ~8usec time delay will get pre-loaded into TIM4 for the end of bit transfer data clearing and calibration routine. At the end of that routine TIM4 will be setup to start counting down to the next bit transfer.

Since only the rising edge of the bit transfer is timing sensitive, the STM8 can use the falling edge of the LOCK's output bit (assuming it's expected to be a 1) as a timing adjust/cal point. TIM4 will be counting up since the expected rising edge, an interrupt can be enabled for the falling edge of LOCK's data. That GPIO isr will then read TIM4 value and compare it to the expected ~4usec pulse width. If it's beyond a tolerance I'm thinking that simply adding/subtracting ~1bit from the calibration factor will account for drift. Everything has to be pretty close to correct timing if we're still alive, so only minor adjustments should be needed to correct for rounding errors and slow drifts in HSI/CIC frequency.

Learning more about STM8 interrupts:
Getting a little deeper into the STM8 I've realized there's a decent way to remove the 1-5cycle jitter from when an interrupt routine starts executing by using "wait for interrupt" opcode which pre-stacks the processor status, and freezes the CPU until an interrupt occurs. With that there's only 1-2 cycle jitter due to timing edge of interrupt and execution of isr instructions. So I'm planning to make use of that.

Additionally I'm realizing an async SNES CIC is even more of a pain as the PAL CIC runs at 3.57Mhz compared to 3.08Mhz NTSC CIC. So it's 1.12usec per PAL CIC machine cycle, and 1.3usec per NTSC CIC machine cycle. So while the machine cycle count is identical between PAL/NTSC SNES CIC, the actual time differs due to operating frequency. So all the timing delays would have to be adjusted to have a multiregion SNES CIC with an asynchronous implementation.

QUESTION on NES CIC clocking in other regions:
I don't think that's the case for NES though. My PAL-A "Mattel" NES is running it's CIC at 4Mhz just like NTSC. I don't have a PAL-B, Comboy, nor other Asian/Aussie NES variants. I only have a PAL-B CIC, and Comboy CIC yanked from cartridges which I place in my CIC socketed NTSC NES for testing. Since PAL-A is 4Mhz like NTSC, I'm hopeful all others are as well. If any one has more info on that I'd appreciate it! Even just having confirmation that PAL-B console runs it's CIC at 4Mhz clock frequency would be good to know.

Discovering STM8's TIM1 has external clock pins available:
So aside from the struggles with my SNES implementation and the motivation it helped provide to making progress on an async solution, I've became more familiar with some of the STM8's details. Namely I'm better understanding how the timers work, and good news is I misunderstood TIM1's abilities previously. I was rather disappointed when I thought that there were no external clock sources (pins) available to clock any of the timers. My understanding was that "ETR" pins the ones that could be used to clock counters. And with the 20pin package the TIM1_ETR pin is unfortunately not pinned out. While I was right about the ETR pin, TIM1 is able to use any of the 4 input pins as a clock source to the counter as well. TIM2 (the other 16bit counter on chip) however does not have this ability. Both TIM2 and TIM4 must be clocked from fMASTER which we need to be running on HSI 16Mhz to allow for multitasking the CICOp.

Learning this, I'm planning to have my SNES implementation use CIC CLK to allow TIM1 to count CIC cycles exactly. So TIM1 will be synchronous with the LOCK's CIC CLK, but the STM8 core itself won't be. I presume that'll be enough to get around issues I had with STM8 core stability when using CIC CLK as a external CPU core clock source. This also resolves the annoyance of PAL & NTSC SNES CIC's running at different frequencies.

What this means for the NES CICOp project:
This realization is good news for the NES CICOp project though. Worst case, the NES CICOp can also clock TIM1 with CIC CLK 4Mhz, while allowing the core to operate on 16Mhz HSI. Most of my prior proposed features would still be viable with this setup. However TIM1 is the most advanced timer on chip, it sure would be nice to have available for PWM DAC audio synthesis, or counting a cartridge signal with the "newly discovered" TIM1 clock inputs. In the end I still think it's possible to handle NES CIC timing with TIM4 solely, so TIM1 has ability to add even more features I previously didn't think were possible.

So there are 4 pins (PC3, PC4, PC6, & PC7) which can be used for TIM1 clock sources that I didn't previously realize. That really opens up opportunities for more interesting PPU A12, A13, (or PPU /RD?) counting, or a more exact CPU cycle counter with M2. I can't really think of any other signals on the connector that would be worth counting, chime in if you have other interesting ideas.

Two of those Port C pins used for TIM1 inputs also map to the SPI pins, but dropping SPI bus support isn't really a big loss anyway. It's an I/O hog anyway with it's 4 pins. PC3 can also be mapped to TLI "top level interrupt" which is a NMI for the STM8 core. I'm thinking this would be the perfect use for the mapper interrupt pin. That would allow the mapper interrupt to be non-maskable which is exactly what we're going for. While all I/O's can be used as configurable priority interrupts, there's only one interrupt vector per port (4 ports total on this device). So allowing the mapper interrupt pin to be separable from other GPIO interrupts aids in ensuring that mapper nibble writes aren't missed or delayed. That would leave 3 TIM1 pins available, 2 could be used as input, and the 3rd as an output (2A03 IRQ). That would allow TIM1 clock source to be selectable between two chosen signals at run time.

The real limitation with using TIM1 as a counter for external signals is that TIM1 was also the timer planned to be tasked as a PWM DAC for sound synthesis. Reason being that TIM1 can perform center aligned PWM generation which improves PWM DAC fidelity. But if edge aligned PWM is acceptable, then the PWM DAC could get switched to TIM2 which can only be clocked by 16Mhz.

Perhaps there isn't as much interest in the CICOp synth since it's not compatible on all consoles and requires an external dongle or console modification. On top of that, having TIM1 count external signals is a pretty powerful feature addition. Arguably the TIM1 counter feature outweighs the increased fidelity gained with center aligned PWM. I've yet to get anywhere close enough to measure/compare the difference in fidelity. So with that my plan is to focus TIM1 on counting external signals and TIM2 for PWM DAC. If a specific project greatly values center aligned PWM, and is willing to give up TIM1 counting features then they can make that trade assuming I can build that flexibility into the PCB layout.

In the end I still have to prove my concept of using TIM4 for CIC timing asynchronously. If I'm unable to pull that off, TIM1 will end up getting consumed to handle CIC timing synchronously. That would leave TIM2 & TIM4 available for PWM DAC, and 2A03 timer but hey that's still something!

Phew... Well things are getting pretty complicated here, but overall good news and some progress being made on this project. Part of me wonders if it just might be worth upgrading to the LQFP-32 package to make pin assignments simpler. But have to resist that temptation and do more with less!

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-08-27 (#203370)

infiniteneslives wrote:

Additionally I'm realizing an async SNES CIC is even more of a pain as the PAL CIC runs at 3.57Mhz compared to 3.08Mhz NTSC CIC. So it's 1.12usec per PAL CIC machine cycle, and 1.3usec per NTSC CIC machine cycle.

Hold on a sec. NTSC SNES consoles come in both 4MHz (ceramic resonator, SHVC-CPU-01) and 3MHz (APU 24.576MHz÷8) versions.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-08-27 (#203371)

lidnariq wrote:

infiniteneslives wrote:

Hold on a sec. NTSC SNES consoles come in both 4MHz (ceramic resonator, SHVC-CPU-01) and 3MHz (APU 24.576MHz÷8) versions.

Oh, Well that's good to know! I'm glad mine happened to have been 3.07Mhz version otherwise I might have gleefully assumed all NES/SNES CIC were 4Mhz. Guess it might not have been an issue as I prob would have stuck with a sync solution with traditional CPU cycle counting. But, I didn't realize the differences between NTSC versions, possible I wouldn't have seen these instability issues with a 4Mhz and gotten burned when shipping to a 3.07Mhz flake like mine...? I'll have to try and hunt down a 4Mhz SHVC-CPU-01 with ceramic resonator for testing. I've got a couple SNES jr's somewhere, but I'm guessing those are 3.07Mhz APU/8 as that sounds cheaper.

All this is even more reason to keep TIM1 CIC CLK cycle counting for the STM8. A purely async solution for SNES would be quite the PITA. Thankfully/Hopefully that doesn't seem to be the case for NES as I'm still hopeful all versions are 4Mhz..? I've only got NTSC and PAL-A to test with.

EDIT: Just pulled out one of my SNES jr's and it's CIC Clock frequency is 3.57Mhz similar to my PAL SNES (apparently German and IDK if it's 1-2chip). So if 1chip SNES CIC's run at 4Mhz, then there are a total of THREE different CIC clock frequencies for NTSC..?

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-09-19 (#204827)

Slight update..

Every time I think I've got an understanding on these STM8 timers I'm proved wrong shortly afterwards... But I suppose can't be too suprised considering the timer portions of the STM8 reference manual spans a whopping 119 pages! As with many things it's not until you final start writing code for a piece of hardware do you actually start to learn it's true behavior. Good news is the STM8 timers are nearly identical to the STM32 timers so won't be starting from square one when I get more involved with the STM32 for other projects.

So I got the basic SNES CIC operating by use of TIM1 for time keeping purposes. Still technically a synchronous implementation, but the mcu core itself is asynchronous, just the timer is CIC LOCK synchronous. This allowed the STM8 core to be clocked by internal 16Mhz oscillator. This implementation is much more stable than my previous one where the core was clocked by 3-4Mhz CIC CLK signal. It was also significantly easier as cycle counting wasn't needed at all. Although that opinion is a little biased considering it was my 3rd NES/SNES CIC implementation so all my mangle calculation bugs had been previously worked out.

If nothing else, I would strongly recommend for anyone looking to make their own CIC implementation to utilize an on chip timer/counter to ensure proper CIC timing. Especially if it has an auto-reload feature making it easy to change the rollover value of the counter without interrupt jitter. My implementation is fairly simple, but it took a bit to realize the best way to operate the timer. It's best to have the reload value preloaded, that way the next event "count down" value is automatically loaded into the counter on it's next reload. The main trouble then is that one needs to know what the upcoming delay should be prior to the event starting. This is easy for bit transfers, but mangle timing calculations isn't as straight forward as it varies based on the data. My solution was to determine the approximate min mangle time based on number of mangles needed, and assuming shortest mangle time; then loading that into the timer prior to performing mangle calculations. Subsequently during the mangle calc I kept track of how many mangle calcs were the longer 'overflow' version. Once the mangle calc is complete, simply do the math to figure out the added delay needed for exact mangle calc timing.

Having the mcu core asynchronous makes coding a breeze in comparison to cycle counting especially on a prefetched 3-stage pipeline! Wish I would have went this route from the beginning. There's benefits all around including condensed program code, and lower power consumption as the mcu can't spend large amount of time sleeping waiting for the timer interrupt. This also allows for trickery in region detection that isn't possible in a sync core. Most regions can be detected on the fly in real time as you can poll the LOCK's data prior to outputting data as the KEY. And now there's free time to adjust the seed accordingly since the mcu core is operating so much faster. Not that traditional means of region switching and saving last known good region to eeprom is that burdensome, but it's cool to have the ability to sense the region on the fly. And of course now the mcu core is operating an order of magnitude faster than the LOCK CIC, it has free time for multitasking fancy new features so long as CIC timings are sufficiently prioritized.

Biggest limitation I realized is that NOT ALL TIM1_CHx inputs can be used for clock inputs to TIM1 on the STM8. The basic diagram doesn't make this clear and so my previous presumption is wrong. The only available TIM1 external clock pin options are ETR, TIM1_CH1, and TIM1_CH2. The ETR pin is unfortunately not pinned out on the 20pin packages, but TIM1_CH1 & TIM1_CH2 are available. TIM1_CH1 & TIM1_CH2 are mostly just as capable as each other for timer clock sources. Only difference is that CH1 can be set to clock TIM1 on all edges (both rising and falling), CH2 doesn't have that setting available. Both CH1 & CH2 can clock TIM1 on rising *or* falling edges, and there's also a nice filter ability where it must be high/low for so long before the edge is actually 'valid' and TIM1 gets clocked. In reality the edges of TIM1_CH1/CH2 aren't fed directly to TIM1 counter. Instead, the internal oscillator's clock is passed to TIM1 following adequately filtered edges of TIM1_CH1/2 pins. So with a mcu core frequency of 16Mhz, an external clock source must be high/low for more than ~63nsec to be a viable clock source. For the slow poke that NES signals are, that's not an issue, and probably for the best especially since signals like PPU A12 are so noisy.

Anyway, I'm mostly done tinkering with the SNES CIC now, so can get back to work on the NES CICOprocessor! Having the SNES implementation is good proof of concept that worst case the CICOp can use externally clocked TIM1 to ensure proper CIC timing. But that's worst case as it requires TIM1/CIC timing to have highest interrupt priority. With that worst case the STM8 can't be guaranteed to catch/respond to 6502 requests. I'm still holding out that I can pull off CIC timings with the meager 8bit internally clocked TIM4 alone; and still prioritize mapper requests over CIC timing with enough trickery and convention with the mapper register read/write protocol as previously laid out.

So the only real update to the CICOp abilities is that we've only got 2 pins to choose from for TIM1 clock sources, compared to the 4 previously thought. Curious if anyone has interesting ideas for useful signal choices. PPU A12 and CPU M2 are the most traditional choices to provide the choice between MMC3 style scanline counter and FME7 style CPU cycle counters. The only other signal of interest that comes to my mind is PPU A13 or PPU /RD for increased precision compared to PPU A12, and removal of MMC3 style pattern table and 8x16 sprite restrictions. With only 2 pins available what would you guys choose for TIM1 clock sources?

Crazy to think about unlocking features like this for the "low cost" of hardware development time. Considering the cost, it's possible the CICOp might even make sense for expanding the abilities of a MMC1-3 scale mapper...

I'm getting low on my discrete mapper board inventory so it's time for me to draft the first iteration of my CICOp plans into the PCB design for my next board order. I might try to include some jumpers to select TIM1 clock sources at assembly time, but the number of jumpers is already a bit higher than I'd like to see...

One other nifty thing I've managed to work out recently is the STM8 in-circuit programming SWIM (single wire interface module) protocol with my latest kazzo/inlretro build. It's pretty slick being only 1 signal, and the entire STM8 memory map is available to read/write from via SWIM, along with ability to control the STM8 core via the debug module. When the chip's read out protection is set, debug module, flash, and eeprom are locked, but SRAM and all periphery registers are still available. So it can actually become a slick means of i/o expansion since all the STM8's gpio registers can be accessed via a single wire even on a virgin/locked STM8. This is nifty for open/short testing to find PCB assembly flaws. Also might try to abuse this for accessing on board CPLD's JTAG signals without needing a single byte of code executing on the STM8. That is especially helpful for famicom carts where there's effectively no card edge pins to spare, now for the low cost of an STM8 I can indirectly add a bunch of pins to the card edge!

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-09-19 (#204830)

infiniteneslives wrote:

So the only real update to the CICOp abilities is that we've only got 2 pins to choose from for TIM1 clock sources, compared to the 4 previously thought. Curious if anyone has interesting ideas for useful signal choices. PPU A12 and CPU M2 are the most traditional choices to provide the choice between MMC3 style scanline counter and FME7 style CPU cycle counters. The only other signal of interest that comes to my mind is PPU A13 or PPU /RD for increased precision compared to PPU A12, and removal of MMC3 style pattern table and 8x16 sprite restrictions. With only 2 pins available what would you guys choose for TIM1 clock sources?

On the one hand, I suspect that the precision available from using PPU/RD would be lost in the 6502's variable IRQ latency. On the other hand, since your hardware can count both rising and falling edges, using PPU/RD lets you directly set IRQ coordates to a specific (X,Y) location on screen.

One of the API niceties of the MMC3 IRQ is that it doesn't let you do anything obviously wrong regarding IRQ latency: it'll always trigger at the same X position. PPUA13 and M2-based IRQs can easily be used wrong and slip a few cycles in response.

I do kinda wonder whether the VRC4's M2-based IRQ prescaler is a better choice, though. Having to rely on the screen being enabled in order to get IRQs feels like a silly constraint.

Re: Adding features to discrete mapper with multipurposed CI
by tepples on 2017-09-19 (#204831)

Two things that break an M2/(341/3) prescaler like that of the VRC4/6 are PAL NES and the Hi-Def NES's overclocking feature.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-09-19 (#204836)

As always, great points guys!

Quote:

It lacks grace, but perhaps one way to overcome the 6502 variable IRQ latency would be to have a "double fire" mode. Have the first IRQ be variable and get the 6502 spinning on some less variable code. Either way one is really going to have to work hard to get precise interrupts though, so perhaps PPU/RD counting isn't that valuable..

Quote:

I do kinda wonder whether the VRC4's M2-based IRQ prescaler is a better choice, though. Having to rely on the screen being enabled in order to get IRQs feels like a silly constraint.

Seems like inclusion of M2 as one of the two signals is an obvious choice for the first pin. I'm leaning towards PPU A12 for the second one simply because it's more traditional and thus more likely to be adopted. If it's not too much of a mess, perhaps include a jumper to swap PPU A12 out for PPU/RD if there's a desire to get fancy. Chances are such jumper is unlikely to get utilized on the first board rev though..

Quote:

Two things that break an M2/(341/3) prescaler like that of the VRC4/6 are PAL NES

The nice thing about the CICOp is that it could actually support various prescaler modes to compensate for this. A software selectable prescaler setting could choose between no prescaler, VRC4 "NTSC traditional" 113-2/3, or VRC4 "PAL alternate" 106-15/16, etc.

Quote:

and the Hi-Def NES's overclocking feature.

I'm not familiar with the specifics of Hi-Def NES's overclocking. Perhaps this the the wrong attitude, but from my (admittedly biased) perspective it's their job to replicate (or not change the behavior of) the original console. I enjoy supporting clone consoles whenever reasonably possible. But trying to support them all including ones which have yet to be created is futile. So I try to not loose much sleep over it nor let have strong influence on design choices. If there's something that I can add to improve compatibility and a means for me to test it I'm open to the idea though.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-09-20 (#204856)

infiniteneslives wrote:

My ridiculous "pipe dream" IRQ system for the NES involves requesting a specific X/Y location for the IRQ to fire, asserting the IRQ early, and an injected clockslide to get the IRQ to start with CPU cycle precision...

This pretty clearly is out of scope for the CICoprocessor

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-09-27 (#205285)

Made a little more progress!

Mostly have pinout nailed down for my first board version. Came to realization that I really just need two STM8 footprints on the board for now. Reason being that for my standard discrete 'non-CICOp' boards the goal is to have the STM8 increase user friendliness with fewer jumpers for things like PRG-ROM and CHR-RAM/ROM size trimming. Where the CICOp demands pins like CPU D0-3, and signals useful for all the hoped features. So it looks like I've got enough room for the two separate footprints with their varying pinouts. One perk of that is that it creates a bit of a backup plan if things go south with implementation of async CIC timing using TIM4 alone. Could actually have two STM8 on board, one for acting as a CIC, and the other as a dedicated co-processor. Certainly doesn't meet the goal of minimal hardware if it actually comes down to that last resort. But considering the STM8 is one of the lowest cost mcus on the market having a second one if it's actually getting well utilized isn't that crazy and much cheaper than a CPLD.

So here's the planned pinout:
PA1 & PA2: CIC Din, Dout, and wire ORed CIC Reset

PA3: CPU R/W

PB5: PRG-ROM /WE pin to support flash writes without '139 logic gate on UxROM (jumper makes optional for alt func)
PB4 & PB5 alternate function: I2C bus available pinned out to female header. Could support RTC, etc.

PC3 TLI: this is an external NMI pin for the STM8, using this for the mapper bit so any other PORTC pins can trigger interrupts with a separate isr. Using TLI gives little extra insurace that CICOp register read/writes have highest priority.

PC4 & PC5: TIM1 & TIM2 output channels. Came up with nifty way to arrange four SMT resistor pads in a square with the mcu pins in opposite corners. IRQ and PWM DAC signals are in other two corners. So placing the resistors horizontally will map TIM1 to 6502 /IRQ (support scanline/CPU cycle counting), and TIM2 to PWM DAC (edge aligned PWM). Mounting resistors vertically maps TIM1 to PWM DAC (center aligned), and TIM2 to /IRQ (async timer only). I was about to give up on option for center aligned PWM DAC till I realized this trick and 0ohm resistor for /IRQ kept routing simple.

PC6: TIM1_CH2 6502 m2 clock for cycle counting with TIM1
PC7: TIM1_CH1 PPU clock, jumper selectable between (default) PPU A12 and (alt) PPU /RD

PD1-4: CPU D0-3

PD5: H/V mirroring control with small MUX gate
PD6: Debug output trying to fit in an LED if I can
PD5 & PD6: can be dual purposed as they're routed to female header for UART to support low cost BT/WiFi modules to be added on.

So in the end giving up on the SPI bus in favor of a scanline/CPU cycle counters gave some breathing room for the pinout. In the end I'm not even sure if I need CPU R/W but not much benefit to leave it out as there's already pins to spare.

Current layout supports CICOp on a decent variety of discrete mappers. Really the only thing that's needed is a spare mapper bit to interrupt the STM8 for CICOp register access. Planning support for UxROM w/512KB PRG-ROM or less, BxROM with 512KB PRG-ROM or less, CNROM with 64KB CHR-ROM or less, and Colordreams with 256KB PRG-ROM or less (or 64KB CHR-ROM or less).

Haven't even started laying traces for the board yet, but current rat nest and component density looks manageable.

As with most things like this I typically realize dumb i/o assignment choices when I actually start writing firmware for the design. So I whipped up a little prototype board with the STM8 on a breakout board. Wrote a little NES test rom and the CICOp register access isr and have successfully transferred data between the 6502 and CICOprocessor!

Realizing some flaws and limitations to my original proposal:

Code:

;now that everything's prepared, perform the mapper write:
STA $8000   ;write to discrete mapper with bit 7 set
STY $5x0x   ;write lower nibble to mcu
STX $5x0x   ;write upper nibble to mcu
;write complete, now read back the old value that was in the mcu register
LDY $5x0x   ;read old value from mcu register (lower nibble)
LDX $5x0x   ;read old value from mcu register (upper nibble)
AND #$7F    ;clear bit 7 so we can disable mcu's interrupt
STA $8000   ;write to discrete mapper with bit 7 clear, CIC mcu interrupt complete

26 cycles of timing sensitive code total

Firstly, the STY $5x0x and STX $5x0x opcodes aren't the best choice because they don't make the CICOp register being accessed variable unless the routine is self modifying.

Secondly it's limiting because we're running out of registers in this routine. I'm trying to keep it as short as possible, and using A to maintain the current bank is a bit of a waste of instruction time and 6502 registers.

Lastly I has assumed the mapper wasn't subject to bus conflicts. But ensuring that may require extra hardware on board, so coming up with a bus conflict compatible routine is helpful. I got around this by simply requiring that a specific bank is always active during this routine, so these values can now be hard coded and align with a bank table.

Remember the CICOp doesn't decode CPU addresses, it's merely snooping the CPU bus during opcode fetching. So really for the 6502 to give data to the CICOp we only need an instruction that presents info on the CPU data bus at some point. Something like STA $5000, X with the register offset in X doesn't work as X is never present on the CPU data bus. Looking at other options I landed on ZP addressing with STA (ZP), Y. This works because the 6502 fetches the ZP bytes from sram so we can sniff their value to glean which CICOp register is being accessed. And while it appears annoying that Y gets consumed/zeroed, it's value is actually moot. So Y can be any value as the CICOp can't even see it.

Having learned a bit more about STM8 interrupts, I also realized that the TLI interrupt is edge sensitive, not level sensitive. So the priority I had to clear $8000.7 mapper bit ASAP is of no value. We can simply clear, then set the TLI mapper bit to create the needed rising/falling edge for the interrupt.

So with all that, this is what I came up with:

Code:

       ;cicop_reg is a ZP pointer that needs to be initialized to $5x0x
       ; where the 'x' values denote the CICOp register number being accessed
       ; the '5' and '0' in $5x0x is actually "don't care" as the CICOp can't see it their values.  
       ; The address in cicop_reg just needs to be an empty address space that won't conflict with anything else.
       ;Lower nibbles of A and X must contain the value (byte) that is desired to be written to the CICOp register

        ldy     #CICOP_BANK_DIS
        sty     CICOP_ADDR_DIS  ;8C 00 C0
        ldy     #CICOP_BANK_EN  ;A0 80

        ;trigger CICOp to start transfer operation:
        sty     CICOP_ADDR_EN
        ;8C 80 C0

        ;first allow CICOp to sniff reg number, and write low nibble of data
        sta     (cicop_reg), y          ;y doesn't actually matter CICOp can't see it
        ;91 04

        ;now the CICOp has register H:L from ZP sniffing, and data L from A lower nibble

        ;write upper nibble, contained in lower nibble of X
        stx     cicop_reg               ;doesn't actually matter what ZP byte is written to
        ;86 04
        ;cicop_reg gets stomped could use different ZP byte to avoid this

        ;now the CICOp has data H from sniffing store of X to ZP

        ;now time to read data from CICOp
        ;the data is returned in specific order,
        ;the CICOp doesn't know what register (X,Y,A) the 6502 is loading into

        ldx     CICOP_PORT      ;data L
        ;AE 00 50
        ldy     CICOP_PORT      ;data H
        ;AC 00 50
        lda     CICOP_PORT      ;data E (error/verification)
        ;AD 00 50

;END timing sensitive code.  The CICOp has now freed itself to go back to whatever it was doing
;don't actually have to clear mapper enable bit.  CICOp won't trigger until next rising edge.

25 cycles total of timing sensitive code.

So by getting a little tricky with choice of instructions I was able to reduce the number of timing sensitive cycles by 1 cycle, while also transferring 1 more nibble of 6502 read data! Granted that cycle count doesn't include all the preparation to get all the data loaded into ZP, A, & X, and also verify the transfer was successful. But those portions can be tailored/optimized by the user if desired. Would be reasonable to have separate read/write routines on the NES. Also possible some CICOp registers could be defined to have fewer read nibbles, and the STM8 would simply bail from the isr once write data was received depending on the register number. Or perhaps some registers are read/write but the actual value doesn't matter in the case of an something like an IRQ acknowledge/clear register for example.

I feel a lot more confident about the robustness of this routine now. I'm also a bigger fan of always performing both read and write to CICOp. If the register is defined as a write only register, then the last 3 LoaD instructions can actually be used for verification on the 6502 side to ensure that the transaction was successful. The CICOp can repeat back the exact data byte that it heard from the 6502. My thought for now is the 3rd LoaD could be an xor of the two nibbles of the register number. That would help verify that it also sniffed the register number correctly. But in the end this data could be defined as whatever we'd like later on. Appear to have a pretty rock solid transfer routine between the 6502 and CICOp for both reads and writes which is good enough for now!

Implementing things on the STM8 side ended up working pretty well. In the end I don't even actually poll CPU R/W level to try and remove STM8's 5 cycle isr latency variation. It didn't actually help as polling creates it's own jitter. And in practice, the STM8 isr latency variation isn't that bad. In practice it's only 3 cycles (187nsec).

Reason for that is the 1-6 cycles advertised in STM8 programming manual for time needed to complete current instruction assumes worst case. It assumes the STM8 may be executing FAR instructions, and this core doesn't even have any FAR address space. So that cuts us down to 1-5 cycles. However there are really only 3 instructions that are 5 cycles, and we can easily avoid them. They are CALL subroutine with indirect pointer, and LoaD/STore Word with indirect pointer. These 3 instruction are pretty easily avoided as immediate addressing is typically all that's needed for those 3 instructions. So that reduces number of cycles needed to complete current STM8 instruction to 1-4 cycles. Which is only 3 cycles of jitter. In practice I verified this with dozens of logic analyzer captures, so it all checks out.

One idea I had to remove jitter would be to set an STM8 GPIO interrupt on the CPU R/W pin, and the first step of the TLI isr would be to enable the CPU R/W interrupt and then WFI wait for interrupt. This would remove most of the jitter, but I fear this would suffer from incompatibility with various console versions. The edge timing of CPU R/W in relation to M2 could easily vary on clones, AVS, etc. And this current setup bases all it's timing off the mapper register bit setting which should be pretty reliable. My only real concern for timing is drift of the STM8 internal oscillator, may have to trim the oscillator if temperature variation becomes an issue.

For the 6502 being as slow as it is, 187nsec of isr jitter is managable. With properly timed and cycle counted STM8 code I was able to reliably capture and present all necessary data for the CICOp data transfer routine I just lined out above. My current tests are a simple on time check at boot/reset that's printed to the screen. Planning on running some automated tests to really exercise the routines. But early tests look good on original NTSC front loader and a portable clone I keep handy. A separate routine will be needed for PAL support, but there is enough free time early in the isr to branch between separate PAL/NTSC routines with their own fixed cycle counted timing. NTSC is worst case (faster), so I'm not too concerned about PAL.

One idea I had was for the STM8 to verify that CPU R/W was low when it should be for added robustness. One concern I realized when writing this is that extreme caution needs to be exercised to ensure the NES doesn't get interrupted in the middle of this routine. If it did, the STM8 will blindly output data onto the bus when it comes time for the last load instructions. If the 6502 isn't executing this code because it was interrupted that could easily cause a CPU crash. The only real way to guarantee that is for NMI's to be turned off during this routine along with disabling interrupts. That's probably the best call for early NES development using the CICOp until one is certain that an NMI won't occur mid transfer.

I still need to draft up an async NES CIC implementation using TIM4 alone for CIC timing so we can dedicate TIM1 to scanline/CPU cycle counting or center aligned PWM DAC. But there's quite a bit of dead/nop time in my current isr to allow for CIC transfers in the middle of the isr. So I've got a clearer outlook on my timing constraints that the async CIC will have to cater to. That part is still a decent challenge and will require lots of testing, but with successful 6502-CICOp read & write data transfers under my belt I'm pretty confident.

Now that I've got a good means to communicate between the 6502 and CICOp it's time for the fun stuff. Next up is trying out some audio synthesis and PWM DAC experiments, along with scanline/CPU cycle counting tests!

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-09-27 (#205289)

infiniteneslives wrote:

Code:

   sty   CICOP_ADDR_EN   ; 8c 80 c0 80 <--data bus contents
   sta   (cicop_reg), y  ; 91 04 ll hh xx xx
   stx   cicop_reg       ; 86 04 ee
   ldx   CICOP_PORT      ; ae xx xx LL
   ldy   CICOP_PORT      ; ac xx xx HH
   lda   CICOP_PORT      ; ad xx xx EE

To make sure I understand:

The CICOp literally waits for the rising edge of the latch, then waits the exact number of ns for the "ll", "hh", and "ee" bytes to come along on the data bus, optionally (probably) waits the exact number of ns for the subsequent "LL", "HH", and "EE" bytes to come along, and drives the data bus at those times?

Seems ridiculous, clever, and fragile, but also like you've got a good handle on it.

infiniteneslives wrote:

IRQ and PWM DAC signals are in other two corners. So placing the resistors horizontally will map TIM1 to 6502 /IRQ (support scanline/CPU cycle counting), and TIM2 to PWM DAC (edge aligned PWM). Mounting resistors vertically maps TIM1 to PWM DAC (center aligned), and TIM2 to /IRQ (async timer only). I was about to give up on option for center aligned PWM DAC till I realized this trick and 0ohm resistor for /IRQ kept routing simple.

Does center-aligned PWM get you better bit depth? I can't imagine there'd be an audible difference otherwise...

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-09-28 (#205321)

Quote:

The CICOp literally waits for the rising edge of the latch, then waits the exact number of ns for the "ll", "hh", and "ee" bytes to come along on the data bus, optionally (probably) waits the exact number of ns for the subsequent "LL", "HH", and "EE" bytes to come along, and drives the data bus at those times?

That's correct. Although to be clear/pedantic the CICOp receives a non-maskable interrupt when the mapper register bit ($8000.7 for example) toggles from clear to set, it's not waiting/polling for the register bit. Additionally the CICOp only has connection to 6502 CPU D0-3, so it's waiting for "lower nibbles" to come along on the data bus, not "bytes".

Quote:

Seems ridiculous, clever, and fragile, but also like you've got a good handle on it.

Haha thanks! Yeah as it turns out the STM8's timing for reading data off the 6502 bus is tighter because we have to wait until data is valid. Outputting data on the 6502 bus has significantly more slack time since it can be output early without concern. So 6502 STores have tighter timing contstraints than LoaDs from the CICOp perspective. That ends up working in our benefit as the STA/STX naturally have to come before LDA/LDX/LDY. For that reason it's less sensitive to STM8/CICOp's oscillator drift adding up to timing error towards the end of the data transfer during LDA/LDX/LDY time.

The biggest fragility to this setup in my mind is ensuring that the 6502 doesn't get interrupted in the middle of this data transfer routine. There is a room for a little more protection from this. Ensuring that CPU R/W is low when it should be during STA/STX helps the CICOp verify the 6502 is performing the expected routine. I also realized that it wouldn't be too hard to verify the lower nibble when the LDA/X/Y CICOP_PORT opcodes and operands are getting fetched by the 6502. Doing that would help assure that the 6502 didn't get interrupted mid routine and the CICOp is mostly safe to output data on the bus in the cycles that follow. I'm not sure if all that is necessarily worth the added complexity/bytes from the STM8 isr side or not, but it's an idea I'll have to keep in my back pocket if it becomes a problem. In the end I don't feel it's too much to ask of the NES code to ensure it doesn't get interrupted mid transfer.

Really it depends on how much access to the CICOp registers is needed. In reality most features only require the NES to write to the CICOp registers, so cutting out or reducing the LoaDs at the end of the transfer routine is an option. Once the CICOp register transfer routines are time tested and proven, perhaps it makes sense to completely remove the LoaD read back portion. They really aren't necessary for features like audio synthesis, switchable mirroring or scanline/CPU counters. The ability to read CICOp registers is really only needed for features like access to save eeprom, or external peripheral ports. Having the rule that rendering needs to be turned off for access to save eeprom shouldn't be a problem IMO. This mostly becomes a concern if looking to actually task the CICOp with digital processing for things like multiplication, division, or other math functions. I don't see data processing as the CICOp's prime feature set, but it's certainly capable of it if handled appropriately and the register interface didn't diminish returns too greatly.

Of course the real fix for fragility would be to add dedicated hardware/glue logic between 6502 and CICOp, but that detracts from the goal of minimal added hardware. Once that leap is made, I would also question if the STM8S003F3 is the best choice. Spending dollars on logic makes desires to upgrade the mcu to a bigger STM8, or a STM32 becomes tempting. So for now it helps to limit myself to the bare minimum and effectively no added parts to the board. I'm enjoying pushing this little STM8 to it's limit and seeing what it's capable of tricks and all. If nothing else I'm learning a lot in the process and will better appreciate more capable hardware in future projects.

Quote:

Does center-aligned PWM get you better bit depth? I can't imagine there'd be an audible difference otherwise...

Great question, I do not yet actually have any physical experience with PWM DACs. I've only recently learned about them and considered them viable. Most of my research has been reading up on what looks to be great info on open music labs. That article goes into pretty good detail of everything while also helping to realize the practical implications of PWM DAC design choices. Note that in the article they refer to center-aligned PWM as "phase correct PWM", and edge aligned PWM as "fast PWM". Here's the best answer I have for you quoting from that article:

openmusiclabs wrote:

Finally, once you have your topology and frequency selected, you can see what the bit depth trade-off is for using Fast PWM or Phase Correct PWM. Unless the signals you are generating are going to be very low frequency, it is almost always better to sacrifice a bit or two of resolution for the reduced distortion that Phase Correct PWM gives you.

Considering that we're only aspiring for "retro 8bit" audio synthesis, and not CD quality audio, I'm not sure how much the distortion will be a problem. Any distortion may simply become the CICOp's quirk and it's own characteristic sound, as the goal is not to replicate some other synth's sound perfectly. My initial goal is to mimic VRC6 audio as it's one of the simpler synths, and has a 6bit linear DAC. Getting close to VRC6 capability and resolution seems fairly achievable and a good starting point.

Lots of good info in openmusiclab's sketch as well including a helpful table for PWM DAC frequency, bit resolution, etc. Conveniently their table is at the same 16Mhz the CICOp will be operating at. My current plan/goal is to run at a PWM frequency of 31.3Khz which equates to the synth engine running every 512 STM8 instruction cycles. This equates to a 9bit edge aligned, and 8bit center aligned PWM DAC. From a STM8 CPU utilization standpoint I have a hard time expecting that the multitasked STM8 can handle more that that. But I also have not yet implemented the synth to get a good sense of how many calculations will actually need to be performed for each update. I expect the compute resources will ultimately be the limit of how many channels, or what forms each channel can take. While I may be able to create multiple different shapes and voice types, there will likely be limits as to how many can be active at one time. Also expecting square channels to be less compute intensive than triangle/saw, may help to limit volume range/step size as well. I'll be posting all the synthesis code publicly when I start digging into that, expecting there will be lots to discuss..

The other limit that I may start getting close to is memory resources of the STM8. This low end version has 1KByte of SRAM, 8KByte of program flash, and 128-640Bytes of eeprom. My current CIC implementation is around 1KB, there's some room for optimization, but I'm also expecting my async solution using TIM4 to require more code to support frequency drift correction. My current CICOp register transfer TLI isr is ~600Bytes, but I haven't made any effort to reduce it's size as it's pretty chocked full of NOPs. Adding support for PAL could easily double that code size. I'm expecting to be able to keep the total register transfer isr under well under 1KB though. So should have at least 4-5KB of flash to support CICOp features which seems like a reasonable budget. But big things like look up tables for audio synthesis could eat up large chunks of our flash budget if sights get set too high..

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-09-28 (#205323)

openmusiclabs wrote:

They explain in more detail on this page, but I'm ... not convinced of their math.

Humans can't usually hear phase; we can hear amplitude, and we can beat frequencies due to differences in phase, and we can hear group delay... but the amount of group delay we're talking about here is less than one sample period. And most of the error component that they identify in their math seems to be the portion that's above the source sample rate, and hopefully inaudible anyway. And the more I play with the math, the more it looks to me like the differences in phase are actually due to using a higher effective sample rate (and commensurate lower bit depth) than due to the symmetry of the waveform.

If I try generating a 5-bit PWM sine wave with 32 underlying samples (1024 total record rate) using 'fast' PWM, and a 4-bit PWM sine wave with 64 underlying samples (i.e. still 1024) using centered PWM, I actually do get a graph that's similar to their figures 6 and 7: DC components is the same, fundamental is the same, more energy at the 2nd and 3rd harmonics is present in 'fast' PWM and more energy is present at every higher harmonic in 'centered' PWM. (This corresponds to "Ratio of Signal Frequency to PWM frequency" of 0.03)

So, it looks like they're right.

On the other hand, if I manually load those 1024-sample loops that I generated and listened to them? I find the harmonic distortion of the centered PWM to be more grating. Here's two I made. They're set to a sample rate of 96kHz so that 96kHz÷1024 = something audible; be careful that your hardware actually supports this or that you're using a good-enough resampler to be assured that the sound differences are due to the method rather than the resampling.

Attachment:

File comment: f_sample = 96kHz
f_PWM = 3kHz
f_sine = 94Hz

center-and-fast-pwm-wavs.zip [759 Bytes]
Downloaded 86 times

Quote:

big things like look up tables for audio synthesis could eat up large chunks of our flash budget if sights get set too high..

You might be able to do (edit: hooked on phonics worked for me) something like the OPL2/OPL3 and just use a logsin and exp table...

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-09-28 (#205325)

Thanks for taking the time to break things down to something I can better understand. I saw that page, but that type of analog analysis is certainly not my strong suit. Not a whole lot of motivation to take the time to understand the math as I'm not even sure I can make heads or tales from how everything translates to audio quality I can make sense of in the end anyways..

Interesting to hear you find the distortion of fast/edge PWM less grating to the ear. I have no clue what my PC hardware is capable of, probably safe to say it's not capable. I can hear a difference, but have a hard time deciding what's better/worse on the ears to be honest.

Any input, criticisms, or suggestions on the audio synthesis is more than welcome. I'll do my best to explain and share what I'm doing as things move along in hopes that anyone who's interested can contribute or criticize my especially noobish judgements. I'm mostly learning as I go when it comes to the audio/analog arena.

Re: Adding features to discrete mapper with multipurposed CI
by tepples on 2017-09-28 (#205328)

You don't have to lose a bit of precision by using symmetric PWM.

Say you have a sample period of 32 cycles, and the signal level is 13 out of 32. With single-ended PWM, the signal would be low 19 cycles and high 13. With symmetric PWM, it'd be low 10, high 13, and low 9. This gives a maximum phase variation of one-half cycle and mostly decorrelated (only with the level's least significant bit), which is an improvement over single-ended PWM's one-half sample period of variation correlated with the level.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-09-28 (#205330)

tepples wrote:

You don't have to lose a bit of precision by using symmetric PWM.

So I'm not sure I follow what you're saying exactly.. Practically speaking to do what you're saying it would mean that the output compare value of the PWM channel would need to be adjusted at the top and bottom of the counter, instead of just a the bottom. So your suggestion would be to adjust the output compare value to something like 100 on the way up, and then 101 on the way down which would average to a pulse width of equivalent to if the PWM output compare had been set to 100.5 effectively gaining a bit of resolution back.?

I can't think of a practical limitation of why one couldn't do that, unless the timer hardware on the selected mcu couldn't produce an update interrupt at both the top and bottom of the counter. That said, I'm pretty sure the STM8 does permit you do do something like that. The only real drawback would be the added complexity/length (however minor) for the isr being executed every counter update event. Every little bit of execution time that can be reduced for that isr running every ~1k instruction cycles the better. 10 extra cycles equates another 1% of CPU utilization in this proposed case.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-09-28 (#205331)

Depends on the hardware implementation, unfortunately.

—

I couldn't leave well enough alone, so I played around with both PWM implementations using PureData for a bit. After trying and failing to get anything more sophistated than its built-in lowpass filter object (lop~) to work (no biquad~, no lop8_butt~) ... my conclusion:

(Implementation: 12.88MHz bit rate; 30720Hz PWM rate; sine waves from 100Hz to 1kHz)

Both kinds of PWM have comparable amounts of THD (-65dB(fast) vs -66dB(sym) @ f_sine/f_PWM=1/300; -57dB(fast) vs -60dB(sym) @ f_sine/f_PWM=1/100; -39.5 dB(both) @ f_sine/f_PWM=1/30) in the audible band. I'm hard pressed to support their assertion that symmetric PWM is more valuable than an extra bit of depth; the math and sim for real-world values I saw don't support that kind of unequivocal statement. Each loss of bit depth reduces your noise floor by 6dB; none of my tests have shown symmetric PWM enough better to warrant it.

"Symmetric" PWM does seem to put more energy at higher overtones than "fast" PWM, and which is more objectionable is a matter of personal taste and context.

Re: Adding features to discrete mapper with multipurposed CI
by tepples on 2017-09-28 (#205332)

infiniteneslives wrote:

Practically speaking to do what you're saying it would mean that the output compare value of the PWM channel would need to be adjusted at the top and bottom of the counter, instead of just a the bottom. So your suggestion would be to adjust the output compare value to something like 100 on the way up, and then 101 on the way down which would average to a pulse width of equivalent to if the PWM output compare had been set to 100.5 effectively gaining a bit of resolution back.?

Correct. I thought symmetric PWM was always defined this way, with the actual comparison being against a triangle wave that takes on a distinct value for all steps, such as 0, 2, 4, 6, 8, ..., 124, 126, 127, 125, 123, ..., 5, 3, 1.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-09-28 (#205333)

Again that depends on the hardware :/

Specifically in the case of the ATmega32's PWM hardware, the symmetric PWM mode always counts like the NES triangle: 0,1,2,3,...3,2,1,0,&c.

You are able to get interrupts right after each direction change, though, so it should still be possible to get higher bandwidth.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-09-28 (#205335)

That and you actually need a up-down timer that supports center aligned PWM at your disposal. In my specific case with the STM8s003, only TIM1 is capable of up-down counting. TIM1 is also the only counter available to be clocked externally, making it the only counter available for scanline/CPU cycle counting. I was initially disappointed to give up center aligned PWM capability when all I was going off of was openmusiclabs. But with lidnariq's analysis and input on the matter I really don't feel this way anymore.

I am still curious to experiment and compare center and edge aligned PWM DACs to try and see how they actually compare in practice. Part of the thing is depending on the synth, there isn't necessarily much one can do with more bits of resolution. If mimicking or designing something similar to the VRC6, it only has 6bits of resolution by definition. Short of adding more channels, having a 9bit vs 6bit DAC doesn't allow for any improvement. However, there would stand room for benefit by using a center aligned 6bit vs an edge aligned 6bit PWM DAC.

The benefit to be had with higher resolution DACs for sinusoidal voices of samples is there, but at this point I'm not even certain that's within the STM8's abilities... We'll see though, all the knowledge and experience that comes with this will be helpful for planning of more capable projects.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-09-28 (#205336)

Yeah, the big possibility for using extra depth would be for running an FM synthesizer. I'm certain it'd involve the logsin and exp tables, but I haven't yet bothered to sit down and figure out how to string together subtraction and the two tables to get an OPL out of it.

edit: this post quotes a very long thread here that says:

Olli Niemitalo wrote:

out = exp(logsin(phase2 + exp(logsin(phase1) + gain1)) + gain2)
[...]
Exponential table:

x = 0..255, y = round((power(2, x/256)-1)*1024)
[...]
Log-sin table:

x = 0..255, y = round(-log(sin((x+0.5)*pi/256/2))/log(2)*256)

edit2: and NukeyKT appears to have explicitly implemented a software FM synthesizer that uses the logsin/exp lookup tables for Chocolate Doom and a few other projects.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-09-28 (#205338)

infiniteneslives wrote:

it turns out the STM8's timing for reading data off the 6502 bus is tighter because we have to wait until data is valid. Outputting data on the 6502 bus has significantly more slack time since it can be output early without concern. So 6502 STores have tighter timing contstraints than LoaDs from the CICOp perspective.

Tangenting back to this tiny bit...

The 6502 is actually driving the data bus during the entirety of the write cycles. So the "ll" and "hh" nybbles have to wait for the RAM's delayed output from taking M2 into account, but "ee" doesn't.

Re: Adding features to discrete mapper with multipurposed CI
by FrankenGraphics on 2017-09-29 (#205346)

Disclaimer: i don't know the first thing about DACs, so this whole post may be based on false assumptions about the output of the DAC. If the output is center-aligned; ignore the post. Anyway...

I wonder what happens when you mix center-aligned waves with 0-to-pos or 0-to-neg waves? I imagine it might be audible for two sine waves in ~unison or octaves, and in more cases with waves having more complex overtones. I also wonder what it means for the speaker - wouldn't one in this case force its peak to peak range to widen to 150% (with an assymetric center position relative to the relaxed point), assuming both waves have the same amplitude? If that's true and the speaker can't handle it, we might get distortion, or even wear.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-09-29 (#205359)

Quote:

The 6502 is actually driving the data bus during the entirety of the write cycles. So the "ll" and "hh" nybbles have to wait for the RAM's delayed output from taking M2 into account, but "ee" doesn't.

That's interesting, because I expected that to be the case and initially allowed latching of data earlier when data was sourced from the 6502. At the time it didn't seem to be getting valid data unless I waited until late in the cycle. Entirely possible I had some other problem going on at the moment instead. I also never actually monitored the data bus with scope/logic analyzer, I was only watching the mapper bit, m2, CPU R/W, and my timing debug pin.

The other thing I didn't take into account is that the open bus after the cycle has ended, should still contain valid data until m2 goes high again. But based on your point, this is only true if the 6502 is not driving the data bus on the subsequent cycle. I need to spend some more time analyzing the bus and testing edge cases to ensure the read is best placed.

Quote:

I wonder what happens when you mix center-aligned waves with 0-to-pos or 0-to-neg waves? I imagine it might be audible for two sine waves in ~unison or octaves, and in more cases with waves having more complex overtones. I also wonder what it means for the speaker - wouldn't one in this case force its peak to peak range to widen to 150% (with an assymetric center position relative to the relaxed point), assuming both waves have the same amplitude? If that's true and the speaker can't handle it, we might get distortion, or even wear.

I don't really understand what your asking, sounds like you're concerned about constructive and destructive interference? I don't see how that relates to center/edge aligned PWM DAC though.. With your concern to the speaker, I don't understand how you're proposing one could exceed 100% output.

Re: Adding features to discrete mapper with multipurposed CI
by FrankenGraphics on 2017-09-29 (#205361)

It can't; I got seriously confused. :oops:

I somehow got the idea alignment was in the amplitude domain (ie if waves swing from +V to -V or from +V to 0v), not the frequency (like this picture illustrates).

.

It seems these two modes would interfere differently when in unison with the internal APU squares, but since their phase isn't synced to begin with, that difference might be lost anyway.

Re: Adding features to discrete mapper with multipurposed CI
by tepples on 2017-09-29 (#205362)

I foresee some intermodulation distortion between two sines with the "fast" PWM.

I also foresee trouble if DMC cycle stealing happens while the CPU is accessing the CICO.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-09-29 (#205368)

infiniteneslives wrote:

Well, I should say that we know that the CPU drives the data bus on write cycles during both φ1 and φ2... but that doesn't say anything about how much time it takes for it to get to valid.

We know that Nintendo's original 2A03letterless exported a 3/4 duty cycle and the 2A03E/G/H use a 5/8 duty cycle; it's possible that they had issues where worst-case conditions took longer than 140ns to get to valid.

tepples wrote:

I foresee some intermodulation distortion between two sines with the "fast" PWM.

I agree, there should be something. Time to throw that in the sim again.

Sim conditions:: F_bit=12.88MHz, F_pwm=30720Hz, F_sine1 = 3072Hz, F_sine2 = 3000Hz. Intermodulation products are expected to show up at 3144Hz and 2928Hz... but I don't see any. At all.

Re: Adding features to discrete mapper with multipurposed CI
by tepples on 2017-09-29 (#205370)

What does the sim show for, say, 2000 and 5000 or 2000 and 4900 Hz?

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-09-29 (#205371)

tepples wrote:

What does the sim show for, say, 2000 and 5000 or 2000 and 4900 Hz?

2kHz & 5kHz: FastPWM shows spurious emissions at 3720, 5720, and 6720Hz, but given those offsets of 720Hz, those aren't intermodulation between just the sine waves.

2kHz & 4.9kHz: FastPWM shows spurious emissions at 4206, 5100, 6200, and 7100 Hz. Not clear on the math generating these.

Both sets, the spurious emissions peaks are 40dB to 60dB below the intended signal, over a noise floor that's 10-20dB quieter; we're still only talking about around ~1LSB of noise.

(Like I previously said, it's not that centered PWM isn't better: it clearly is. It's just not obviously consistently enough better to warrant the loss of bit depth.)

Re: Adding features to discrete mapper with multipurposed CI
by tepples on 2017-09-29 (#205372)

Thanks for running them.

Now on to how DMC cycle stealing interacts with a scheme to blindly put stuff on data bus for reading at 4, 8, and 12 cycles after the final write.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-09-29 (#205373)

tepples wrote:

I foresee some intermodulation distortion between two sines with the "fast" PWM.

For the CICOp and it's set limitations I'll probably be limited to simpler waveforms (square, triangle, saw) anyway. If it does get a sine channel, it likely won't be more than one. I am curious to see what how well a PWM DAC (perhaps even dual-PWM) can perform on a more capable mcu in comparison to a hardware DAC. In the end, the added cost of a mcu with built in DAC isn't too much. So using a PWM DAC is best placed in the most cost sensitive/limited projects.

Thanks for the PWM analysis guys! It is nice to hear that an extra bit of edge aligned resolution stands the chance to make up for the loss of center aligned PWM!

Quote:

I also foresee trouble if DMC cycle stealing happens while the CPU is accessing the CICO.

Ahh yeah. I completely forgot about that DMC guy..

Quote:

Now on to how DMC cycle stealing interacts with a scheme to blindly put stuff on data bus for reading at 4, 8, and 12 cycles after the final write.

I think we can cover this case, so here goes:

So assuming this to be accurate enough for our situation:

Quote:

Likely internal implementation of the read

The following is speculation, and thus not necessarily 100% accurate. It does accurately predict observed behavior.

The 6502 cannot be pulled off of the bus normally. The 2A03 DMC gets around this by pulling RDY low internally. This causes the CPU to pause during the next read cycle, until RDY goes high again. The DMC unit holds RDY low for 4 cycles. The first three cycles it idles, as the CPU could have just started an interrupt cycle, and thus be writing for 3 consecutive cycles (and thus ignoring RDY). On the fourth cycle, the DMC unit drives the next sample address onto the address lines, and reads that byte from memory. It then drives RDY high again, and the CPU picks up where it left off.

This matters because on NTSC NES and Famicom, it can interfere with the expected operation of any register where reads have a side effect: the controller registers ($4016 and $4017), reads of the PPU status register ($2002), and reads of VRAM/VROM data ($2007) if they happen to occur in the same cycle that the DMC unit pulls RDY low.

For the controller registers, this can cause an extra rising clock edge to occur, and thus shift an extra bit out. For the others, the PPU will see multiple reads, which will cause extra increments of the address latches, or clear the vblank flag.

This problem has been fixed on the 2A07 and PAL NES is exempt of this bug.

For reference, here's my CICOp routine:

Code:

   sty   CICOP_ADDR_EN   ; 8c [bank num] [bank table low] [bank table high]
   sta   (cicop_reg), y  ; 91 [ZP byte num] [x:regL] [x:regH] xx [x:wrL]
   stx   cicop_reg       ; 86 [ZP byte num] [x:wrH]
   ldx   CICOP_PORT      ; ae [addr L] [addr H] [x:rdL]
   ldy   CICOP_PORT      ; ac [addr L] [addr H] [x:rdH]
   lda   CICOP_PORT      ; ad [addr L] [addr H] [x:rdE]

The 6502 gets stalled on the next read cycle following the DMC pulling RDY low. And once the DMC is done with it's fetch/stealing the 6502 re-performs the read that was stalled. What I'm uncertain of is which of the reads actually gets caught by the 6502. I would guess the initial stalled read is executed, but the 6502 doesn't catch it. It's the second post-stall read that's actually caught (otherwise it probably wouldn't be done)..?

[edit sorry I've got some opcode name and cycle number errors here.. Think they're fixed now. Realizing there are some other cases I'm not detecting and differentiating between, but I think there's room to cover them]

If the stall were to occur on any of the read cycles during STA (T0-T4) or STX(T0-1), the CICOp could sense the stall as CPU R/W wouldn't be low during the expected write cycle (STA T5 and STX T2) due DMC stall & fetch. It would also be known that the CPU stalled for 4 cycles, the CICOp could potentially insert that delay to it's routine.

If the stall were to start on the final write cycle of STA, the write would occur normally. But the 3 cycle stall could be sensed as CPU R/W wouldn't be low at expected time for STX. It would be known that the CPU was stalled for 3 cycles, similarly the CICOp could delay it's routine by 3 cycles.

If the stall were to start on the final write cycle of STX, the write would occur normally. But the opcode and address fetch of LDX wouldn't be present on the bus. The CICOp could sense this case by checking LDX opcode being present on the bus @ T0 provided the written data from STX didn't also equal '0xE'. T1 would continue as open bus, and the DMC would hijack the bus on T2. So the CICOp could stop itself/delay outputing data on the bus for LDX T3. This case also stalls the CPU by 3 cycles, the CICOp could try to delay itself, but needs to differentiate this case from the ones below that have a 4 cycle stall.

The final 3 LDX/LDY/LDA all have similar behavior. The case that really needs to be covered is if the stall were to start during T0 when the opcode was being fetched. In this case the DMC will hijack the bus on the last cycle (T3) when the CICOp is also planning on driving the bus. This condition could be detected by verifying low address of CICOP_PORT being on the bus for T2. The CICOp could recover by delaying 4 cycles.

If the stall were to start during T1/T2 of LDX/LDY/LDA we're mostly safe to output data as the DMC will hijack the bus after the CICOp drives data to a stalled CPU. A stall during T2 could be sensed by CICOP_PORT high addr (T3) not being present. To support that, A0-3 of CICOP_PORT needs to differ from A8-11 (and also not equal 0xE, 0xC, 0xD to differ from LDX/LDY/LDA opcode) which is simple enough with a chosen address of something like $5A05.

If the stall were to occur on T3, the final LDX/Y/A cycle, there isn't a great way to sense that until T0 of the subsequent LDY/LDA. In that case the opcode/operand wouldn't be present for the next load. This can be easily caught for the first two loads (LDX/LDA), and CICOp routine delayed accordingly. But not for the final one assuming I'm correct that the initial stalled read isn't caught, and the subsequent second read is what's caught. This isn't necessarily an issue as this final read is intended to be the verification read. So reading back open bus would simply be a false failure. However, we could define the 6502 instruction that follows the current routine to allow for detection of this condition so the final nibble can be delayed and resent.

Phew.. Well in the end doing all this doesn't seem to unreasonable given the number of nops currently in my CICOp isr. It's certainly gets complicated quick though..

Now I'm curious how hard it would be to detect and protect if the 6502 were interrupted mid routine. It certainly wouldn't be as easy to transparently recover as there's no telling how long the 6502 interrupt will last. The above proposals will probably cover an interrupt prior to completion of STA/STX with CPU R/W prior to latching data. But I'll have to check into this more, hopefully there's a way to detect this case enough to prevent from outputting data on the bus when it shouldn't be thus preventing a crash. If that's possible, then one could set a flag to denote that a CICOp register "transfer is in progress" prior to this routine. Then the NMI/IRQ routine could update that flag to "current transfer failed". Once the 6502 returns to the CICOp routine that got interrupted, it would make a check at the end and reperform the transfer if "current transfer failed" was true.

Re: Adding features to discrete mapper with multipurposed CI
by FrankenGraphics on 2017-09-29 (#205376)

infiniteneslives wrote:

For the CICOp and it's set limitations I'll probably be limited to simpler waveforms (square, triangle, saw) anyway. If it does get a sine channel, it likely won't be more than one.

I don't think i'd sweat it to get a sine form in - unless it was routed to modulate another waveform (hardware automated modulation, yay!) and had a range dipping into the subsonic. On itself, it adds little: Maybe some extra punch, low end for bass notes, organ overtone or air as a double/follow to another richer channel, or use it for drum synthesis where it'd fill a role on its own. But the question then becomes how one would "what you hear is what you get" it when composing. If always used in unison/octaves with another track or strictly used for kicks, toms or snare support, maybe one could live without hearing it between assembly/compile tests more easily.

Sines might also have a worse 'wanted sound' vs. digital artifacts ratio if the resolution is low, which limits its use somewhat further.

Basically, it has its distinct and possibility expanding uses, but they're somewhat limited if the choice stands between one more tri, saw* or pulse channel (potentially with finer pwm steps than the internal APU) on one hand, and a sine channel on the other.

*Saws are also less flexible than pulses when they're without a filter so in that regard they're a bit limited in use just like the sine, but at least the saw function is simple.

===

EDIT: More on saws... here's one idea on how to make them more versatile.

Assume each channel is a pulsewidth variable square channel. They all have a control bit. when 0, everything is normal. When 1, the wave will be normal when low, but output the remainder of a saw function when high.

With a wide enough pulse and the control bit set to 1, that's a saw or something very close by. Then by changing the pulse width, you could control the morph between the two: pulse and saw.

Re: Adding features to discrete mapper with multipurposed CI
by tepples on 2017-09-29 (#205379)

Or you could express saw, triangle, and pulse in terms of rise time, high time, fall time, and low time.

Saw: rise=0%, high=0%, fall=100%, low=0%
Triangle: rise=50%, high=0%, fall=50%, low=0%
Square: rise=0%, high=50%, fall=0%, low=50%
1/8 duty pulse: rise=0%, high=12%, fall=0%, low=88%
Approximate sine: rise=33%, high=17%, fall=33%, low=17%
Filter: Interpolate a waveform toward approximate sine

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-10-01 (#205449)

lidnariq wrote:

2kHz & 5kHz: [...] 2kHz & 4.9kHz: FastPWM

I realized that I had set up my test wrong. I wasn't windowing my FFTs, and 30720 Hz for the PWM modulator wasn't an integer factor of the FFT size.

When I redid the tests with other choices of PWM frequency that were integer multiples of 192000Hz÷131072, then no spurious emissions showed up that were obvious results of intermodulation, at all.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-10-01 (#205470)

Quote:

Now I'm curious how hard it would be to detect and protect if the 6502 were interrupted mid routine.

A quick glance at this and detecting that the 6502 hit an IRQ/NMI mid transfer routine looks rather simple. The easiest detection would be on T2 (cycle 3) of any instruction CPU R/W will be low for the first push while the 6502 is processing it's interrupt. STX ZP is the only instruction I currently have which also has CPU R/W low on T2. In the end absolute addressing would be fine and delay CPU R/W of that instruction to T3, I only chose ZP addressing because it saved a cycle. The CICOp may need a little more time if it's making all of these checks anyway..

I'm not sure what if anything is on the data bus during the first two interrupt cycles while internal operations are being performed. I presume the data bus is open and retaining data of the last executed instruction, detecting lack of fetched STX opcode and ZP operand would allow STX to keep ZP addressing.

While the CICOp could protect against the 6502 getting interrupted mid transfer routine, it couldn't recover/delay like the DMC case. The chances of a DMC collision should be relatively low. So to keep the CICOp transfer routine simpler I'll probably opt to require any interrupted transfers to be retried by use of flags in NES code as I described in my last post. Any features/protections added in the transfer routine will bloat the isr code on the CICOp which may need to be doubled to support PAL. Still have to implement all that, but once done I'd expect the CICOp transfer routine to be bullet proof!

Thanks for the input on what may be desirable audio channel features guys. I can't speak too intelligibly on this front as I've yet to write any supporting code. But with the synth engine running each PWM update cycle (expecting every 1024 STM8 cycles), it'll be necessary to keep execution time low. I'm thinking that will favor more of the modal type settings as FrankenGraphics proposed, compared to what I expect to be calculation heavy settings of Tepples' proposal. But we'll see!

Thinking I'll call the CICOp register transfer routine good enough for now, and circle back to it later to support DMC/NMI/IRQ collisions. So next up I'm planning some basic synth & PWM tests with a single square wave. I'm concerned about what the exact output/mixing circuitry will require. If the output of the PWM DAC is too loud compared to the NES APU I may need an extra voltage divider resistor I'm thinking..? I'm also unsure about a series output DC decoupling cap. Or perhaps it'll be too quiet and require an opamp output buffer if the PWM DAC can't source enough current. Also unsure how much I can change the PWM DAC component values to help the situation. My experience level is low with these analog areas so I never trust myself until fully prototyped..

Aside from that I would also like to try out some basic scanline & CPU cycle counting before ordering the first batch of boards.

Re: Adding features to discrete mapper with multipurposed CI
by Memblers on 2017-10-01 (#205471)

I was gonna say about the audio stuff, is there any reason to not have the waveforms in RAM? It doesn't need a lot, you can get pretty good capability even with just 32 bytes. You can still have predefined waveforms in ROM if you wanted, then just copy it into RAM. From there, the NES program could modify them or upload entirely new ones. That's what my synth does.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-10-01 (#205472)

Yeah putting a wave table in RAM would certainly be an option. Something like that shouldn't require much for calculations at run time. Modeling after the FDS wave table may also help out with composition tools.

The only reason against it might be the not so speedy interface of the CICOp registers, but the wave tables wouldn't have to be frequently updated to be useful.

Re: Adding features to discrete mapper with multipurposed CI
by FrankenGraphics on 2017-10-02 (#205479)

If it can handle writes at 60hz, it would be transparent compared to fds.

A bit crazy software implementation idea:
Using the irq and any board with an updatable wavetable synth, one *could* update the waveform being used twice per frame, leading to very smooth (comparatively speaking) wave transitions. That's a bit new sonic territory for the NES. PWM, saw-to-sinoid "filters" and various other morphs would be more useful than otherwise. But you'd either have to sacrifice the irq for that or make sure it happens anyway at a point relatively opposite of the general music update during a frames' full cycle. If it varies a bit up and down the scanlines; it's not really a problem. It'd be pretty useful emulating attack portions of various instruments, above all, but also for "synthy" synth sounds.

The extra time-domain granularity has to be inserted around export-time, mostly out of convenience.
One way is to keep doubles of the instruments and their macro strings, and change the most significant digit in the instrument row to the "hi-res" versions before exporting; discarding unused instruments in the process.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-10-02 (#205485)

FrankenGraphics wrote:

If it can handle writes at 60hz, it would be transparent compared to fds.

Yeah it can easily handle multiple writes per frame. Bigger question is how much time the NES wants to spend making those writes. Current CICOp register transfer routine is 21cycles for 1Byte written, and read back verified, plus some amount of overhead for preparing for the write. I will likely have to limit the number of CICOp register writes to reserve enough processing resources for CIC operations, but it several writes per frame should be fine. Once I have async CIC operations implemented I'll have a better idea of what the practical limitations will be.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-10-03 (#205536)

Well it's happened again, get around to writing some code and taking a closer look at the STM8 register settings, and I found a little gem.

TIM2 is the mid grade counter on chip I've dedicated to the PWM DAC. It's only capable of up counting, and thus has no center aligned PWM mode of operation. HOWEVER there is an output mode that toggles the output pin on timer compare match. Taking advantage of this, one can effectively create a center aligned PWM if you're willing to dedicate the CPU resources to update the compare value on each timer overflow. I'm not yet sure I've got the resources to pull this off in my case, we'll see.

To make this work all that needs to be done is invert the DAC value every other output. We can also take advantage of Tepples tip to keep from loosing a bit of resolution, it's best of both worlds with this trick if there's CPU resources to spare.

Here's how I'm thinking this would work:
-Assuming 16Mhz counter clock, 31Khz PWM freq, counter top value of 255 (0xFF), normally give 8bit DAC, but can gain a bit back for 9bits if down doesn't have to equal up count.

Start with PWM output clear. Say the desired output value for this cycle is 100.5

1)First update cycle is "odd" operation and thus going to simulate down counting. By convention let's round down on odd/down cycles, so this half's output is 100. Being an odd cycle we subtract 100 from the top value 255. We place this difference of 155 in the compare register. When reached the output PWM pin toggles high.

2)Next output cycle occurs, this one is "even" normal up counting. Round up this time from 100.5 to 101. The 101 value is simply placed in the compare register. When reached the output PWM pin toggles low.

3) Calculate next DAC output value and go to step 1.

While this idea works, the added interrupt for step 2 will be costly for the STM8. Interrupts cost a whopping 20 cycles (9 pushes, 2 jump isr, 9 pops). This scenario requires two interrupts per 512 clock update cycle. That's 4% CPU resources for the added interrupt entry/exit, prob need another 5-10 cycles to support the added complexity running each isr, so that's another ~4%, for a total CPU resource cost of 8%. Doesn't seem worthwhile at this point anyway when we've got better things to do with that CPU time. So while this whole trick is viable, but I've now sufficiently talked myself out of it...

I sure wish all those excessive register pushes for STM8 ISRs wasn't there. Much more convenient how the 6502 only pushes PC & SR.. Leave it to the ISR to decide what registers are worth preserving! Worst part about it is, the built in pushes aren't any faster than manual pushing; all they're saving is code bytes.

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-10-04 (#205549)

One little caveat about the "just use a count-up / count-down PWM and alternate direction every reload" — it effectively halves your sample rate (and adds one to the bit depth), producing modulation noise at IRQ frequency/2.

15.5kHz is still high enough that it's likely not an issue, but it's worth keeping in mind.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-10-04 (#205550)

lidnariq wrote:

Yeah my initial thought was to run the synth engine on each count "direction" to keep from changing the sample rate. But that doubles the synth computation load in order to keep the same PWM frequency. With limited timer clock speed, and limited CPU resources, there's no way to get around some sort of trade off when going from fast/edge to center/phase correct. And with your conclusion of giving up 1 bit of resolution not being worth gaining center aligned, I question if center aligned is ever worthwhile in a system operating near it's constraints.

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2017-10-09 (#205738)

Quick little update, testing out the PWM DAC with a simple middle C square wave. Current PWM DAC is the standard 3.9k 4.7nF (8.7Khz cutoff) low pass filter fed directly to EXP6. Then fed through the 'standard' 47k EXP audio resistor I was able to achieve a volume comparable to one of the 2a03's squares at full volume. Things went according to my plan here with a DAC setting of ~12% the CICOp square was audibly similar to the 2a03 square in volume and timbre. I haven't came up with a means to objectively compare the two, but using my ears alone the CICOp may have sounded a little "warmer" but honestly I could be mistaking that based on slight difference in volume. In any event I'm happy with the preliminary performance of this simple setup!

I took a handful of different scope measurements, still blows me away that the PWM DAC sounds as good as it does despite how it looks! [EDIT: all measurements taken with 10x probe]

Realizing a limitation of running at PWM frequency of 31Khz is the min period resolution of 32usec which I assume will prove troublesome for keeping higher pitched notes in tune.. I may be able to pull off 62Khz, but I'm wondering if this could be made up for by counting fractional steps and then rounding each period. Where the the average of something like 4 cycles would be in tune effectively providing 8usec period resolution. Have a feeling a hack like that has some (audible) drawback but I don't really know..

Re: Adding features to discrete mapper with multipurposed CI
by lidnariq on 2017-10-09 (#205739)

infiniteneslives wrote:

Realizing a limitation of running at PWM frequency of 31Khz is the min period resolution of 32usec which I assume will prove troublesome for keeping higher pitched notes in tune.

It isn't necessary to have pitches be integer divisors of your sample rate.

It's true that square waves (or anything else with more higher frequency content) will start having audible aliasing artifacts if you just use nearest-neighbor=sample-and-hold resampling, but that can be fixed or worked around in a variety of ways.

Quote:

I'm wondering if this could be made up for by counting fractional steps and then rounding each period. Where the the average of something like 4 cycles would be in tune effectively providing 8usec period resolution. Have a feeling a hack like that has some (audible) drawback but I don't really know.

That's actually literally how the Namco 163 works. The waveform position there is 8.16 fixed point (and the pitch is 2.16 fixed point). The SNES does something similar (pitch is 2.12 fixed point), but it adds an interpolator ("Gaussian") to reduce aliasing noise (and everything else high frequency, oops)

Re: Adding features to discrete mapper with multipurposed CI
by rainwarrior on 2017-10-12 (#205833)

Yeah, with accumulator based tones (e.g. N163, FDS, SID) it's not a divider of the clock frequency, and your low pitch precision ends up at low frequencies instead of high.

It's the inverse of the clock divider approach (e.g. 2A03, VRC6, AY).

I think the main difference is that the divider gets a lot more range with less bits in the register, and only has to increment instead of doing a full add. (Also at 31 kHz you'll have audible aliasing, but probably acceptable. No worse than N163 in 4 channel mode.)

Re: Adding features to discrete mapper with multipurposed CI
by infiniteneslives on 2018-03-17 (#215434)

Wanted to take a min and post a bit of an update on where I'm at with this project so far.

Firstly in STM8 news, ST bested themselves with the addition of a soic-8 package version with the STM8S001 (datasheet). It's the same silicon as the STM8S003, but only 5 GPIO. The pinout is rather interesting, it completely cuts out the /RESET pin, and of the 5GPIO many of the pins have multiple GPIO bonded to the same package pin. So there's a decent number of peripherals still available despite the low pin count, but you have to be cautious to not enable more than one driver for a given package pin.

Anyway, the soic-8 package gives up most of it's pins and now beats out the stm8s003 in price, but it's a pretty minor difference. The soic-8 package allowed for easier migration for some of my designs from attiny13 as the footprint didn't have to change. I included it in my latest discrete mapper board design and have started using it in production. None of that really has much to do with this project though. At most with the SOIC-8 package only ~2 i/o pins are left over for added features which isn't enough to do anything on the level of what the CICOprocessor is targetting. I'm using those 2 spare pins for mirroring control with a MUX, and PRG-ROM /WE control to separate mapper writes from PRG-ROM writes without an EXP0 pin which will be extra helpful for a 60pin famicom version I'm hoping to wrap up soon.

I successfully converted my NES CIC design over to the TIM1 counter method of keeping track of CIC cycles as I had already done with SNES. I also successfully combined the CIC RESET pin with KEY DOUT. It turned out that it works better to combine RESET with the KEY's data output instead of the LOCK's output. Reason being is that wire ORing the pins results in a lower Voh of the original CIC in the console. In my experimentation, this reduction is more significant with the SNES CIC than the NES for whatever reason. The reduction in Voh can be enough to get too close to the Vih of the cartridge CIC for comfort. The mcu's CMOS driver has no problem driving close to the full 5v on two wire ORed pins of the console though. The only important inputs for the cartridge CIC are the RESET pulse and stream ID on LOCK DOUT. So those come in on separate mcu pins, but then the mcu drives it's Dout on the combined RESET/KEY_DOUT pin.

So that's all good news for the CICOp's plan to only have 2 GPIO used for CIC RESET, KEY_IN, & KEY_OUT. Next step is to migrate CIC timing from using TIM1 clocked externally from CIC_CLK, to TIM4 being clocked asynchronously by the 16Mhz HSI. Getting my drift calibration factor working will be the biggest challenge with this. But I think I can pull it off leaving TIM1 available for scanline/CPU cycle counting.

The fact the stm8 core is always running off the internal HSI greatly simplifies communicating with it when it's plugged into something like my programmer. I've been able to implement some less impressive features I had planned. Basically the cartridge's circuit board is now serialized via the CIC with some reasonable security. I have the STM8's read out protection enabled so an external programmer can't access flash, eeprom, nor CPU registers. But it does still have access to RAM, which my programmer is able to read out via the SWIM interface. Currently when the STM8 boots up I have it copy any data that may be of interest to the bottom of the stack. So it doesn't really have dedicated RAM allocated to it, and it'll be visible so long as the stack hasn't been heavily used since boot. So I've got a string that's able to be read out by the programmer which includes things like the CIC build name and version, PCB's version, special text where I've placed things like "LizardV1" or "LizardV2" based on the homebrew game and it's build version. For a little extra flare, I've been able to put include a 96bit guaranteed unique ID number. I can't think of a great use for it, but if someone wanted to keep a registry of boards utilized for a limited edition to allow for later verification they could. I'm able to easily come up with a unique ID even though the STM8S001/3 doesn't have "96bit unique chip ID" advertised as a feature like the STM8S103. In the end the stm8s103 is the same silicon as stm8s001/3, so the unique chip ID (fablot, wafer number & X/Y location) info is all there, may as well do something potentially useful with it!

Beyond that I do copy over a "copyright Infinite NES Lives LLC" string into RAM as well.

Coming up with detection algorithms for all the different boards I've made has always been a daunting task. So that feature is handy for the programmer to have a simple means of determining board & mapper info etc. The copyright message also provides a sizable string to read out and verify all is well with SWIM communications.

One other gotcha that I somehow didn't pick up until recently is that the "True open drain" GPIO pins lack a P driver completely. I had realized they didn't have pull-up resistors, but didn't realize they had no ability to drive the pin high until recently. So that limits what can be done with those pins a little, but shouldn't be too big of issue now that I'm aware of it.

Moving forward lidnariq has got me interested in the greenpak finally. Now that there's a tssop-20 package available which can also be reconfigured I'm starting to think the SLG46824 could make for a nice pairing with the CICOprocessor. I didn't have much planned for the I2C pins on the CICOp so giving it the ability to configure and possibly communicate with the mapper is an interesting thought. I still haven't fully wrapped my head around the greenpak's abilities, nor am I certain of it's costs. Looks like the price may be comparable to amount of logic needed for UNROM512 which means it may be within reach with this project's minimal cost goals. I messaged Dialog for a quote and will probably pick up a devkit to tinker around with soon.