Ack - another tile data compressor

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic

Ack - another tile data compressor
by cadahl on 2010-03-17 (#58503)

First, I'm a reader and admirer of the people of this forum. While I'm not a NES developer myself (I do some DS) I'm tickled by the intricacies of old console hardware. Reverse engineering, and pushing the limits of these old consoles is fascinating to my nerdy brain.

Especially Tokumaru's implementation and enhancement of the CodeMasters tile compression scheme interested me, and when CartCollector published the "NESdev Corpus", I just had to try to compete (all in good spirit).

My compressor gets these results (when splitting the input file into 8KB blocks, compressed independently):

Code:

                ACK 8KB  Tokumaru    ZIP   Original
battlekid.chr    30087     29881    27306    47488
bombsweeper.chr   2147      2416     2317     8192
cmc80s.chr        8530      8730     7025    16384
dpadhero.chr      8484      8381     7977    24576
geminim.chr       1571      1605     1621     4096
gemventure.chr    1956      2093     1816     4096
lj65.chr          1855      2027     1806     4096
manhole.chr       4061      3807     4124     8192
neotoxin.chr     10438     10992    10355    24576
nesnake2.chr      3347      4914     3026     8192
quantdisco.chr   12902     13165    12977    32768
sackofflour.chr  10594     10541    10191    32768
solarwars.chr    16081     16160    15568    32768
sting.chr         1894      2048     1752     6144
                113947    116760   107861   254336

smb.chr (SMB1)    5235         -        -     8192

So it sits somewhere in between ZIP/DEFLATE and Tokumaru's encoder currently. I think it's possible to eek out more of it, and I have some ideas brewing.

I decided to mostly devise my own scheme, it's mainly based on single bitplane 8x8 patterns and blocks of 8 of these (so 4 full tiles per block). The resulting byte list is split into 4-bit nibbles which are Huffman encoded. The reason for using 4-bit symbols is that it only requires a 48 byte RAM table on the decompression end of things, and it didn't really hurt the ratio much. I've tried to stay within the hw limits of the NES, but like I said I've only read about it, and had to guess from what little I know of 6502 asm to know what is viable. No idea what the decompression time might be.

Source code (MIT licensed, purely C# right now) can be found at: http://code.google.com/p/artwork-compression-kit

Hope it can be of use!

by tepples on 2010-03-17 (#58505)

Huffman on 4x1 pixel slices of one bitplane? Great minds think alike. Nintendo had the exact same idea for compressing the Nintendo logo in the Game Boy Advance BIOS.

by tokumaru on 2010-03-17 (#58507)

Cool, it managed to beat my (well, Codemasters') scheme in several cases! =) We just need to code the decompression routine in 6502 ASM.

by cadahl on 2010-03-17 (#58539)

Tepples, that's cool. Can't really claim it was something that I carefully planned. Was mostly figuring out a way to make the Huffman decoding cheaper, and it turned out to work well.

Tokumaru, if someone were to pick up writing a 6502 version that would be awesome. I'd have to learn the platform first if I were to do it myself. In the long run I'd love to have encoder variations for SNES, SMS, MD etc. I'm not sure where I'm going with this, it's mostly for fun and to learn.

I've documented the format (actually it's layered: an outer container for multiple banks, a Huffman container that could be omitted or replaced and the inner NES-specific compression). Hope it's readable:

https://code.google.com/p/artwork-compr ... /ACKFormat

I'm sure I've overlooked many possible improvements, so any comments are welcome.

by tokumaru on 2010-03-17 (#58541)

Very interesting format. I imagine that the decompression code would be quite large/slow, because of all the modes and the moving back and forth between huffman and NES format decompression.

Normally I'd take a shot at writing decompression code, but now I'm looking for something different in a compression scheme. I want something that is simple and fast to decompress, even if the compression ratios are not that great.

by cadahl on 2010-03-17 (#58543)

Fair enough. I think it's probably not that bad, and might take a stab at it. Would you mind at all if I used your decoder/test source code as a starting point? I am versed in assembly, just not 6502, and can probably learn the platform as I go.

by tokumaru on 2010-03-17 (#58545)

cadahl wrote:
Would you mind at all if I used your decoder/test source code as a starting point?

Sure, go ahead. Let us know of the results, OK?

by cadahl on 2010-03-19 (#58652)

I've so far ported NesDecoder and have it decompressing the SMB CHR correctly. Next thing up is the HuffmanDecoder.

I'm getting the hang of assembly on this thing, but had a couple of headaches. E.g. confusion over which instructions affect which flags, and it seems the various 6502 docs disagree over which instructions exist on what variety of the CPU.
http://www.masswerk.at/6502/6502_instruction_set.html says BIT has only ZP and absolute addressing, but immediate operand (BIT #$80) works fine at least in FCEU for me.

by tokumaru on 2010-03-19 (#58655)

cadahl wrote:
http://www.masswerk.at/6502/6502_instruction_set.html says BIT has only ZP and absolute addressing, but immediate operand (BIT #$80) works fine at least in FCEU for me.

That shouldn't work. What assembler are you using? Maybe the command got assembled to something else and by coincidence the flags are being set as you expect them to.

FCEU and its "sons" are not terribly accurate either. Nintendulator and Nestopia are the most accurate emulators around, so be sure to always test on them, and on actual consoles when possible. Also, look at Nintendulator's or FCEUX's debugger to see what your instruction actually assembled into (locate the command in the disassembly window).

The NES uses a standard 6502 (without decimal mode), ignore documents that talk about the 65C02 and other processors based on it. The document you liked to appears to be correct.

by cadahl on 2010-03-19 (#58656)

Ah, thanks. I'll see if I can run those, using Linux so I tend to prefer native emulators. In this case finding a workaround is easy, I can roll bit 7 into carry instead, but I really want to be compatible.

If you're interested, here are my efforts for now: http://code.google.com/p/artwork-compression-kit/source/browse/trunk/src/deackNES/deack.asm

I'm sure I'm doing all sorts of mistakes and missing opportunities for optimization, but it seems to decode rather instantly (still need to see if the Huffman hurts badly though).

Edit: I'm using ASM6, same as you afaik (except I compiled it from source).

by ReaperSMS on 2010-03-19 (#58657)

After looking at it, albeit briefly, a couple things come to mind:

Column transposing is going to be rather slow, or need an excessively large LUT.

It's probably better for the group header to have the bits stored chunky rather than planar, as it cuts down on the amount of RMW rotates needed. The code would probably benefit from extracting all 8 modes at once with unrolled special case code.

by cadahl on 2010-03-19 (#58659)

The two things you mention bother me as well.

Column mode is pretty nasty, I had plans to see if I could have kept it row-oriented and bunched the unique columns toward one end of each byte instead. Decoding would be more ALU-oriented, and while taking 8 bytes like a raw pattern it might skew the value distribution for the Huffman backend to pay for the extra symbols.

Now that I'm using 4 bit symbols rather than 8 bit, I'm not sure if it's worth trying. Worst case it would be a loss vs just using Mode 0.

I probably will change the pattern mode encoding, I didn't consider the cost of RMW.

On another note, is FCEUX the recommended debugger? I've tried FCEU and FCEUXSP for some light ROM disassembly before. The latter seemed more complete.

by tokumaru on 2010-03-19 (#58661)

I think FCEUX is the only one that's still being updated. Its debugging features are great, but the fact that they are built on top of a not-so-accurate emulator doesn't help. See if you can get Nintendulator to run, even if it means using Wine or a virtual machine (and even if it's slow), because you should test your code on an accurate emulator before releasing it. Ideally you should test on hardware too.

by tepples on 2010-03-19 (#58663)

And if your computer/OS isn't powerful enough for Nintendulator, a PowerPak is probably cheaper than a new computer. At least it was for me and my Eee PC laptop.

by cadahl on 2010-03-20 (#58667)

I actually got Nestopia to compile for Linux, and my test failed spectacularly. It turned out to be the BIT #$80, easily replaced with a shift and BCC. It now seems fine.

Interestingly Nintendulator didn't fail the BIT #$80 version, looked identical to FCEU. I think I'll stick to Nestopia now that I have it running.

Edit: Btw, is loop unrolling preferred? I know ROM space is tight, but there's no I-cache and even if I JSR to somewhere expensive in a loop I'd save a steady number of cycles by not doing loop management. If it also allows me to use cheaper addressing, all the better.

by ReaperSMS on 2010-03-20 (#58684)

If you moved the y/ptr increment around a bit, you could avoid the need for the bit or the asl, as lda will set the N flag appropriately.

by cadahl on 2010-03-20 (#58694)

Ah. I think I'll leave that alone for now, I replaced the ASL with "CMP #$80 / BCC" because I do need A unaffected. It's not a hotspot and this, I think, is easier to read. Thanks though. Very thankful for any hints you guys can provide.

One thing I'm wondering, and I think I haven't seen anything on this, are there costs other than the obvious cycle counts when setting up PPU addressing with the $2006 register? Penalties for switching between R/W:ing?
As a PC programmer I'm used to graphics hardware being very far from the CPU with many deep pipelines in between, so almost anything other than streaming data to the "PPU" is expensive.

by tepples on 2010-03-20 (#58697)

cadahl wrote:
One thing I'm wondering, and I think I haven't seen anything on this, are there costs other than the obvious cycle counts when setting up PPU addressing with the $2006 register?

You have to either wait for vertical blanking or force a blank screen. The codec I've been working on decodes four tiles per frame to an unused part of the stack so that I can decompress with the screen turned on if necessary. That way, I can do the decompression during draw time and blast it out to the PPU with a partially unrolled loop during vertical blanking.

Quote:
Penalties for switching between R/W:ing?

Reading from VRAM has numerous caveats:
You usually have to seek (two writes to $2006) before you start to read or write.
You have to discard the first byte you read due to PPU pipeline delay.
You have to do it all during vertical or forced blanking.
If you're playing sampled sound, the reading will occasionally skip a byte.

But these caveats were still not enough to stop plenty of CNROM games from using one-third of CHR ROM for arbitrary data storage.

by cadahl on 2010-03-21 (#58729)

Thanks tepples.

I implemented the Huffman decoding and.. man is it ever slow. I should have listened to Tokumarus warning. It currently takes (guessing) about a second to decompress a bank full of tiles, that's almost useless IMHO.

De-HuffmanEncode: http://code.google.com/p/artwork-compression-kit/source/browse/trunk/src/deackNES/deack_huffman.asm?spec=svn14&r=14
(Huffman_Init and Huffman_ReadByte are the starting points in that file)

De-NesEncode, optimized and fixed since last time: http://code.google.com/p/artwork-compression-kit/source/browse/trunk/src/deackNES/deack.asm?spec=svn14&r=14

Starting to think it's better to drop Huffman for and see if Rice/Golomb/Elias is much faster for variable bitlength coding, or leave it byte-oriented. (The other part seems fast enough)

by tokumaru on 2010-03-21 (#58731)

My decoder was very slow too, I favored code size rather than speed. I believe these complex schemes will always be significantly slower, and that's why I'm pursuing something "lighter".

by cadahl on 2010-03-21 (#58732)

tokumaru wrote:
My decoder was very slow too, I favored code size rather than speed. I believe these complex schemes will always be significantly slower, and that's why I'm pursuing something "lighter".

Other than RLE? What sort of characteristics are you after? (random access etc)

by tokumaru on 2010-03-21 (#58733)

cadahl wrote:
Other than RLE? What sort of characteristics are you after? (random access etc)

I'd like something aware that it's dealing with 2bpp bitmaps, and could take advantage of more kinds of spacial redundancy than RLE can.

It would probably have to be something byte-aligned, because reading from a stream of bits seems to be a huge bottleneck.

I'm open to ideas if you have any. =)

by cadahl on 2010-03-21 (#58735)

Well there's always the NesEncoder part of my scheme. Without the Huffman coding, SMB.chr is around 5800 bytes instead of 5200 but that's still fairly good starting from 8192 bytes.

by tokumaru on 2010-03-21 (#58736)

Yes, it's very interesting.

by tokumaru on 2010-03-21 (#58751)

Hey cadahl, I'm taking a look at your compression scheme and I have a question: Does the "group repeat" feature ever get used? I find it hard to believe that someone would repeat 4 whole tiles in the same tileset, specially considering they'd have to be aligned to a 4-tile boundary. So I was wondering if you have any statistics about how often it's used in the tilesets you tested. I wonder if it wouldn't be better to just give the first pattern in the group the option to use all 8 compression modes too.

by cadahl on 2010-03-21 (#58753)

Very good question. To be honest, those tend to be used mostly for large empty areas. But there's another case, when you have multiple pages of tiles in order to animate (please correct me here if I'm mistaken), you tend to have lots of duplicated tiles for the static stuff. This allows a group to refer back up to an offset of one bank.

Now.. I got the idea from the corpus .chr's... but maybe when you use compression you'd never use bank/page flipping in that way. So that feature could be useless in practice. And I agree it'd be nicer without it.

Edit: clarified wording. I'll read this over again tomorrow. Been chasing a deadline.

by tepples on 2010-03-21 (#58755)

tokumaru wrote:
Does the "group repeat" feature ever get used?

To elaborate on what cadahl said: There are big blocks of X's representing unused tiles in a lot of NROM-128 games, even official ones. LJ65 has a few blocks of X's in one pattern table and a completely blank second pattern table. Concentration Room has over a kilobyte of X's on titlegfx.chr because I'm still figuring out how I'll do the cut scenes.

To use CHR compression, you need to use CHR RAM. There are two ways to use bankswitching with CHR RAM: either switch between two 4 KiB pages (much like in Dr. Mario) or use more than 8 KiB of CHR RAM. In a case like Dr. Mario, a program would just load the basic tiles to both pages. And only a couple boards take more than 8 KiB of CHR RAM, one of them being CPROM used for Videomation. You'd probably just want to decompress the CHR once, blast it over all pages, and then overwrite those few tiles that need animated.

by tokumaru on 2010-03-21 (#58757)

cadahl wrote:
Very good question. To be honest, those tend to be used mostly for large empty areas in a bank.

It makes sense to have empty areas in a game that uses CHR-ROM, but in order to use your compression the game must use CHR-RAM, in which case it would be better to compress just the tiles which are actually used. It would make no sense to compress empty tiles.

Quote:
But there's another case, when you have multiple banks in order to animate, you tend to have lots of duplicated tiles for the static stuff. This allows a group to refer back up to an offset of one bank.

That makes sense for CHR-ROM, but not for CHR-RAM. Since you can update as many tiles as you want you'd hardly waste time redrawing tiles that look the same. Also, if I understood it correctly, you can only copy data from the same bank, that has already been decompressed, so even if the tiles were repeated in different banks you wouldn't be able to make use of that because the banks are encoded individually, isn't that right?

We often test compression schemes and measure their efficiency by compressing whole sets of 512 tiles, but not many commercial games compress their tiles like that. They often use smaller sets, and combine them as necessary depending on the level or whatever. For example, it makes sense to have a block with the main character's tiles, a block for each level containing the background graphics and sprites used in that level, a block for common items, things like that.

The Codemasters codec for example, uses only 1 byte to indicate how many tiles are in a block, because it was never intended for encoding a large number of tiles in the same block.

by cadahl on 2010-03-22 (#58762)

tokumaru wrote:
It makes sense to have empty areas in a game that uses CHR-ROM, but in order to use your compression the game must use CHR-RAM, in which case it would be better to compress just the tiles which are actually used. It would make no sense to compress empty tiles.

Good point. Maybe a front-end with skip lists, an address table or something might be useful? Or it could be entirely manual.

Quote:
Quote:
... up to an offset of one bank.

That makes sense for CHR-ROM, but not for CHR-RAM. Since you can update as many tiles as you want you'd hardly waste time redrawing tiles that look the same. Also, if I understood it correctly, you can only copy data from the same bank, that has already been decompressed, so even if the tiles were repeated in different banks you wouldn't be able to make use of that because the banks are encoded individually, isn't that right?

Yep, the offset is up to -8192 bytes, so you can reach from the last to the first pattern in a bank (But not past the beginning since the encoder only knows this bank.)

Quote:
...smaller sets, and combine them as necessary...

Now that you say it, it makes all the sense in the world, with CHR-RAM being about fine grained control. (At least I guess the price was steep compared to ROM in the past so you needed a *good* reason to use it?)

Plenty to think about. I think I'll make this less bank-oriented, and I have another idea to pursue: using previously decoded tiles as "predictors" and encoding the EOR residual only.

There are plenty of cases where tiles are similar but not identical. I think the decoder would be fast enough even if it 1) copies a previous pattern, 2) EORs the residual error on top (encoded with current modes) and 3) writes the pattern to VRAM. I'll experiment with the encoder and see what wins can be had.

by tokumaru on 2010-03-22 (#58793)

I imagined something like using mode 7 to indicate that the encoded tile should be EOR'ed with a previous one.

In this case, another byte would be necessary: 3 bits would indicate the encoding mode and 5 bits would point to a previous pattern. If mode 7 is used again the next byte would contain the actual compression mode and the 2 5-bit references would form a 10-bit reference, enough to select any pattern in the 8KB space.

I don't know if that's the best idea, because the 2 bytes necessary to point to a distant pattern may be a lot of overhead.

Another thing I thought about was making use of the fact that mode 0 (raw pattern) would never be used to encode the residual error. I can't think of anything right now, but that's something to keep in mind.

Another idea I had was to use a command to redefine the common patterns, to make the whole scheme more adaptive.

by cadahl on 2010-03-23 (#58857)

All good ideas! I spent a couple hours testing based on that and arrived at something good I think. I removed group repeats, lifted the restrictions on pattern 0 in a group, and added mode 7. This mode is "EOR", but the EOR residual is actually never stored because it doesn't help as such. Instead a good reference pattern is found (and the EOR difference), the differing rows are stored. So:

Mode 7 header 1: xpppppppp (7 bits of offset, -1..-128 patterns)
Mode 7 header 2: 01234567 (per row, which ones differ from the reference)
..then byte * set bits in header 2...

Allowing more than 7 bit offset is only of marginal benefit. I did a graph of where the best matches fall in the range of offsets, and it's a strong 1/x-like curve with smaller offsets dominating. Even 6 bits would do well. I figure bit 7 can be used for extensions.

Got the following numbers now (still without Huffman, this is the corpus+smb.chr):

Code:
30643
2331
8319
9210
1583
1726
1920
4462
11265
5093
13843
12013
16917
2016
5570

Total ratio of 47.7% against the original data, not bad. I changed the group header so the pattern modes are stored chunkily, not sure if it helped the decoding:

h1: 00011122
h2: 23334445 (for patterns 0-7)
h3: 55666777

Edit: NES decoder here:
http://code.google.com/p/artwork-compression-kit/source/browse/trunk/src/deackNES/deack.asm?spec=svn18&r=17

by tokumaru on 2010-03-23 (#58869)

Personally, I'd rather keep the pattern modes planar, and use a piece of code like this to read each mode:

Code:
asl Header0
bne +Skip

;READ 3 NEW BYTES AND PUT THEM INTO Header0, Header1 and Header2

sec ;might not be necessary if reading the bytes doesn't change the carry
rol Header0
+Skip:
lda #$00
rol
asl Header1
rol
asl Header2
rol

You just have to initialize Header0 with $80 (code for empty buffer) and the code above will take care of loading new bytes when necessary. It's much shorter than the huge decoding you have there, and I don't think there's any significant difference in speed.

Still checking out the changes you've made, I didn't quite understand it from the explanation on your post.

by cadahl on 2010-03-23 (#58880)

Thanks, going to try to understand the code you posted.

This is what's done for mode 7 in the encoder:

Code:
- For every pattern P
- Try modes with zero pattern cost first.
- Speculatively calculate RCR (Mode 1).
- Speculatively calculate RD (Mode 7).
- Pick RCR, RD or RAW based on which uses less bytes.

Calculate RD:
- For every pattern P' from P-1 to P-128
- Use the best candidate P'' with most rows shared with P.
- Store offset to P'' in header 1.
- Store rows from P where E != 0 and mark bits in header 2.
^^^^^^

So it's performing a brute force search 128 patterns back. And the "new rows" in P compared to P'' are stored, plus the headers.

Edit: You know what, I think I've been confusing you (and myself) by insisting on calling it "EOR mode" though the EOR result isn't stored. Even in the encoder, byte comparisons can be used instead. The mode could be called "row differences" or something. Sorry about that. It started out as an EOR-based idea, but halfway through I noted that the only reason to store the EOR is if fewer set bits cost less to store somehow. Maybe with a backend entropy encoder that could be true. Anyway I changed the pseudo code above and updated the doc: http://code.google.com/p/artwork-compre ... =NESFormat

Hope it makes sense now.

Edit 2: Reverted the decoder to read "planar" modes again: http://code.google.com/p/artwork-compre ... svn21&r=21

by cadahl on 2010-03-23 (#58888)

Hm. I'm not sure I understand your mode scheme. What if bit 2 of the modes are all 0? Then the byte reading code is triggered.

Seems like this would need an exception rule along the lines of "the last pattern always has bit 2 set" or something to work. Then you'd know when to load as header0 turned to 0. Otherwise I don't see how to distinguish an "emptied" set of headers from all possible bit patterns of 1..8 remaining modes. This is meant to get rid of the "group" loop, right? That'd be nice.

Be sure to let me know if I'm being dense.

by tokumaru on 2010-03-23 (#58896)

cadahl wrote:
What if bit 2 of the modes are all 0? Then the byte reading code is triggered.

That's why we initialize the variable to $80, to keep it from triggering. Bit 7 will prevent the byte from being $00 until that set bit is shifted out. Note in my code that when Header0 becomes $00 the bit that is shifted out is not used, and when the new value of Header0 is loaded this bit is inserted back into the variable at bit 0 (hence "rol Header0" and the "sec" that might not be necessary), as the real bit of information comes out from the other end.

So by initializing the variable to $80, the very first time we use it it will be detected as empty, and 3 new bytes will be loaded. And it will happen again whenever all 8 bits of each byte are used.

by cadahl on 2010-03-23 (#58902)

Doh. Finally I get it. I kept thinking there would be an extra unwanted cycle before the load triggered again, and that the load was meant to happen AS the last bit was shifted out. Now it makes sense. I will add this tomorrow, the loops should get cleaner. Thanks!

Some updated stats.. with mode 7, ACK is looking promising.

ZIP: 42.4% of originals in the NesDev Corpus
Tokumaru/CM: 45.9%
ACK: 46.6%
ACK+Huffman: 43.2%

Included in the numbers is a tweak to Mode 7. I managed an extra 1.1% ratio by allowing the referenced pattern to be flipped in Y. That unused header bit is used for flagging it.

I tried, for fun, a few other uses of the bit. E.g. that the referenced tile should be bit inverted, or allowing references in half-pattern steps (4 Bytes). Both did give very minor improvements but I chalk that up mainly to chance. The Y flipped sub-mode seems to hit on some kind of existing redundancy. Maybe it's because bg tiles can't be flipped by hw. If horizontal flipping was cheaper in the decoder, I'd see if that helped too.

by tokumaru on 2010-03-23 (#58906)

cadahl wrote:
You know what, I think I've been confusing you (and myself) by insisting on calling it "EOR mode" though the EOR result isn't stored. Even in the encoder, byte comparisons can be used instead. The mode could be called "row differences" or something.

Yeah, the "EOR" was throwing me off! =) I would probably get it from looking at the code (sometimes this works better than reading textual descriptions for me), I just didn't have the time to look at all of it.

I find these compression ratios you were able to achieve really good, even without huffman. It's the exact relation between complexity and size I was looking for.

Do you mind if I experiment with a similar compression scheme (using compression modes for each pattern, but experimenting with different modes a bit)? I want to use something more customized for my project, instead of using a generic solution.

by tokumaru on 2010-03-23 (#58913)

cadahl wrote:
I tried, for fun, a few other uses of the bit. E.g. that the referenced tile should be bit inverted, or allowing references in half-pattern steps (4 Bytes).

Did you try using the bit for row/column selection like in mode 1? I'd be interested in the results of that.

Quote:
The Y flipped sub-mode seems to hit on some kind of existing redundancy. Maybe it's because bg tiles can't be flipped by hw. If horizontal flipping was cheaper in the decoder, I'd see if that helped too.

I think it's much more common for background tiles to be flipped horizontally than vertically. I think you should try it, even if it takes a while to decode (it couldn't be worse than column-oriented mode 1).

by cadahl on 2010-03-24 (#58931)

Quote:
Do you mind if I experiment with a similar compression scheme (using compression modes for each pattern, but experimenting with different modes a bit)? I want to use something more customized for my project, instead of using a generic solution.

No problem at all, this area could use some research. It'd be nice if you come up with something good, and hoping you won't mind if I borrow it for ACK.

Quote:
Did you try using the bit for row/column selection like in mode 1? I'd be interested in the results of that.

Yep I did, but there wasn't much improvement, certainly below what Y-flipped referencing gave. Maybe the bit differences are too randomly distributed.

Quote:
I think it's much more common for background tiles to be flipped horizontally than vertically. I think you should try it, even if it takes a while to decode (it couldn't be worse than column-oriented mode 1).

Awesome. I have to try that. As for decoding it should be eight asl/rol pairs per line I guess.

by cadahl on 2010-03-24 (#58984)

I implemented the x-flip as you suggested and it helped a lot! Had to find a place to flag it, and I measured a 6-bit offset to be too short. But it turned out that old mode 5 ("same as two back") is rarely useful. It's often used in fonts etc with one bitplane empty or at least identical over a run of many chars. Modes 2/3 (common A/B) fit most of the time and referencing is a good fallback.

New pattern modes:
0: "raw" bytes
1: row/column repeats
2: common A
3: common B
4: same as one back
5: same as one back (bitwise NOT)
6: reference pattern + diff bytes
7: y-flipped reference pattern + diff bytes

New numbers:
ACK+Huffman: 41.6% (!)
ZIP: 42.4%
ACK: 44.8%
Tokumaru/CM: 45.9%

by tokumaru on 2010-03-24 (#58992)

Very impressive. So you used two modes (6 and 7) to select whether to Y-flip and used bit 7 of the reference byte to select whether to X-flip? It's an interesting way to support all possibilities.

by cadahl on 2010-03-25 (#59003)

Yup. I wish I didn't have to (ab)use the mode numbers like that, though.

I'm going to compile some stats and see how often the various modes are being used. I think it can now be seen as a dictionary encoder (with Mode 6 & 7 as the main part) with different code shortening hacks. Modes 2/3 form a tiny fixed dictionary determined at encode time, and Mode 4 is a cheap way for a common case of Mode 6. RCR (mode 1) is just a compact way to express new additions to to the dictionary. The mode that stands out is Mode 5 which does a NOT of the previous pattern. Modes 6/7 don't support that, but in my testing it didn't seem all that useful. Maybe the patterns that benefit are always a direct neighbor (i.e. the companion bitplane).

Seems likely that I'll rearrange the modes once I figure out how it all pieces together.

by cadahl on 2010-03-29 (#59271)

Hey, a question for you guys (being NES developers). Is it "trivial" to rearrange the palette for the whole background (and/or all the sprites)? I know color 0 is special, so leave it at 0, but is it ok to allow any permutation of colors 1, 2 and 3? (E.g. the decompressor says "0,3,2,1" so you should set the palette in that order for the gfx to look good.) Letting the encoder select a certain combination in fact helps the compression ratio a little bit.

My question, in more general terms, is this: in YOUR game/demo, could you reorder the palette (except entry 0) arbitrarily without ill effects if you wanted?

On another note, some summary stats for the Corpus + SMB.chr:

CommonA: 25.81%
CommonB: 3.82%
SameAsLast: 2.65%
SameAsLastInverted: 0.89%
RD: 13.30%
RD(-x): 7.93%
RD(-y): 3.74%
RD(-x,-y): 3.42%
RCR: 28.65%
Raw: 9.79%

At least one mode stands out sorely, being used in < 1% of all patterns.

by tokumaru on 2010-03-29 (#59274)

Very interesting stats. Common plane A is used a lot, I wouldn't expect that much. I expected RCR to be the heart of it though.

About the permutation of the palette, I think that might be a problem. Sometimes palettes are shared between different sets of graphics. Also, some games use a logical order for the colors, such as sorting them by brightness, which makes them interchangeable. So I don't think it's a good idea to have the graphics bossing the palettes around. There are probably more reasons to avoid this, I just can't think of them right now.

If you really think it will help compression, you might add it as something optional, so that those who don't care about the order of the colors can make use of it.

by cadahl on 2010-03-30 (#59293)

Good points. I added an option "-c" to allow the color reordering.

Here are the stats per file (more interesting really): http://www.pastey.net/134730

Common A clearly dominates some files, and like you say RCR is the main mode but the RD variants contend in many cases. The Raw usage is satisfyingly low in many files but also indicate the problem areas. I've looked at a dump of raw tiles in yy-chr but can't really spot any more obvious redundancies to exploit.

Edit: I came up with an interesting tweak to the Common modes btw. Instead of storing the patterns themselves in the header, I store their indices. When they are first decoded in the stream (using Raw or RCR) they're added to the two-entry dictionary. Any subsequent occurrences use the Common A/B modes.

Now... this exposes the trick: before the first occurrence of A/B, the two-entry dictionary is unused. I figured why not initialize the entries to (all zeroes) and (all ones)? Then, any occurrences of 0-tiles or 1-tiles can borrow the Common A/B modes until the actual A/B values are found in the stream. Additionally, if the actual A/B values are (all zeroes) or (all ones) I can just leave them implicitly defined and never really encode them.

This is cheap, almost automatic and actually shaves a little off the result. (and I'll try updating the A/B patterns many times in the file.. like you suggested)

by tokumaru on 2010-03-30 (#59301)

cadahl wrote:
I came up with an interesting tweak to the Common modes btw. Instead of storing the patterns themselves in the header, I store their indices. When they are first decoded in the stream (using Raw or RCR) they're added to the two-entry dictionary. Any subsequent occurrences use the Common A/B modes.

Ah, good idea.

Quote:
(and I'll try updating the A/B patterns many times in the file.. like you suggested)

Maybe you could do it like this: At the start, define the index of the 1st common pattern, and when this pattern is reached, the index of the pattern that will take it's place is defined, and so on. Instead of indices you could even use counters (how many patterns until a new one should be loaded), because it's easy to detect when numbers reach 0.