I wrote a compression library, with a technique insipred by DEFLATE, but the goal was to have reasonable generic compression performance with very low RAM usage.
Only needs 9 bytes of RAM. Decompression times vary depending on the data, but seems to easily handle more than 100 bytes per frame.
It's free and open source. There's a command line tool and a demo ROM that compresses
The Most Dangerous Game to fit in NROM.
https://github.com/bbbradsmith/huffmunch
Would you say that compressing text is the best use of Huffman?
BTW, the idea of compressing an entire (NROM) game and decompressing it as needed... is an interesting one.
dougeff wrote:
Would you say that compressing text is the best use of Huffman?
Huffman coding is just a component of the algorithm. Though the example ROM I published compresses text, the purpose of this algorithm is generic. It works on any kind of data. (DEFLATE also uses huffman coding in its algorithm.)
For text, Huffmunch should perform better than huffman coding alone, because it encodes longer strings and not just single bytes.
dougeff wrote:
BTW, the idea of compressing an entire (NROM) game and decompressing it as needed... is an interesting one.
The name I used for the demo ROM may have been misleading? The demo just has the text of a story compressed. The decompression library will let you retrieve each byte from the data in order (serial access). You can break the data up into pieces, though, and start from the beginning of any of those pieces. In the demo, it's broken into pages of the story.
(All of the pieces in a bank share a single compression tree, so despite being broken up into pieces they are still compressed more globally with shared redundancies.)
I made this image to illustrate how the compression works. This is an output of the compression tree created for the demo ROM, in order bit length it takes to encode each symbol. (A symbol is just a string of bytes to be output.)
Again, this isn't a text-specific algorithm, but it works well on text and it makes the example easy to follow what it's doing...
You can see that [e] and [ ] are the most common, and got encoded with only 4 bits each.
Before it even gets to the rest of the alphabet, some common digraphs appear like [re] [on] [it] [ar] [in] with 6 or 7 bit encodings (>50% smaller).
Eventually longer strings appear lower in the tree. The name of the main character [Rainsford] got an 8-bit encoding since it gets used so frequently (88% smaller).
Very infrequent symbols at the end of the tree are actually expanded, like captial [J] takes 14 bits to encode (75% larger), but that's how huffman coding works in general: deprioritize rare symbols to compress common ones better.
Another minor compression is applied as suffixes within the tree. If you see a symbol followed by - and another symbol, it means instead of storing the whole string it replaced that suffix with a pointer to another tree node that already includes it. This lets words like [night] and [light] share an [ight] suffix. There are quite a few [ing] suffixes. In some cases a space prefixed variant of a symbol appears, since words are most often preceded by a space.
In some cases multiple words end up in a single symbol, like [did not] or [I am]. The algorithm is not looking specifically for words, just any common substrings, but since text is naturally grouped into words you will see some appear in the tree like this. There is more compression gained from the smaller fragments than whole words, I think.
The tree symbol storage could be made even better by allowing both suffixes and prefixes, but that would require some recursion / extra storage... maybe it could go on the stack but I'd have to put explicit limits on it, and it seemed out of scope for the problem. With suffix-only there's no extra RAM requirement.
Anyway, the same principles apply to any kind of data you want to compress, but the output here would be hard to read as hex dumps.
The compressor itself could probably be improved. The huffman tree component is optimal, but the tree structure and choice of symbols is very much heuristic (and the hill climbing feels a little bit naive and computationally expensive), but the results seem pretty good to me regardless. If anyone else has ideas on how to improve it I'd love to hear them.
I also did some experiments adding some simple pre-processing (e.g. RLE) before encoding to see if that helped, but I couldn't find any that were generically applicable. These seemed to add entropy to the starting data that made their eventual huffmunch encoding worse. Huffmunch's tree actually encodes long runs fairly well anyway, without any additional help. I could make pathological examples where RLE is better than it, but for any of my test cases of more practical looking data it performed worse.
So it uses an explicit dictionary of byte strings, instead of the usual LZ77/LZ78 on-the-fly dictionary? Interesting. Pike's method does something similar but with a simple nybble-aligned encoding instead of Huffman.
I'd like to see how LZSS-style optimal parsing works with this scheme. With a fixed dictionary, each symbol has a fixed length, so it can find a truly optimal parse for that dictionary. If I find the time to hack around someday (and remember this project, and nobody beats me to it) I might give it a shot.
Generating an optimal dictonary is probably beyond me, though. Maybe an iterative algorithm, tweaking symbol lengths after each parse?
How are you handling the position and length of each string? When things are variable length, you need to encode the address and length of every string in some way.
The format is outlined here:
https://github.com/bbbradsmith/huffmunch/blob/master/format.txtThe dictionary strings aren't themselves compressed, except for shared suffixes. The strings are stored in the dictionary tree. Each node in the tree is either a branch, a single character leaf, a string, or a string with suffix.
A branch is 1 (short) or 3 bytes (long), containing an offset to its right child (the left child immediately follows). Most are short.
A single character leaf is 2 bytes. A string is 2 + length of string (up to 255). A string with suffix adds another 2 bytes for the offset to the suffix.
The alternative "canonical" version of the tree basically removes all the branch nodes and replaces them with a leaf count for each "level" of the tree, which saves a little bit of space but takes a lot longer to decode. (Could be accelerated with more RAM use, but that kinda goes against one of my primary goals here.)
The position of each string in the data just comes from a bitstream. Read one bit, walk the tree, repeat until you you hit a leaf, then emit its string. Once that string finishes, start walking the tree again. (The bitstream itself is more or less a standard huffman coding affair.)
This isn't huffman-related at all, but here's a completely different idea for how to map a number to a variable-length string. Basically like DTE, but with possible longer lengths.
Define these:
Number of 1 byte symbols
Number of 2 byte symbols
Number of 3 byte symbols
...
Number of 8 byte symbols (just picking a maximum size)
And SORT the symbols by their length, so 1 byte symbols come first, 2 byte symbols come next, etc. Then you can define all your characters without using any terminators, tables, or pointers.
Hmm, that's sort of like how canonical huffman trees work, except for a table of strings rather than a binary tree.
I can't really think of a way to apply it here, though. If the strings were moved to a separate table, I'd need a way to know what string each leaf is associated with. So... right now it's taking 1 byte to specify the length of a string, and it gets stored directly in the tree. Within the tree itself the strings can't be sorted by length, they need to be represented in order of frequency-weight, so if it was stored in a separate table I'd need an index to look up the string? That index would probably have to be 2 bytes (the tree will usually have more than 256 entries).
So, interesting idea for a string table, but probably not applicable here. I'll keep it in mind if I have some other string table application in the future though.
Though, even if it did work out, I'm not sure it would be worth doing, for the same reason the canonical tree probably isn't either: the compression gain will be very small, and the performance loss will be very large (unless RAM usage is increased).
The dictionary tree is generally only a small percentage of the data. Making it smaller does save some space, but the real "meat" of the compression here is from frequency distribution and repetition in the bitstream portion of the data.
I can think of a few ways to shrink the tree by a marginal amount, but all of them seem to have a huge performance penalty (and/or require more RAM use). The canonical tree option is kind of a poster child for this. I implemented it just to see how well it worked, but ultimately it felts like a bad speed:size tradeoff. I left it in as an option because it was already done, but I don't think I'd want to use it.
rainwarrior wrote:
Hmm, that's sort of like how canonical huffman trees work, except for a table of strings rather than a binary tree.
I did something similar to this as a class project around 1990. One difficulty with approaches like this is that there are trade-offs between the amount of text each compressed symbol should represent, and the amount of space necessary to represent each symbol. An approach that may be helpful if the material to compressed will be largely English-language text is to identify a reversible transform that will make the text easier to compress, apply it before compression, and reverse it afterward. For example, many words will usually appear preceded by blanks except when following certain other characters (e.g. hyphen, quote mark, open parenthesis, etc.), and are not usually separated by blanks from those other characters. If one inserts a space after every such special character before compressing the text, and removes that space after uncompressing it, the result may be better compression than would have resulted without the spaces.
Well, if you want to go down the road of maximal compression of very specific data, there's some interesting stuff in the Hutter Prize competetion:
http://prize.hutter1.net/Huffmunch, though, is intended for generic compression that's practical for NES. Trying to keep the compression good, speed OK, code size small, RAM use very small.
Trying to optimize a special case for English text spaces is just not part of this, and honestly I believe spaces are already encoded rather efficiently by this algorithm, based on the text samples I've tried it on. The spaces just aren't eating up that much space, proportionally, so it's not something I really want to investigate (even if it wasn't text-specific). I've tried a lot of dead-end experiments on my way to the current version, and I have a few more ideas I'm going to try eventually, but spaces are already a "solved" problem here, IMO.