I kind of love bit twiddling hacks. They are great little puzzles to figure out, and are a lot like trigonomotric identities. Both are kind of esoteric, but they can save you a lot of work if you know when to apply them to simplify something bigger.
I recently had some fun rewritting the png <--> chr utilities I was using to be simpler, and figured I would do a writeup on what I ended up with. They didn't need to be rewritten, but why not?
So first up, converting a 8bpp indexed color to NES CHR bitplane format.
So let's start with the tile_to_chr() function first since it's pretty straightforward. It has two for loops. One loop for each bitplane. One iteration for each row of input pixels. The row_to_chr() function is much more interesting. It takes a pointer to a row of 8 pixels in 8bpp indexed format, and a shift parameter to select which bit to grab from each byte, and outputs a byte for that bitplane.
Let's step through that row_to_chr() function. The first thing we want to do is load the entire row of 8 pixels into a 64 bit unsigned integer. To do that we need to cast the pointer and dereference it. (Make sure it's aligned!) Next we shift up by either 6 (or 7) bits, and then mask the top bit in each byte. This selects the first (or second) bit in each of the original bytes. The reason to shift it up into the top bit ties in with the modulus step. So what does that modulus do?
Modulus by 2**n - 1 has some neat properties. Starting with a simpler example, say you are taking the modulus of 15 (2**4 - 1) The first 4 bits (1, 2, 4, 8) are all less than 15, so they end up directly in the modulus. What about the next four bits 16, 32, 64, 128?
Interesting! Because of how modulus works, the groups of bits end up being added together, and as long as they don't overlap, they don't overflow. When they don't overflow, addition ends up being the same as logically ORing them together. So for example 0b1000_0001 % 15 == 0b1001. This is exactly what we need to collapse a set of 8 spread out bits into a single byte.
So back to the CHR conversion example, we have all of the 8 bits we want to collapse into a byte in the top bit of each byte in a 64 bit unsigned integer. 511 == 2**9 - 1 so we are collecting groups of 9 bits. The bit from the lowest byte will end up in the 8th bit, the bit from the second byte will end up in the 7th bit, and so on. Done!
"BUT WAIT!" I hear some saying. What about endianness of that 64 bit load we did? It is important, because if the endianness was flipped, the tile would by flipped on the x axis. I don't actually handle that in this example because all the CPUs today seem to be doing things the Intel way, but it could be pretty easily reversed. Instead of placing the bits we want to select in the 8th bit, shift and mask them so they end up in the first. Mod 511 will do the rest to flip it.
"BUT WAIT!" I hear someone else saying. Isn't integer modulo really expensive? Well... yes, but certainly less expensive than treating all of the bytes separately. Additionally there are tricks for computing modulo of 2**n - 1 similar to how you can divide by 9 quickly. Basically you can just do a bunch of shifting and adding. Clang seems to be aware of this, I would be surprised if other compilers weren't. Alternatively, in this specific case because there is no overflow, you can just shift and OR a few times (3 in this case), until you've collapsed all the groups.
I also have a utility for converting NES CHR back to png.
At the top there is a for loop that pases the two bitplane bytes for a CHR row to the bytes_to_row() function, and assigns the 64 bit return value to 8 pixels in the output.
bytes_to_row() is conceptually much simpler than row_to_chr(). The multiplication coefficient has every 9th bit in it set. That makes 8 copies of the original byte with an extra bit of padding between them. The top bits in the top byte get cut off, but we don't actually need them. The 8th bit from the input byte ends up in the top bit of the first output byte. The 7th bit from the input byte ends up in the top bit of the second byte. (and so on) Next we shift the bit we want into the first (or second) bit, and then mask. Finally we OR the two bitplanes together and have the final output. Tada!
"BUT WA..." Yes, yes, endianness. The endianness can be flipped easily enough by changing the shifting. I'll leave that as an excercise for the reader though.
Anyway, hopefully somebody found that to be enlightening. If some of this seems kind of SIMD-like, it is! SWAR (SIMD Within A Register) is a set of techniques that let you do SIMD like things in a cross platform way on CPUs that don't even have SIMD support. Even more interestingly, it lets you work with weird bit width sizes like how I used 9 bit widths here.
I recently had some fun rewritting the png <--> chr utilities I was using to be simpler, and figured I would do a writeup on what I ended up with. They didn't need to be rewritten, but why not?
So first up, converting a 8bpp indexed color to NES CHR bitplane format.
Code:
static inline u8 row_to_chr(const void *pixels, uint shift){
// Load 64 bits / 8 bytes for the whole row of the tile.
// Shift so the bit we want is in the top bit of each byte and mask.
// Mod 511 (2**9 - 1) selects each bit per 9 bit sequence to collect them into a byte.
// In combination with the earlier shift, this reverses the byte -> bit mapping.
return ((*(u64 *)pixels << shift) & 0x8080808080808080) % 511;
}
static void tile_to_chr(const u8 *pixels, uint stride, u8 tile[]){
for(uint i = 0; i < 8; i++) tile[i + 0] = row_to_chr(pixels + i*stride, 7);
for(uint i = 0; i < 8; i++) tile[i + 8] = row_to_chr(pixels + i*stride, 6);
}
// Load 64 bits / 8 bytes for the whole row of the tile.
// Shift so the bit we want is in the top bit of each byte and mask.
// Mod 511 (2**9 - 1) selects each bit per 9 bit sequence to collect them into a byte.
// In combination with the earlier shift, this reverses the byte -> bit mapping.
return ((*(u64 *)pixels << shift) & 0x8080808080808080) % 511;
}
static void tile_to_chr(const u8 *pixels, uint stride, u8 tile[]){
for(uint i = 0; i < 8; i++) tile[i + 0] = row_to_chr(pixels + i*stride, 7);
for(uint i = 0; i < 8; i++) tile[i + 8] = row_to_chr(pixels + i*stride, 6);
}
So let's start with the tile_to_chr() function first since it's pretty straightforward. It has two for loops. One loop for each bitplane. One iteration for each row of input pixels. The row_to_chr() function is much more interesting. It takes a pointer to a row of 8 pixels in 8bpp indexed format, and a shift parameter to select which bit to grab from each byte, and outputs a byte for that bitplane.
Let's step through that row_to_chr() function. The first thing we want to do is load the entire row of 8 pixels into a 64 bit unsigned integer. To do that we need to cast the pointer and dereference it. (Make sure it's aligned!) Next we shift up by either 6 (or 7) bits, and then mask the top bit in each byte. This selects the first (or second) bit in each of the original bytes. The reason to shift it up into the top bit ties in with the modulus step. So what does that modulus do?
Modulus by 2**n - 1 has some neat properties. Starting with a simpler example, say you are taking the modulus of 15 (2**4 - 1) The first 4 bits (1, 2, 4, 8) are all less than 15, so they end up directly in the modulus. What about the next four bits 16, 32, 64, 128?
- 16 % 15 == 1
- 32 % 15 == 2
- 64 % 15 == 4
- 128 % 15 == 8
Interesting! Because of how modulus works, the groups of bits end up being added together, and as long as they don't overlap, they don't overflow. When they don't overflow, addition ends up being the same as logically ORing them together. So for example 0b1000_0001 % 15 == 0b1001. This is exactly what we need to collapse a set of 8 spread out bits into a single byte.
So back to the CHR conversion example, we have all of the 8 bits we want to collapse into a byte in the top bit of each byte in a 64 bit unsigned integer. 511 == 2**9 - 1 so we are collecting groups of 9 bits. The bit from the lowest byte will end up in the 8th bit, the bit from the second byte will end up in the 7th bit, and so on. Done!
"BUT WAIT!" I hear some saying. What about endianness of that 64 bit load we did? It is important, because if the endianness was flipped, the tile would by flipped on the x axis. I don't actually handle that in this example because all the CPUs today seem to be doing things the Intel way, but it could be pretty easily reversed. Instead of placing the bits we want to select in the 8th bit, shift and mask them so they end up in the first. Mod 511 will do the rest to flip it.
"BUT WAIT!" I hear someone else saying. Isn't integer modulo really expensive? Well... yes, but certainly less expensive than treating all of the bytes separately. Additionally there are tricks for computing modulo of 2**n - 1 similar to how you can divide by 9 quickly. Basically you can just do a bunch of shifting and adding. Clang seems to be aware of this, I would be surprised if other compilers weren't. Alternatively, in this specific case because there is no overflow, you can just shift and OR a few times (3 in this case), until you've collapsed all the groups.
I also have a utility for converting NES CHR back to png.
Code:
for(uint y = 0; y < 8; y++){
*(u64 *)(pixels + col + w*(row + y)) = bytes_to_row(tile_data[y], tile_data[y + 8]);
}
static u64 bytes_to_row(u8 byte0, u8 byte1){
// For each byte, copy in 9 bit intervals, shift into place and mask.
return (0
| (((byte0 * 0x8040201008040201) >> 7) & 0x0101010101010101)
| (((byte1 * 0x8040201008040201) >> 6) & 0x0202020202020202)
);
}
*(u64 *)(pixels + col + w*(row + y)) = bytes_to_row(tile_data[y], tile_data[y + 8]);
}
static u64 bytes_to_row(u8 byte0, u8 byte1){
// For each byte, copy in 9 bit intervals, shift into place and mask.
return (0
| (((byte0 * 0x8040201008040201) >> 7) & 0x0101010101010101)
| (((byte1 * 0x8040201008040201) >> 6) & 0x0202020202020202)
);
}
At the top there is a for loop that pases the two bitplane bytes for a CHR row to the bytes_to_row() function, and assigns the 64 bit return value to 8 pixels in the output.
bytes_to_row() is conceptually much simpler than row_to_chr(). The multiplication coefficient has every 9th bit in it set. That makes 8 copies of the original byte with an extra bit of padding between them. The top bits in the top byte get cut off, but we don't actually need them. The 8th bit from the input byte ends up in the top bit of the first output byte. The 7th bit from the input byte ends up in the top bit of the second byte. (and so on) Next we shift the bit we want into the first (or second) bit, and then mask. Finally we OR the two bitplanes together and have the final output. Tada!
"BUT WA..." Yes, yes, endianness. The endianness can be flipped easily enough by changing the shifting. I'll leave that as an excercise for the reader though.
Anyway, hopefully somebody found that to be enlightening. If some of this seems kind of SIMD-like, it is! SWAR (SIMD Within A Register) is a set of techniques that let you do SIMD like things in a cross platform way on CPUs that don't even have SIMD support. Even more interestingly, it lets you work with weird bit width sizes like how I used 9 bit widths here.