8-bit floating point

8-bit floating point
by Bregalad on 2013-03-20 (#109836)

I think it's funny but it could be possible to do floating point in 8-bit.
1 - bit sign
3 - bit exponant
4 - bit mantissa

Smallest possible number : (1 + 1/16)*(1/16) = 0.06640625
Precision of 1/256 (0.00390625) between 0.06640625 and 1/8 (= 0.125)
Precision of 1/128 between 1/8 and 1/4
Precision of 1/64 between 1/4 and 1/2
Precision of 1/32 between 1/2 and 1
Precision of 1/16 between 1 and 2
Precision of 1/8 between 2 and 4
Precision of 1/4 between 4 and 8
Precision of 1/2 between 8 and 15.5
Largest possible number : 15.5

Not very useful, but I just noticed that I always assumed that floating points were an extremely complex thing when they're not. It's just that they threw so many bits to them that you don't know what they're doing anymore. This helped me to see how it works internally.

Operations are actually very simple :
MULTIPLICATION :
Multiply mantissas (after adding an extra '1'), add exponants, xor signs

DIVISION :
Divide mantissas (after adding an extra '1'), substract exponants, xor signs

ADDITION :
It's a bit more tricky.
You need to look the number with the smaller exponant, and shift it so it match the larger exponant. Then add mantissas.

SUBSTRATION :
Just change the sign of a number then perform addition

CONVERT TO INT :
Add extra '1' to mantissa, shift "exponant" times left (right if exponant is negative, throwing bits away). Negate if the number was negative.

CONVERT FROM INT :
Take absolute value. "Exponant" is equal to the position of the first '1' bit in the int, mantissa is what follows. Match the sign with the original int.

Now all this sounds really trivial to me, I don't know why I had the idea it was so complex for all those years. Of course it's more complex as you'll have to check for particular cases, overflow, underflow etc... and multiplication and division is a bit more complex, but the idea is simple.

Re: 8-bit floating point
by blargg on 2013-03-20 (#109837)

I read the topic and was like, "huh? too few bits!" but it is possible. How few could you have? Don't need a sign, since >= 0 is still floating-point. Must have exponent, since that's what "floats" the point. Mantissa isn't strictly needed, since there's always the implied 1 bit. So you could have a one-bit floating-point value, storing 1 or either 0.5 or 2, depending on which direction you have the exponent go.

How is zero encoded in your 8-bit FP value?

Re: 8-bit floating point
by Bregalad on 2013-03-20 (#109844)

Normally, zero is encoded as all bits zero (as opposed to, here, 1 * 2^(-4) = 1/6) which is why the minimum is (1+1/16)*(1/16).

However I should also have provisions to encode +INF and -INF. In this case, 15.5 would probably means INF (no matter the sign).
Nan could be encoded with minus zero.

I agree it's too few bits to be useful, calculations where you never risk getting above 15 are rare.
Of course the range could be extended by adding bits to exponent, but it would reduce bits for the mantissa and be even less precise.

Also the exponant could be "biased", so that there is either a larger range above 1, or a larger range between 0 and 1, depending on the needs.

Subnormal
by tepples on 2013-03-20 (#109846)

Bregalad wrote:

Smallest possible number : (1 + 1/16)*(1/16) = 0.06640625
Precision of 1/256 (0.00390625) between 0.06640625 and 1/8 (= 0.125)
Precision of 1/128 between 1/8 and 1/4

IEEE floating point handles this by special casing exponent 0 as subnormal. If the exponent is 0, it is treated as the exponent being 1, but the implied high bit of the mantissa is 0, not 1. This gives: "Precision of 1/128 between 0 and 1/4".

As for the exponent range, one thing commonly stored in a float is a velocity. On the NES, it's reasonable for velocities not to exceed 16 pixels per frame, as that's one row of metatiles in many engines.

Re: 8-bit floating point
by Bregalad on 2013-03-20 (#109847)

Quote:

Mmh, I thought the exponent were signed, that is that, if exponent is zero, it just means the number is between 1 and 2.

Oh wait, this is in total contradiction with my previous post, or else encoding all bits as zero would mean 1 (1 * 2^0)

I suck. Then, pehaps I imagined the exponents biased, so that 0 would mean 2^(-4) and 7 would mean 2^3 after all... I don't remember. Looks like I still didn't get floating point entirely.

Quote:

As for the exponent range, one thing commonly stored in a float is a velocity. On the NES, it's reasonable for velocities not to exceed 16 pixels per frame, as that's one row of metatiles in many engines.

Wow, great idea. Let's code games where velocities are coded in 8-bit floating point

Not much advantages over 4:4 fixed point though - same range, but more complex calculations. Sure getting exponential precision instead of linear is nice, but is it worth all the overhead ?

Re: 8-bit floating point
by tepples on 2013-03-20 (#109848)

Bregalad wrote:

Not much advantages over 4:4 fixed point though - same range, but more complex calculations. Sure getting exponential precision instead of linear is nice, but is it worth all the overhead ?

Think of it as lossy compression of fixed-point values in RAM. If you have lots and lots of objects, and you need more velocity precision than 5:3 fixed, it could save you a byte per object over 8:8 fixed.