I think it's funny but it could be possible to do floating point in 8-bit.
1 - bit sign
3 - bit exponant
4 - bit mantissa
Smallest possible number : (1 + 1/16)*(1/16) = 0.06640625
Precision of 1/256 (0.00390625) between 0.06640625 and 1/8 (= 0.125)
Precision of 1/128 between 1/8 and 1/4
Precision of 1/64 between 1/4 and 1/2
Precision of 1/32 between 1/2 and 1
Precision of 1/16 between 1 and 2
Precision of 1/8 between 2 and 4
Precision of 1/4 between 4 and 8
Precision of 1/2 between 8 and 15.5
Largest possible number : 15.5
Not very useful, but I just noticed that I always assumed that floating points were an extremely complex thing when they're not. It's just that they threw so many bits to them that you don't know what they're doing anymore. This helped me to see how it works internally.
Operations are actually very simple :
MULTIPLICATION :
Multiply mantissas (after adding an extra '1'), add exponants, xor signs
DIVISION :
Divide mantissas (after adding an extra '1'), substract exponants, xor signs
ADDITION :
It's a bit more tricky.
You need to look the number with the smaller exponant, and shift it so it match the larger exponant. Then add mantissas.
SUBSTRATION :
Just change the sign of a number then perform addition
CONVERT TO INT :
Add extra '1' to mantissa, shift "exponant" times left (right if exponant is negative, throwing bits away). Negate if the number was negative.
CONVERT FROM INT :
Take absolute value. "Exponant" is equal to the position of the first '1' bit in the int, mantissa is what follows. Match the sign with the original int.
Now all this sounds really trivial to me, I don't know why I had the idea it was so complex for all those years. Of course it's more complex as you'll have to check for particular cases, overflow, underflow etc... and multiplication and division is a bit more complex, but the idea is simple.
1 - bit sign
3 - bit exponant
4 - bit mantissa
Smallest possible number : (1 + 1/16)*(1/16) = 0.06640625
Precision of 1/256 (0.00390625) between 0.06640625 and 1/8 (= 0.125)
Precision of 1/128 between 1/8 and 1/4
Precision of 1/64 between 1/4 and 1/2
Precision of 1/32 between 1/2 and 1
Precision of 1/16 between 1 and 2
Precision of 1/8 between 2 and 4
Precision of 1/4 between 4 and 8
Precision of 1/2 between 8 and 15.5
Largest possible number : 15.5
Not very useful, but I just noticed that I always assumed that floating points were an extremely complex thing when they're not. It's just that they threw so many bits to them that you don't know what they're doing anymore. This helped me to see how it works internally.
Operations are actually very simple :
MULTIPLICATION :
Multiply mantissas (after adding an extra '1'), add exponants, xor signs
DIVISION :
Divide mantissas (after adding an extra '1'), substract exponants, xor signs
ADDITION :
It's a bit more tricky.
You need to look the number with the smaller exponant, and shift it so it match the larger exponant. Then add mantissas.
SUBSTRATION :
Just change the sign of a number then perform addition
CONVERT TO INT :
Add extra '1' to mantissa, shift "exponant" times left (right if exponant is negative, throwing bits away). Negate if the number was negative.
CONVERT FROM INT :
Take absolute value. "Exponant" is equal to the position of the first '1' bit in the int, mantissa is what follows. Match the sign with the original int.
Now all this sounds really trivial to me, I don't know why I had the idea it was so complex for all those years. Of course it's more complex as you'll have to check for particular cases, overflow, underflow etc... and multiplication and division is a bit more complex, but the idea is simple.