Toby Opferman
http://www.opferman.net
programming@opferman.net
IEEE Floating Point
In this simple tutorial we will learn IEEE floating point format for
extended, double and single precision. Also, how to convert to and from
these formats. Before you read this I assume you can convert whole binary
numbers to decimal. This tutor will teach you how to convert real numbers
to floating point, but that is just beyond the decimal, the whole number
is still the same conversion so you should read the number base tutorial
if you do not know how already.
Single Precision is 32 bits (4 Bytes)
Double Precision is 64 bits (8 Bytes)
Extended Precision is 80 bits (10 Bytes)
[ 1 Sign Bit | 8 Bit Exponent | 23 Bit Mantissa ]
[ 1 Sign Bit | 11 Bit Exponent | 53 Bit Mantissa ]
[ 1 Sign Bit | 15 Bit Exponent | 64 Bit Mantissa ]
Sign Bit is 1 = Negative, 0 = Positive
The next represent 5 different numbers in the 3 different IEEE standards:
1.0
2.0
0.0
1.08
10.333
3F 80 00 00
40 00 00 00
00 00 00 00
3F 8A 3D 71
41 25 53 F8
3F F0 00 00 00 00 00 00
40 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
3F F1 47 AE 14 7A E1 47
40 24 AA 7E F9 DB 22 D1
3F FF 80 00 00 00 00 00 00 00
40 00 80 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00
3F FF 8A 3D 70 A3 D7 0A 3D 71
40 02 A5 53 F7 CE D9 16 87 2B
Single Precision
The Exponet is stored in excess 127 and the mantissa is 1.xxxx
3F 80 00 00
Sign Bit Exp 1.Mantissa
0 01111111 00000000000000000000000
127 - 127 = 0 1.0 bitshift 0 places
Exponent = 0, so the number is 1.0
Double Precision
The Exponet is stored in excess 127 and the mantissa is 1.xxxx
3F F0 00 00 00 00 00 00
Sign Bit Exp 1.Mantissa
0 01111111111 0000000000000000000000000000000000000000000000000000
Exponent stored Excess 1023
1023 - 1023 = 0 1.0 bitshift 0 places
1.0 is the answer.
Extended Precision
3F FF 80 00 00 00 00 00 00 00
Sign Bit Exp Mantissa
0 011111111111111 1000000000000000000000000000000000000000000000000000000000000000
Excess 65535
16383 - 16383 = 0 1.0 bitshift 0 places
1.0 is the answer.
Single Precision:
40 00 00 00
Sign Bit Exp 1.Mantissa
0 10000000 00000000000000000000000
128 - 127 = 1
1.0 bitshift 1 place to 10.0 the answer is 2.0
Now, you can see the others are the same and the next one is obviously 0.
But, now it's time to take the Mantissa out and find out what it is.
3F 8A 3D 71
Sign Bit Exp 1.Mantissa
0 01111111 00010100011110101110001
Well, we know the exponent is 0 obviously since we just did the last one that way.
Now, to get the number it's almost the same as when you convert regular
binary to hex, with a small difference.
But, instead of each bit reprsenting positive powers of 2, they represent
negative powers of 2 (Starting Left to Right)
0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1
-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21 -22 -23
So, you add up the powers of 2 that aren't 0. (You multiply it with the bit,
if the bit is 0, you will get 0 so only add up the ones with a set bit)
1 1 1 1 1 1 1 1 1 1 1
-4 -6 -10 -11 -12 -13 -15 -17 -18 -19 -23
2^-4 + 2^-6 + 2^-10 + 2^-11 + 2^-12 + 2^-13 + 2^-15 + 2^-17 + 2^-18 + 2^-19 + 2^-23
.080000042915
1.080000042915 * 2^1 = 1.080000042915
You are going to have trailing numbers.
To convert TO IEEE you do the following:
You divide the number by 2^-1 and each whole number are the bits. Then you
take off the whole number and divide the decimal again.
.08/2^-1 = 0.16 1
.16/2^-1 = 0.32 2
.32/2^-1 = 0.64 3
.64/2^-1 = 1.28 4
.28/2^-1 = 0.56 5
.56/2^-1 = 1.12 6
.12/2^-1 = 0.24 7
.24/2^-1 = 0.48 8
.48/2^-1 = 0.96 9
.96/2^-1 = 1.92 10
.92/2^-1 = 1.84 11
.84/2^-1 = 1.68 12
.68/2^-1 = 1.36 13
.36/2^-1 = 0.72 14
.72/2^-1 = 1.44 15
.44/2^-1 = 0.88 16
.88/2^-1 = 1.76 17
.76/2^-1 = 1.52 18
.52/2^-1 = 1.04 19
.04/2^-1 = 0.08 20
.08/2^-1 = 0.16 21
.16/2^-1 = 0.32 22
.32/2^-1 = 0.64 23
.64/2^-1 = 1.28 24
Number Bits
0.16 1
0.32 2
0.64 3
1.28 4
0.56 5
1.12 6
0.24 7
0.48 8
0.96 9
1.92 10
1.84 11
1.68 12
1.36 13
0.72 14
1.44 15
0.88 16
1.76 17
1.52 18
1.04 19
0.08 20
0.16 21
0.32 22
0.64 23
1.28 24
Notice that the whole numbers spell out the binary for the positions. With 1 exception.
We have a 0 in the 23 bit place where in the binary above they have a 1. This is
because they took it out to 24 places like we did above, and rounded. Since
there is a 1, we round to a 1 in the 23 bit place. Therefore, We have
gotten the same.
Now, we do the same to the whole numbers and we have:
1.00010100011110101110001
Now, we know we need to get it into power of 2 form. But, it looks like it's already
there. So, we knock off the 1 and keep the 0001010001111010111000100010100011110101110001
and we just put down 127 so 127 - 127 = 0 shifts. sign bit is 0 as well.
10.333
We will decode each of these, the double precision and the extended precision.
----------------------------------------------
Double Precision
40 24 AA 7E F9 DB 22 D1
01000000 00100100 10101010 01111110 11111001 11011011 00100010 11010001
0 10000000010 0100101010100111111011111001110110110010001011010001
10000000010 = 1026
1026 - 1023 = 3
Remeber, all expoents are stored in EXCESS, so you subtract your exponent
FROM the excess to get the shit. Remeber also, Negative shift means
shift the decimal to the left and positive shift means shift the decimal
to the right. Only after the shift do you start counting mantissa positions.
Insert implied 1.
1.0100101010100111111011111001110110110010001011010001
Shift 3 places
1010.0101010100111111011111001110110110010001011010001
The whole number is 10. (1010b = Ah = 10)
The mantissa.
0101010100111111011111001110110110010001011010001
Find the bit positions with 1
2, 4, 6, 8, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 25, 26, 27, 29, 30, 32, 33, 36, 40, 42, 43, 45, 49
2^-2 + 2^-4 + 2^-6 + 2^-8 + 2^-11 + 2^-12 + 2^-13 +
2^-14 + 2^-15 + 2^-16 + 2^-18, 2^-19 + 2^-20 +
2^-21 + 2^-22 + 2^-25 + 2^-26 + 2^-27 + 2^-29 +
2^-30 + 2^-32 + 2^-33 + 2^-36 + 2^-40 + 2^-42 +
2^-43 + 2^-45 + 2^-49 =.333
Answer is 10.333
-------------------------------------------------------
Extended Precision
40 02 A5 53 F7 CE D9 16 87 2B
0100 0000 0000 0010 1010 0101 0101 0011 1111 0111 1100 1110 1101 1001 0001 0110 1000 0111 0010 1011
0 100000000000010 1010010101010011111101111100111011011001000101101000011100101011
100000000000010 = 16386
16386 - 16383 = 3
So, you have 1.010010101010011111101111100111011011001000101101000011100101011
Move the decimal 3 places
1010.010101010011111101111100111011011001000101101000011100101011
Now, you will notice from this equation and the previous equation with the
extended precsion. the first bit in the Mantissa is actually the whole number.
1.xxxxx So, the mantissa is actually 63 bits long with 1 bit being the whole
number, so 64 bits. Where as in the other forms, single and double, the 1
isn't written into the mantissa, it's implied to be there.
Now, if we look at the part above the decimal point, we see it's 10.
10.xxxx Now, we need to multiply out the powers of 2^-n and add.
mbitn = mantissa bit #n from left to right.
n
You can say the mantissa is Summation(mbitn*2^-n)
i=1
Mantissa:
010101010011111101111100111011011001000101101000011100101011
The 1 is in bit positions:
2, 4, 6, 8, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 25, 26, 27, 29, 30, 32, 33, 36, 40, 42, 43, 45, 50, 51, 52, 55. 57, 59, 60
So,
2^-2 + 2^-4 + 2^-6 + 2^-8 + 2^-11 + 2^-12 + 2^-13 +
2^-14 + 2^-15 + 2^-16 + 2^-18, 2^-19 + 2^-20 +
2^-21 + 2^-22 + 2^-25 + 2^-26 + 2^-27 + 2^-29 +
2^-30 + 2^-32 + 2^-33 + 2^-36 + 2^-40 + 2^-42 +
2^-43 + 2^-45 + 2^-50 + 2^-51 + 2^-52 + 2^-55 + 2^-57 + 2^-59 + 2^-60 = .333
Answer is 10.333
Now, you see how the IEEE floating point format works in Single Precision,
double precision and Extended precision. The only difference betsize
the size of the exponent and mantissa between single/double and extended
is that single and double precisions have a bit 1.Mantissa that is
not in the format itself where in the extended format, the 1 bit is actually
IN the mantissa as the first bit and the decimal place is implied to be there.
And you notice again that the double precison rounded bit 50 to bit 49.
Single precision done on the FPU and double precision done on the FPU should
be decently accurate since the FPU of the PC is an 80 bit processor.
Extended bit math does NOT have overflow like the other two. It goes to
bit 80 and there is no overflow math. So, Extended floating point numbers
aren't always extremely accurate to long decimal places, they may only be as
accurate as the double precision. Then again, you do have more places and it
may help to even have an approximation of the end. But, just remeber,
the FPU overflows to 80 bits, so single precision and double have good rounding
approximations.
That is the end of the tutorial. You see the format, we have decoded the format
and even went to the format on one occasion. So, you should understand
how to convert numbers to and from IEEE to single/double/extended floating
point standards.