Floating-Point Types
Introduction
Digital computers are great, but when it comes to computing numbers with arbitrary precision there is a problem. There is no way of representing every possible number as this would mean that a number might take up an infinite amount of space. Thankfully, clever engineers invented floating-point values which offer a reasonable trade-off between range and precision.
History of floating-point
According to wikipedia the invention of floating-point values dates back to as early as 1914 when Leonardo Torres y Quevedo proposed an early form of floating-point values for a calculator design.
The Z1 built by Konrad Zuse in 1938 was the first freely programmable computer and it already had full support for 24-bit floating-point values!
The Z3 completed in 1941 even had representations for +/-∞.
Because there are many subtleties in floating-point numbers without one obvious way of doing things, by the 1980s there where many incompatible floating-point implementations. This was the source of many problems and lead to the standardization of floating-points in IEEE 754
(Binary) floating-point numbers - a refresher
The way computers store (binary) floating-point numbers is by breaking them up into the sign, n digits of the fraction and an exponent which is stored in m bits. The floating-point number thus needs 1 sing bit + n fraction bits + m exponent bits of space. Special values that floating-point numbers can represent are +/-zero, +/-∞ and NaN (not a number)
In IEEE754, every operation (+-*/...) is always allowed for floating-points even e.g. dividing by zero. The result of dividing by zero can be NaN or ∞. A NaN is always unequal to any other number including itself. If numbers get too big to be represented by the floating-point number, the result will be ∞.
Because of the nature how floating-point numbers are stored in computers, the density of floating-point numbers decreases the further you are from zero. The result of this is that within the range of a floating-point the relative error will always be less than a certain epsilon.
32-bit floats
If the type of a float is not inferable it is assumed to be 64-bit. That is why we have to call f32 with 4.0 in below example.
64-bit floats
16-bit floats
Type f16
is currently not supported.
128-bit floats
Type f128
is currently not supported.