Floating Point Arithmetic

   Floating Point Arithmetic

  • The term floating point refers to the fact that a number's radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number.


Representation of Floating point number–

  •  The following description explains terminology and primary details of IEEE 754 binary floating point representation. The discussion confines to single and double precision formats
  •  Usually, a real number in binary will be represented in the following format,                                                    ImIm-1…I2I1I0.F1F2…FnFn-1
  •  Where Im and Fn will be either 0 or 1 of integer and fraction parts respectively. A finite number can also represented by four integers components, a sign (s), a base (b), a significant (m), and an exponent (e). Then the numerical value of the number is evaluated as                                                                (-1) s x m x be ________ Where m < |b|
  • Depending on base and the number of bits used to encode variouscomponents, the IEEE 754 standard defines five basic formats. Among thefive formats, the binary32 and the binary64 formats are single precision and double precision formats respectively in which the base is 2.

One More Somple Way to Understand –

An n-digit floating point number in base β has the form 

          x = ±(0.d1d2 · · · dn)β × Î² e

where
0.d1d2 · · · dn is a β-fraction called the mantissa and e is an integer called the exponent. Such a floating point number is called normalised if
d1 6= 0, or else, d1 = d2 = · · · = dn =.
The exponent e is limited to a range m < e < M Usually, m = −M.

Operations–

  • A floating-point operation is any mathematical operation (such as +, -, *, /) or assignment that involves floating-point numbers (as opposed to binary integer operations).
  • Floating-point numbers have decimal points in them. The number 2.0 is a floating-point number because it has a decimal in it. The number 2 (without a decimal point) is a binary integer
  • Floating-point operations involve floating-point numbers and typically take longer to execute than simple binary integer operations. For this reason, most embedded applications avoid wide-spread usage of floating-point math in favor of faster, smaller integer operations.

Normalization–

  • Many floating point representations have an implicit hidden bit in the mantissa. This is a bit which is present virtually in the mantissa, but not stored in memory because its value is always 1 in a normalized number.
  • We say that the floating point number is normalized if the fraction is at least 1/b, where b is the base. In other words, the mantissa would be too large to fit if it were multiplied by the base. Non-normalized numbers are sometimes called de-normal; they contain less precision than the representation normally can hold.
  • If the number is not normalized, then you can subtract 1 from the exponent while multiplying the mantissa by the base, and get another floating point number with the same value. Normalization consists of doing this repeatedly until the number is normalized. Two distinct normalized floating point numbers cannot be equal in value.

Pitfalls of floating point representation–

  • Current critical systems often use a lot of floating-point computations, and thus the testing or static analysis of programs containing floating-point operators has become a priority. However, correctly defining the semantics of common implementations of floating-point is tricky, because semantics may change according to many factors beyond source-code level, such as choices made by compilers. We here give concrete examples of problems that can appear and solutions for implementing in analysis software.

Error in numerical computation–

  • Often in Numerical Analysis we have to make approximations to numerically compute things. That’s the thing; in Calculus we can take limits and arrive at exact results, but when we use a computer to calculate, say something simple like a derivative, we can’t take an infinite limit so we have to approximate the answer, and therefore, it has error. Computers are limited in their capacity to store and calculate the precision and magnitude of numbers.
  • We can characterize error in measurements and computations with respect to their accuracy and their precision. Accuracy refers to how closely a calculated value agrees with the true value. Precision refers to how closely calculated values agree with each other. Inaccuracy (also called bias) is a systematic deviation from the true values. Imprecision (also called uncertainty) refers to how close together calculated results are to each other. We use the term error to represent both inaccuracy and imprecision in our results.
  • The relationship between the exact result and the approximation can be
  • formulated as                                                                                                                                                                              true value=approximation+error

Iterative methods

  • Iterative method. In computational mathematics, an iterative method is a mathematical procedure that uses an initial guess to generate a sequence of improving approximate solutions for a class of problems, in which the nth approximation is derived from the previous ones.

Bisection methods–

  • The bisection method in mathematics is a root-finding methodthat repeatedly bisects an interval and then selects a subinterval in which a root must lie for further processing. ... The method is also called the interval halving method, the binary searchmethod, or the dichotomy method.

How To Solve

  • The Bisection method is a approximation method to find the roots of an equation by continuously dividing an interval. It will divide the interval in halves until the resulting interval found, which is extremely small.
There is no any specific formula to find the root of a function using bisection
method:


For the ith iteration, the interval width is:

ΔΔxii = 1212 = 0.5 and ΔΔxi−1i−1 = (0.5)ii(n-m); where n > m

So the new midpoint is xii = mi−1i−1 + ΔΔxii

for i = 1, 2, 3, ..., n

  • Below are certain steps to get the solution for the function. For a continuous function g(x)
Step 1: Find two points, say m and n st m < n and g(m) * g(n) < 0

Step 2: Find the midpoint of m and n, say t.

Step 3: t is root of function if g(t) = 0, else follow the next step.

Step 4: Divide the interval [m, n]. If g(t) * g(n) < 0, let m = t, else if g(t) *

g(m) < 0 then let n = t.

Step 5: Repeat above two steps until g(t) = 0.
  •  Example: Find the root of the polynomial, g(x) = x33 - 5 + 3x using bisection method. Where m = 1 and n = 2?
    Solution:
    First find the value of g(x) at m = 1 and n = 2

    g(1) = 133 - 5 + 3*1 = -1 < 0

    g(2) = 233 - 5 + 3*2 = 9 > 0

    Since function is continuous, its root lies in the interval [1, 2].

    Let t be the average of the interval i.e t = 1+221+22 = 1.5

    The value of the function at t is

    g(1.5) = (1.5)33 - 5 + 3*(1.5) = 2.875

    As g(t) is negative so n = 2 is replaced with t = 1.5 for the next iteration. Make sure that g(m) and g(n) have opposite signs.

                        Below table contains nine iterations of the function -

Therefore we chose m = 1.15234375 to be our approximated solution.

Regula – Falsi method–


The Regula falsi method is an oldest method for computing the real roots of an algebraic equation. This below worksheet helps you to understand how to compute the roots of an algebraic equation using Regula falsi method. The practice problems along with this worksheet improve your problem
solving capabilities when you try on your own.

Examples:
Find the root between (2,3) of x3+ - 2x - 5 = 0, by using regular falsi method. Given
f(x) = x3 - 2 x - 5
f(2) = 23 - 2 (2) - 5 = -1 (negative)
f(3) = 33 - 2 (3) - 5 = 16 (positive)

Let us take a= 2 and b= 3.
The first approximation to root is x1 and is given by
x1 = (a f(a) - b f(b))/(f(b)-f(a))
=(2 f(3)- 3 f(2))/(f(3) - f(2))
=(2 x 16 - 3 (-1))/ (16- (-1))
= (32 + 3)/(16+1) =35/17
= 2.058

Now f(2.058) = 2.0583 - 2 x 2.058 - 5
= 8.716 - 4.116 - 5
= - 0.4
The root lies between 2.058 and 3

Taking a = 2.058 and b = 3. we have the second approximation to the root given
by
x2 = (a f(a) - b f(b))/(f(b)-f(a))
= (2.058 x f(3) - 3 x f(2.058)) /(f(3) - f(2.058))
= (2.058 x 16 -3 x -0.4) / (16 - (-0.4))
= 2.081

Now f(2.081) = 2.0812 - 2 x 2.081 - 5
= -0.15
The root lies between 2.081 and 3
Take a = 2.081 and b = 3
The third approximation to the root is given by
x3 = (a f(a) - b f(b))/(f(b)-f(a))
= (2.089 X 16 - 3 x (-0.062))/ (16 - (-0.062))
= 2.093
The root is 2.09

Newton – Raphson method

  • The Newton-Raphson method, or Newton Method, is a powerful technique for solving equations numerically. Like so much of the differential calculus, it is based on the simple idea of linear approximation. The Newton Method, properly used, usually homes in on a root with devastating efficiency.
Let x0 be a good estimate of r and let r = x0 + h. Since the true root is r, and h = r − x0, the number h measures how far the estimate x0 is from the truth.
    
        Since h is ‘small,’ we can use the linear (tangent line) approximation to conclude that

             0 = f(r) = f(x0 + h) ≈ f(x0) + hf’ (x0),

and therefore, unless f’(x0) is close to 0,

                                   h ≈ − f(x0)/ f’(x0) .

It follows that

r = x0 + h ≈ x0 − f(x0) / f’(x0) .

Our new improved (?) estimate x1 of r is therefore given by

x1 = x0 − f(x0) / f’(x0) .

The next estimate x2 is obtained from x1 in exactly the same way as x1 was obtained from x0:
x2 = x1 − f(x1) f0 (x1) .

Continue in this way. If xn is the current estimate, then the next estimate xn+1 is given by

xn+1 = xn − f(xn) / f’(xn)

Post a Comment

Previous Post Next Post