How do you know that the least squares regression line is unique and corresponds to a minimum

We already know that using the criterion of either

  1. minimizing sum of residuals OR
  2. minimizing sum of the absolute value of residuals

is BAD as either of the criteria do not give a unique line. Visit these notes for an example where these criteria are shown to be inadequate.

So we use minimizing the sum of the squares of the residuals as the criterion. How can we show that this criterion gives a unique line?

The proof is given below as image files because the proof is equation intensive. I made a better resolution pdf file also.


_____________________________________________________

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://nm.mathforcollege.com

Subscribe to the blog via a reader or email to stay updated with this blog. Let the information follow you.

Finding the optimum polynomial order to use for regression

Many a times, you may not have the privilege or knowledge of the physics of the problem to dictate the type of regression model. You may want to fit the data to a polynomial. But then how do you choose what order of polynomial to use.

Do you choose based on the polynomial order for which the sum of the squares of the residuals, Sr is a minimum? If that were the case, we can always get Sr=0 if the polynomial order chosen is one less than the number of data points. In fact, it would be an exact match.

So what do we do? We choose the degree of polynomial for which the variance as computed by

Sr(m)/(n-m-1)

is a minimum or when there is no significant decrease in its value as the degree of polynomial is increased. In the above formula,

Sr(m) = sum of the square of the residuals for the mth order polynomial

n= number of data points

m=order of polynomial (so m+1 is the number of constants of the model)

Let’s look at an example where the coefficient of thermal expansion is given for a typical steel as a function of temperature. We want to relate the two using polynomial regression.

Temperature

Instantaneous Thermal Expansion

oF

1E-06 in/(in oF)

80

6.47

40

6.24

0

6.00

-40

5.72

-80

5.43

-120

5.09

-160

4.72

-200

4.30

-240

3.83

-280

3.33

-320

2.76

If a first order polynomial is chosen, we get

$latex alpha=0.009147T+5.999$, with Sr=0.3138.

If a second order polynomial is chosen, we get

$latex alpha=-0.00001189T^2+0.006292T+6.015$ with Sr=0.003047.

Below is the table for the order of polynomial, the Sr value and the variance value, Sr(m)/(n-m-1)

Order of

polynomial, m

Sr(m)

Sr(m)/(n-m-1)

1

0.3138

0.03486

2

0.003047

0.0003808

3

0.0001916

0.000027371

4

0.0001566

0.0000261

5

0.0001541

0.00003082

6

0.0001300

0.000325

So what order of polynomial would you choose?

From the above table, and the figure below, it looks like the second or third order polynomial would be a good choice as very little change is taking place in the value of the variance after m=2.

Optimum order of polynomial for regression

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://nm.mathforcollege.com

Subscribe to the feed to stay updated and let the information follow you.

Data for aluminum cylinder in iced water experiment

A colleague asked me what if he did not have time or resources to do the experiments that have been developed at University of South Florida (USF) for numerical methods. He asked if I could share the data taken at USF.

Why not – here is the data for the experiment where an aluminum cylinder is placed in iced water. This link also has the exercises that the students were asked to do.

The temperature vs time data is as follows: (0,23.3), (5,16.3), (10,13), (15,11.8), (20,11), (25,10.7), (30,9.6), (35,8.9), (40,8.4). Time is in seconds and temperature in Celcius. Other data needed is

Ambient temperature of iced water = 1.1oC

Diameter of cylinder = 44.57 mm

Length of cylinder = 105.47 mm

Density of aluminum = 2700 kg/m3

Specific heat of aluminum = 901 J/(kg-oC)

Thermal conductivity of aluminum = 240 W/(m-K)

Table 1. Coefficient of thermal expansion vs. temperature for aluminum (Data taken from http://www.llnl.gov/tid/lof/documents/pdf/322526.pdf by using mid values of temperatures at which CTE is reported)

Temperature

(oC)

Coefficient of thermal expansion

(μm/m/oC)

-10

58

12.5

59

37.5

60

62.5

62

87.5

66

112.5

71

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://nm.mathforcollege.com

Subscribe to the feed to stay updated and let the information follow you.

In regression, when is coefficient of determination zero

The coefficient of determination is a measure of how much of the original uncertainty in the data is explained by the regression model.

The coefficient of determination, $latex r^2$ is defined as

$latex r^2$=$latex \frac{S_t-S_r}{S_r}$

where

$latex S_t$ = sum of the square of the differences between the y values and the average value of y

$latex S_r$ = sum of the square of the residuals, the residual being the difference between the observed and predicted values from the regression curve.

The coefficient of determination varies between 0 and 1. The value of the coefficient of determination of zero means that no benefit is gained by doing regression. When can that be?

One case comes to mind right away – what if you have only one data point. For example, if I have only one student in my class and the class average is 80, I know just from the average of the class that the student’s score is 80. By regressing student score to the number of hours studied or to his GPA or to his gender would not be of any benefit. In this case, the value of the coefficient of determination is zero.

What if we have more than one data point? Is it possible to get the coefficient of determination to be zero?

The answer is yes. Look at the following data pairs (1,3), (3,-2), (5,4), (7,-5), (9,4.2), (11,3), (2,4). If one regresses this data to a general straight line

y=a+bx,

one gets the regression line to be

y=1.6

When is rsquared zero?

In fact, 1.6 is the average value of the given y values. Is this a coincidence? Because the regression line is the average of the y values, $latex S_t=S_r$, implying $latex r^2=0$

QUESTIONS

  1. Given (1,3), (3,-2), (5,4), (7,a), (9,4.2), find the value of a that gives the coefficient of determination, $latex r^2=0$. Hint: Write the expression for $latex S_r$ for the regression line $latex y=mx+c$. We now have three unknowns, m, c and a. The three equations then are $latex \frac{\partial S_r} {\partial m} =0$, $latex \frac{\partial S_r} {\partial c} =0$ and $latex S_t=S_r$.
  2. Show that if n data pairs $latex (x_1,y_1)……(x_n,y_n)$ are regressed to a straight line, and the regression straight line turns out to be a constant line, then the equation of the constant line is always y=average value of the y-values.

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://nm.mathforcollege.com

Subscribe to the feed to stay updated and let the information follow you.

Length of a curve experiment

In a previous post, I mentioned that I have incorporated experiments in my Numerical Methods course. Here I will discuss the second experiment.

Length of the curve experimentIn this experiment, we find the length of two curves generated from the same points – one curve is a polynomial interpolant and another one is a spline interpolant.

Motivation behind the experiment: In 1901, Runge conducted a numerical experiment to show that higher order interpolation is a bad idea. It was shown that as you use higher order interpolants to approximate f(x)=1/(1+25x2) in [-1,1], the differences between the original function and the interpolants becomes worse. This concept also becomes the basis why we use splines rather than polynomial interpolation to find smooth paths to travel through several discrete points.

What do students do in the lab: A flexible curve (see Figure) of length 12″ made of lead-core construction with graduations in both millimeters and inches is provided. The student needs to draw a curve similar in shape to the Runge’s curve on the provided graphing paper as shown. It just needs to be similar in shape – the student can make the x-domain shorter and the maximum y-value larger or vice-versa. The student just needs to make sure that there is a one-to-one correspondence of values.

Assigned Exercises: Use MATLAB to solve problems (3 thru 6). Use comments, display commands and fprintf statements, sensible variable names and units to explain your work. Staple all the work in the following sequence.

  1. Signed typed affidavit sheet.
  2. Attach the plot you drew in the class. Choose several points (at least nine – do not need to be symmetric) along the curve, including the end points. Write out the co-ordinates on the graphing paper curve as shown in the figure.
  3. Find the polynomial interpolant that curve fits the data. Output the coefficients of the polynomial.
  4. Find the cubic spline interpolant that curve fits the data. Just show the work in the mfile.
  5. Illustrate and show the individual points, polynomial and cubic spline interpolants on a single plot.
  6. Find the length of the two interpolants – the polynomial and the spline interpolant. Calculate the relative difference between the length of each interpolant and the actual length of the flexible curve.
  7. In 100-200 words, type out your conclusions using a word processor. Any formulas should be shown using an equation editor. Any sketches need to be drawn using a drawing software such as Word Drawing. Any plots can be imported from MATLAB.

Where to buy the items for the experiment:

  1. Flexible curves – I bought these via internet at Art City. The brand name is Alvin Tru-Flex Graduated Flexible Curves. Prices range from $5 to $12. Shipping and handling is extra – approximately $6 plus 6% of the price. You may want to buy several 12″ and 16″ flexible curves. I had to send a query to the vendor when I did not receive them within a couple of weeks. Alternatively, call your local Art Store and see if they have them.
  2. Engineering Graph Paper – Staples or Office Depot. Costs about $12 for a pack for 100-sheet pad.
  3. Pencil – Anywhere – My favorite store is the 24-hour Wal-Mart Superstore. $1 for a dozen.
  4. Scale – Anywhere – My favorite store is the 24-hour Wal-Mart Superstore. $1 per unit.

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://nm.mathforcollege.com

Subscribe to the feed to stay updated and let the information follow you.

A legend used in the movie “The Happening”

Well M. Night Shyamalan may have made another disappointing movie – The Happening, but I somewhat liked it. I would give it a grade of B.

In the movie, John Leguzomo’s character, a math teacher, is distracting his fellow panicking passenger in the Jeep with a mathematical question. The question he asks her is if he gave her a penny on Day 1 of the month, two pennies on Day 2 of the month, four pennies on Day 3 of the month, and so on, how much would money would she have after a month. She shouts $300 or some odd number like that. But, do you know that the amount is actually more than a 10 million dollars (Thanks to a student who mentioned that it was a penny that John offered on the first day, not a dollar – sometimes I do feel generous).

This question is based on a story from India and it goes as follows.

King Shriham of India wanted to reward his grand minister Ben for inventing the game of chess. When asked what reward he wanted, Ben asked for 1 grain of rice on the first square of the board, 2 on the second square of the board, 4 on the third square of the board, 8 on the fourth square of the board, and so on till all the 64 squares were covered. That is, he was doubling the number of grains on each successive square of the board. Although Ben’s request looked less than modest, King Shriham quickly found that the amount of rice that Ben was asking for was humongous.

QUESTIONS:

Write a MATLAB (you can use any other programming language) program for the following using the for or while loop.

  1. Find out how many grains of rice Ben was asking for.
  2. If the mass of a grain of rice is 2 mg, and the world production of rice in recent years has been approximately 600,000,000 tons (1 ton=1000 kg), how many times the modern world production was Ben’s request?
  3. Do the inverse problem – find out how many squares are covered if the the number of grains on the chess board are given to you. For example, how many squares will be covered if the number of grains on the chess board are 16?

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://nm.mathforcollege.com

Subscribe to the feed to stay updated and let the information follow you.

Shortest path for a robot

Imagine a robot that had to go from point to point consecutively (based on x values) on a two dimensional x-y plane. The shortest path in this case would simply be drawing linear splines thru consecutive data. What if the path is demanded to be smooth? Then what!

Well one may use polynomial or quadratic/cubic spline interpolation to develop the path. Which path would be shorter? To find out thru an anecdotal example, click here.

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://nm.mathforcollege.com

Subscribe to the feed to stay updated and let the information follow you.

Do quadratic splines really use all the data points?

There are two reasons that linear splines are seldom used

  1. Each spline uses information only from two consecutive data points
  2. The slope of the splines is discontinuous at the interior data points

The answer to resolving the above concerns are to use higher order splines such as quadratic splines. Read the quadratic spline textbook notes before you go any further. You do want what I have to say to make sense to you.

In quadratic splines, a quadratic polynomial is assumed to go thru consecutive data points. So you cannot just find the three constants of each quadratic polynomial spline by using the information that the spline goes thru two consecutive points (that sets up two equations and three unknowns). Hence, we incorporate that the splines have a continuous slope at the interior points.

So does all this incorporation make the splines to depend on the values of all given data points. It does not seem so.

For example, in quadratic splines you have to assume that the first or last spline is linear. For argument sake, let that be the first spline. If the first spline is linear, then we can find the constants of the linear spline just by the knowledge of the value of the first two data points. So now we know that we can set up three equations for the three unknown constants of the second spline as follows

  1. the slope of the first spline at the 2nd data point and the slope of the second spline at the 2nd point are the same,
  2. the second spline goes thru the 2nd data point
  3. the second spline goes thru the 3rd data point

That itself is enough information to find the three constants of the second spline. We can keep using the same argument for all the other splines.

So the mth spline constants are dependent on data from the data points 1, 2, .., m, m+1 but not beyond that.

Can you now make the call on the kind of dependence or independence you have in the constants of the quadratic splines?

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://nm.mathforcollege.com

Subscribe to the feed to stay updated and let the information follow you.

Extrapolation is inexact and may be dangerous

The NASDAQ was booming – people were dreaming of riches – early retirement and what not. The year was 1999 and NASDAQ was at an all time high of 4069 on the last day of 1999.

The NASDAQ was booming – people were dreaming of riches – early retirement and what not. The year was 1999 and NASDAQ was at an all time high of 4069 on the last day of 1999.

Yes, Prince was right, not just about the purple rain, but – “‘Cuz they say two thousand zero zero party over, Oops out of time, So tonight I’m gonna party like it’s 1999 party like 1999.”

But as we know the party did not last too long. The dot com bubble burst and as we know it today (June 2008), the NASDAQ is hovering around 2400.

Year ………………NASDAQ on December 31st

1994………………………… 751

1995 ……………………….1052

1996 ………………………..1291

1997 ………………………..1570

1998 ………………………..2192

1999 ………………………..4069

• End of Year NASDAQ Composite Data taken from www.bigcharts.com

So how about extrapolating the value of NASDAQ to not too far ahead – just to the end of 2000 and 2001. This is what you obtain from using a 5th order interpolant for approximation from the above six values.

End of Year …Actual Value …..5th Order Poly Extrapo……………Abs Rel True Error
2000 ……………..2471…………………. 9128 ………………………………….. 269%
2001………………1950……………….. 20720 ………………………………….. 962%

Do you know what would be the extrapolated value of NASDAQ on June 30, 2008 -a whopping 861391! On June 30, 2008, compare it with the actual value.

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://nm.mathforcollege.com

Finding the length of curve using MATLAB

As per integral calculus, the length of a continuous and differentiable curve f(x) from x=a to x=b is given by

S=\int_a^b \sqrt{(1+(dy/dx)^2} dx

Now how do we find the length of a curve in MATLAB.

Let us do this via an example. Assume one asked you to find the length of $latex x^2*sin(x) $ from Π to 2Π. In the book, How People Learn, the authors mention that learning a concept in multiple contexts prolongs retention. Although it may not be the context that the authors of the book are talking about, let us find the length of the curve multiple ways within MATLAB. Try the program for functions and limits of your own choice to evaluate the difference.

METHOD 1: Use the formula S= \int_a^b \sqrt{(1+(dy/dx)^2} dx by using the diff and int function of MATLAB

METHOD 2: Generate several points between a and b, and join straight lines between consecutive data points. Add the length of these straight lines to find the length of the curve.

METHOD 3. Find the derivative dy/dx numerically using forward divided difference scheme, and then use trapezoidal rule (trapz command in MATLAB) for discrete data with unequal segments to find the length of the curve.

QUESTIONS TO BE ANSWERED:

  1. Why does METHOD 3 giving inaccurate results? Can you make them better by using better approximations of derivative like central divided difference scheme?
  2. Redo the problem with f(x)= x^{\frac{3}{2}} with a=1 and b=4 as the exact length can be found for such a function.

% Simulation : Find length of a given Curve
% Language : Matlab 2007a
% Authors : Autar Kaw
% Last Revised : June 14 2008
% Abstract: We are finding the length of the curve by three different ways
% 1. Using the formula from calculus
% 2. Breaking the curve into bunch of small straight lines
% 3. Finding dy/dx of the formula numerically to use discrete function
% integration
clc
clear all

disp(‘We are finding the length of the curve by three different ways’)
disp(‘1. Using the formula from calculus’)
disp(‘2. Breaking the curve into bunch of small straight lines’)
disp(‘3. Finding dy/dx of the formula numerically to use discrete function integration’)

%INPUTS – this is where you will change input data if you are doing
% a different problem
syms x;
% Define the function
curve=x^2*sin(x)
% lower limit
a=pi
% b=upper limit
b=2*pi
% n = number of straight lines used to approximate f(x) for METHOD 2
n=100
%p = number of discrete data points where dy/dx is calculated for METHOD 3
p=100

% OUTPUTS
% METHOD 1. Using the calculus formula
% S=int(sqrt(1+dy/dx^2),a,b)
% finding dy/dx
poly_dif=diff(curve,x,1);
% applying the formula
integrand=sqrt(1+poly_dif^2);
leng_exact=double(int(integrand,x,a,b));
fprintf (‘\nExact length =%g’,leng_exact)
%***********************************************************************

% METHOD 2. Breaking the curve as if it is made of small length
% straight lines
% Generating n x-points from a to b

xi= a:(b-a)/n:b;
% generating the y-values of the function
yi=subs(curve,x,xi);
% assuming that between consecutive data points, the
% curve can be approximated by linear splines.
leng_straight=0;
m=length(xi);
% there are m-1 splines for m points
for i=1:1:m-1
dx=xi(i+1)-xi(i);
dy= yi(i+1)-yi(i);
leneach=sqrt(dx^2+dy^2);
leng_straight=leng_straight+leneach;
end
fprintf (‘\n\nBreaking the line into short lengths =%g’,leng_straight)

% METHOD 3. Same as METHOD1, but calculating dy/dx
% numerically and integrating using trapz
xi=a:(b-a)/p:b;
% generating the dy/dx-values
m=length(xi);
for i=1:1:m-1
numer=yi(i+1)-yi(i);
den=xi(i+1)-xi(i);
dydxv(i)=numer/den;
end
% derivative at last point using Backward divided difference formula
% is same as Forward divided difference formula
dydxv(m)=dydxv(m-1);
integrandi=sqrt(1+dydxv.*dydxv);
length_fdd=trapz(xi,integrandi);
disp(‘ ‘)
disp(‘ ‘)
disp (‘Using numerical value of dy/dx coupled’)
disp (‘with discrete integration’)
fprintf (‘ =%g’,length_fdd)

This post is brought to you by Holistic Numerical Methods: Numerical Methods for the STEM undergraduate at http://nm.mathforcollege.com