We begin this section by using the method of least squares to find the best straight line fit to a set of data. Later in the section we will discuss best fits to other curves.

An Example of Best Linear Fit to Data

Suppose that we are given data points for . For example, consider the ten points

Least Squares Fit to a Quadratic Polynomial

Suppose that we want to fit the data to a quadratic polynomial by least squares methods. We want to find constants so that the error made is using the quadratic polynomial is minimal among all possible choices of quadratic polynomials. The least squares error is where and, as before, is the vector with all components equal to .

We solve the minimization problem as before. In this case, the space of possible approximations to the data is three dimensional; indeed, . As in the case of fits to lines we try to find a point in that is nearest to the vector . By (??), the answer is: where is an matrix.

Suppose that we try to fit the data in (??) with a quadratic polynomial rather than a linear one. Use MATLAB as follows

e10_3_1
 
A = [F1 X X.*X];  
b = inv(A’*A)*A’*Y;

to obtain
b0(1) =   0.0443
 
b0(2) =   1.7054  
b0(3) =  -3.8197

So the best parabolic fit to this data is . Note that the coefficient of is small suggesting that the data was well fit by a straight line. Note also that the error is which is only marginally smaller than the error for the best linear fit. For comparison, in Figure ?? we superimpose the equation for the quadratic fit onto Figure ??.


PIC

Figure 1: Scatter plot of data in (??) with best linear and quadratic approximations. The best linear fit is plotted with a dashed line.

General Least Squares Fit

The approximation to a quadratic polynomial shows that least squares fits can be made to any finite dimensional function space. More precisely, Let be a finite dimensional space of functions and let be a basis for . We have just considered two such spaces: for linear regression and for least squares fit to a quadratic polynomial.

The general least squares fit of a data set is the function that is nearest to the data set in the following sense. Let be column vectors in . For any function define the column vector So is the evaluation of on the data set. Then the error is minimal for .

More precisely, we think of the data as representing the (approximate) evaluation of a function on the . Then we try to find a function whose values on the are as near as possible to the vector . This is just a least squares problem. Let be the vector subspace spanned by the evaluations of function on the data points , that is, the vectors . The minimization problem is to find a vector in that is nearest to . This can be solved in general using (??). That is, let be the matrix where is the column vector associated to the basis element of , that is, The minimizing function is a linear combination of the basis functions , that is, for scalars . If we set then least squares minimization states that

This equation can be solved easily in MATLAB. Enter the data as column -vectors X and Y. Compute the column vectors Fj = (X) and then form the matrix A = [F1 F2 Fm]. Finally compute

b = inv(A’*A)*A’*Y

Least Squares Fit to a Sinusoidal Function

We discuss a specific example of the general least squares formulation by considering the weather. It is reasonable to expect monthly data on the weather to vary periodically in time with a period of one year. In Table ?? we give average daily high and low temperatures for each month of the year for Paris and Rio de Janeiro. We attempt to fit this data with curves of the form: where is time measured in months and are scalars. These functions are periodic, which seems appropriate for weather data, and form a three dimensional function space . Recall the trigonometric identity where Based on this identity we call the space of sinusoidal functions. The number is called the amplitude of the sinusoidal function .












Paris
Rio de Janeiro
Paris
Rio de Janeiro
Month High Low High Low Month High Low High Low










1 55 39 84 73 7 81 64 75 63
2 55 41 85 73 8 81 64 76 64
3 59 45 83 72 9 77 61 75 65
4 64 46 80 69 10 70 54 77 66
5 68 55 77 66 11 63 46 79 68
6 75 61 76 64 12 55 41 82 71











Table 1: Monthly Average of Daily High and Low Temperatures in Paris and Rio de Janeiro.

Note that each data set consists of twelve entries — one for each month. Let be the vector in the general presentation. Next let be the data in one of the data sets — say the high temperatures in Paris.

Now we turn to the vectors representing basis functions in . Let

F1=[1 1 1 1 1 1 1 1 1 1 1 1]’

be the vector associated with the basis function . Let F2 and F3 be the column vectors associated to the basis functions These vectors are computed by typing
F2 = cos(2*pi/12*T);
 
F3 = sin(2*pi/12*T);

By typing temper, we enter the temperatures and the vectors T, F1, F2 and F3 into MATLAB.

To find the best fit to the data by a sinusoidal function , we use (??). Let be the matrix

A = [F1 F2 F3];

The table data is entered in column vectors ParisH and ParisL for the high and low Paris temperatures and RioH and RioL for the high and low Rio de Janeiro temperatures. We can find the best least squares fit of the Paris high temperatures by a sinusoidal function by typing

b = inv(A’*A)*A’*ParisH

obtaining
                                                                  

                                                                  
b(1) =  66.9167
 
b(2) =  -9.4745  
b(3) =  -9.3688

The result is plotted in Figure ?? by typing
plot(T,ParisH,’o’)
 
axis([0,13,0,100])  
xlabel(’time (months)’)  
ylabel(’temperature (Fahrenheit)’)  
hold on  
xx = linspace(0,13);  
yy = b(1) + b(2)*cos(2*pi*xx/12) +  
     b(3)*sin(2*pi*xx/12);  
plot(xx,yy)


PIC PIC

Figure 2: Monthly averages of daily high temperatures in Paris (left) and Rio de Janeiro (right) with best sinusoidal approximation.

A similar exercise allows us to compute the best approximation to the Rio de Janeiro high temperatures obtaining

b(1) =  79.0833
 
b(2) =   3.0877  
b(3) =   3.6487

The value of is just the mean high temperature and not surprisingly that value is much higher in Rio than in Paris. There is yet more information contained in these approximations. For the high temperatures in Paris and Rio The amplitude measures the variation of the high temperature about its mean. It is much greater in Paris than in Rio, indicating that the difference in temperature between winter and summer is much greater in Paris than in Rio.

Least Squares Fit in MATLAB

The general formula for a least squares fit of data (??) has been preprogrammed in MATLAB. After setting up the matrix whose columns are the vectors just type

b = A\Y

This MATLAB command can be checked on the sinusoidal fit to the high temperature Rio de Janeiro data by typing
b = A\RioH

and obtaining
b =
 
   79.0833  
    3.0877  
    3.6487

Exercises

World population data for each decade of this century (except for 1910) is given in Table ??. Assume that population growth is linear where time is measured in decades since the year 1900 and is measured in billions of people. This data can be recovered by typing e10_3_po.
  • Find and to give the best linear fit to this data.
  • Use this linear approximation to the data to make predictions of the world populations in the year 1910 and 2000.
  • Do you expect the prediction for the year 2000 to be high or low or on target? Explain why by graphing the data with the best linear fit superimposed and by using the differential equation population model discussed in Section ??.






Year Population (in millions) Year Population (in millions)




1900 1625 1950 2516
1910 n.a. 1960 3020
1920 1813 1970 3698
1930 1987 1980 4448
1940 2213 1990 5292





Table 2: Twentieth Century World Population Data by Decades.

Find the best sinusoidal approximation to the monthly average low temperatures in Paris and Rio de Janeiro. How does the variation of these temperatures about the mean compare to the high temperature calculations? Was this the result you expected?
In Table ?? we present weather data from ten U.S. cities. The data is the average number of days in the year with precipitation and the percentage of sunny hours to hours when it could be sunny. Find the best linear fit to this data.








City Rainy Days Sunny (%) City Rainy Days Sunny (%)






Charleston 92 72 Kansas City 98 59
Chicago 121 54 Miami 114 85
Dallas 82 65 New Orleans 103 61
Denver 82 67 Phoenix 28 88
Duluth 136 52 Salt Lake City 99 59







Table 3: Precipitation Days Versus Sunny Time for Selected U.S. Cities.