## Sunday, June 17, 2012

### An overview of using linear least squares to create predictive models

You undoubtedly remember my groundbreaking work with the Mega Millions probability and expected value.  To review, I showed that due to multiple winners splitting the jackpot the Mega Millions has never had a positive expected value.  However, there could come a time when this was no longer true.  In that post, I provided all the math one would need to calculate if there was indeed a positive expected value.  There is one problem though.  To do the calculations one would need the ticket sales.  While I found a good site with all the past sales posted, what matters is current sales.  What is needed then, given historical records, is a way to predict what current sales will be.  I will attempt to use linear least squares to find a model that fits the historical data well enough.

Overview of Linear Least Squares
If you're already familiar with least squares or don't care, then feel free to skip ahead to the next section.

You may remember the general rule for solving systems of equations: you need as many equations as you have unknowns (variables), no more, no less.  If you have too few equations you end up with an infinite number of solutions.  If you have too many you'll end up with an inconsistent result, eg, one where 0=1 (this is ignoring the rare case where everything works out perfectly even with too many equations, a situation that pretty much has to be engineered).

Consider the following pairs of x and y coordinates: (-1,3), (0,2), (1,9).

We know that with three points we can fit a quadratic curve to this data.  Starting with the familiar quadratic equation: Ax^2 + Bx + C = y, we have three sets of values for x and y, and can thus generate a system of 3 equations:
A(-1)^2 + B(-1) + C = 3
A(0)^2 + B(0) + C =2
A(1)^2 + B(1) + C =9

This is our system of equations with three unknowns and three equations.  Note here that the unknowns are A, B, and C not x or y.  Solving this system yields values of: A=4, B=3, C=2, and thus 4x^2 + 3x + 2 = y.

It isn't hard to see why adding a new point could be a problem.  If the new point doesn't fall exactly on that line the equation is no loner true, and indeed there would be no quadratic equation that would satisfy the new system with 4 points.  This is all well and good from a mathematical perspective, but what if this data represented the real world?  In effect, we would be adding additional information about the world, but that would result in a losing information in our model (ie, going from having a perfect model to having none at all).  This doesn't seem to make sense.  There has to be a way to find the line that would fit the available data as well as possible.

Using Least Squares
What happens when we have a lot of data that still clearly is represented by a quadratic equation, but doesn't fit in perfectly?  Let's start with this data:
 x y -5 86 -4 57 -3 24 -2 11 -1 2 0 3 1 8 2 23 3 49 4 81 The plot here shows that these points are still clearly following a quadratic curve.  But they don't match up perfectly with our previously found curve.  Indeed, they won't match any quadratic curve we could find.  So we must use least squares to find the closest match.

To begin, we list the equations those values give:
$$A(-5)^2 + B(-5) + C = 86 \\ A(-4)^2 + B(-4) + C =57 \\ A(-3)^2 + B(-3) + C =24 \\ A(-2)^2 + B(-2) + C =11 \\ A(-1)^2 + B(-1) + C =2 \\ A(0)^2 + B(0) + C =3 \\ A(1)^2 + B(1) + C =8 \\ A(2)^2 + B(2) + C =23 \\ A(3)^2 + B(3) + C =49 \\ A(4)^2 + B(4) + C =81 \\$$ We now must extract the coefficients of A, B, and C and use them to build a matrix A.  Note that the coefficient of C is 1 in every equation.  Also note that we must square the appropriate values first.  This means that A(-5)^2 + B(-5) + C = 86 becomes A(25) + B(-5) + C(1) = 86.   Even if you've never worked with a matrix before it should be pretty simple to see what we are doing mechanically:
$$A= \left[\begin{array}{rrr} 25 & -5 & 1 \\ 16 & -4 & 1 \\ 9 & -3 & 1 \\ 4 & -2 & 1 \\ 1 & -1 & 1 \\ 0 & 0 & 1 \\ 1 & 1 & 1 \\ 4 & 2 & 1 \\ 9 & 3 & 1 \\ 16 & 4 & 1 \end{array}\right]$$
We now take the values of y and use them to define a vector b:
bb{b}=[,,,,,,,,,]

We now use what is known as the 'normal equations': bb{hat{x}} = (A^T A)^(-1) A^T bb{b}. The resulting value of the vector hat{x} will be our values for A, B, and C.  Maple tells me that the values are: A=4.129, B=3.438, C=1.024.  Thus, our best fit curve is 4.129 x^2 + 3.438 x + 1.024 = y.  This plot looks pretty similar to the other one.  However, if you compare the two it's clear that the curve passes closer to the points in this one.

I'd like to mention a caveat about the normal equations we use to find our best fit equation.  The normal equations come from linear algebra, which as the name implies, deals with linear systems.  Therefore, we must be taking the coefficients for our matrix A from a system of linear equations.   "But wait!" you probably don't say, "we took the coefficients above from a group of quadratic equations.  They're not linear."  However if you look at what we actually did above, we first put our values for x and y in, and once we did that we had linear equations.  Take A(-5)^2 + B(-5) + C = 86 which evaluates to A(25) + B(-5) + C(1) = 86, that is indeed a linear equation.  This is a key point, and why I stressed above that x and y were not unknowns, A, B, and C were.  This reveals the power of the normal equations.  It is possible to make a great number of equations into linear equations if one can plug in data for certain values.  Arbitrary degree functions, exponentials, logarithms, they can all be converted to linear equations under the right circumstances.

Solving the normal equations is all pretty easy arithmetic.  Unfortunately, it is an excruciatingly long process.  However, if you live in a world where computers exist, you are in luck because computers love solving matrix problems.  I've made a Maple worksheet for the example problem I used here, with a few comments.  It should be pretty easy to figure out how to modify it to solve for different systems.  You only need to modify two lines (A and b).
http://daleswanson.org/blog/leastsquares.mw

The next post where I attempt to fit a model to Mega Millions jackpots and ticket sales is here:
http://daleswanson.blogspot.com/2012/07/attempting-to-predict-mega-millions.html