Sunday, July 1, 2012

Attempting to predict Mega Millions sales

In the last post, I overviewed the linear least squares method.  Here, we will be using that method to attempt to fit a model to the historical data of Mega Millions jackpots and ticket sales.

As I pointed out before, in order to know if buying a Mega Millions ticket has a positive expected return one needs to know the tickets sold.  That data is only available after the fact.  However, one could attempt to predict the ticket sales based on historical data.  We will do that now.

To begin, I've plotted ticket sales vs jackpot values.  It seems exponential; however, should our independent variable be in the base (`x^2`) or exponent (`e^x`)?  Also, are there any other factors we can find data for that might affect sales?  Our process allows for any number of independent variables (well any number less than the 250 or so data points I'll be using), so we might as well put anything we can think of in there.  The least squares process will minimize factors if they don't have any correlation.

A quick word about the data.  I got the data from the same site I've been using.  In Feb 2010 states became allowed to sell both Mega Millions and Powerball tickets (cross-selling).  Before that there were 12 states selling Megamillions, and after it rose to 35.  Because of that dramatic change in sales, I am only using data from then until now.  Also, a handful of states have been added since then.  That same site also has sales figures for every state, which allowed me to correct for this.  Although I could have done the exact corrections, I just modified each drawing by the average percentage the states added.  I did this because I was too lazy to do it the more thorough way, the percentage of sales was pretty consistent, and the total change ranges from 0% - 3.8% at most.  There are 249 drawings, from Feb 2, 2010 to June 19, 2012 inclusive.

Identifying Variables
The obvious factor is jackpot amount.  We'll represent it as `x_1`.  Another factor that I thought of is simply time.  By this I mean, as time progresses the game becomes more well known and more people play, leading to increased sales.  I'm not too sure how significant this factor would be, but I'll throw it in as `x_2`.  For now we'll go with these two.  There's a few more I can think of to throw in depending on how these two work out.

The next thing to consider is how important these variables are.  More important variables can be used multiple times with different degrees.  Since `x_1` is likely to be very important, we can use it as both `x_1` and `x_1^2`, using a different coefficient for both.  If we felt two factors were interacting with each other we could use something like `(x_1 * x_2)`.

Model 1
I'll begin with a simple model and see how it goes:
`y = beta_0 + beta_1 x_2 + beta_2 x_1 + beta_3 x_1^2`
Where y is ticket sales, `x_1` is jackpot, and `x_2` is drawing number.  Jackpot and ticket sales will be in millions.  Drawing number will be an arbitrary count I made up.  Assigning 1 to the first drawing I have data for, and working up to 249 for the latest.  Hopefully it's clear that since we are going to build the model here it doesn't matter that we are using values in millions or an arbitrary counter, as long as we're consistent.

The betas (`beta`) are the variable coefficients we are trying to find with least squares.  Our goal here is to find actual numerical values for them, and then have a formula where we can plug in jackpot and count values and get out a predicted ticket sales.

If you want actual details of what in the world we're doing here, see my previous post.  We use a spreadsheet to find the values of `x_1` and `x_2` and that gives us a matrix 4 wide by 249 long.  We copy and paste that from the spreadsheet to Maple (which does a nice job of understanding it as a matrix).  We get the value of the vector `bb{b}` from the values for `y`, which are the ticket sales.  Pasteing these two into Maple and running the worksheet gives our coefficients:
`beta_0 = 21.46800262,  beta_1 = 0.000282810868,  beta_2 = -0.1019284315,  beta_3 = 0.001671278620`

Therefore, model 1 is:
`y = 21.46800262 + ( 0.000282810868) x_2 - (0.1019284315) x_1 + (0.001671278620) x_1^2`

Evaluating the Model
In addition, Maple tells us that the normal distance between our model and the actual data is 71.46.  If you live in 249 dimensional space it should be easy for you to visualize what this represents.  For those of you still in 3 dimensional space you'll have to visualize an analogy.   Imagine we have a 2 dimensional plane in which all our answers must lie.  Now imagine we draw a line starting on the plane but pointing out into space in some random direction (a vector).  We can then imagine projecting the line onto the plane in order to get the best possible estimate of that line in our allowed plane.  If you are still visualizing this, there is now a right triangle drawn between the real line, and the projected line.  The distance between the two endpoints (one out in space, and one on our plane) is the third side to the triangle, and it is also the normal distance between the two lines.  This can be thought of as the error between the actual line and our projected 'line model'.  It should be clear that of all the infinite lines we could draw in the plane, and the resulting distance given by them, the one with the smallest distance is the best.

This picture, which I stole from here, arguably does a better job of explaining this than my drunken ramblings.  Here the real line is `bb{b}` (blue), the projection is `tt{proj}_w bb{b}` (yellow) and the normal distance is `bb{b}-tt{proj}_w bb{b}`.  And hopefully to help tie this to the concrete real world example, note that `bb{b}` is our vector containing all the previous ticket sales.  It is 249 dimensions; However, we need it to fit into only 4 dimensions as that is all the unknowns in our model.

Anyway, the point of all this is that our error is about 71.5.  We can use this to quantitatively compare this model to others.  We can also get a general feel for how this model holds up by calculating its predicted ticket sales in our spreadsheet and finding the relative errors from the actual sales.  Here I would say we do pretty well.  Most of the errors are less than 20%, with the largest being 40%.  This error happened on a drawing with a jackpot of $266M and sales of 79M.  This is something of an outlier and can be seen on the plot above.  Another outlier happened a few months later when there was a jackpot of $242M and sales of 122M.  Note this is less of a jackpot and more sales.  It's not hard to see then how our model failed to predict these.  It is worth looking at these outliers and attempting to come up with factors that may have caused them.  There are several I can think of, but for now let's stick to what we have and try a different model.

Model 2
I'd like to try an exponential model, ie, one where the independent variable is in the exponent.  Something like this:
`y = beta_0 * e^{beta_1 * x_1}`

There are some important notes about this one.  First notice that simply plugging in values for `x_1` will not make this linear for the remaining variables.  Since we must have a linear equation for least squares to work, we must find a way to fix this.  This leads to the second point, that we only have a single independent variable here.  Since we are about to use logs to make this a linear equation we are limited by how we can set this up.  We cannot add several variables together, as we did before, because we will end up with a log of that addition, and there isn't any property of logs that will let us do anything with that.

We make this linear by taking the natural log of both sides: `ln(y) = ln(beta_0 * e^{beta_1 * x_1}) to ln(y) = ln(beta_0) + (beta_1 * x_1)`
Is this linear?  Well `ln(y)` is ok, because once we plug in our values for y it will give a number.  It should be clear `(beta_1 * x_1)` is fine, since it is similar to the other models we've used.  It may seem like `ln(beta_0)` could give us problems, however since `beta_0` is a constant and there is not a variable in the argument of the log we can effectively substitute in a new variable such that `beta_2 = ln(beta_0)`.  I'll leave this step in my head for simplicity's sake.

We are thus ready to extract the matrix and vector and use Maple to solve.  Remember to use `ln(y)` and not just `y` values in the vector.

Maple gives us the vector [2.787582661, 0.006493826665].  We must be mindful of what this represents.  `beta_1 = 0.006493826665`, however `beta_0 != 2.776342491`.  Instead, that is the natural log of `beta_0`.  We find that `beta_0 = 16.2417105887`.  Our Model 2 is then:
`y = 16.2417105887 * e^{0.006493826665 * x_1}`

Maple also tells us the normal distance is only 1.6.  Compare this to 71.5 for model 1, and it would seem this is a much better model.  Actually looking at the errors gives some interesting insights.  One thing I notice is that with this model single digit error percentages are more common.  However, the largest error is 59%, quite a bit more than largest of 40% in model 1.  This error is also in the recent $640 M jackpot.  This is rather important as this, ie large jackpots, is exactly where the model is most needed.  Model 1 gave a prediction of 641 M sales, vs the actual figure of 652 M, for an error of about 2%.  Taking the average of all the individual errors with both models gives 9.3% for model 1, and 7.6% for model 2.  The numbers seem to be in favor of model 2, but perhaps it would change if we only considered high jackpots, which is where the model is needed.  Perhaps we should recalculate the models with only the higher jackpots.

For now though, we'll try adding some new variables.

Model 3
I'll revert back to the basic format of model 1, as it's easier to add variables to.  So what are some other possible factors?  Well to begin with, I figure time of year likely has some effect.  The problem is how to represent time of year in a way that would provide meaningful data.  At first just using the month number seems ok; however, this creates a huge difference between December and January.  I doubt that the actual month makes much difference (although since birthdays aren't evenly distributed, maybe that would matter), rather than just the general time of year.  I think the best way to do this is to just create a variable for each time of year.  At first, this meant just having a binary 0 or 1 for 4 variables (seasons).  Then I realized that I might as well vary the intensity of the variable depending on how strongly it was that season.  Then I decided to just go with two seasons (summer/winter) centered around the hottest and coldest parts of the year (July 20, Jan 20).

I was then debating if this factor should be linear or not.  I noted how the average temperatures change quickly in fall and spring, and then generally hover in summer and winter.  Then I realized average temperature is a great factor for this.  I know I always have trouble finding historical weather data on the internet, but luckily averages are a bit easier.  I went with this source for averages in Iowa since the location really didn't matter, as long as it was somewhat representative of the US as a whole.

A second factor that I thought could prove useful was the previous drawing's sales.  I figured that this would help capture factors I didn't think of.  On my first try at this I missed something obvious.  When someone wins the jackpot and it resets, the previous sales don't matter (or at least not in the same way).  To fix this I combined the previous sales with a binary variable that was 1 normally and 0 when the jackpot had just reset.

So, I'm defining three new variables.  `x_3` will be the previous ticket sales.  `x_4` will be a binary switch 1 normally and 0 when the jackpot resets.  `x_5` will be the average temperature.  And our model is:
`y = beta_0 + beta_1 x_2 + beta_2 x_1 + beta_3 x_1^2 + beta_4(x_3*x_4) + beta_5 x_5`

Note this is the same as model 1, but with the two new terms added on the end.

Fast forward some spreadsheet and Maple work and we have our coefficients:
`beta_0 = 24.58,  beta_1 = -0.000739,  beta_2 = -0.10927,  beta_3 = 0.001678,  beta_4 = -0.0463,  beta_5 = 0.0075598`

And our model:
`y = 24.58 - (0.000739) x_2 - (0.10927) x_1 + (0.001678) x_1^2 - (0.0463)(x_3*x_4) + (0.0075598) x_5`

How well does this model perform?  Well the good news is that it's better than model 1; the bad news is that it's basically the same.  Recall that model 1 had a distance of 71.5 and an average error of 9.3%.  Model 3 has a distance of 70.2 and an average error of 9.3%.  I really felt like the temperature should be a good indicator of general season, and that season should have some effect on sales.  I also considered that perhaps forcing the previous sales figure to be 0 when the jackpot reset was doing more harm than good.  So, I redid this model with that term removed and called it model 4.

I won't both posting the details of model 4 because it's much the same.  Distance of 70.2 and average error of 9.3%.  I should mention that if you go out to more decimal places each additional piece of information does decrease both average error, and distance.  However, I'm not going to pretend like that is an improvement in the model.  Given enough variables, no matter how arbitrary they are, the least squares method will approach a perfect representation of the given data.  The problem is that this would only be perfect for the given past data.  If the additional variables have no real effect on the sales there would be no corresponding increase in predictive accuracy.  Many a person has lost money because they over tweaked a model with past data and were so satisfied with how well the model described past results that they were sure it would predict future results equally well.

Another thing worth noting is that our `beta_1` term, which was positive in model 1 is negative in model 3 (and 4).  It is also very small in all models.  This indicates that there isn't any strong correlation here.  So as long as we are removing superfluous variables why not that one too.  For those of you keeping score at home this means we are left with just one variable, the jackpot amount.  This leads us to:

Model 5
We are left with a simple quadratic:
`y = beta_0 + beta_1 x_1 + beta_2 x_1^2`

Its coefficients are:
`y = 21.50354999 - (0.1019449781) x_1 + (0.001671373144) x_1^2`

It has a distance of 71.5 and average error of 9.3%, ie, the same as models 1, 3, and 4.

Model 2 vs Model 5
So it would seem that Model 2 is the winner.  However, I still don't like the huge error on the largest jackpot.  I've plotted these two models and compared them to the actual sales.  I had plotted all the models at first, but since 1, 3, 4, and 5 are so similar they can be treated as the same on the plot.  Anyway, the plot shows that model 5 is very close to actual sales at all points.  Model 2 is too, except for the largest jackpot, where it is way off.  As I said above, since the big jackpots are what matters most, I'm hesitant to recommend model 2 over 5, even though all the number back it up.  Maybe only looking at drawings with a large jackpot will improve the models.

Large Jackpots
I decided to use $100 M as a cut off for the jackpot.  This gives 48 drawings to use for our model.  I then found the coefficients for models 2 and 5 with this smaller data set.  I'm calling these models 2a and 5a. 
Model 2a:
`y = 18.51083081 * e^{0.006057907333 * x_1}`

Model 5a:
`y = 43.12456321 - (0.2824251538) x_1 + (0.001932541134) x_1^2`

Of course, the real question is how accurate are they.  The answer is that model 2a has a distance of 0.74 and average error of 7.0%.  Model 5a has a distance of 53, and an average error of 5.8%.  So, both are more accurate than the original models based on the larger data set.  This isn't surprising, for the same reason it wasn't surprising when adding variables improved accuracy without actually improving the models' predictive power.  Here, however, I think this may be an actual improvement.  It's hard to say though really until we have future high value jackpots to test the models with.

Will it ever be profitable to play the Mega Millions?
Now that we have a way of estimating the ticket sales based on jackpot amount we have the tools needed to answer this question.  In my Mega Millions post I provided the formula needed to determine the expected value of a Mega Millions ticket.  It requires both jackpot and ticket sales, but we now have ticket sales as a function of jackpot.  So, we can plot the expected value of tickets for various jackpots.  The formula for expected jackpot is:
`V = \frac{J}{e^r}\sum_{n=1}^{} \frac{r^n}{n(n!)}`

Where r is the rate of winners. This is found by this formula:
`r = 5.699 \times 10^-9 * y`
Where y is the tickets sold, as given by our shiny new model.  Remember that we used millions in the above model, whereas these formulas all want full figures.  You'll have to convert this manually if you're going to do these calculations yourself (and I know you are).  Also keep in mind that we determined that the non-jackpot prizes were worth about $0.15 per ticket.  Thus, our breakeven point is $0.85.

In theory, we should be able to substitute these equations into each other and end up with one with only one independent variable, J.  Then we could plot that equation, and even take the derivative to find where (and if) it peaks at some maximum expected value.  Unfortunately, if you examine those equations you'll see that we end up with a quadratic for r.  That quadratic then has to be substituted in as an exponent of e; messy but doable.  However, the real problem comes from within the summation.  We have to raise r to a incrementing power.  There's no easy way to do this.  Even letting Maple do the work, we end up with a giant expanded summation hundreds of terms long.  I feel there may yet be a way to get a simplified equation here, but I am unable to figure it out.

It's just as well as it gave me an excuse to figure out some Maple programming.  I simply iterated through all the integer jackpots, from $12 million to $3000 million, and found the expected values (with taxes accounted for, see previous post for details).  The results were interesting.

Since the recent record jackpot had the best expected value in the other post I had expected that the expected value would increase with jackpot.  I knew that the number of multiple winners would increase fast as jackpot increased, particularly as the model shows exponential growth in sales.  This turned out to be the case, but to my surprise the peak was lower than the record jackpot.

The best expected value turns out to be at a jackpot of $530 million.  The value is $0.564751, which falls $0.285249 short of the adjusted ticket price.  I plotted the expected values at various jackpots and found it rather interesting.  The plot is immediately recognizable as a blackbody radiation curve.  I'm not sure what to read into this, other than the obvious fact that the sun controls the Mega Millions jackpot.


Final Thoughts
So, this has all been a good time, but I suppose I should be wrapping this post up at some point.  I'll do that similarly to the first Mega Millions post, by negating everything I've done here with some causal observations.  Despite the impression someone may get after these posts, I don't actually follow the Mega Millions at all, and really had very little idea about it prior to the first post.  Something I'm still not certain of is how the jackpot is calculated and announced.  It seems the jackpot increases continuously during the period between drawings.  Worse, it seems this increase is based on sales.  In other words, the independent variable (jackpot amount) we have been using actually seems to be dependent on the dependent variable (sales), which is itself still dependent on the other.  So both are dependent on each other, and the system is a positive feedback loop.  This means jackpot amount isn't known and can't be used to model ticket sales.  This could be overcome if the jackpot advertised at the last moment to buy tickets is close enough to the final jackpot.

Either way, the Mega Millions certainly has a lot of mathematical life in it.  I could probably keep writing posts about it forever, but won't.

I've updated the lottery spreadsheet with quite a bit of data:
http://daleswanson.org/blog/lottery.ods

I've also uploaded a zip file of all my Maple worksheets for those of you who wish to play along at home:
http://daleswanson.org/blog/megamillions.mapleworksheets.7z

No comments:

Post a Comment