Dale Swanson: July 2012

Saturday, July 28, 2012

How to compile C++ code in Notepad++ with gcc/g++

I use Notepad++ for all my coding. It really is a great text editor. These days I mostly use Perl so I don't have to worry about anything else, but I occasionally use C++ for speed in numerical tasks. The system I've been using for years is to edit in Notepad++ then hop over to Dev-C++ to compile. Dev C++ is an IDE that uses g++ to compile, so it's pretty silly to use it as a middle man. Still, it has worked for years, and continues to. The problem is that it hasn't been updated in a long time, and it still uses g++ 3.4.2 which is close to 10 years old.

I knew Notepad++ could run programs and that I could just compile directly via command line from it, I had just been too lazy to set it up. I decided to make it work, and in researching how I found a lot of old info. I decided to explain the process I used here for future reference.

The first step is to acquire nppexec, which used to be bundled with Notepad++ but isn't any more. It comes as a zip of one dll and two folders, just put them all in the plugins folder of Notepad++.

Now you need gcc/g++. MinGW is the standard way of doing this on Windows. I went with this version because it came with a lot of useful libraries like boost. Just extract that to some path with no spaces (spaces may work, but are generally a bad idea), and run the batch file. You may need to add that path to the windows path system variable; I already had it in there from a past install of MinGW. At this point test the install by running: g++ --version in a command prompt. I have something like 5 versions of g++ floating around my computer, and it kept running an old version. I finally just deleted that one.

Now open Notepad++ and hit F6 to bring up the execute window. Here you can enter this code:

npp_save

cd  "$(CURRENT_DIRECTORY)"

g++ "$(FILE_NAME)" -o $(NAME_PART) -march=native -O3

NPP_RUN $(NAME_PART)

I'll explain what each line does, so you can edit it as you see fit. The first saves the current document. Next changes the directory to the one the file is in. Before this I kept compiling and couldn't find the executable. It was in the Notepad++ folder. Skipping the third for now, the last line runs the executable.

The third line does the actual compiling. g++.exe "$(FILE_NAME)" should be clear, and -o $(NAME_PART) is the output filename. -O3 uses the highest optimization level in g++. There is some debate online as to whether the 3rd level is more trouble than it's worth. A lot of people suggested that the 3rd was too aggressive, which could cause compile errors and is even sometimes slower than the 2nd level. This wasn't unanimous though, and there weren't any references to actual examples. Frankly, I felt like this was probably an example of cargo cult programming where it may have been true 15 years ago but people have just been repeating the same thing without ever stopping to reevaluate if it is still true. In my exhaustive test of two programs it produced a program that was the same as level 2 as far as I could tell. -march=native detects which instruction sets (eg SSE) are supported by the system doing the compiling and enables any that are. This is a good option when the code will run on the same or similar system to the one compiling.

Anyway, once you have this code save it as something like 'Compile C++'. You could create a few versions here, like one that doesn't run the code after compile, or that uses different flags. I made one for interpreted languages that don't need compiling that is just:

npp_save

NPP_RUN $(FILE_NAME)

At this point, you are almost done. Hitting F6 brings up that execute box where you can select which script to run. Hitting Ctrl+F6 runs the last ran script. However, I found hitting Ctrl+F6 to be annoying. So I decided to remap it to just F5 (leaving F6 to bring up the box). Under settings go to shortcut mapper. Under main menu scroll down and find the command currently mapped to F5 (the simpler run box), and change it to something else or nothing. Then under plugin commands find direct execute last, mapped to Ctrl+F6 and change it to F5.

That's it. The first time you hit F5 in a document it brings up the box to let you pick which script to run. Then every time after that it remembers and just runs it.

Thursday, July 26, 2012

Risk Management

This post will be about risk management. Well not really, I just wanted to post that clip. Anyway, a few days ago I posted about a site that does some good data analyses. One of them was on the game Risk. This post will be a hodgepodge of random items related to Risk.

The Game

Risk is one of the staples of my gaming diet. The computer version of Risk, that I've been playing for at least 10 years, is called TurboRisk. Risk is notorious for taking a long time to play. But that program, while maintaining almost identical rules, reduces games to 5 minutes. The reason is simply automation. I'd estimate a large battle (100 vs 100 armies) would take at least 15 minutes (possibly closer to an hour) in real life, but in TurboRisk would happen in about a second.

The speed increase is not the only benefit of TurboRisk though. It allows anyone to write AI programs that you can play against. There are about a dozen AI scripts included and they all have some interesting varieties of strategy.

The Program
My strategy is pretty formulistic, and I've long wanted to program an AI script based on it. Unfortunately, the scripts must be written in pascal. I had never used pascal, and figured it couldn't be that bad. However, pascal is an old language and is closer to BASIC than C. While I've defended BASIC a lot in my life, and still think it's a valid first language, after using C like syntax for so long the thought of using 'begin' and 'end' to deliminate code blocks wasn't appealing. Still, I gave it a shot, but the combinations of the whole new API, having to code in the custom IDE instead of Notepad++, and length needed even for simple AI caused me to give up after a few hours.

Basic Rules
Before I get on to the fun* stuff I feel the need to explain the basic rules of Risk. Feel free to skip over them if you are already a Risk master, however no other reason for skipping is acceptable.

Risk is a turn based strategy game. There are about 40 territories that players can control. You can attack any territory bordering one of yours (a handful of black lines make some territories border across an ocean). You can attack as many times as you want on one turn. To attack you must have at least 2 armies in the attack territory. The mechanics of attacking are a bit odd. The attacker can use from 1 to 3 dice, the defender only 1 or 2. Both players must have at least 1 army per dice being used. So, if I'm attacking with 2 armies a territory with 1 army I can only use 1 or 2 dice and the defender can only use 1. It's best to use as many dice as you can.

Both players roll, and then the dice are arranged in order and paired up. Each pair is compared and who ever has a higher number wins (defender wins ties). The loser loses 1 army for each loss. Example: Attacker rolls 5, 3, 1; defender rolls 4, 3. 5 beats 4 (defender loses 1 army), but 3 ties 3 (attacker loses one army). If the defender had only rolled 1 dice only one army in total could have been lost, but the attacker would have had 3 chances to roll higher than him.

This odd rolling method means that the attacker has a clear advantage (being able to roll more dice), but ties going to the defender is also a clear advantage. The net result is that for 12 vs 12 and below the defender has the advantage; above 12, the attacker has the advantage.

Moving on, if you defeat all the defending armies you win the territory, and may move any number of armies from the attacking territory into it (you must maintain at least 1 army in all your territories at all times). The attacker cannot lose his territory in an attack (but if reduced to 1 army it could be an easy target come someone else's turn). After attacking however many territories you want on one turn, you get one free move. You can move armies from one of your territories to another bordering it (the same as if you had just conquered it). Then your turn ends.

Going back to the start of your turn, you are given new armies to place on any of your territories. The amount you get is the sum of three sources. First is the number of territories you control (at the start of that turn) divided by 3 (integer only). The second is a bonus for controlling an entire continent. The bonus is hardcoded, but ranges from 2-7, and depends on the size. The last source of armies is for trading in cards. You get one card at the end of any turn where you captured at least one territory. There are three kinds of cards; you need either one of each, or 3 of one kind to trade in. The number of armies you get starts at 4 and grows to 15 for the 6th set, and adding 5 for each set after that. An important note, in the real game this count is for all card sets traded in by anyone. In TurboRisk each player has his own count.

To summarize: each turn consists of 3 parts. Part 1, getting armies based on territories, continents and cards. Part 2, attacking. Part 3 moving.

My Strategy

My strategy is a somewhat common one. Capture a continent and hold it while getting the bonus armies, and slowly build up forces. Many of the bots do this with Australia as it only has one choke point to control. I, however, prefer South America. It has 2 choke points, and the same bonus as Australia , but it has much better options for growth. From Australia you can only go into Asia, which is wide open and hotly contested. From SA you can expand up through NA and only have 3 choke points total. Then you can further expand to take both Europe and Africa while still only having 3 choke points (6, 7 in map). If you can control that for a few turns you shouldn't have any problem taking the rest of the map.

A word about choke points. Take 2 and 5, between SA and Africa, for example. If you are holding SA you will have to build up forces in that choke point. However, rather than putting them on your continent (2) put them in Africa (5). This not only gives you 1 more territory, but also denies the Africa continent bonus to anyone else. The same can be done for all the choke points.

My Analysis
I pointed out this great analysis above. I wanted to write a program to check these odds, and it was a good excuse to do something in C++. I spent a while working on speed improvements, with mild success. My results match his pretty closely, except for one key fact. As I mentioned above, you must leave 1 army behind when you conquer a new territory. You must also have at least 2 armies before you can attack at all. This, in effect, means if you are attacking with 15 armies, only 14 are useful. As far as all the math is concerned this is the way to go. This is what he did, but he never added that army back into his analysis. This was intentional on his part: "All my calculations are based on the number of characters doing the actual attacking, not the number of armies on the starting square."

Still, I feel like it gives the wrong impression about the odds. He says that 5 vs 5 is the cross over point, when it is actually 13 vs 13 when you count all armies. Again, it's really just a matter of semantics, but I think it's important to remember which version you are dealing with.

Anyway, I felt like this would be a good thing to code up in javascript, so I largely replicated my C++ code in javascript. This gave me the excuse to compile some common javascript functions I've used several times into a common shared file.
http://daleswanson.org/things/risk.htm

After finishing it I noted it was fast enough (50 vs 50 and 10,000 rounds takes about 2 seconds for me), but, still I would have liked it to be faster. A search for risk odds turns up several much better versions. I decided to blame it on the default javascript PRNG. Several hours of hunting and benchmarking later, and I can say the default is the fastest. The first site seems to do the calculations on the server side, which seems odd, but does result in about 2 orders of magnitude of speed increase. The second site is all javascript though. Looking at the script shows the key difference. It has hardcoded the odds for each of the six possible dice combinations. This is not only faster and more accurate, but is exactly how the guy in the post that started this all did it.

Making a program to calculate this all with recursive probability seems fun, so maybe I'll do that at some point. History says that I won't though.

* May not actually be fun.

Tuesday, July 24, 2012

Why programmers work at night

http://swizec.com/blog/why-programmers-work-at-night/swizec/3198

But even programmers should be sleeping at night. We are not some race of super humans. Even programmers feel more alert during the day.

Why then do we perform our most mentally complex work work when the brain wants to sleep and we do simpler tasks when our brain is at its sharpest and brightest?

Because being tired makes us better coders.

Similar to the ballmer peak, being tired can make us focus better simply because when your brain is tired it has to focus! There isn’t enough left-over brainpower to afford losing concentration.

Monday, July 23, 2012

Finding interesting things in data

http://www.datagenetics.com/blog.html

This is the best site I've found in a while.

Wednesday, July 18, 2012

Cryptographic Hash Generators

So, I made a cryptographic hash generator:

https://wetzel.dev/tools/hash.htm

Unnecessary Background:

There are plenty of hash generators on the internet, like this one that I stole all the code from. My problem with this stuff is I never like the UI. First, they force you to hit a button, instead of just making it generate as you type. Second, despite the fact that he has all the code for several hashes, the page only displays either sha-1 or md5, and forces you to pick one at a time.

Extra Unnecessary Background:

If you don't know, there are a bunch of extensions that take a strong general password + simple per site password and use a hash to generate a very strong password that is actually inputted into the site. As with above, while I like the concept I don't like the implementations.

How I would like this to work is this: Take the general pass + domain name + per site password (any of which could be blank or not used), and generate the sha-512 hash of it. Then have the user tell it what characters are allowed and how long it may be. Then convert to a string that meets those constraints while maximizing entropy.

This would mean you could create a nice long pass phrase using all characters as the general password, and enter it once per session, or even store it locally. Then for extra secure sites you could enter an additional password that would only add to security. Using the domain name would mean that your passwords would be different for every site even without the extra per site password. This would also mean you could enter your passwords with no length or character restrictions and convert them to conform with whatever horrible rules various sites had.

The extension I linked to above is pretty good, but only allows for up to 14 chars, and you can't pick only lower or upper case (only both or none). While 14 chars is very strong for the type of result you're going to get from a hash, there are problems. What if a site doesn't allow the full array of special characters the extension uses, and only allows lower case? You'd be stuck with only digits, which would dramatically reduce entropy. Instead, why not allow any length you want? If someone is using something like this why not let them use the full allowed password size? Also, let the user pick exactly what characters are allowed; some sites have some really bizarre password restrictions.

Anyway, since any extension based password generator needs a web based version for when you are away from your home computer, I decided to make the web version, while knowing full well that I will never make the extension. I then decided to make it more general, and thus it's now just a generic hash generator.

Sunday, July 15, 2012

Orion Nebula: The Hubble View

http://apod.nasa.gov/apod/ap120715.html

There have been a lot of really good astronomy picture of the days, but I don't think I've ever posted one before. This one is really good though.

Saturday, July 14, 2012

America’s economy: Points of light

http://www.economist.com/node/21558591

A decade ago Air Tractor sold almost all its crop-dusting and fire-fighting aircraft in the United States, leaving it vulnerable both to America’s business cycle and its weather. Now, helped by federal financing, it has increased foreign sales to about half its total. Employment has more than doubled, to 270. From its home in Olney, Texas (population 3,285), Air Tractor this year will sell 40 aircraft, a fifth of its annual total, to Brazil, which needs bigger crop-dusters to expand grain sales worldwide. “If we can do it from a town that has three stop-lights and one Dairy Queen, it can be done by anyone,” says David Ickert, the chief financial officer.

This article does a good job of pointing out the benefit of globalization and a strengthening Chinese economy. As the Chinese become richer, they go from an unlimited source of cheap labor to a new huge consumer pool.

Thursday, July 12, 2012

zxcvbn: realistic password strength estimation

I've written a password strength estimator that attempts to deal with basic dictionary attacks, but it only attempts a very small dictionary. I've found this checker that works very well, and provides a lot of info about the weaknesses in your password.

http://dl.dropbox.com/u/209/zxcvbn/test/index.html

http://tech.dropbox.com/?p=165

Sunday, July 1, 2012

Attempting to predict Mega Millions sales

In the last post, I overviewed the linear least squares method. Here, we will be using that method to attempt to fit a model to the historical data of Mega Millions jackpots and ticket sales.

As I pointed out before, in order to know if buying a Mega Millions ticket has a positive expected return one needs to know the tickets sold. That data is only available after the fact. However, one could attempt to predict the ticket sales based on historical data. We will do that now.

To begin, I've plotted ticket sales vs jackpot values. It seems exponential; however, should our independent variable be in the base (`x^2`) or exponent (`e^x`)? Also, are there any other factors we can find data for that might affect sales? Our process allows for any number of independent variables (well any number less than the 250 or so data points I'll be using), so we might as well put anything we can think of in there. The least squares process will minimize factors if they don't have any correlation.

A quick word about the data. I got the data from the same site I've been using. In Feb 2010 states became allowed to sell both Mega Millions and Powerball tickets (cross-selling). Before that there were 12 states selling Megamillions, and after it rose to 35. Because of that dramatic change in sales, I am only using data from then until now. Also, a handful of states have been added since then. That same site also has sales figures for every state, which allowed me to correct for this. Although I could have done the exact corrections, I just modified each drawing by the average percentage the states added. I did this because I was too lazy to do it the more thorough way, the percentage of sales was pretty consistent, and the total change ranges from 0% - 3.8% at most. There are 249 drawings, from Feb 2, 2010 to June 19, 2012 inclusive.

Identifying Variables
The obvious factor is jackpot amount. We'll represent it as `x_1`. Another factor that I thought of is simply time. By this I mean, as time progresses the game becomes more well known and more people play, leading to increased sales. I'm not too sure how significant this factor would be, but I'll throw it in as `x_2`. For now we'll go with these two. There's a few more I can think of to throw in depending on how these two work out.

The next thing to consider is how important these variables are. More important variables can be used multiple times with different degrees. Since `x_1` is likely to be very important, we can use it as both `x_1` and `x_1^2`, using a different coefficient for both. If we felt two factors were interacting with each other we could use something like `(x_1 * x_2)`.

Model 1
I'll begin with a simple model and see how it goes:
`y = beta_0 + beta_1 x_2 + beta_2 x_1 + beta_3 x_1^2`
Where y is ticket sales, `x_1` is jackpot, and `x_2` is drawing number. Jackpot and ticket sales will be in millions. Drawing number will be an arbitrary count I made up. Assigning 1 to the first drawing I have data for, and working up to 249 for the latest. Hopefully it's clear that since we are going to build the model here it doesn't matter that we are using values in millions or an arbitrary counter, as long as we're consistent.

The betas (`beta`) are the variable coefficients we are trying to find with least squares. Our goal here is to find actual numerical values for them, and then have a formula where we can plug in jackpot and count values and get out a predicted ticket sales.

If you want actual details of what in the world we're doing here, see my previous post. We use a spreadsheet to find the values of `x_1` and `x_2` and that gives us a matrix 4 wide by 249 long. We copy and paste that from the spreadsheet to Maple (which does a nice job of understanding it as a matrix). We get the value of the vector `bb{b}` from the values for `y`, which are the ticket sales. Pasteing these two into Maple and running the worksheet gives our coefficients:
`beta_0 = 21.46800262, beta_1 = 0.000282810868, beta_2 = -0.1019284315, beta_3 = 0.001671278620`

Therefore, model 1 is:
`y = 21.46800262 + ( 0.000282810868) x_2 - (0.1019284315) x_1 + (0.001671278620) x_1^2`

Evaluating the Model

In addition, Maple tells us that the normal distance between our model and the actual data is 71.46. If you live in 249 dimensional space it should be easy for you to visualize what this represents. For those of you still in 3 dimensional space you'll have to visualize an analogy. Imagine we have a 2 dimensional plane in which all our answers must lie. Now imagine we draw a line starting on the plane but pointing out into space in some random direction (a vector). We can then imagine projecting the line onto the plane in order to get the best possible estimate of that line in our allowed plane. If you are still visualizing this, there is now a right triangle drawn between the real line, and the projected line. The distance between the two endpoints (one out in space, and one on our plane) is the third side to the triangle, and it is also the normal distance between the two lines. This can be thought of as the error between the actual line and our projected 'line model'. It should be clear that of all the infinite lines we could draw in the plane, and the resulting distance given by them, the one with the smallest distance is the best.

This picture, which I stole from here, arguably does a better job of explaining this than my drunken ramblings. Here the real line is `bb{b}` (blue), the projection is `tt{proj}_w bb{b}` (yellow) and the normal distance is `bb{b}-tt{proj}_w bb{b}`. And hopefully to help tie this to the concrete real world example, note that `bb{b}` is our vector containing all the previous ticket sales. It is 249 dimensions; However, we need it to fit into only 4 dimensions as that is all the unknowns in our model.

Anyway, the point of all this is that our error is about 71.5. We can use this to quantitatively compare this model to others. We can also get a general feel for how this model holds up by calculating its predicted ticket sales in our spreadsheet and finding the relative errors from the actual sales. Here I would say we do pretty well. Most of the errors are less than 20%, with the largest being 40%. This error happened on a drawing with a jackpot of $266M and sales of 79M. This is something of an outlier and can be seen on the plot above. Another outlier happened a few months later when there was a jackpot of $242M and sales of 122M. Note this is less of a jackpot and more sales. It's not hard to see then how our model failed to predict these. It is worth looking at these outliers and attempting to come up with factors that may have caused them. There are several I can think of, but for now let's stick to what we have and try a different model.

Model 2
I'd like to try an exponential model, ie, one where the independent variable is in the exponent. Something like this:
`y = beta_0 * e^{beta_1 * x_1}`

There are some important notes about this one. First notice that simply plugging in values for `x_1` will not make this linear for the remaining variables. Since we must have a linear equation for least squares to work, we must find a way to fix this. This leads to the second point, that we only have a single independent variable here. Since we are about to use logs to make this a linear equation we are limited by how we can set this up. We cannot add several variables together, as we did before, because we will end up with a log of that addition, and there isn't any property of logs that will let us do anything with that.

We make this linear by taking the natural log of both sides: `ln(y) = ln(beta_0 * e^{beta_1 * x_1}) to ln(y) = ln(beta_0) + (beta_1 * x_1)`
Is this linear? Well `ln(y)` is ok, because once we plug in our values for y it will give a number. It should be clear `(beta_1 * x_1)` is fine, since it is similar to the other models we've used. It may seem like `ln(beta_0)` could give us problems, however since `beta_0` is a constant and there is not a variable in the argument of the log we can effectively substitute in a new variable such that `beta_2 = ln(beta_0)`. I'll leave this step in my head for simplicity's sake.

We are thus ready to extract the matrix and vector and use Maple to solve. Remember to use `ln(y)` and not just `y` values in the vector.

Maple gives us the vector [2.787582661, 0.006493826665]. We must be mindful of what this represents. `beta_1 = 0.006493826665`, however `beta_0 != 2.776342491`. Instead, that is the natural log of `beta_0`. We find that `beta_0 = 16.2417105887`. Our Model 2 is then:
`y = 16.2417105887 * e^{0.006493826665 * x_1}`

Maple also tells us the normal distance is only 1.6. Compare this to 71.5 for model 1, and it would seem this is a much better model. Actually looking at the errors gives some interesting insights. One thing I notice is that with this model single digit error percentages are more common. However, the largest error is 59%, quite a bit more than largest of 40% in model 1. This error is also in the recent $640 M jackpot. This is rather important as this, ie large jackpots, is exactly where the model is most needed. Model 1 gave a prediction of 641 M sales, vs the actual figure of 652 M, for an error of about 2%. Taking the average of all the individual errors with both models gives 9.3% for model 1, and 7.6% for model 2. The numbers seem to be in favor of model 2, but perhaps it would change if we only considered high jackpots, which is where the model is needed. Perhaps we should recalculate the models with only the higher jackpots.

For now though, we'll try adding some new variables.

Model 3

I'll revert back to the basic format of model 1, as it's easier to add variables to. So what are some other possible factors? Well to begin with, I figure time of year likely has some effect. The problem is how to represent time of year in a way that would provide meaningful data. At first just using the month number seems ok; however, this creates a huge difference between December and January. I doubt that the actual month makes much difference (although since birthdays aren't evenly distributed, maybe that would matter), rather than just the general time of year. I think the best way to do this is to just create a variable for each time of year. At first, this meant just having a binary 0 or 1 for 4 variables (seasons). Then I realized that I might as well vary the intensity of the variable depending on how strongly it was that season. Then I decided to just go with two seasons (summer/winter) centered around the hottest and coldest parts of the year (July 20, Jan 20).

I was then debating if this factor should be linear or not. I noted how the average temperatures change quickly in fall and spring, and then generally hover in summer and winter. Then I realized average temperature is a great factor for this. I know I always have trouble finding historical weather data on the internet, but luckily averages are a bit easier. I went with this source for averages in Iowa since the location really didn't matter, as long as it was somewhat representative of the US as a whole.

A second factor that I thought could prove useful was the previous drawing's sales. I figured that this would help capture factors I didn't think of. On my first try at this I missed something obvious. When someone wins the jackpot and it resets, the previous sales don't matter (or at least not in the same way). To fix this I combined the previous sales with a binary variable that was 1 normally and 0 when the jackpot had just reset.

So, I'm defining three new variables. `x_3` will be the previous ticket sales. `x_4` will be a binary switch 1 normally and 0 when the jackpot resets. `x_5` will be the average temperature. And our model is:
`y = beta_0 + beta_1 x_2 + beta_2 x_1 + beta_3 x_1^2 + beta_4(x_3*x_4) + beta_5 x_5`

Note this is the same as model 1, but with the two new terms added on the end.

Fast forward some spreadsheet and Maple work and we have our coefficients:
`beta_0 = 24.58, beta_1 = -0.000739, beta_2 = -0.10927, beta_3 = 0.001678, beta_4 = -0.0463, beta_5 = 0.0075598`

And our model:
`y = 24.58 - (0.000739) x_2 - (0.10927) x_1 + (0.001678) x_1^2 - (0.0463)(x_3*x_4) + (0.0075598) x_5`

How well does this model perform? Well the good news is that it's better than model 1; the bad news is that it's basically the same. Recall that model 1 had a distance of 71.5 and an average error of 9.3%. Model 3 has a distance of 70.2 and an average error of 9.3%. I really felt like the temperature should be a good indicator of general season, and that season should have some effect on sales. I also considered that perhaps forcing the previous sales figure to be 0 when the jackpot reset was doing more harm than good. So, I redid this model with that term removed and called it model 4.

I won't both posting the details of model 4 because it's much the same. Distance of 70.2 and average error of 9.3%. I should mention that if you go out to more decimal places each additional piece of information does decrease both average error, and distance. However, I'm not going to pretend like that is an improvement in the model. Given enough variables, no matter how arbitrary they are, the least squares method will approach a perfect representation of the given data. The problem is that this would only be perfect for the given past data. If the additional variables have no real effect on the sales there would be no corresponding increase in predictive accuracy. Many a person has lost money because they over tweaked a model with past data and were so satisfied with how well the model described past results that they were sure it would predict future results equally well.

Another thing worth noting is that our `beta_1` term, which was positive in model 1 is negative in model 3 (and 4). It is also very small in all models. This indicates that there isn't any strong correlation here. So as long as we are removing superfluous variables why not that one too. For those of you keeping score at home this means we are left with just one variable, the jackpot amount. This leads us to:

Model 5
We are left with a simple quadratic:
`y = beta_0 + beta_1 x_1 + beta_2 x_1^2`

Its coefficients are:
`y = 21.50354999 - (0.1019449781) x_1 + (0.001671373144) x_1^2`

It has a distance of 71.5 and average error of 9.3%, ie, the same as models 1, 3, and 4.

Model 2 vs Model 5

So it would seem that Model 2 is the winner. However, I still don't like the huge error on the largest jackpot. I've plotted these two models and compared them to the actual sales. I had plotted all the models at first, but since 1, 3, 4, and 5 are so similar they can be treated as the same on the plot. Anyway, the plot shows that model 5 is very close to actual sales at all points. Model 2 is too, except for the largest jackpot, where it is way off. As I said above, since the big jackpots are what matters most, I'm hesitant to recommend model 2 over 5, even though all the number back it up. Maybe only looking at drawings with a large jackpot will improve the models.

Large Jackpots
I decided to use $100 M as a cut off for the jackpot. This gives 48 drawings to use for our model. I then found the coefficients for models 2 and 5 with this smaller data set. I'm calling these models 2a and 5a.
Model 2a:
`y = 18.51083081 * e^{0.006057907333 * x_1}`

Model 5a:
`y = 43.12456321 - (0.2824251538) x_1 + (0.001932541134) x_1^2`

Of course, the real question is how accurate are they. The answer is that model 2a has a distance of 0.74 and average error of 7.0%. Model 5a has a distance of 53, and an average error of 5.8%. So, both are more accurate than the original models based on the larger data set. This isn't surprising, for the same reason it wasn't surprising when adding variables improved accuracy without actually improving the models' predictive power. Here, however, I think this may be an actual improvement. It's hard to say though really until we have future high value jackpots to test the models with.

Will it ever be profitable to play the Mega Millions?
Now that we have a way of estimating the ticket sales based on jackpot amount we have the tools needed to answer this question. In my Mega Millions post I provided the formula needed to determine the expected value of a Mega Millions ticket. It requires both jackpot and ticket sales, but we now have ticket sales as a function of jackpot. So, we can plot the expected value of tickets for various jackpots. The formula for expected jackpot is:
`V = \frac{J}{e^r}\sum_{n=1}^{} \frac{r^n}{n(n!)}`

Where r is the rate of winners. This is found by this formula:
`r = 5.699 \times 10^-9 * y`
Where y is the tickets sold, as given by our shiny new model. Remember that we used millions in the above model, whereas these formulas all want full figures. You'll have to convert this manually if you're going to do these calculations yourself (and I know you are). Also keep in mind that we determined that the non-jackpot prizes were worth about $0.15 per ticket. Thus, our breakeven point is $0.85.

In theory, we should be able to substitute these equations into each other and end up with one with only one independent variable, J. Then we could plot that equation, and even take the derivative to find where (and if) it peaks at some maximum expected value. Unfortunately, if you examine those equations you'll see that we end up with a quadratic for r. That quadratic then has to be substituted in as an exponent of e; messy but doable. However, the real problem comes from within the summation. We have to raise r to a incrementing power. There's no easy way to do this. Even letting Maple do the work, we end up with a giant expanded summation hundreds of terms long. I feel there may yet be a way to get a simplified equation here, but I am unable to figure it out.

It's just as well as it gave me an excuse to figure out some Maple programming. I simply iterated through all the integer jackpots, from $12 million to $3000 million, and found the expected values (with taxes accounted for, see previous post for details). The results were interesting.

Since the recent record jackpot had the best expected value in the other post I had expected that the expected value would increase with jackpot. I knew that the number of multiple winners would increase fast as jackpot increased, particularly as the model shows exponential growth in sales. This turned out to be the case, but to my surprise the peak was lower than the record jackpot.

The best expected value turns out to be at a jackpot of $530 million. The value is $0.564751, which falls $0.285249 short of the adjusted ticket price. I plotted the expected values at various jackpots and found it rather interesting. The plot is immediately recognizable as a blackbody radiation curve. I'm not sure what to read into this, other than the obvious fact that the sun controls the Mega Millions jackpot.

Final Thoughts
So, this has all been a good time, but I suppose I should be wrapping this post up at some point. I'll do that similarly to the first Mega Millions post, by negating everything I've done here with some causal observations. Despite the impression someone may get after these posts, I don't actually follow the Mega Millions at all, and really had very little idea about it prior to the first post. Something I'm still not certain of is how the jackpot is calculated and announced. It seems the jackpot increases continuously during the period between drawings. Worse, it seems this increase is based on sales. In other words, the independent variable (jackpot amount) we have been using actually seems to be dependent on the dependent variable (sales), which is itself still dependent on the other. So both are dependent on each other, and the system is a positive feedback loop. This means jackpot amount isn't known and can't be used to model ticket sales. This could be overcome if the jackpot advertised at the last moment to buy tickets is close enough to the final jackpot.

Either way, the Mega Millions certainly has a lot of mathematical life in it. I could probably keep writing posts about it forever, but won't.

I've updated the lottery spreadsheet with quite a bit of data:
http://daleswanson.org/blog/lottery.ods

I've also uploaded a zip file of all my Maple worksheets for those of you who wish to play along at home:
http://daleswanson.org/blog/megamillions.mapleworksheets.7z

Dale Swanson