Wednesday, December 26, 2007

Paradoxes

Well it's still only 9am, so I'm trying to think what else I can discuss. There was an interesting game that was brought up.  If you remember a while back when I introduced you to the prisoner's dilemma, and other game theory games, this one is similar, but not quite.  It's pretty simple, you start with a dollar in a basket, the host flips a coin, if it comes up heads the amount in the basket doubles, if it comes up tails you take whatever is in the basket and the game is over.  The question is how much would you be willing to pay to play this game?  50% of the time you'd win $1, 25% you'd win $2, and so on.  Each time the odds would half, but they'd never reach 0, thus there is no limit to what you could win.  Thus, unlike casino games where in the long run you'll always lose, for this game in the long run you'll always win.  No matter how much you pay to play each time eventually you'll win it all back.

I suppose a practice problem would arise similar to the old progressive betting system that you should be familiar with by now (double your bet each time you lose until you win), in that you couldn't afford to keep playing forever.  The question remains, how much would you pay to play?  Also looked at from the other side, how much would a casino have to charge so that it made a slight profit in the long run.  That's how you over come the problem of limited money supply.  If a casino offered this game, and it took $100 to play eventually someone would win back more money than anyone else had ever paid to play.  Still though according to the book most people would pay about $10-$20 to play it, and that seems about right.

So here is another more phlisopee question.  Now you are part of an experiment, it's pretty simple.  A doctor will give you a sleeping pill, and while you are asleep he will flip a coin, if the coin is heads he we wake you up and you leave.  But, if it's tails then he will wake you up and give you another sleeping pill.  The pill has a side effect though of making you forget what happened.  So every time you are woke up you can't remember any previous wake ups (if any).  Now, when the doctor wakes you he asks you what are the odds the coin landed tails up.  At first it seems obvious that it's still 50%, there no reason for the odds of coin toss to change.  But, think about it this way.  There are 3 possible times you would be woke up and asked that question, out of those 3 times 1 would happen when the coin was heads, and 2 would happen when it was tails.  Thus 66% the coin would have be tails up.

Here's a more extreme example.  Instead of a coin toss it's a wheel, with a 1 in 100 chance of coming up a particular way (A note, I'll be using the term 'the wheel comes up' to mean it comes up with a particular number, and it had a 1 out of a 100 chance of doing that).  If it does come up that way then you will be put back to sleep, over and over, 10,000 times (the wheel is only spun once), if it dosen't you will be woken up and walk away.  So there is a 99% chance you will go to sleep once, and then wake up and leave, but 1% of the time you will go to sleep and wake up 10,000 times a row and not know it.  Now think of it from the doctors point of view.  If he were to ask 'do you think the wheel came up or not?', it would seem that by answering no 99% of his patients would be right.  However, most of the times he asks that question the correct answer is yes (since he does it 10,000 times each time it happens).  Going farther still, say he only wakes you up at all when it comes up, if it dosen't he kills you, or puts you in a coma (or wakes you with out asking you the question).  Now it should be obvious that you must answer yes it did come up, since that is the only way you could have arrived in that situation.

I just thought of another way to look at it, with out all the crazy sleeping pill nonsense.  Say you take a poll, you flip a coin and then ask people if they thought it was heads or tails.  The odds would be 50% they'd be right. However, if you told them that whenever the coin came up tails you only asked one person, but when it came up heads you asked 100 people (for that single toss, and none of them know about the other people getting asked). Then that means for most of the people who you are asking, the answer is heads.  Going back to our wheel, you tell them that if it comes up (1 in 100 chance) then you ask 10,000 people, but when it doesn't (99%) you only ask 1.  After spinning the wheel 100 times, it should have come up once, and didn't 99 times.  Thus, you should have asked 99 people the question when the right answer was no, but 10,000 people when the right answer was yes. Again most people you ask would be right to guess that the wheel had come up.  Now put your self in the position of a person being asked, and also put some money on the line, say $10,000 if you correctly guess if the wheel had come up or not (if you guess right you win the money, even if you guess the wheel didn't come up [99 out of 100] and it didn't).  Would you guess it had come up or not?

I would guess it had, every time.  At first this is pretty crazy, but it seems fairly obvious with that last example with different people, I just hope I didn't miss some point of logic that fundamentally changes the game.

http://en.wikipedia.org/wiki/Sleeping_Beauty_problem

Next up on my tour of Probability theory paradoxes we will visit the common Monty Hall Problem.  This is pretty often told, so there is a high probability you've heard it, but either way here it is again (briefly).  You are on a game show, there are 3 doors, one of which has a prize behind it. You pick a door and if you pick the door with the prize you win.  Here's the twist, after you pick, the host will open one of the other prizeless doors, and then give you the option to change doors.  The question is is it better for you to switch, or does it make no difference?  At first it would seem to no matter, they are all 33%, and even after one is opened at most it's now 50/50, but both doors still have the same chances.  However, here is the way that I understood it.  When the host opens a door he must open one without a prize, and one that you didn't pick.  There are two scenarios, either you picked the right door at first (33%), and thus he has two doors without prizes that you didn't pick to open one of.  Thus, if you change your door you will change to one with out a prize, but if you keep with your orginal you will win.  Thus for the 33% of the time you picked the correct door at the start you will always win to stick with your pick, and always lose to switch.

Now what is the other possibility?  That you picked the wrong door at the start (66%), now the host dosen't have a choice, he can only open the other prizeless door.  He can't open the one you picked, and he can't open the one with the prize.  Now the only doors left are yours (prizeless) and the other (with prize).  Thus, 66% when you start with the wrong door you will always win by switching, and always lose by staying.  So it should be clear that 66% you will win if you switch, but you will only win 33% by always sticking.

http://en.wikipedia.org/wiki/Monty_Hall_problem

http://en.wikipedia.org/wiki/Category:Probability_theory_paradoxes

Now for my next act.  This one is actually pretty interesting.  It's less visual, but whatever.  When you have a body of statistics broken up in some way, by combining them you can change leading statistic.  It's much easier to give an example:
                    1995                    1996            Combined
Derek Jeter            12/48 .250                183/582 .314    195/630 .310
David Justice        104/411 .253            45/140 .321        149/551 .270
Now, in case the spacing on that terrible tabbed chart dosen't work out, the bottom guy has the better batting average in 95 and 96, but the combined total the top guy is better.  This seems like a paradox, how could he be better each year, but the average still be worst?  And in that case who is better?  If you look at the numbers you can figure it out.  In 95 they both didn't do well, but davey did slightly better, however derek barely played in that year.  I don't know what the second number is, but I assume it's attempts to hit the ball, and the first is sucesses.  So in 95 derek only tried 48 times, where davey tried 411 times.  As you should know the larger a sample you have the more accurate your statitics are, so the 95 statistic is much more accurate for davey than for derek.  Now we move to 96.  In this year the opposite is true, they both did good, but derek has a lot more samples, and thus his statistic of doing good is more meaningful than davey's.

To take this to the extream, say that in 95 derek only tried once, and failed, but davey tried 100 times, and only hit it 10 times.  For 95 derek's average would be 0.000, but davey's would be .100, at the same time davey's would be much more accurate picture of his actual skill, where as derek's would be a fluke.  Now in 96 derek tried 100 times and hit the ball 90 times, where as davey only tired once, and hit it.  Now for 96 derek's average would be .900, and davey's would be 1.000.  So again in both years davey would out perform derek, but taken as a whole derek would have 90/101 = .891, and davey would only have 11/101 = .109.  Thus with more data we see a much more accurate picture.

You can see how if I were trying to convice some one of davey's baseball skill I could use this to my advantage.  You can almost always make a statistic to say just about anything you want to convice people of.  This is where that famous Mark Twain quote comes from (paraphrased), there are 3 types of lies: lies, damned lies, and statistics.

http://en.wikipedia.org/wiki/Simpson%27s_paradox

Rock Paper Scissor

I guess if you insist I could go into a method to solve one of the greatest problems facing mankind today, playing Rock Paper Scissors by email.  This is something I thought about a while ago, when I was reading lots of crypto books.  It's deff possible to send your choice in a way that locks you into what it is, but at the same time doesn't reveal what it is until you give additional information.  The challenge was for me to come up with an actual method of doing it.

I'm not sure how much you know about public key encryption, or NP complete problems, but modern public key encryption works by doing something that is very easy to do in one direction, but very hard in the other.  The most common method used today is to generate a pair of large prime numbers, then multiply them together.  The resulting number is huge almost prime number (almost prime because it is made up of two primes).  Figuring out what two numbers make it up (factoring) though is very hard.  However, if you reveal one of the numbers then you can easily figure out what the other is.

So back to RPS.  I know I could encrypt my choice using public key encryption, and send it to you, but like I said, I wanted a method that I could do myself.  So my method would work as follows.  I generate a list of large primes, each of which takes about a minute or so to come up with.  I then pick two and multiply them together, then send the result number to you.  My choice is stored in one of the numbers.  The format for storing it could be this.  I'm pretty sure you are familiar with the concept of adding a number's digits, then adding that number's digits again and again until you have a single digit number (ex 94184, 9 + 4 + 1 + 8 + 4 = 26, 2 + 6 = 8).  You do that with the larger number, and then you take the result and divide it mod 3, as in you divide by 3 and then take the remainder, if it's 0 though it's invalid (0 mod 3 is 0, but for our game it's an invalid choice).  So if the digits add to 7, then that mod 3 would be 1.  They would all be either 0, 1, or 2, 3 each for each choice, makes 9 digits (and you ignore 0).  0=R, 1=P, 2=S.

Here's an example using not so large primes.  I send you my choice as 9030233.  You then send your choice back as plain text, say 'rock'.  Now I reveal that my primes that make up that choice are: 3391 and 2663.  You can verify that they are in fact prime, and that they do in fact multiply to give the larger number.  Both of those will be rather quick to check.  Using the above format, 3391 is the larger, and thus you add it's digits 3 + 3 + 9 + 1 = 16, 1 + 6 = 7.  7 mod 3 is 1, 1 = Paper, so I win.  Actually upon doing this I realize it would probably be better to add all the digits of both primes together.  I would need someone that was better at math to verify that giving you the results of the sum of the digits of a number (or both numbers) doesn't give you any information about the product of those numbers.

Now let's analyze how this worked.  The key is that I have to send you some sort of info that locks me into a choice, but doesn't reveal that choice to you.  It should be clear that sending you the product of the two primes does in fact tell you what numbers I used.  The question is whether you could have figured out my choice before I revealed the key (the composite primes). To do that you'd need to factor a large almost prime number.  Factoring the above number wouldn't be that hard, but it would be much harder than coming up with the 4 digit primes that make it up.  For that scale imagine we had to do this all by hand, hundreds of years ago, (playing by paper mail). Coming up with a 4 digit prime would be hard, but not that hard.  However, factoring a 7 digit number would take much longer.  With computers we can generate much larger primes, and as the size of the primes increase the length of time to factor a number that is the product of two of them grows even faster.  Thus in a real life example I'd send a number hundreds of digits long, which would take years to factor.  It would only take a few seconds or minutes though, to confirm that the composite primes are in fact prime, or to generate them in the first place.

I was going to end with an actual number, but upon trying this I realized that any prime big enough to make factoring difficult would be absurdly large (think of that binder), and the product of two of them would be even larger.  Thus I'd have to send several mbytes of text to you.  I suppose I could get by with using a prime that was only a few hundred digits long, but that would probably be factorable by some specialty program.  Plus it'd still be very difficult to add all the digits.  I could write a program to do all this stuff, but it's probably not worth it just for a closing line in this email.

Comparing Text

One of the books I read mentioned an interesting way of comparing the author of texts.  As you probably know, common compression programs make a dictionary of the most common strings of text, and then replaces them.  So the way this method works is this, you take two fairly large (10kbytes) samples of text from known authors.  You zip (or rar or whatever) them and note the file size.  Then you take a smaller sample of text by an unknown author (although it should be one of the two from above), and add it to the two text bodies, and rezip them.  Now you note the size increase.  Whichever file combined adds less size is likely the author.

If you think about it, how it works is that the more similar a file is the easier it is to compress, thus when both pieces of text are by the same author (with the same writing style), it compresses smaller.  So, I tried this out, I got my article on accelerating returns, and a few paragraphs from elsewhere on my site.  Then I found some other site and grabbed another chunk of text.  I rar'd it all and my file added less size than the other file.  This was all well and good, but I knew the style of writing was pretty different, I couldn't find anything that would be written in the same semi casual style as mine (mind you I didn't have the internet, so I was searching through the limited selection of site I had downloaded).  While I was mentally explaining all this to you, I was using an example of two bodies of text, one written by me, one you, and then a shorter one that was either me or you, but unknown.  Then I realized that I could use the trip reports as the two larger bodies, and then one of your emails I had saved as the test text.  So I did all this, and rar'd it up, and it turns out my text added less, than your (thus indicating that the test text was more similar to mine, and thus was written by me).

So I adjusted some things like the file size of the two unzipped sample texts, to make the match.  When I redid it, the results were less, but still said I was the author.  I was going to do a joke that claimed this meant you had plagiarized it from me.  I didn't bring the files in though, and I too tired to be funny, so you are going to have to just imagine how funny it would have been.