Wednesday, December 26, 2007

Comparing Text

One of the books I read mentioned an interesting way of comparing the author of texts.  As you probably know, common compression programs make a dictionary of the most common strings of text, and then replaces them.  So the way this method works is this, you take two fairly large (10kbytes) samples of text from known authors.  You zip (or rar or whatever) them and note the file size.  Then you take a smaller sample of text by an unknown author (although it should be one of the two from above), and add it to the two text bodies, and rezip them.  Now you note the size increase.  Whichever file combined adds less size is likely the author.

If you think about it, how it works is that the more similar a file is the easier it is to compress, thus when both pieces of text are by the same author (with the same writing style), it compresses smaller.  So, I tried this out, I got my article on accelerating returns, and a few paragraphs from elsewhere on my site.  Then I found some other site and grabbed another chunk of text.  I rar'd it all and my file added less size than the other file.  This was all well and good, but I knew the style of writing was pretty different, I couldn't find anything that would be written in the same semi casual style as mine (mind you I didn't have the internet, so I was searching through the limited selection of site I had downloaded).  While I was mentally explaining all this to you, I was using an example of two bodies of text, one written by me, one you, and then a shorter one that was either me or you, but unknown.  Then I realized that I could use the trip reports as the two larger bodies, and then one of your emails I had saved as the test text.  So I did all this, and rar'd it up, and it turns out my text added less, than your (thus indicating that the test text was more similar to mine, and thus was written by me).

So I adjusted some things like the file size of the two unzipped sample texts, to make the match.  When I redid it, the results were less, but still said I was the author.  I was going to do a joke that claimed this meant you had plagiarized it from me.  I didn't bring the files in though, and I too tired to be funny, so you are going to have to just imagine how funny it would have been.

No comments:

Post a Comment