Friday, August 30, 2013

Anatomy of a hack: How crackers ransack passwords like “qeadzcwrsfxv1331”
One of the things Gosney and other crackers have found is that passwords for a particular site are remarkably similar, despite being generated by users who have never met each other. After cracking such a large percentage of hashes from this unknown site, the next step was to analyze the plains and mimic the patterns when attempting to guess the remaining passwords. The result is a series of statistically generated brute-force attacks based on a mathematical system known as Markov chains. Hashcat makes it simple to implement this method. By looking at the list of passwords that already have been cracked, it performs probabilistically ordered, per-position brute-force attacks. Gosney thinks of it as an "intelligent brute-force" that uses statistics to drastically limit the keyspace.

Where a classic brute-force tries "aaa," "aab," "aac," and so on, a Markov attack makes highly educated guesses. It analyzes plains to determine where certain types of characters are likely to appear in a password. A Markov attack with a length of seven and a threshold of 65 tries all possible seven-character passwords with the 65 most likely characters for each position. It drops the keyspace of a classic brute-force from 957 to 657, a benefit that saves an attacker about four hours. And since passwords show surprising uniformity when it comes to the types of characters used in each position—in general, capital letters come at the beginning, lower-case letters come in the middle, and symbols and numbers come at the end—Markov attacks are able crack almost as many passwords as a straight brute-force.

Thursday, August 29, 2013

An overview of open source licenses

I know what you are saying, 'wait a minute, you already did a post on open source licenses in May of 2011'.  Well yes, but that was more of an intro to the concepts of open source licenses, and the virus like nature of the GPL.  Here I will do more of a review of the major licenses, as well as some good sites I've found to explain them.

During my recent WMGK parser post, I published the code to github, and that required picking a license.  This lead to much more research than would strictly be necessary for something no one will ever see or use ever.  I want to now spread that knowledge to all of you.

In uncharacteristic fashion, I will give a quick summary of the licenses here, as opposed to forcing you to read through all my drunken ramblings (although we are already three paragraphs in):

  • If you provide no explict license on your code it will default to the most restrictive one possible.  That is, no one will be able to redistribute your code in any way, unless they get your expressed written consent first.

  • If you want people to be able to do anything they want with your code, and not even have to attribute you, then use Unlicense, which is effectively the public domain, with a disclaimer that you can't be sued if the code burns down their house.  This is good for short code snippets that aren't worth a complex license.

  • If you want people to be able to do anything they want with your code, as long as they attribute you, then you can use the MIT or BSD licenses.  The MIT license seems to be somewhat more popular, and is probably simpler to read.  The BSD license has a 2 clause version that is basically the same as MIT, and a 3 clause version that also prohibits the use of your organizations names to endorse the derivative works.  There is also the Apache license which prevents people from using software patents to sue, it is probably better than pure MIT for longer programs.

  • If you want people to be able to release programs based on your program, as long as they also release the source code to those derivative works, then use the GPL.  There are two versions in common use, v2 and v3.  The main update was an attempt to deal with software being released on physical devices, as well as some software patent stuff.

If you want to review software licenses in greater depth, and I know you do, here are two good sites with simple color coded overviews.

Monday, August 26, 2013

An analysis of radio song play frequency for 102.9 WMGK

Recently I had to drive a car with only a radio for music.  After just a few days I was annoyed at the variety, or lack thereof.  I decided to scrape the station's web site for recently played data and see what the long term trends were.

The local classic rock station is 102.9 WMGK.  They have a recently played page which is easily scrapable.  This is contrasted to other classic rock stations that I wanted to scrape for a comparison that only listed the last few songs, or embedded it in flash.

I began scraping on June 26th and August 24th makes 60 days worth of data.

Thanks to my recent post, we know classic rock songs average 4:48 in length.  There are 86,400 minutes in 60 days.  That would be enough time for 18,000 songs.  If we assume only 75% of the time is actually music that's 13,500 songs.  During these 60 days WMGK played 14,286 songs.

I won't speculate about what this means about the actual percentage of music on the air.  Slight changes in average song length have a big effect on numbers.

So, now it is time for the interesting questions.  How many of those were unique songs?  How many songs would be needed to represent 25% or 75% coverage?

In case it isn't clear what I mean by coverage, I mean how many songs represent 10% (or whatever) of the total songs played.  For example, if a station played 1 song 10 times, and then 10 other songs 1 time each, for a total of 20 plays, that 1 top song would give 50% coverage.

So, without further ado here are the stats:


924 unique plays out of 14,286 means about 6.5% of songs were being played for the first time in 60 days.  Honestly, that's not bad.  However, the 50% and 75% coverages are awful.  I have 30k unique songs in my playlist, and that's not even particularly impressive.  Admittedly, they aren't all one genre, but Led Zeppelin has 86 studio songs, all of which would be suitable for a classic rock station.

The key take away is that there are about 300 to 350 songs that get played at least every other day.  Then, they occasionally toss in a deep cut.  There is no middle ground; either a song is played every other day, or once every few months.  I made some (read: over 100) graphs that illustrate this, but for now let's stick to tables.

Want to guess what the most popular bands or songs are?

Top songs:
Plays per 30 daysBandSong
27.5Warren ZevonWerewolves Of London
27CarsJust What I Needed
27Blue Oyster CultBurnin' For You
27Steve Miller BandRock 'n Me
26.5SupertrampThe Logical Song
26.5David BowieChanges
26.5Pink FloydAnother Brick In The Wall
26.5Electric Light OrchestraDo Ya
26J. Geils BandCenterfold
26WarLow Rider

Top Bands:
Plays per 30 daysBand
356.5Rolling Stones
334.5Led Zeppelin
183Pink Floyd
169.5Van Halen
138.5Billy Joel
135.5Tom Petty And The Heartbreaker
135.5Steve Miller Band
135Electric Light Orchestra
107.5Bad Company
105.5Creedence Clearwater Revival
105Elton John

I would not have guessed most of those top songs.  Note how much higher the top three bands are than the rest; there is a second smaller drop off after Foreigner.  Also interesting is that none of the top three bands have a top 10 song.  I would have also guessed Beatles to be the top band.

Let's start looking at the graphs.

Here we have two graphs of bands.  The first is limited to the top 50 and has labels.  The second is all bands, and without labels, just showing the general trend.

These next two graphs really show the tendency to play the same songs.  The first shows how many songs were played various number of times in 60 days.  There are three clusters.  First, songs that were played once or twice.  Then there is a wasteland from 11 to 26 plays in 60 days.  After that, there is the main group of songs that are played every other day.  That tapers off, and leads to the last group of songs which are played almost every day.  Keep in mind that the number of plays compounds the number of songs in that group.  20 songs each played 35 times is a lot more plays than 200 songs played once.
The second graph that illustrates this point is this one of every single song.  It is a bit confusing as there are way too many songs to see individual bars, but you can look at the slopes to see the same three groups as before.  The top songs as the peak on the left.  Then the plateau of the normal roster, followed by a steep drop off to the songs played a few times.  The steep drop off illustrates the lack of a middle ground.

The last general WMGK graph is this one that shows the average daily plays in a given hour of the day.  It shows there is a drop off in the morning hours from 5am - 8am.  The night hours of 12am - 4am are the highest.  It's interesting that there is a clear rise at midnight.  I don't think WMGK has a morning show, so I'm not sure why there is the drop off.  At first I thought they increased ads during peak commuting hours, but there is no drop off in the evening.  My guess is they must provide short news and traffic conditions during those hours.

I made that graph so that I could compare individual bands to see if different bands were being played more at certain times (eg longer songs being relegated to the middle of the night).  Unfortunately, I don't have enough data to really answer this.  A typical band might have a few plays in a given hour for the entire 60 day period.  I suppose I could have increased the buckets to 6 hour windows, but was too lazy to code this.

The rest of the graphs compare WMGK plays to plays for a given band.  I'll post a few here.  All of the graphs are in this gallery on my site.

I had a hard time with the comparisons.  First, there was the fact that some of the titles had subtle differences.  This meant I had to hard code the differences in a hash.  There is also the problem of songs with multiple versions on, this will tend to change the order slightly.  Also, the api only gives play data from the last 6 months.  For most classic rock this doesn't matter, but, eg, David Bowie had a recent album, and thus his graphs are hugely skewed towards it.

Then I couldn't decide which songs to show.  I ended up making two graphs for each band.  One shows all the WMGK plays and the other shows the top 20 songs.  Each has the other's plays on them, but are sorted differently.  I think the graph is more interesting, as it shows which popular songs WMGK doesn't play.  In their defense, some songs simply aren't classic rock.  On the other hand, the WMGK sorted graph shows the sharp drop off when you go from the frequently played songs down to the deep cuts (that aren't that deep).

For example, here are the two Billy Joel graphs:

I think they both show the vast difference of opinion of what good Billy Joel songs are.


A very different Genesis:

Led Zeppelin, showing the drop off:

A not-that-surprising Pink Floyd.  Except, why does WMGK love Young Lust?

Not big Rush fans:

WMGK doesn't seem to know where the Styx CD gets good:

A crime against humanity:

Finally, I used this as an excuse to learn some git and github use.  All the code, text files, and graphs are available there. Here are some files of interest:

Text file of all plays

Text file of all songs, with play counts

Text file of all bands, with play counts

Perl script to scrape WMGK site

Perl script which provides general stats and graphs for all WMGK

Perl script which provides specific stats from one band passed to it on command line

Wednesday, August 21, 2013

Genre Average Song Lengths

I was working on a different post and had to write a script to get the average song length of a genre from I wrote a quick perl script and figured I'd get some other genres to compare. I grabbed the 250 top tags, and then grabbed the 50 top tracks for each tag.

I didn't really expect anything revolutionary.  Prog has longer songs than punk.  Some of the tags don't represent genres (eg Live), but I didn't remove them, both because they are interesting too, and I'm lazy.  Here are the results.

There's a slight positive correlation (0.23) between song length and ranking.  That is, longer genres are generally less popular.  Mean and median length across all genres are quite close at 4:20, and 4:14 respectively.

A graph and table of the top 50 most popular:

RankGenreAverage Length
48progressive metal6:40
26progressive rock6:02
24black metal5:38
29heavy metal5:14
40thrash metal5:14
19hard rock4:51
9classic rock4:48
27death metal4:34
50hip hop4:31
10alternative rock4:25
8female vocalists3:58
2seen live3:41
13indie rock3:39
31punk rock3:26

Sunday, August 18, 2013

A Comparison of PNG Compression Methods

You may remember my past post about which image format you should be using.  The summary was use PNG for computer generated graphics.  I briefly touched on the fact that PNG allows you to use 24 bit or 8 bit colors.  It is, however, much more complicated than that.


There are three color modes available in PNG.  The mode with the fullest color space is RGB 24 bit.  The second mode is indexed.  It allows you to use 8 bits or less for up to 256 colors.  Additionally, it allows you to choose the palette.  The last mode is grayscale.  As the name implies it's black and white, but allows a gradient of grays.  By comparison, a 1 bit indexed mode would only allow full black or full white, no grays.

Generally, switching to indexed offers a pretty big savings in file size.  However, you can often get even greater savings by going to less than 8 bit color.  Doing so will generally require generating a custom palette to take full advantage of the reduced color space.  If the file is black and white, it can sometimes be worth it to switch to grayscale or to 1 bit indexed.

There really is no simple way to know which will be best, besides trial and error.  Doing this for one file is a bit annoying, but doing it for many files is pretty unreasonable.  As such, many PNG compressors have popped up to help automate this process.

It turns out that the compression process is not so clean cut, and thus many of the various programs perform differently on various files.  I decided to look into if my current program was a good choice.  I found a few old comparisons, but not any from the last few years.  I decided to just download a bunch of programs and try them out.

I didn't download too many, preferring those that either had recent development, or were spoken highly of somewhere.

Programs Tested

So without further ado, here are the candidates:

TinyPNG - Has been my go to.  It's the only web based option here.  It's quite fast and has a great interface.

pngquant v1.8.3 - Some research showed that pngquant was the library being used at TinyPNG.  I figured if I could have the same compression as TinyPNG as a command line tool, that would be ideal.

pngcrush v1.7.9 - This is the standard PNG compression utility.

optipng v0.7.4 - Another standard utility, based on pngcrush.

pngout 20130221 - Based largely on the recommendation of this 6 year old Jeff Atwood post.  It is still being developed though.

Before I get into the results I should note the command line switches used.  In all cases its none.  Some of the utilities are geared more towards automatic use than others.  There is no doubt you could achieve much better results if you were willing to experiment with the options.  However, that wasn't the use case I was testing here.  I was concerned with processing dozens or more files quickly and easily.

If you want to see the images used I uploaded them to imgur here.  Unfortunately, imgur is no dummy, they compressed the files themselves.  They also resized the largest ones and converted some to jpg when that made sense.  If you want all the files, both the original and the results, I uploaded them as a zip to mediafire.


Filesizes in KB:

Ratio to the best compression method:

In the first table sum for all columns is just the sum of the column.  In the second table sum is the ratio of sum to the smallest sum.  In other words, a sum ratio of 100% should indicate a perfect score, but it's skewed by the large file sizes.  The geometric mean solves this for the normalized ratios.  Or does itYeah, probably.


I think the take away here is that TinyPNG is the winner.  That being said, you'll note there were three cases where it did significantly worse than pngout.  Images b and o were both black and white, and pngout turned b into grayscale, but both left o as indexed.  What's more, e was grayscaled by pngout for worse size than the indexed tinyPNG version.  Pretty clear evidence that grayscale vs indexed  isn't clear cut.

quantpng is consistently a few percent higher than TinyPNG, lending credence to the idea that TinyPNG is using quantpng, but then also doing something else to squeeze a bit of filesize out.  On the other hand, quantpng actually made a few files worse, so maybe there is some overhead on its end.

The alert reader will have noticed there is a seventh algorithm listed which wasn't defined above.  t2o is a response to the fact that either TinyPNG or pngout was the optimal algorithm.  It is the TinyPNG files ran through pngout.

While it almost always does better than TinyPNG or pngout alone, you'll note the one case where it failed to return to the level pngout had achieved on the raw file.

I suppose my strategy will be to continue to use TinyPNG, and if I really need to minimize a file, run it through pngout independently, and compare the results.