## Monday, May 20, 2013

### IMDB Ratings

They are remastering Star Trek TNG and releasing them on bluray.  They just released season 3 and I was looking through them.  One of the first things that I noticed was how many great episodes were in that season.  Seasons 1 and 2 had some good episodes, but it felt like most of the season 3 ones were great.

This wasn't that surprising; I've often stated that in each Star Trek series the first two seasons tend to be the worst.  As I constantly feel the need to evangelize Star Trek I wanted to see what the top episodes were and list them here.  I found that the IMDB had individual user ratings for each episode.  I almost didn't notice they had a simple page with all the ratings on there and was about to write a script to scrape them from the season pages.

I copied that data and began to work on ways to present it.  I wrote some gnuplot scripts for a few graphs.  I wanted to ultimately make graphs for each of the five series.  The problem was preparing the data for each plot was rather time consuming.  I decided to write a perl script to do that.

The script went well, and I decided I was on roll so I might as well make it download the data itself.  I ended up with likely my most robust script ever.  Every show has a IMDB id like: tt0092455.  You can either put that in the file, or pass it to the script as an argument.  It generates a directory for each show, and puts all the raw data files in there, along with 3 graphs.  I suppose I could have made it so that it takes multiple ids and runs each, but it's so easy to just paste them into a file and just type 'perl imdb.pl ' down the column and save that as a shell script.

I'm pretty happy with the script.  It handled the real word data of a variety of shows quite well.  Which is frankly amazing considering this regex is in it:
\$fileline =~ m/\s+(\d+)\.(\d+)\s+(.+?)\s+(\d+\.\d+)\s+(\d+,?\d*)\n/

Here's the script, and the source for the 0 of you that are interested.

I ended up compiling a list of 32 shows, both my own favorites and popular ones from the internet.  Here is a gallery of all the graphs.

http://imgur.com/a/gO68p

This one is a straight forward scatter plot of every episode.  There is a linear regression line plotted showing the general trend.  Note that in all the graphs the seasons fall entirely to the right of the grid line they are labeled at.  In other words the first episode of season 5 is directly on that dotted line, the rest of season 5 is to the right.

The average rating of each season.  Not weighted by number of reviews.  Also note that none of these graphs start the y axis at 0, which exaggerates difference between points.

Here I took the top quarter of best episodes and bottom quarter of worst episodes and counted how many of each were in each season.

I still will have to do some comparison between the Star Trek series and post that.