Getting the baseball data.

The first major challenge of my thesis project has been acquiring historical data, parsing it into a usable format (database), and then going through it and trying to make some sense of it. I’ll try to go over some of the steps I took to get where I’m at (as usual both for anyone trying to do the same thing and for my own sanity the next time I need to do this).

First of all, thanks to the extremely hard work of a few super dedicated individuals, historical data for every baseball game  is available online at the site http://www.retrosheet.com

These files have an insane amount of data for almost every game played for at least the last 50 years (there is more but before then the complete info is a bit hit or miss). Unfortunately, due to the way that they are recording the data (I’m not really sure how, but I assume it involves some sort of spreadsheet), the files are not really in an easily interpreted format. Luckily, again due to the work of a few dedicated individuals, the open source community has a solution for this as well. The tool “Chadwick” (named after the man who is historically credited with inventing the first widely accepted form of baseball scorekeeping) is a command-line unix tool which can take these files and parse them to Comma Separated Value (CSV) files. This does entail learning a proprietary command language for the compiling (it looks sorta like regular expressions but I don’t think its related), but the end result is worth it.

From this point (once you have your CSV files) you can create a database in the SQL of your choice (MySQL, sqLite, etc) and import the database.

Fortunately there are a few people out there who have already done a great bulk of this work, and have written some scripts to take care of a lot of this process for you. Unfortunately, they are using Python with the MySQLDB extension, which under snow leopard is near impossible to compile at this time.

The main site I have been working from to get this data to work can be found here:

http://blog.wellsoliver.com/2009/06/retrosheet/

This guy is the perfect combination for my thesis: a data geek and a huge baseball fan. He has made several amazing tools (a few of which I look forward to getting into in the future) that allow people to get baseball data in a bit more friendly manner, and shares his experiences obtaining the data.

I spent the better half of a weekend attempting to get his code to work on my computer, due in no way to any error on his fault but solely based on the fact that snow leopard refuses to compile and install a workable version of MySQLDB to python.

I finally gave up, and decided that I would go to plan B and run the script on my server (Dreamhost). This too proved to hit a few roadblocks (the version of Python installed on my server wasn’t new enough, had to install new Python then compile and install new MySQLDB a couple times :( )

Then I had to figure out how to compile the Chadwick source code under Unix, which seemed very unclear until I actually opened the readme file and realized all I had to do was type ./configure in the Chadwick directory and it would compile no sweat.

I finally got the code to run and (seemingly) work! Then I realized I had another error…my server was killing my requests and I was only getting part of the data. After emailing back and forth with Dreamhost for an hour or two, I finally realized what the issue was. The python code that I was working with ran in 20 seperate threads (which would have been fine had I been able to run it on my own computer) however my Dreamhost account had a limit on concurrent threads that were allowed to run and i was greatly exceeding this. After changing the amount to 5, I finally was able to execute the script and build my database. It took about 15 minutes (!) to run the script through, and I was left with a complete, 4+GB database of baseball information.

This has been a huge headache (to say the damn least!) but now that I have gone through this process once it should be a lot easier from her on out if I need to grab new data (hopefully…).

I’m now working on understanding and parsing this enormous amount of data (the one table of events alone has over 8 million rows…) and make some sense of it. I’m close to having a workable XML format and I’ll post more about this in the future.

One more thing….

I found this link: http://www.wantlinux.net/2009/04/retrosheet-baseball-mysql-database-download/

which is basically the entire code already compiled into a database and downloadable :( On one hand this totally bums me out, but on the other hand it was a great experience having terminal/unix try to beat me down and coming out victorious! I am considering doing what this person has done and hosting the 2009 database on my site once it has become available, I’ll make a post if I decide to do so.



This entry was posted on Saturday, November 21st, 2009 at 1:47 pm and is filed under Code, MySQL, Prototypes, Python, Thesis, php. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

Your comment