So I have been doing a lot of research about the data I’m going to need for thesis, and where I should get this. The two main sources of data I am looking at are Retrosheet, which I’ve previously mentioned extensively, and the unofficial MLB xml sources which are used by the official MLB gameday app.
Both of these resources have their advantages/disadvantages, and neither completely suits my needs. I am starting to think that in order to get what I want, I will have to create a hybrid of the two, but the big question is how exactly will I do this? Here are my thoughts on the good and bad about each source:
Retrosheet:
Great resource, a pain to get into a database but once its there its very compete
Data going back to 1951
Self contained, doesn’t rely on resources that may be unavailable in the future
Doesn’t contain any current season data
No great way to get data, have to write custom XML to get data in a usable form
MLB Data:
Completely thorough
Provided in XML format that is really well executed
Updated daily for current games
Not official, no promises that it will be there in the future
Questionable as far as a data source/retrieving data repeatedly.
I have been working on creating an API for Retrosheet games that returns complete data, and I’ve made some headway but its largely a work in progress. After finding the XML based MLB data I can see that they have already tackled this and have the data in a beautiful format that is pretty much exactly what I wanted to make. They do not however have this data stored in a database or any easy way to request a game in a traditional way.
I think that my strategy for this problem will be to combine both resources into a best of both world’s situation that I can make freely available to anyone. I will adapt the MLB XML format and parse all Retrosheet games to this, taking advantage of the format already established by MLB.com as a standard and saving myself some work. I will then write code that does the opposite with the MLB.com data, checking every day of a season and breaking down their XML into data that can be put into the retrosheet database. This will likely be the hardest part of the process but I think I can figure it out. This will provide me with a complete retrosheet database with current game data and an XML format that will provide a full summary of a game in an easy to read way.
Hopefully this works out for me, if I can do this and provide access to it through my web server I think it will be exciting and a great contribution to the baseball data community.

