Digging into the Data
I’ve begun to dig into the Retrosheet database that I recently acquired (see previous post) and have started to plan out a basic format for parsing the immense amount of data into a readable, usable XML format that will let me retrieve game data in a simple, direct way.
In order to do this, I had to first think about how a baseball game is structured. What are the constants and what are the variables? How can I structure my XML around this?
Speaking in terms of a baseball game and not the way that Retrosheet is formatted, the constants in a game are as follows:
- 9 Innings
- 2 halves of an inning
- 1 set of at bats and 1 set of fielding (defense) for each team
- Away team always bats first in an inning
- Each half of an inning consists of 3 outs
- An out always ends an inning
After writing this down, I was starting to see a structure.
I then started to investigate how the game data was stored in the retrosheet database. The database is structured in a somwhat simplistic way, with only two levels of heirarchy: Games and at bats. Within the at bats lies all the other information about where the at bat took place in the game…which inning, which team, how many outs, etc.
This format didn’t quite mix with the way that I was thinking a game should be represented, but for a database structure it made perfect sense. I then began to think how I could merge these two formats, and what information I could gain from a game.
After several attempts, this is the XML format that I have come up with which I think will work quite well in organizing the information that I need.
This format organizes the game around the structure I mentioned previously, then adds in the game events from retrosheet as Plate Appearances within outs. This approach creates an XML document that is very easy to read, which I can use to reconstruct a game very quickly and pull connected data out based on related fields. At this point I am not pulling all the data but a smaller set that will be easy for me to get started with.
Here is a diagram of the overall structure.
