Archive for February, 2010

Workin on some Vis!

‘ve been working all weekend on putting down a solid visualization of baseball games that gives an informative overview of exactly what happened in a game in the least amount of space available. I want to give a variety of ways to view a game in order to get the most out of it, but first and foremost I want to provide a simplified overview of all game events that shows the flow of the game and allows people to see exactly what happened.

I started by trying to classify events and think of how that I can rank them each in a way that will allow me to show them in a linear way that both unifies them per team as well as shows the magnitude of each event. In this idea I thought about the major possible events and ranked them according to the impact they have on the game, form the perspective of the batting team.

Tier 5: Homeruns,
Tier 4: Triples
Tier 3: Doubles
Tier 2: Singles
Tier 1: Walks/Stolen Bases
Tier 0: Outs

I had to revise this several times after experimenting with it, mostly due to issues wtih classifying walks/stolen bases and singles. At first I thought that I could get away with putting them in the same tier, since they all result in the player moving up one base. I then realized that there is a clear difference between the two. While walks and stolen bases both advance the runner towards scoring position on the diamond, singles put the ball in play, which can result in RBIs. Walks can sometimes result in an RBI if the bases are loaded but even this is restricted to one run, while a single can drive in up to three runs.

After coming up with this classification, I began to work on a circular visualization of events that classified them according to a number of criteria. While in concept this made sense to me, in reality it turned out to look somewhat interesting but really show a lot of nothing with a simple look. I also attempted to create an organic shape from this visualization just to see what it looked like, and while this too proved interesting I dont’ think it was very successful in showing anything.

PastedGraphic-1

PastedGraphic-2

I then moved on to the next logical step, visualizing things linearly. I continued with my key of dashes and colors, and while linearly things made a bit more sense it was still not working very well IMO. I experimented with representing outs as negative events at this point as well but I don’t think it read the way I intened.

PastedGraphic-3

PastedGraphic-4

At this point I also created a more organic flow diagram from these charts. I think this actually worked out really well and is a great overall representation of the flow of the game. While it doesn’t really show exactly what I wanted from the first overall view of the game a user is presented with, I think it is a great sub-view and something that I’m going to continue to investigate

PastedGraphic-6

This is where I realized I was sorta getting caught up in doing the same thing over and over, so I decided to give it a rest and move on the next day.

Today I started working again and had some major breakthroughs. I spent a lot of time working on various visualizations, trying to bring down the amount of data ink used (as per reading Edward Tufte) and began to really simplify everything. I did a lot of different experimenting, and came up with a visualization I think finally works pretty well. I did some quick testing with a few casual baseball fans I know (obviously need to do more user testing but something is better than nothing and I just finished it!) and I was pleased that they could easily recreate the narrative of a game themselves after a very brief explanation which wasn’t repeated.

PastedGraphic-8

PastedGraphic-9

PastedGraphic-10

PastedGraphic-11

All of this needs more work I know, but I put in a crapload of time this weekend and I think that I am definitely making progress. I also worked with a real data set vs just making things up, which was a bit of a pain in the ass at first but really helped me a lot in the long run. I hsould have been working like this before and I’m glad I am doing it now.

Thats it for now, I also did a good amount of work on the otehr sections of the application but I don’t feel like they are ready to show yet so that will be another post.

If you read this please check everything out and give me some feedback!

Thanks,

-Steve

Continue reading » No comments

Getting MySQLdb up and running on Snow Leopard with MAMP

OK this is another post that is more for my own future sanity than anything else. This has been a huge pain in the ass for me many times and I want to document the steps I took to get this working.

First, this is a good general overview

http://cd34.com/blog/programming/python/mysql-python-and-snow-leopard/

So I downloaded/installed MySQL, then I had to add it to my PATH var to get the MySQLdb install to work.

//Open With textmate
mate ~/.profile

export PATH=”/usr/local/bin:/usr/local/sbin:/usr/local/mysql/bin:$PATH”

then make sure you update your profile or close and open a new terminal window
source ~/.profile

that will let the MySQLdb install work. Then when you run Python and try to use MySQL db you’ll get errors. What you need to do is the following:

change your PATH to export PATH=/Applications/MAMP/Library/bin:$PATH

which links to MAMP mysql, then run this command

sudo ln -s /Applications/MAMP/tmp/mysql/mysql.sock /tmp/mysql.sock

Then you should be good to go. Might be able to skip the first step but I don’t know if the MySQL that comes with mamp is 64bit and you need 64 bit for snow leopards install of Python..

Why is python such a pain in the ass?

Continue reading » 4 Comments

Considering baseball data API

So I have been doing a lot of research about the data I’m going to need for thesis, and where I should get this. The two main sources of data I am looking at are Retrosheet, which I’ve previously mentioned extensively, and the unofficial MLB xml sources which are used by the official MLB gameday app.

Both of these resources have their advantages/disadvantages, and neither completely suits my needs. I am starting to think that in order to get what I want, I will have to create a hybrid of the two, but the big question is how exactly will I do this? Here are my thoughts on the good and bad about each source:

Retrosheet:
Great resource, a pain to get into a database but once its there its very compete
Data going back to 1951
Self contained, doesn’t rely on resources that may be unavailable in the future
Doesn’t contain any current season data
No great way to get data, have to write custom XML to get data in a usable form

MLB Data:
Completely thorough
Provided in XML format that is really well executed
Updated daily for current games
Not official, no promises that it will be there in the future
Questionable as far as a data source/retrieving data repeatedly.

I have been working on creating an API for Retrosheet games that returns complete data, and I’ve made some headway but its largely a work in progress. After finding the XML based MLB data I can see that they have already tackled this and have the data in a beautiful format that is pretty much exactly what I wanted to make. They do not however have this data stored in a database or any easy way to request a game in a traditional way.

I think that my strategy for this problem will be to combine both resources into a best of both world’s situation that I can make freely available to anyone. I will adapt the MLB XML format and parse all Retrosheet games to this, taking advantage of the format already established by MLB.com as a standard and saving myself some work. I will then write code that does the opposite with the MLB.com data, checking every day of a season and breaking down their XML into data that can be put into the retrosheet database. This will likely be the hardest part of the process but I think I can figure it out. This will provide me with a complete retrosheet database with current game data and an XML format that will provide a full summary of a game in an easy to read way.

Hopefully this works out for me, if I can do this and provide access to it through my web server I think it will be exciting and a great contribution to the baseball data community.

Continue reading » 5 Comments