visualizing MLB hit locations on a Google Map

Here’s a map of yesterday’s perfect game pitched by Mark Buehrle — only the 18th in MLB history. Highlighted is defensive replacement Dewayne Wise’s perfect game-saving catch over the wall in the 9th inning.

Here the same, as displayed by MLB’s Gameday, for which the initial pixel coordinates were collected.

The conversion from the pixel-based coordinates used on the above to a geographic latitude-longitude space took a fair amount of work. More on that in a bit. So why even attempt it? What is the value added by presenting these data on a satellite photo?

At first, it was just to see if I could do it. From Wal-Marts to crimes to tweets, the slippy map has become the de facto platform for geovisualization. Hits — balls in play — are inherently spatial, though at a micro level that gets rare consideration from neocartographers. The conversion to geographic points would also let me take advantage of the various symbologies written for mapping APIs, from heatmaps to clusterers.

As I continued work, though, I became more interested in the issue of scale: the images produced reveal the scale of baseball, easily compared to the scale of urban life. For example, most MLB ballparks have fences as far as 400 feet from home plate. In New York City, the average N-S block is 264 feet long. Such a block may contain dozens of businesses or residences, all of which we’d expect Google Maps and others to accurately geolocate. So why not hits?

Here a Jake Fox homer at the intersection of Kenmore and Waveland outside Wrigley Field.

And here, a recent Washington Nationals game, with hits placed as though home plate was located at the center of Dupont Circle.

To get an idea for how field dimensions affect the game, we can display hits from a particular stadium as if they occurred at any other. Here a recent Pirates-Phillies game that took place at the relatively small Citizens Bank Park. In the second image, I’ve placed and oriented hits as though they took place at the Pirates’ PNC Park. As you can see, two, maybe three of the home runs would have been swallowed up by the visiting team’s more cavernous outfield.

What follows is a description of how I got the data, transformed it to lat-long coordinates, and displayed it on a slippy map. For the demo proof of concept application, go here.

Getting the data

MLB’s PITCHf/x system provides highly accurate real-time data on pitch speed, break, and location, all measured using high-speed video cameras. The data — freely available from MLB Gameday — are used in real-time by a number of media outlets and after the fact by many number-crunching sabermetricians. A similar system for batted balls does not exist, though is apparently in the works. Fairly accurate hit location data are available from the for-profit firms Baseball Info Solutions and STATS Inc, and data on home run distance/location is disseminated freely by HitTracker. But the only free source of locational information on all balls in play is the observational data collected by MLB Gameday for non-analytic purposes.

These data are purely observational, meaning hits are plotted by a MLB employee watching each game. For more on the accuracy of such data, see Is Seeing Believing? at The Hardball Times. For each ball in play, then, we have an X-Y coordinate. This coordinate is in an arbitrary pixel space and must be transformed quite a bit; more on that later.

Gameday’s XML data

MLB Gameday provides both the PITCHf/x and observational hit location data in online directories, starting here, that are organized chronologically and then by game. No API is provided, though, for querying this massive database of millions of pitches and hits — just directories and directories of XML files. Each game is described by dozens of XML files; the root game URL is accessed thusly:

http://gd2.mlb.com/components/game/mlb/ + year_{year}/month_{month}/day_{day}/ + gid_{year}_{month}_{day}_{away_team}mlb_{home_team}mlb_{game_number}/

Ex. gid_2009_07_21_balmlb_nyamlb_1, gid_2009_04_05_atlmlb_phimlb_1

In Baseball Hacks, Joseph Adler provides Perl scripts for traversing these directories, parsing the various XML files found in them, and storing the parsed data in a MySQL database. Mike Fast has updated the scripts, which can be used pretty quickly to create a pitch and hit database with this structure. I’ve done so, and the advantages of such a database are many, but the speed of queries is perhaps the greatest. For a small webapp, though, I didn’t want to go to the hassle of talking to a database, nor did I want the server-side dependencies. The Yahoo Query Language (YQL) provides a great solution in such cases, and allowed me to create an API where none existed before and use SQL-like syntax to query it.

My YQL tables

YQL lets you use a universal SQL-like syntax on any API or web data source. Since you can use subselects, YQL also lets you join disparate data sources. Typically it is used as a wrapper for APIs, so that developers may use the same syntax across many different data sources. They do so by writing a XML file that describes a YQL table. In a basic table, data are returned from the the API unmodified. Since I’ve no API to call, I rely on YQL’s execute element to run server-side Javascript for each query. With just a dozen or so lines of Javascript in the execute element of each table, I can quite easily return usable data from MLB’s directories of XML files.

I can also use the execute element to insert records into my own database (which can be MySQL or otherwise), allowing me to keep my database up-to-date more easily than with the Adler/Fast Perl update script. I should note, though, that the YQL queries are nowhere near as fast as querying a local database, and likely aren’t quick enough for actual production (though they’re great for the kind of experimentation and rapid prototyping described here).

I initially set out to recreate Mike Fast’s database environment with YQL tables: seven tables, one each for games, game types, pitches, pitch types, players, and umpires. For this exercise, I needed only the games and atbats tables. The latter includes the hit location information if a ball is sent into play. For both, I’ve added fields, grabbing more data from the Gameday XML files as needed. I also needed a stadiums table, though this draws from data I’ve personally collected, rather than the Gameday XML files. The three tables are hosted on github and available for use by anyone. If you’re logged into Yahoo, you can try em out right now by loading up the YQL console with the mlb.gd2.env environment file. You can then use SQL syntax on these tables. A few example queries (links are to REST queries that can be called from any web app):

  • SELECT * FROM atbats WHERE gid=”gid_2009_04_11_nyamlb_kcamlb_1″ AND hit_x<>“NaN”
  • SELECT stadium_id FROM games WHERE year=”2009″ AND month=”07″ AND day=”23″
  • SELECT * FROM stadiums WHERE id=”3″

Transforming the data

Pixels to feet

The hit location data recorded by MLB is reported in pixels. The employees watching the games plot locations on a 250×250 image of the ballpark; the (0,0) origin is — not home plate — but simply the upper left corner of the image. The first step in making these data usable is to convert from this arbitrary pixel coordinate system to one in feet or meters. We need to know 1) the scale of the 250×250 image (feet per pixel) and 2) a known point on the field (preferably home plate) in the original pixel coordinates.

Initially, I assumed that these numbers would be the same or similar for all MLB ballparks. An article by Peter Jensen disabused me of this notion. Jensen undertook to determine the home plate X-Y location and distance multiplier for each MLB ballpark by assuming a uniform hit ball distribution and imposing physical constraints on hit balls. Unfortunately, when I plotted hits using Jensen’s numbers, the ball distribution diverged quite obviously from the Gameday-plotted version: outfield balls were plotted too far from home plate, infield balls too close. Jensen’s study used 2007-2008 hit locations, and MLB changes their images from season-to-season, so this may explain the divergence. Either way, I needed a new method.

I devised this simpler but less systematic approach. To determine the distance multiplier, I took a look at the stadium images provided by MLB Gameday. The stadium diagrams shown on Gameday’s real-time app are all 250×250 in expanded mode. Though I’m not positive that these are the exact diagrams upon which hits are initially plotted, they likely are as the resultant distance multipliers are quite accurate. Given a known distance (typically the straightaway center field fence distance), I can calculate the scale of the image in feet per pixel.

Determining home plate locations in Gameday’s coordinate system was less straightforward. In Baseball Hacks, Joseph Adler creates spray charts of the field using Gameday data. Unlike Jensen, Adler simply uses (125, 210) as the home plate location in pixels. I figured I could use this as a starting point, but to my surprise I found that I only needed to modify this origin for a half dozen stadiums. Of course, the only accuracy check I have is eyeballing back-and-forth between my rendering and the Gameday version, but given the observational nature of the data, I believe this is good enough.

Stadium locations and orientations

Using Google Earth, I recorded home plate latitude-longitude coordinates and eyeballed stadium orientations for all open MLB ballparks. Two stadiums are domes, so they’re out.

And three of the league’s six retractable-roofed stadiums were closed during their satellite shots.

LandShark Stadium was being used as a football field, and Citi Field hadn’t been built yet in Google’s satellite imagery, so home plate locations and stadium orientations weren’t collected for these.

For the other 23 stadiums, home plate latitude-longitude and X-Y coordinates, distance multipliers, and orientations are available in my YQL stadiums table.

Feet to latitude/longitude

Given the above, I can convert all pixel locations to feet with home plate as the origin. For stadiums oriented away from due north, these feet-based locations can be rotated about their home plate origins. The final step — converting from feet to geographic coordinates — is fairly simple, though I initially tried a couple of more complicated methods. First, I converted hit locations to meters, converted the home plate lat-long to UTM coordinates, added the hit coordinates to the latter, and converted back to lat-long. This worked fine, as did the second method I tried: a formula that calculates the hit’s geographic coordinate given the distance and bearing from home plate.

Given the huge scale and low accuracy required, I can utilize a much simpler method. Most mapping APIs include a distance method of some kind, taking as input two lat-long coordinates and returning the great circle (spherical, so perhaps off by as much as .3%) distance in meters or km. Here’s Yahoo’s, Google’s and Mapstraction’s. With this method, I can determine an approximate conversion factor between meters and degrees latitude-longitude:

var latConv = homePlate_latLong.distance( new LatLonPoint( homePlate_latLong.lat + .001, homePlate_latLong.lng ) ) * 1000000;
var lngConv = homePlate_latLong.distance( new LatLonPoint( homePlate_latLong.lat, homePlate_latLong.lng + .001 ) ) * 1000000;

Accuracy

A game was taking place when the satellite image of AT&T Park was taken; you can see the fans, players, and — importantly — the bases. Using the formulas above, I’ve placed white squares on the map where the bases should be, and this shows the typical accuracy achieved.

I’ve also plotted known outfield fence coordinates with similar accuracy.

Putting it all together

I created a demo application using Mapstraction, jQuery, and YUI. It’s very much a proof-of-concept, and the YQL queries can take quite a while to populate the map and table. Nonetheless, this is the only non-trivial web development I’ve done outside of Flash/Flex, and I’m quite impressed by these tools.

The app lets you visualize the hits in any individual game that took place in an open stadium in 2009. It’ll show games from earlier years, but I can’t vouch for the accuracy of the plotted points due to changes in Gameday’s observational data collection. By default, the app shows this year’s all-star game at Busch Stadium. To see another game, add a gid URL parameter. Here, for example, a few games from last night.

The display is pretty basic: just pink dots for outs and blue ones for hits. And if you want to do any of the crazy stuff with plotting hits at different ballparks or arbitrary locations, you’ll have to dig around a bit. I can imagine many aggregation, filtering, and visualization options for these data. None are explored here. But all the code and data’s up on github if you want to have a go at it.

update Aug 08 2009: My proof-of-concept app was failing for a few days. Should be back now. My MLB Gameday YQL tables weren’t working as loaded from github, so I copied em over to indiemaps. Project’s still hosted on github. Any query links below utilizing the tables have been changed.

28 Comments

  1. Thanks for posting this awesome POC. I’m eager to see what other data are available for the games. Would be really neat if you could drill down for a particular batter/pitcher, for example. Night games vs day games at Wrigley is an interesting idea that immediately comes to mind. Can definitely see the benefit of exploring data this way. Go Sox (White).

    Joel
    Posted July 25, 2009 at 4:15 pm | Permalink
  2. Excellent mashup idea. I particularly like the ability to compare the same hit set across multiple ballparks. This should make it very easy to compute a “HR index” for a ballpark, allowing fantasy statisticians to compute better potential HR counts for players based on where they’ll be playing.

    Posted July 28, 2009 at 4:08 pm | Permalink
  3. A great start on interpreting gameday XML. Keep it up!

    Alexander
    Posted July 28, 2009 at 9:42 pm | Permalink
  4. Great stuff - one of the best examples I’ve seen on integrating baseball data with Google Maps. I’m looking at similar applications of game level data for historical results (the 1960 Pirates season, for example), but they won’t have the same level of detail due to the inavailability of hit location data.

    Posted August 11, 2010 at 8:57 am | Permalink
  5. Buenas como va eso, me ha parecido bastante bien estructurado el post que te has currado, a partir esta semana ya posees un nuevo y atento visitante, gracias y sique asi.

    Posted November 17, 2011 at 8:40 pm | Permalink
  6. Very interesting blog, nice text an good informations…

    ciekawe
    Posted November 19, 2011 at 12:43 am | Permalink
  7. Thank you for this blog !

    Posted November 20, 2011 at 4:29 pm | Permalink
  8. Whoah this weblog is magnificent i love studying your articles. Continue the truly amazing paintings! You already know, many persons need round with this info, you can assist them to greatly.

    Posted November 21, 2011 at 1:41 pm | Permalink
  9. It’s really a great and useful piece of information. I am glad that you shared this useful info with us. Please keep us informed like this. Thank you for sharing.

    Posted November 25, 2011 at 12:53 am | Permalink
  10. Hmm it appears like your blog ate my first comment (it was extremely long) so I guess I’ll just sum it up what I wrote and say, I’m thoroughly enjoying your blog. I too am an aspiring blog writer but I’m still new to everything. Do you have any recommendations for newbie blog writers? I’d definitely appreciate it.

    Posted December 2, 2011 at 1:37 pm | Permalink
  11. By the way i enjoy your writing, I also have a square piano that i have had for years and would think about selling.

    Posted December 4, 2011 at 3:24 pm | Permalink
  12. Ekstra artykuł. Pozdrawiam

    Posted December 6, 2011 at 4:27 am | Permalink
  13. Thanks for your posting. What I want to say is that when evaluating a good internet electronics shop, look for a web-site with entire information on key elements such as the privacy statement, safety details, payment options, along with terms as well as policies. Always take time to investigate the help and FAQ segments to get a better idea of how a shop is effective, what they are able to do for you, and in what way you can maximize the features.

    Alonso Gruse
    Posted December 18, 2011 at 2:33 am | Permalink
  14. Super. Ciekawy artykuł.

    Posted December 30, 2011 at 5:22 am | Permalink
  15. Hey. I just simply needed to place a good brief comment and also let you know that I’ve been focusing on your internet page for quite some time. Keep up the very superb efforts and I’ll be looking back again another time in a little while.

    Posted January 4, 2012 at 8:32 pm | Permalink
  16. Hmm. Interesujące.

    Posted February 9, 2012 at 4:35 am | Permalink
  17. I like that blog layout . How do you make it!? It is very sweet.

    Posted February 18, 2012 at 6:37 am | Permalink
  18. Ekstra artykuł. Pozdrawiam

    Posted February 24, 2012 at 5:02 am | Permalink
  19. I and also my friends came taking note of the nice tips and tricks located on your site and so immediately I got an awful feeling I had not expressed respect to you for those strategies. Those young boys happened to be warmed to read through them and have in effect simply been taking pleasure in these things. Appreciate your actually being simply considerate as well as for making a decision on this form of essential ideas most people are really desperate to be informed on. My very own honest apologies for not saying thanks to earlier.

    Posted March 19, 2012 at 6:25 pm | Permalink
  20. I carry on listening to the news update lecture about getting free online grant applications so I have been looking around for the top site to get one. Could you advise me please, where could i get some?

    Bone pain
    Posted March 21, 2012 at 5:02 pm | Permalink
  21. you’re in reality a just right webmaster. The site loading speed is amazing. It seems that you are doing any distinctive trick. In addition, The contents are masterwork. you have done a magnificent process in this matter!

    Posted January 7, 2014 at 8:06 am | Permalink
  22. Hey just wanted to give you a quick heads up.

    The words in your content seem to be running
    off the screen in Opera. I’m not sure if this is a formatting issue or something to do with
    web browser compatibility but I figured I’d post to let you know.
    The design look great though! Hope you get the problem solved soon.

    Kudos

    Posted February 24, 2014 at 6:14 am | Permalink
  23. A big thank you for your post. Want more.

    Video
    Posted April 15, 2014 at 4:45 am | Permalink
  24. e-shisha pen indiemaps.com/blog » visualizing MLB hit locations on a Google Map

    Posted April 27, 2014 at 6:29 pm | Permalink
  25. Hello there, just became alert to your blog
    through Google, and found that it’s really informative. I’m going to watch out for brussels.
    I’ll appreciate if you continue this in future. Lots of people will be
    benefited from your writing. Cheers!

    Ralph
    Posted May 11, 2014 at 4:33 am | Permalink
  26. I am enthusiastic about starting a site on the exact same topic.
    Would you please tell me the best way to get started?
    I am not very good at technical things. Any tip or guideliines would be
    quite welcome!

    Adan
    Posted June 4, 2014 at 4:43 am | Permalink
  27. Undergraduates - Fall - $ 18,372; Spring -
    $ 18,372 for a total of $ 36,744. They are what
    is known as “kinesthetic” people and usually need something they can hold onto and
    feel and possibly even try on. that can offer you a great deal of fashion tips for your special occasion.

    Andrew
    Posted June 17, 2014 at 2:05 pm | Permalink
  28. The fact is that each and everything must be black and bright white
    as that the notion of the creation must be legibly drawn on a piece of paper.
    Fashion is not something that exists in dresses only. It can be so hard to guess
    just how cold or warm it is outdoors by peeking through
    your window in the morning.

    summer party dresses
    Posted June 21, 2014 at 9:20 pm | Permalink

2 Trackbacks

  1. By Fresh From Twitter today | zu-web.de on July 28, 2009 at 12:48 pm

    [...] @pjdonnelly: awesome visual mashup of MLB hit locations using MLB data, YQL, JQuery and Gmaps. http://bit.ly/xGu6lThruYou: a whole album’s worth of funky, intriguing sound collages culled entirely from [...]

  2. [...] visualizing MLB hit locations on a Google Map Really interesting post that started off with looking at data from a no hitter baseball game that morphs into something else. [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *