classed cartograms

Classification is commonplace in thematic cartography. In choropleth mapping, classification is the norm. This was not always so. The first choropleth map (created by Baron Charles Dupin in 1826) was unclassed.

According to Arthur Robinson, in his Early Thematic Mapping in the History of Cartography:

So far as we now know, the first choropleth map to provide a legend and give class limits to the tones was the 1828 Prussian manuscript map of population density. After the early 1830s most choropleth maps employed classes of various sorts, some based on rankings of the districts, some based on percentage departures from a mean, and some based on categorizing the data itself.

It seems that classification in choropleth mapping was adopted for technical rather than theoretical reasons. Robinson notes that “engravers were unsuccessful in making such [unclassed] maps, since they did not have full control over the darkness of very many flat tones.” Nonetheless, empirical research, begun in the late 1970s and continuing into the 1990s, backed up the decision, at least as regards the acquisition of specific information. Data on the recall of specific information and the acquisition of general information was less conclusive; a good summary of this research is provided in Slocum et al’s cartography textbook.

(classed histogram legend from my first thematic map, produced for Mark Harrower’s Geography 370: Introduction to Cartography)

Range-graded proportional symbols

In proportional symbol mapping, the unclassed form is the norm. This, too, seems to be a technical rather than theoretically-based trend. As noted in the Slocum text:

Classed, or range-graded, maps can be created by classing the data and letting a single symbol size represent a range of data values, but unclassed proportional symbol maps are more common. This might seem surprising given the frequency with which classed choropleth maps are used. The difference stems, in part, from the ease with which unclassed proportional symbol maps could be created in manual cartography (either an ink compass or a circle template could be used to draw circles of numerous sizes).

The logical extension of classed choropleth mapping to proportional symbol mapping was first suggested by Hans Joachim Meihoefer in the late 1960s. Here, too, classification is purported to make the thematic map easier to comprehend; also from Slocum:

Range grading is considered advantageous because readers can easily discriminate symbol sizes and thus readily match map and legend symbols; another advantage is that the contrast in circle sizes might enhance the map pattern, in a fashion similar to the use of an arbitrary exponent…

This comes at the expense of precision, as map reader’s can never get exact values for enumeration units. Borden Dent notes, “In this scaling method, symbol-size discrimination is the design goal, rather than magnitude estimation.”

Rarely, though, is exact value retrieval (magnitude estimation) the goal of thematic cartography, at least of the static variety. Further, map readers are notoriously bad size estimators, as an array of psychophysical studies in and outside of academic cartography have established. The graph below (from John Krygier’s excellent blog post on perceptual scaling) sums up the average response to 1, 2, and 3D proportional symbols.

The above suggests that subjects uniformly underestimate the areas of proportional symbols. Perceptual scaling has been developed in response, in which symbol sizes are perceptually scaled according to a power function, rather than being scaled mathematically.

Perhaps more important, though, are the deviations among subjects hidden in the graph above. T.L.C. Griffin, in a 1985 study, found an average underestimation similar to previous studies, but stressed that “perceptual rescaling was shown to be inadequate to correct the estimates of poor judges, while seriously impairing the results of those who were more consistent.”

Both the Slocum et al and Dent cartography textbooks introduce range-grading of proportional symbols as a potential solution to the 1) average poor estimation of symbol size and 2) high deviation in this underestimation.

Cartograms then

This makes a case for classification in proportional symbol mapping. It also makes the case for classification on cartograms. Cartograms can be considered a variant of proportional symbol map in which the symbol shape used is that of the original enumeration unit in some projection. But in my cartogram research for the M.S. Cartography degree at the University of Wisconsin, I ran across no references to classifying the area of cartogram units. A more recent search also revealed no references. Classification seems particularly appropriate to cartograms because the limited research done on their perception suggests that users estimate cartogram feature area much less accurately than the simplified shapes of standard proportional symbol maps.

The above is a normal, unclassed cartogram showing U.S. population. Notice the continuous range of state areas, shown below in order of population (but smaller).

Despite the two legend chips, I think it’s quite difficult to estimate the population of states with any degree of precision. The following classed cartogram abandons precision in favor of a more easily and quickly interpreted map.

Exact values can never be retrieved, but each state can quickly be matched with one of four classes (grouped according to Jenks natural breaks classification).

The above maps were created with a beta version of indiemapper. No other mapping software allows the creation of classed cartograms. You can get around this when using tools such as ScapeToad or Frank Hardisty’s Cartogram Generator (both of which utilize the contiguous Gastner-Newman cartogram algorithm) by pre-classing your data, so that only 3-7 unique values are found in the dataset for a given attribute.

In addition to 1) increased discriminability of symbol sizes and 2) a potentially enhanced spatial pattern, classed cartograms may hold a third advantage over the unclassed variety, though only in bivariate mapping. Cartograms, though, are most typically and appropriately employed in bivariate mapping. When constructing a bivariate cartogram, the cartographer sizes enumeration units according to one variable (typically population) and colors the units, choropleth-style, according to some other variable (often election results). In unclassed cartograms, some features may shrink down so as to be nearly invisible, making the reading of the second (coloring) attribute impossible. In the classed variant, a small but still legible minimum size can be established for the first class of data, ensuring that the coloring attribute can be interpreted on all units. A minimum size can be established on an unclassed cartogram, but if mathematical/proportional scaling is employed it may result in absurdly large features at the higher end of the mapped variable.

Of course, classed cartograms won’t always have a larger minimum size; this must be a conscious design decision (in the U.S. cartograms above, the classed cartogram minimum size is smaller than the minimum size found on the unclassed version).

Inappropriate

Classification in cartogram scaling is not always appropriate. Indeed, the typical unclassed form is preferable in many cases. Borden Dent writes the following of classification and choropleth mapping; I believe it’s equally applicable to the proportional and cartogram symbologies:

The purpose of the choropleth map dictates its form. If the map’s main purpose is to simplify a complex geographical distribution for the purpose of imparting a particular message, theme, or concept, the conventional [classed] choropleth technique should be followed. On the other hand, if the goal is to provide an inventory for the reader in a form that the reader must simplify and generalize, then the unclassed form should be chosen.

This sounds a lot like the modern distinction between cartography and geovisualization, or like the older one between communication and represenation. These distinctions have been overblown; experts and amateurs are often looking at the same map. But there are certainly cases where the purpose and audience are more clear-cut. In such cases, classed cartograms may be an option.

While providing the cartographer with the “ability to direct the message of the communication” (Dent), this power comes with great responsibility — specifically the responsible choice of the 1) number of classes, 2) classification method, and 3) appropriate and easily discriminable symbol sizes for each class.

Research on classification in choropleth and proportional symbol mapping has settled on three and seven as the minimum and maximum number of classes that are effective. Classification methods have been much studied in academic cartography, statistics, and other fields. Jenks natural breaks classification is often employed; equal intervals and quantiles are also typical. Cartographers have less guidance on the third choice, that of an appropriate symbol size for each class.

The above shows a set of ten class sizes for range grading developed by Hans Joachim Meihoefer as the result of a visual experiment; the symbol sizes were developed for maximum discriminability, while maintaining a realistic size range (none too small, none too large). Ten classes weren’t recommended, but supposedly any five contiguous sizes could be chosen. Indiemapper uses a slight variation of Meihoefer’s sizing scheme for classed proportional symbol and cartogram sizing.

Despite the additional responsibilities placed on the cartographer, I believe a carefully-crafted classed cartogram can be a very effective representation of some datasets for certain audiences.

visualizing MLB hit locations on a Google Map

Here’s a map of yesterday’s perfect game pitched by Mark Buehrle — only the 18th in MLB history. Highlighted is defensive replacement Dewayne Wise’s perfect game-saving catch over the wall in the 9th inning.

Here the same, as displayed by MLB’s Gameday, for which the initial pixel coordinates were collected.

The conversion from the pixel-based coordinates used on the above to a geographic latitude-longitude space took a fair amount of work. More on that in a bit. So why even attempt it? What is the value added by presenting these data on a satellite photo?

At first, it was just to see if I could do it. From Wal-Marts to crimes to tweets, the slippy map has become the de facto platform for geovisualization. Hits — balls in play — are inherently spatial, though at a micro level that gets rare consideration from neocartographers. The conversion to geographic points would also let me take advantage of the various symbologies written for mapping APIs, from heatmaps to clusterers.

As I continued work, though, I became more interested in the issue of scale: the images produced reveal the scale of baseball, easily compared to the scale of urban life. For example, most MLB ballparks have fences as far as 400 feet from home plate. In New York City, the average N-S block is 264 feet long. Such a block may contain dozens of businesses or residences, all of which we’d expect Google Maps and others to accurately geolocate. So why not hits?

Here a Jake Fox homer at the intersection of Kenmore and Waveland outside Wrigley Field.

And here, a recent Washington Nationals game, with hits placed as though home plate was located at the center of Dupont Circle.

To get an idea for how field dimensions affect the game, we can display hits from a particular stadium as if they occurred at any other. Here a recent Pirates-Phillies game that took place at the relatively small Citizens Bank Park. In the second image, I’ve placed and oriented hits as though they took place at the Pirates’ PNC Park. As you can see, two, maybe three of the home runs would have been swallowed up by the visiting team’s more cavernous outfield.

What follows is a description of how I got the data, transformed it to lat-long coordinates, and displayed it on a slippy map. For the demo proof of concept application, go here.

Getting the data

MLB’s PITCHf/x system provides highly accurate real-time data on pitch speed, break, and location, all measured using high-speed video cameras. The data — freely available from MLB Gameday — are used in real-time by a number of media outlets and after the fact by many number-crunching sabermetricians. A similar system for batted balls does not exist, though is apparently in the works. Fairly accurate hit location data are available from the for-profit firms Baseball Info Solutions and STATS Inc, and data on home run distance/location is disseminated freely by HitTracker. But the only free source of locational information on all balls in play is the observational data collected by MLB Gameday for non-analytic purposes.

These data are purely observational, meaning hits are plotted by a MLB employee watching each game. For more on the accuracy of such data, see Is Seeing Believing? at The Hardball Times. For each ball in play, then, we have an X-Y coordinate. This coordinate is in an arbitrary pixel space and must be transformed quite a bit; more on that later.

Gameday’s XML data

MLB Gameday provides both the PITCHf/x and observational hit location data in online directories, starting here, that are organized chronologically and then by game. No API is provided, though, for querying this massive database of millions of pitches and hits — just directories and directories of XML files. Each game is described by dozens of XML files; the root game URL is accessed thusly:

http://gd2.mlb.com/components/game/mlb/ + year_{year}/month_{month}/day_{day}/ + gid_{year}_{month}_{day}_{away_team}mlb_{home_team}mlb_{game_number}/

Ex. gid_2009_07_21_balmlb_nyamlb_1, gid_2009_04_05_atlmlb_phimlb_1

In Baseball Hacks, Joseph Adler provides Perl scripts for traversing these directories, parsing the various XML files found in them, and storing the parsed data in a MySQL database. Mike Fast has updated the scripts, which can be used pretty quickly to create a pitch and hit database with this structure. I’ve done so, and the advantages of such a database are many, but the speed of queries is perhaps the greatest. For a small webapp, though, I didn’t want to go to the hassle of talking to a database, nor did I want the server-side dependencies. The Yahoo Query Language (YQL) provides a great solution in such cases, and allowed me to create an API where none existed before and use SQL-like syntax to query it.

My YQL tables

YQL lets you use a universal SQL-like syntax on any API or web data source. Since you can use subselects, YQL also lets you join disparate data sources. Typically it is used as a wrapper for APIs, so that developers may use the same syntax across many different data sources. They do so by writing a XML file that describes a YQL table. In a basic table, data are returned from the the API unmodified. Since I’ve no API to call, I rely on YQL’s execute element to run server-side Javascript for each query. With just a dozen or so lines of Javascript in the execute element of each table, I can quite easily return usable data from MLB’s directories of XML files.

I can also use the execute element to insert records into my own database (which can be MySQL or otherwise), allowing me to keep my database up-to-date more easily than with the Adler/Fast Perl update script. I should note, though, that the YQL queries are nowhere near as fast as querying a local database, and likely aren’t quick enough for actual production (though they’re great for the kind of experimentation and rapid prototyping described here).

I initially set out to recreate Mike Fast’s database environment with YQL tables: seven tables, one each for games, game types, pitches, pitch types, players, and umpires. For this exercise, I needed only the games and atbats tables. The latter includes the hit location information if a ball is sent into play. For both, I’ve added fields, grabbing more data from the Gameday XML files as needed. I also needed a stadiums table, though this draws from data I’ve personally collected, rather than the Gameday XML files. The three tables are hosted on github and available for use by anyone. If you’re logged into Yahoo, you can try em out right now by loading up the YQL console with the mlb.gd2.env environment file. You can then use SQL syntax on these tables. A few example queries (links are to REST queries that can be called from any web app):

  • SELECT * FROM atbats WHERE gid=”gid_2009_04_11_nyamlb_kcamlb_1″ AND hit_x<>“NaN”
  • SELECT stadium_id FROM games WHERE year=”2009″ AND month=”07″ AND day=”23″
  • SELECT * FROM stadiums WHERE id=”3″

Transforming the data

Pixels to feet

The hit location data recorded by MLB is reported in pixels. The employees watching the games plot locations on a 250×250 image of the ballpark; the (0,0) origin is — not home plate — but simply the upper left corner of the image. The first step in making these data usable is to convert from this arbitrary pixel coordinate system to one in feet or meters. We need to know 1) the scale of the 250×250 image (feet per pixel) and 2) a known point on the field (preferably home plate) in the original pixel coordinates.

Initially, I assumed that these numbers would be the same or similar for all MLB ballparks. An article by Peter Jensen disabused me of this notion. Jensen undertook to determine the home plate X-Y location and distance multiplier for each MLB ballpark by assuming a uniform hit ball distribution and imposing physical constraints on hit balls. Unfortunately, when I plotted hits using Jensen’s numbers, the ball distribution diverged quite obviously from the Gameday-plotted version: outfield balls were plotted too far from home plate, infield balls too close. Jensen’s study used 2007-2008 hit locations, and MLB changes their images from season-to-season, so this may explain the divergence. Either way, I needed a new method.

I devised this simpler but less systematic approach. To determine the distance multiplier, I took a look at the stadium images provided by MLB Gameday. The stadium diagrams shown on Gameday’s real-time app are all 250×250 in expanded mode. Though I’m not positive that these are the exact diagrams upon which hits are initially plotted, they likely are as the resultant distance multipliers are quite accurate. Given a known distance (typically the straightaway center field fence distance), I can calculate the scale of the image in feet per pixel.

Determining home plate locations in Gameday’s coordinate system was less straightforward. In Baseball Hacks, Joseph Adler creates spray charts of the field using Gameday data. Unlike Jensen, Adler simply uses (125, 210) as the home plate location in pixels. I figured I could use this as a starting point, but to my surprise I found that I only needed to modify this origin for a half dozen stadiums. Of course, the only accuracy check I have is eyeballing back-and-forth between my rendering and the Gameday version, but given the observational nature of the data, I believe this is good enough.

Stadium locations and orientations

Using Google Earth, I recorded home plate latitude-longitude coordinates and eyeballed stadium orientations for all open MLB ballparks. Two stadiums are domes, so they’re out.

And three of the league’s six retractable-roofed stadiums were closed during their satellite shots.

LandShark Stadium was being used as a football field, and Citi Field hadn’t been built yet in Google’s satellite imagery, so home plate locations and stadium orientations weren’t collected for these.

For the other 23 stadiums, home plate latitude-longitude and X-Y coordinates, distance multipliers, and orientations are available in my YQL stadiums table.

Feet to latitude/longitude

Given the above, I can convert all pixel locations to feet with home plate as the origin. For stadiums oriented away from due north, these feet-based locations can be rotated about their home plate origins. The final step — converting from feet to geographic coordinates — is fairly simple, though I initially tried a couple of more complicated methods. First, I converted hit locations to meters, converted the home plate lat-long to UTM coordinates, added the hit coordinates to the latter, and converted back to lat-long. This worked fine, as did the second method I tried: a formula that calculates the hit’s geographic coordinate given the distance and bearing from home plate.

Given the huge scale and low accuracy required, I can utilize a much simpler method. Most mapping APIs include a distance method of some kind, taking as input two lat-long coordinates and returning the great circle (spherical, so perhaps off by as much as .3%) distance in meters or km. Here’s Yahoo’s, Google’s and Mapstraction’s. With this method, I can determine an approximate conversion factor between meters and degrees latitude-longitude:

var latConv = homePlate_latLong.distance( new LatLonPoint( homePlate_latLong.lat + .001, homePlate_latLong.lng ) ) * 1000000;
var lngConv = homePlate_latLong.distance( new LatLonPoint( homePlate_latLong.lat, homePlate_latLong.lng + .001 ) ) * 1000000;

Accuracy

A game was taking place when the satellite image of AT&T Park was taken; you can see the fans, players, and — importantly — the bases. Using the formulas above, I’ve placed white squares on the map where the bases should be, and this shows the typical accuracy achieved.

I’ve also plotted known outfield fence coordinates with similar accuracy.

Putting it all together

I created a demo application using Mapstraction, jQuery, and YUI. It’s very much a proof-of-concept, and the YQL queries can take quite a while to populate the map and table. Nonetheless, this is the only non-trivial web development I’ve done outside of Flash/Flex, and I’m quite impressed by these tools.

The app lets you visualize the hits in any individual game that took place in an open stadium in 2009. It’ll show games from earlier years, but I can’t vouch for the accuracy of the plotted points due to changes in Gameday’s observational data collection. By default, the app shows this year’s all-star game at Busch Stadium. To see another game, add a gid URL parameter. Here, for example, a few games from last night.

The display is pretty basic: just pink dots for outs and blue ones for hits. And if you want to do any of the crazy stuff with plotting hits at different ballparks or arbitrary locations, you’ll have to dig around a bit. I can imagine many aggregation, filtering, and visualization options for these data. None are explored here. But all the code and data’s up on github if you want to have a go at it.

update Aug 08 2009: My proof-of-concept app was failing for a few days. Should be back now. My MLB Gameday YQL tables weren’t working as loaded from github, so I copied em over to indiemaps. Project’s still hosted on github. Any query links below utilizing the tables have been changed.

Lens tools and fisheye map browsing

L.A.’s Cartifact recently released Cartifact Maps, a Flash-based tilemaps viewer with custom cartography and advanced map browsing tools. The historic overlays and beautiful cartographic design are perhaps of most interest, but I’m equally impressed by their implementation of a novel map browsing UI featuring a magnifying glass or “lens tool”.

I first saw this map browsing technique in a minimal browser Matt Bloch created for an older static project.

I implemented a lens tool in the final project version of the World Freedom Atlas. I also experimented in some of the early prototypes with a continuous fisheye effect for map browsing. The latter never really took off because of the distortion and pixelation inherent in the raster method.

And the same idea in a Google Maps mashup.

The originator of the fisheye/magnification method for multi-scale mapping is probably Edgar Kant, in a 1957 map he produced for a migration study of Asby, Sweden, by Torsten Hägerstrand. Here the “distance from the centre shrinks proportionally to the logarithm of the real distance.”

Much work proceeded on multi-scale map projections, with the touchstone article being Snyder’s 1987 “‘Magnifying-glass’ azimuthal map projections”. Good coverage on such projections (including parallels to cartogram distortion) can be found in Canter’s Small-scale Map Projection Design.

In non-mapping UIs, the magnification/fisheye effect is fairly common; the Mac dock does it and even Cover Flow can be considered somewhat of a variant. For browsing and selection from a “large linear list”, Ben Bederson at HCIL came up with fisheye menus.

So nothing new in general UI terms, but still pretty novel and perhaps especially applicable to online map browsing.

Map browsing

Axis Maps cartographer Andy Woodruff did a great post on a variety of map panning and zooming methods. The lens or magnifying glass tool performs both panning and zooming functions, and should be considered as an alternative to the nine methods outlined there.

In interactive applications, the approach’s major strengths are threefold:

  • low mouse mileage for panning and making selections
  • less disorientation or “getting lost” as general cues are always available
  • the ability to see generals and specifics simultaneously

The last is particularly important in cartographic applications where the success of a good thematic map is often seen as its ability to present overall trends and specific values in the same map.

Semantic zoom lens tool

The Cartifact example above is particularly interesting because of the semantic zoom inherent in its lens tool. In normal, geometric zooming, a map (or other image) is simply blown up; more detail is shown by definition, because more pixels are dedicated to the image. In semantic zooming, different (typically more detailed) larger-scale renderings are shown at higher zoom levels; not only are features larger, but more details (and labels) are shown. Such semantic zooming is standard in slippy maps, which are produced and tiled at predefined zoom levels. Nonetheless, the application to a lens tool is noteworthy, especially in thematic cartography; generals and specifics can be presented simultaneously, and both can be tailored semantically to different zoom levels.

Here’s a quick example I threw together in Flex using Modest Maps (right-click to view source).

I like inverting the above, or perhaps more interestingly, showing Microsoft Aerial as the base and Microsoft Hybrid as the lens; the spotlight (zoomed in or otherwise) then serves to provide political/cultural details for the moused-over region.

Application to thematic cartography

In online thematic cartography, the practice of showing smaller enumeration units at higher zoom levels is somewhat common. The NY Times has done it a few times, including their Election 2008 results maps (the county-level choropleth is revealed by zooming in).

The same idea can be applied to a lens tool. Here the lens reveals the county vote results, and can be zoomed (again, a quick Flex job) to further investigate the local-level results. Click to launch the map (it’ll take a few secs to load, project, and draw the data).

The above is based on some projections and choropleth code I released last year. I think there’s more room for experimentation here: the size of the lens could be user-modifiable and semi-transparent (so you can still see where you’re mousing over the main map); I’d also like to create a less-pixellated fisheye lens and try out multiple lenses/fisheyes (for detailed comparisons of multiple areas while still providing context).