18 Optimize the data–ink ratio
We can broadly subdivide the graphical components in any visualization into components that represent data and components that do not. The former are elements such as the points in a scatter plot, the bars in a histogram or barplot, and the shaded areas in a heatmap. The latter are elements such as plot axes, axis ticks and labels, axis titles, legends, and plot annotations. As a general rule, most of the ink (Chapter 13) in a plot should be devoted to displaying data. Remove unnecessary frames, lines, or other adornments. For non-data elements that can’t be removed (such as, for example, a legend), make sure they aren’t overly prominent and stand back relative to the data.
18.1 Finding the appropriate data–ink ratio
The idea that one should remove non-data visual elements from graphs was popularized by Edward Tufte in his book “The Visual Display of Quantitative Information” (Tufte 2001). Tufte introduces the concept of the “data–ink ratio”, which he defines as the “proportion of a graphic’s ink devoted to the non-redundant display of data-information.” He then writes:
Maximize the data–ink ratio, within reason.
(emphasis mine) I have emphasized the phrase “within reason” because it is critical and frequently forgotten. Removing non-data ink is valuable up to a point. Beyond that point, making a graph more minimal will make it less compelling.
First, let’s consider a figure that clearly has too much non-data ink (Figure 18.1). In this figure, the colored points in the plot panel (the framed center area containing data points) are data ink. Everything else is non-data ink. The non-data ink includes a frame around the entire figure, a frame around the plot panel, and a frame around the legend. None of these frames are needed. We also see a prominent and dense background grid that draws attention away from the actual data points. By removing the frames and minor grid lines and by drawing the major grid lines in a light gray, we arrive at Figure 18.2. In this version of the figure, the actual data points stand out much more clearly, and they are perceived as the most important component of the figure.
However, we can take the removal of non-data ink too far. Figure 18.3 is a minimalist version of Figure 18.2, and a clear regression. Most importantly, the axis tick labels and titles have been made so faint that they are hard to see. If we just glance at the figure we will not immediately perceive what data is actually shown. We only see points floating in space. Second, the legend annotations are so faint that the points in the legend could be mistaken for data points. This effect is amplified because there is no clear visual separation between the plot area and the legend. Notice how the background grid in Figure 18.2 both anchors the points in space and sets off the data area from the legend area. Both of these effects have been lost in Figure 18.3.
Figures with too little non-data ink commonly suffer from figure elements that appear to float in space, without clear connection or reference to anything. This problem tends to be particularly severe in small multiples plots. Figure 18.4 shows a small-multiples plot comparing six different bar plots, but it looks more like a piece of modern art than a useful data visualization. The bars are not anchored to a clear baseline and the individual plot facets are not clearly delineated. We can resolve these issues by adding a light gray background and thin horizontal grid lines to each facet (Figure 18.5).
Since removing too much non-data ink can be just as bad as adding too much of it, I want to propose an alternative to Tufte’s maxim. We need to optimize, rather than maximize, the data–ink ratio. Make sure you don’t overload your plot with non-data ink, such that the data remains in the foreground, but don’t take it to the point where the data loses context.
Optimize the data–ink ratio.
18.2 Background grids
Not all forms of non-data ink are equally superfluous. In particular, by Tufte’s definition, background grids, axis lines, axis ticks, and labels do not count towards the ink used to represent data, and thus they decrease the data–ink ratio. However, they do carry critical information. We can’t just eliminate them all and hope to still have a meaningful visualization. In fact, one of the first rules of data visualization is to label your axes. And if labels are required, then making them sufficiently dark and prominent that they are actually visible and decipherable is also a requirement.
Whether a background grid is required is less clear-cut, and reasonable people can disagree about what the best options are for a given visualization. The R software ggplot2 has popularized a style using a fairly prominent background grid of white lines on a gray background. Figure 18.6 shows an example in this style. The figure displays the change in stock price of four major tech companies over a five-year window, from 2012 to 2017. With apologies to the ggplot2 author Hadley Wickham, for whom I have the utmost respect, I don’t find the white-on-gray background grid particularly attractive. To my eye, the gray background can detract from the actual data, and a grid with major and minor lines can be too dense. I also find the gray squares in the legend confusing.
Arguments in favor of the gray background include that it (i) helps the plot to be perceived as a single visual entity and (ii) prevents the plot to appear as a white box in surrounding dark text (Wickham 2016). I completely agree with the first point, and it was the reason I used gray backgrounds in Figure 18.5. For the second point, I’d like to caution that the perceived darkness of text will depend on the font size, fontface, and line spacing, and the perceived darkness of a figure will depend on the absolute amount and color of ink used, including all data ink. A scientific paper typeset in dense, 10-point Times New Roman will look much darker than a coffee-table book typeset in 14-point Palatino with one-and-a-half line spacing. Likewise, a scatter plot of five points in yellow will look much lighter than a scatter plot of 10,000 points in black. If you want to use a gray figure background, consider the color intensity of your figure foreground, as well as the expected layout and typography of the text around your figures, and adjust the choice of your background gray accordingly. Otherwise, it could happen that your figures end up standing out as dark boxes among the surrounding lighter text. Also, keep in mind that the colors you use to plot your data need to work with the gray background. We tend to perceive colors differently against different backgrounds, and a gray background requires darker and more saturated foreground colors than a white background.
We can go all the way in the opposite direction and remove both the background and the grid lines (Figure 18.7). In this case, we need visible axis lines to frame the plot and keep it as a single visual unit. For this particular figure, I think this choice is a worse option, and I have labeled it as “bad”. In the absence of any background grid whatsoever, the curves seem to float in space, and it’s difficult to reference the final values on the right to the axis ticks on the left.
At the absolute minimum, we need to add one horizontal reference line. Since the stock prices in Figure 18.7 indexed to 100 in June 2012, marking this value with a thin horizontal line at y = 100 helps a lot (Figure 18.8). Alternatively, we can use a minimal “grid” of horizontal lines. For a plot where we are primarily interested in the change in y values, vertical grid lines are not needed. Moreover, grid lines positioned at only the major axis ticks will often be sufficient. And, the axis line can be omitted or made very thin, since the horzontal lines clearly mark the extent of the plot (Figure 18.9).
For such a minimal grid, we generally draw the lines orthogonally to direction along which the numbers of interest vary. Therefore, if instead of plotting the stock price over time we plot the five-year increase, as horizontal bars, then we will want to use vertical lines instead (Figure 18.10).
Grid lines that run perpendicular to the key variable of interest tend to be the most useful.
For bar graphs such as Figure 18.10, Tufte (2001) recommends to draw white grid lines on top of the bars instead of dark grid lines underneath. In my opinion, this style is another example of overly trying to remove non-data-ink from a figure, with detrimental consequences (Figure 18.11). Most importantly, the white grid lines make it look like the bar is broken into separate pieces. In fact, I used this style purposefully in Figure 6.10 to visually separate stacked bars representing male and female passengers. Second, because the grid lines are not visible outside the bars, they are difficult to connect to axis ticks, and they obscure how close or distant the end of a bar is to the next grid line. Finally, because the grid lines are on top of the bars, I had to move the percentages outside the bars. This choice inappropriately visually elongates the bars.
Background grids along both axis directions are most appropriate for scatter plots where there is no primary axis of interest. Figure 18.2 at the beginning of this chapter provides an example. When a figure has a full background grid, axis lines are generally not needed (Figure 18.2).
18.3 Paired data
For figures where the relevant comparison is the x = y line, such as in scatter plots of paired data, I prefer to draw a diagonal line rather than a grid. For example, consider Figure 18.12, which compares gene expression levels in a mutant virus to the non-mutated (wild-type) variant. By drawing the diagonal line, we can see immediately which genes are expressed higher or lower in the mutant relative to the wild type. The same observation is much harder to make when the figure has a background grid and no diagonal line (Figure 18.13). Thus, even though Figure 18.13 looks pleasing, I label it as bad. In particular, gene 10A, which clearly has a reduced expression level in the mutant relative to the wild-type (Figure 18.12), does not visually stand out in Figure 18.13.
Of course we could take the diagonal line from Figure 18.12 and add it on top of the background grid of Figure 18.13, to ensure that the relevant visual reference is present. However, the resulting figure is getting quite busy (Figure 18.14). I had to make the diagonal line darker so it would stand out against the background grid, but now the data points almost seem to fade into the background. We could ameliorate this issue by making the data points larger or darker, but all considered I’d rather choose Figure 18.12.
Overloading a figure with non-data ink is bad, but excessively erasing non-data ink is not necessarily better. We need to find a healthy medium, where the data points are the main emphasis of the figure while sufficient context is provided about what data is shown, where the points lie relative to each other, and what they mean.
With respect to backgrounds and background grids, there is no one choice that is preferable in all contexts. I recommend to be judicious about grid lines. Think carefully about which specific grid or guide lines are most informative for the plot you are making, and then only show those. I prefer minimal, light grids on a white background, since white is the default neutral color on paper and supports nearly any foreground color. However, a shaded background can help the plot appear as a single visual entity, and this may be particularly useful in small multiples plots. Finally, we have to consider how all these choices relate to visual branding and identity. Many magazines and websites like to have an immediately recognizable in-house style, and a shaded background and specific choice of background grid can help create a unique visual identity.
Tufte, E. 2001. The Visual Display of Quantitative Information. 2nd ed. Cheshire, Connecticut: Graphics Press.
Telford, R. D., and R. B. Cunningham. 1991. “Sex, Sport, and Body-Size Dependency of Hematology in Highly Trained Athletes.” Medicine and Science in Sports and Exercise 23: 788–94.
Wickham, H. 2016. ggplot2: Elegant Graphics for Data Analysis. 2nd ed. New York: Springer.
Paff, M. L., B. R. Jack, B. L. Smith, J. J. Bull, and C. O. Wilke. 2018. “Combinatorial Approaches to Viral Attenuation.” bioRxiv, 29918. http://dx.doi.org/10.1101/299180.