20 Balance the data-to-ink ratio
We can broadly subdivide the graphical elements in any visualization into elements that represent data and elements that do not. The former are elements such as the points in a scatter plot, the bars in a histogram or barplot, or the shaded areas in a heatmap. The latter are elements such as plot axes, axis ticks and labels, axis titles, legends, and plot annotations. When designing a plot, it can be helpful to think about the amount of ink (Chapter 14) used to represent data and non-data elements. A common recommendation is to reduce the amount of non-data ink, and following this advice can often yield less cluttered and more elegant visualizations. At the same time, non-data ink provides context and visual structure, and overly minimizing such plot elements can result in figures that are difficult to read, confusing, or simply not that compelling.
20.1 Finding the appropriate data-to-ink ratio
The idea that distinguishing between data and non-data ink may be useful was popularized by Edward Tufte in his book “The Visual Display of Quantitative Information” (Tufte 2001). Tufte introduces the concept of the “data–ink ratio”, which he defines as the “proportion of a graphic’s ink devoted to the non-redundant display of data-information.” He then writes:
Maximize the data–ink ratio, within reason.
(emphasis mine) I have emphasized the phrase “within reason” because it is critical and frequently forgotten. In fact, I think that Tufte himself forgets it in the remainder of his book, where he advocates overly minimalistic designs that, in my opinion, are neither elegant nor easy to decipher. If we interpret the phrase “maximize the data–ink ratio” to mean “remove clutter and strive for clean and elegant designs,” then I think it is reasonable advice. But if we interpret it as “do everything you can to remove non-data ink” then it will result in poor design choices. If we go too far in either direction along the scale of possible data-to-ink ratios we will end up with ugly figures. However, away from the extremes there is wide range of designs with different data-to-ink ratios that are all acceptable and may be appropriate in different contexts.
To explore the extremes, let’s consider a figure that clearly has too much non-data ink (Figure 20.1). In this figure, the colored points in the plot panel (the framed center area containing data points) are data ink. Everything else is non-data ink. The non-data ink includes a frame around the entire figure, a frame around the plot panel, and a frame around the legend. None of these frames are needed. We also see a prominent and dense background grid that draws attention away from the actual data points. By removing the frames and minor grid lines and by drawing the major grid lines in a light gray, we arrive at Figure 20.2. In this version of the figure, the actual data points stand out much more clearly, and they are perceived as the most important component of the figure.
At the other extreme of the data-to-ink-ratio scale, we might end up with a figure such as Figure 20.3, which is a minimalist version of Figure 20.2. In this figure, the axis tick labels and titles have been made so faint that they are hard to see. If we just glance at the figure we will not immediately perceive what data is actually shown. We only see points floating in space. Moreover, the legend annotations are so faint that the points in the legend could be mistaken for data points. This effect is amplified because there is no clear visual separation between the plot area and the legend. Notice how the background grid in Figure 20.2 both anchors the points in space and sets off the data area from the legend area. Both of these effects have been lost in Figure 20.3.
In Figure 20.2, I am using an open background grid and no aixs lines or frame around the plot panel. I like this design because it conveys to the reader that range of possible data values extends beyond the axis limits. For example, even though Figure 20.2 shows no athlete taller than 210 cm, such an athlete could conceivably exist. However, some authors prefer to delineate the extent of the plot panel, by drawing a frame around it (Figure 20.4). Both options are reasonable, and which is preferable is primarily a matter of personal opinion. One advantage of the framed version is that it clearly separates the legend from the plot panel.
Figures with too little non-data ink commonly suffer from the effect that figure elements appear to float in space, without clear connection or reference to anything. This problem tends to be particularly severe in small multiples plots. Figure 20.5 shows a small-multiples plot comparing six different bar plots, but it looks more like a piece of modern art than a useful data visualization. The bars are not anchored to a clear baseline and the individual plot facets are not clearly delineated. We can resolve these issues by adding a light gray background and thin horizontal grid lines to each facet (Figure 20.6).
20.2 Background grids
Gridlines in the background of a plot can help the reader discern specific data values and compare values in one part of a plot to values in another part. At the same time, gridlines can add visual noise, in particular when they are prominent or densely spaced. Reasonable people can disagree about whether to use a grid or not, and if so how to format it and how densely to space it. Throughout this book I am using a variety of different grid styles, to highlight that there isn’t necessarily one best choice.
The R software ggplot2 has popularized a style using a fairly prominent background grid of white lines on a gray background. Figure 20.7 shows an example in this style. The figure displays the change in stock price of four major tech companies over a five-year window, from 2012 to 2017. With apologies to the ggplot2 author Hadley Wickham, for whom I have the utmost respect, I don’t find the white-on-gray background grid particularly attractive. To my eye, the gray background can detract from the actual data, and a grid with major and minor lines can be too dense. I also find the gray squares in the legend confusing.
Arguments in favor of the gray background include that it (i) helps the plot to be perceived as a single visual entity and (ii) prevents the plot to appear as a white box in surrounding dark text (Wickham 2016). I completely agree with the first point, and it was the reason I used gray backgrounds in Figure 20.6. For the second point, I’d like to caution that the perceived darkness of text will depend on the font size, fontface, and line spacing, and the perceived darkness of a figure will depend on the absolute amount and color of ink used, including all data ink. A scientific paper typeset in dense, 10-point Times New Roman will look much darker than a coffee-table book typeset in 14-point Palatino with one-and-a-half line spacing. Likewise, a scatter plot of five points in yellow will look much lighter than a scatter plot of 10,000 points in black. If you want to use a gray figure background, consider the color intensity of your figure foreground, as well as the expected layout and typography of the text around your figures, and adjust the choice of your background gray accordingly. Otherwise, it could happen that your figures end up standing out as dark boxes among the surrounding lighter text. Also, keep in mind that the colors you use to plot your data need to work with the gray background. We tend to perceive colors differently against different backgrounds, and a gray background requires darker and more saturated foreground colors than a white background.
We can go all the way in the opposite direction and remove both the background and the grid lines (Figure 20.8). In this case, we need visible axis lines to frame the plot and keep it as a single visual unit. For this particular figure, I think this choice is a worse option, and I have labeled it as “bad”. In the absence of any background grid whatsoever, the curves seem to float in space, and it’s difficult to reference the final values on the right to the axis ticks on the left.
At the absolute minimum, we need to add one horizontal reference line. Since the stock prices in Figure 20.8 indexed to 100 in June 2012, marking this value with a thin horizontal line at y = 100 helps a lot (Figure 20.9). Alternatively, we can use a minimal “grid” of horizontal lines. For a plot where we are primarily interested in the change in y values, vertical grid lines are not needed. Moreover, grid lines positioned at only the major axis ticks will often be sufficient. And, the axis line can be omitted or made very thin, since the horzontal lines clearly mark the extent of the plot (Figure 20.10).
For such a minimal grid, we generally draw the lines orthogonally to direction along which the numbers of interest vary. Therefore, if instead of plotting the stock price over time we plot the five-year increase, as horizontal bars, then we will want to use vertical lines instead (Figure 20.11).
Grid lines that run perpendicular to the key variable of interest tend to be the most useful.
For bar graphs such as Figure 20.11, Tufte (2001) recommends to draw white grid lines on top of the bars instead of dark grid lines underneath. In my opinion, this style is another example of overly trying to remove non-data-ink from a figure, with detrimental consequences (Figure 20.12). Most importantly, the white grid lines make it look like the bar is broken into separate pieces. In fact, I used this style purposefully in Figure 6.10 to visually separate stacked bars representing male and female passengers. Second, because the grid lines are not visible outside the bars, they are difficult to connect to axis ticks, and they obscure how close or distant the end of a bar is to the next grid line. Finally, because the grid lines are on top of the bars, I had to move the percentages outside the bars. This choice inappropriately visually elongates the bars.
Background grids along both axis directions are most appropriate for scatter plots where there is no primary axis of interest. Figure 20.2 at the beginning of this chapter provides an example. When a figure has a full background grid, axis lines are generally not needed (Figure 20.2).
20.3 Paired data
For figures where the relevant comparison is the x = y line, such as in scatter plots of paired data, I prefer to draw a diagonal line rather than a grid. For example, consider Figure 20.13, which compares gene expression levels in a mutant virus to the non-mutated (wild-type) variant. By drawing the diagonal line, we can see immediately which genes are expressed higher or lower in the mutant relative to the wild type. The same observation is much harder to make when the figure has a background grid and no diagonal line (Figure 20.14). Thus, even though Figure 20.14 looks pleasing, I label it as bad. In particular, gene 10A, which clearly has a reduced expression level in the mutant relative to the wild-type (Figure 20.13), does not visually stand out in Figure 20.14.
Of course we could take the diagonal line from Figure 20.13 and add it on top of the background grid of Figure 20.14, to ensure that the relevant visual reference is present. However, the resulting figure is getting quite busy (Figure 20.15). I had to make the diagonal line darker so it would stand out against the background grid, but now the data points almost seem to fade into the background. We could ameliorate this issue by making the data points larger or darker, but all considered I’d rather choose Figure 20.13.
Both overloading a figure with non-data ink and excessively erasing non-data ink can result in poor figure design. We need to find a healthy medium, where the data points are the main emphasis of the figure while sufficient context is provided about what data is shown, where the points lie relative to each other, and what they mean.
With respect to backgrounds and background grids, there is no one choice that is preferable in all contexts. I recommend to be judicious about grid lines. Think carefully about which specific grid or guide lines are most informative for the plot you are making, and then only show those. I prefer minimal, light grids on a white background, since white is the default neutral color on paper and supports nearly any foreground color. However, a shaded background can help the plot appear as a single visual entity, and this may be particularly useful in small multiples plots. Finally, we have to consider how all these choices relate to visual branding and identity. Many magazines and websites like to have an immediately recognizable in-house style, and a shaded background and specific choice of background grid can help create a unique visual identity.
Tufte, E. 2001. The Visual Display of Quantitative Information. 2nd ed. Cheshire, Connecticut: Graphics Press.
Telford, R. D., and R. B. Cunningham. 1991. “Sex, Sport, and Body-Size Dependency of Hematology in Highly Trained Athletes.” Medicine and Science in Sports and Exercise 23: 788–94.
Wickham, H. 2016. ggplot2: Elegant Graphics for Data Analysis. 2nd ed. New York: Springer.
Paff, M. L., B. R. Jack, B. L. Smith, J. J. Bull, and C. O. Wilke. 2018. “Combinatorial Approaches to Viral Attenuation.” bioRxiv, 29918. doi:10.1101/299180.