18 Optimize the data–ink ratio

We can broadly subdivide the graphical components in any visualization into components that represent data and components that do not. The former are elements such as the points in a scatter plot, the bars in a histogram or barplot, and the shaded areas in a heatmap. The latter are elements such as plot axes, axis ticks and labels, axis titles, legends, and plot annotations. As a general rule, most of the ink (Chapter 13) in a plot should be devoted to displaying data. Remove unnecessary frames, lines, or other adornments. For non-data elements that can’t be removed (such as, for example, a legend), make sure they aren’t overly prominent and stand back relative to the data.

18.1 Finding the appropriate data–ink ratio

The idea that one should remove non-data visual elements from graphs was popularized by Edward Tufte in his book “The Visual Display of Quantitative Information” (Tufte 2001). Tufte introduces the concept of the “data–ink ratio”, which he defines as the “proportion of a graphic’s ink devoted to the non-redundant display of data-information.” He then writes:

Maximize the data–ink ratio, within reason.

(emphasis mine) I have emphasized the phrase “within reason” because it is critical and frequently forgotten. Removing non-data ink is valuable up to a point. Beyond that point, making a graph more minimal will make it less compelling.

First, let’s consider a figure that clearly has too much non-data ink (Figure 18.1). In this figure, the colored points in the plot panel (the framed center area containing data points) are data ink. Everything else is non-data ink. The non-data ink includes a frame around the entire figure, a frame around the plot panel, and a frame around the legend. None of these frames are needed. We also see a prominent and dense background grid that draws attention away from the actual data points. By removing the frames and minor grid lines and by drawing the major grid lines in a light gray, we arrive at Figure 18.2. In this version of the figure, the actual data points stand out much more clearly, and they are perceived as the most important component of the figure.

Percent body fat versus height in professional male Australian athletes. Each point represents one athlete. This figure devotes way too much ink to non-data. There are unnecessary frames around the entire figure, around the plot panel, and around the legend. The coordinate grid is very prominent, and its presence draws attention away from the data points. Data source: Telford and Cunningham (1991)

Figure 18.1: Percent body fat versus height in professional male Australian athletes. Each point represents one athlete. This figure devotes way too much ink to non-data. There are unnecessary frames around the entire figure, around the plot panel, and around the legend. The coordinate grid is very prominent, and its presence draws attention away from the data points. Data source: Telford and Cunningham (1991)

Percent body fat versus height in professional male Australian athletes. This figure is a cleaned-up version of Figure 18.1. Unnecessary frames have been removed, minor grid lines have been removed, and majore grid lines have been drawn in light gray to stand back relative to the data points. Data source: Telford and Cunningham (1991)

Figure 18.2: Percent body fat versus height in professional male Australian athletes. This figure is a cleaned-up version of Figure 18.1. Unnecessary frames have been removed, minor grid lines have been removed, and majore grid lines have been drawn in light gray to stand back relative to the data points. Data source: Telford and Cunningham (1991)

However, we can take the removal of non-data ink too far. Figure 18.3 is a minimalist version of Figure 18.2, and a clear regression. Most importantly, the axis tick labels and titles have been made so faint that they are hard to see. If we just glance at the figure we will not immediately perceive what data is actually shown. We only see points floating in space. Second, the legend annotations are so faint that the points in the legend could be mistaken for data points. This effect is amplified because there is no clear visual separation between the plot area and the legend. Notice how the background grid in Figure 18.2 both anchors the points in space and sets off the data area from the legend area. Both of these effects have been lost in Figure 18.3.

Percent body fat versus height in professional male Australian athletes. In this example, the concept of maximizing the data–ink ratio has been taken too far. The axis tick labels and title are too faint and are barely visible. The data points seem to float in space. The points in the legend are not sufficiently set off from the data points, and the casual observer might think they are part of the data. Data source: Telford and Cunningham (1991)

Figure 18.3: Percent body fat versus height in professional male Australian athletes. In this example, the concept of maximizing the data–ink ratio has been taken too far. The axis tick labels and title are too faint and are barely visible. The data points seem to float in space. The points in the legend are not sufficiently set off from the data points, and the casual observer might think they are part of the data. Data source: Telford and Cunningham (1991)

Figures with too little non-data ink commonly suffer from figure elements that appear to float in space, without clear connection or reference to anything. This problem tends to be particularly severe in small multiples plots. Figure 18.4 shows a small-multiples plot comparing six different bar plots, but it looks more like a piece of modern art than a useful data visualization. The bars are not anchored to a clear baseline and the individual plot facets are not clearly delineated. We can resolve these issues by adding a light gray background and thin horizontal grid lines to each facet (Figure 18.5).

Survival of passengers on the Titanic, broken down by gender and class. This small-multiples plot is too minimalistic. The individual factes are not framed, so it’s difficult to see which part of the figure belongs to which facet. Further, the individual bars are not anchored to a clear baseline, and they seem to float.

Figure 18.4: Survival of passengers on the Titanic, broken down by gender and class. This small-multiples plot is too minimalistic. The individual factes are not framed, so it’s difficult to see which part of the figure belongs to which facet. Further, the individual bars are not anchored to a clear baseline, and they seem to float.

Survival of passengers on the Titanic, broken down by gender and class. This is an improved version of Figure 18.4. The gray background in each facet clearly delineates the six groupings (survived or died in first, second, or third class) that make up this plot. Thin horizontal lines in the background provide a reference for the bar heights and facility comparison of bar heights among facets.

Figure 18.5: Survival of passengers on the Titanic, broken down by gender and class. This is an improved version of Figure 18.4. The gray background in each facet clearly delineates the six groupings (survived or died in first, second, or third class) that make up this plot. Thin horizontal lines in the background provide a reference for the bar heights and facility comparison of bar heights among facets.

Since removing too much non-data ink can be just as bad as adding too much of it, I want to propose an alternative to Tufte’s maxim. We need to optimize, rather than maximize, the data–ink ratio. Make sure you don’t overload your plot with non-data ink, such that the data remains in the foreground, but don’t take it to the point where the data loses context.

Optimize the data–ink ratio.

18.2 Background grids

Not all forms of non-data ink are equally superfluous. In particular, by Tufte’s definition, background grids, axis lines, axis ticks, and labels do not count towards the ink used to represent data, and thus they decrease the data–ink ratio. However, they do carry critical information. We can’t just eliminate them all and hope to still have a meaningful visualization. In fact, one of the first rules of data visualization is to label your axes. And if labels are required, then making them sufficiently dark and prominent that they are actually visible and decipherable is also a requirement.

Whether a background grid is required is less clear-cut, and reasonable people can disagree about what the best options are for a given visualization. The R software ggplot2 has popularized a style using a fairly prominent background grid of white lines on a gray background. Figure 18.6 shows an example in this style. The figure displays the change in stock price of four major tech companies over a five-year window, from 2012 to 2017. With apologies to the ggplot2 author Hadley Wickham, for whom I have the utmost respect, I don’t find the white-on-gray background grid particularly attractive. To my eye, the gray background can detract from the actual data, and a grid with major and minor lines can be too dense. I also find the gray squares in the legend confusing.

Stock price over time for four major tech companies. The stock price for each company has been normalized to equal 100 in June 2012. This figure mimics the ggplot2 default look, with white major and minor grid lines on a gray background. In this particular example, I think the grid lines overpower the data lines, and the result is a figure that is not well balanced and that doesn’t place sufficient emphasis on the data. Data source: Yahoo Finance

Figure 18.6: Stock price over time for four major tech companies. The stock price for each company has been normalized to equal 100 in June 2012. This figure mimics the ggplot2 default look, with white major and minor grid lines on a gray background. In this particular example, I think the grid lines overpower the data lines, and the result is a figure that is not well balanced and that doesn’t place sufficient emphasis on the data. Data source: Yahoo Finance

Arguments in favor of the gray background include that it (i) helps the plot to be perceived as a single visual entity and (ii) prevents the plot to appear as a white box in surrounding dark text (Wickham 2016). I completely agree with the first point, and it was the reason I used gray backgrounds in Figure 18.5. For the second point, I’d like to caution that the perceived darkness of text will depend on the font size, fontface, and line spacing, and the perceived darkness of a figure will depend on the absolute amount and color of ink used, including all data ink. A scientific paper typeset in dense, 10-point Times New Roman will look much darker than a coffee-table book typeset in 14-point Palatino with one-and-a-half line spacing. Likewise, a scatter plot of five points in yellow will look much lighter than a scatter plot of 10,000 points in black. If you want to use a gray figure background, consider the color intensity of your figure foreground, as well as the expected layout and typography of the text around your figures, and adjust the choice of your background gray accordingly. Otherwise, it could happen that your figures end up standing out as dark boxes among the surrounding lighter text. Also, keep in mind that the colors you use to plot your data need to work with the gray background. We tend to perceive colors differently against different backgrounds, and a gray background requires darker and more saturated foreground colors than a white background.

We can go all the way in the opposite direction and remove both the background and the grid lines (Figure 18.7). In this case, we need visible axis lines to frame the plot and keep it as a single visual unit. For this particular figure, I think this choice is a worse option, and I have labeled it as “bad”. In the absence of any background grid whatsoever, the curves seem to float in space, and it’s difficult to reference the final values on the right to the axis ticks on the left.

Indexed stock price over time for four major tech companies. In this variant of Figure 18.6, the data lines are not sufficiently anchored. This makes it difficult to ascertain to what extent they have deviated from the index value of 100 at the end of the covered time interval. Data source: Yahoo Finance

Figure 18.7: Indexed stock price over time for four major tech companies. In this variant of Figure 18.6, the data lines are not sufficiently anchored. This makes it difficult to ascertain to what extent they have deviated from the index value of 100 at the end of the covered time interval. Data source: Yahoo Finance

At the absolute minimum, we need to add one horizontal reference line. Since the stock prices in Figure 18.7 indexed to 100 in June 2012, marking this value with a thin horizontal line at y = 100 helps a lot (Figure 18.8). Alternatively, we can use a minimal “grid” of horizontal lines. For a plot where we are primarily interested in the change in y values, vertical grid lines are not needed. Moreover, grid lines positioned at only the major axis ticks will often be sufficient. And, the axis line can be omitted or made very thin, since the horzontal lines clearly mark the extent of the plot (Figure 18.9).

Indexed stock price over time for four major tech companies. Adding a thin horizontal line at the index value of 100 to Figure 18.7 helps provide an important reference throughout the entire time period the plot spans. Data source: Yahoo Finance

Figure 18.8: Indexed stock price over time for four major tech companies. Adding a thin horizontal line at the index value of 100 to Figure 18.7 helps provide an important reference throughout the entire time period the plot spans. Data source: Yahoo Finance

Indexed stock price over time for four major tech companies. Adding thin horizontal lines at all major y axis ticks provides a better set of reference points than just the one horizontal line of Figure 18.8. This design also removes the need for prominent x and y axis lines, since the evenly spaced horizontal lines create a visual frame for the plot panel. Data source: Yahoo Finance

Figure 18.9: Indexed stock price over time for four major tech companies. Adding thin horizontal lines at all major y axis ticks provides a better set of reference points than just the one horizontal line of Figure 18.8. This design also removes the need for prominent x and y axis lines, since the evenly spaced horizontal lines create a visual frame for the plot panel. Data source: Yahoo Finance

For such a minimal grid, we generally draw the lines orthogonally to direction along which the numbers of interest vary. Therefore, if instead of plotting the stock price over time we plot the five-year increase, as horizontal bars, then we will want to use vertical lines instead (Figure 18.10).

Percent increase in stock price from June 2012 to June 2017, for four major tech companies. Because the bars run horizontally, vertical grid lines are appropriate here. Data source: Yahoo Finance

Figure 18.10: Percent increase in stock price from June 2012 to June 2017, for four major tech companies. Because the bars run horizontally, vertical grid lines are appropriate here. Data source: Yahoo Finance

Grid lines that run perpendicular to the key variable of interest tend to be the most useful.

For bar graphs such as Figure 18.10, Tufte (2001) recommends to draw white grid lines on top of the bars instead of dark grid lines underneath. In my opinion, this style is another example of overly trying to remove non-data-ink from a figure, with detrimental consequences (Figure 18.11). Most importantly, the white grid lines make it look like the bar is broken into separate pieces. In fact, I used this style purposefully in Figure 6.10 to visually separate stacked bars representing male and female passengers. Second, because the grid lines are not visible outside the bars, they are difficult to connect to axis ticks, and they obscure how close or distant the end of a bar is to the next grid line. Finally, because the grid lines are on top of the bars, I had to move the percentages outside the bars. This choice inappropriately visually elongates the bars.

Percent increase in stock price from June 2012 to June 2017, for four major tech companies. White grid lines on top of bars are a suboptimal choice. They make it look like the bars are falling apart, and, because they are not visible against the white background, they also obscure how close any one bar is to the next higher grid line. Data source: Yahoo Finance

Figure 18.11: Percent increase in stock price from June 2012 to June 2017, for four major tech companies. White grid lines on top of bars are a suboptimal choice. They make it look like the bars are falling apart, and, because they are not visible against the white background, they also obscure how close any one bar is to the next higher grid line. Data source: Yahoo Finance

Background grids along both axis directions are most appropriate for scatter plots where there is no primary axis of interest. Figure 18.2 at the beginning of this chapter provides an example. When a figure has a full background grid, axis lines are generally not needed (Figure 18.2).

18.3 Paired data

For figures where the relevant comparison is the x = y line, such as in scatter plots of paired data, I prefer to draw a diagonal line rather than a grid. For example, consider Figure 18.12, which compares gene expression levels in a mutant virus to the non-mutated (wild-type) variant. By drawing the diagonal line, we can see immediately which genes are expressed higher or lower in the mutant relative to the wild type. The same observation is much harder to make when the figure has a background grid and no diagonal line (Figure 18.13). Thus, even though Figure 18.13 looks pleasing, I label it as bad. In particular, gene 10A, which clearly has a reduced expression level in the mutant relative to the wild-type (Figure 18.12), does not visually stand out in Figure 18.13.

Gene expression levels in a mutant bacteriophage T7 relative to wild-type. Gene expression levels are measured by mRNA abundances, in transcripts per million (TPM). Each dot corresponds to one gene. In the mutant bacteriophage T7, the promoter in front of gene 9 was deleted, and this resulted in reduced mRNA abundances of gene 9 as well as the neighboring genes 8 and 10A (highlighted). Data source: Paff et al. (2018)

Figure 18.12: Gene expression levels in a mutant bacteriophage T7 relative to wild-type. Gene expression levels are measured by mRNA abundances, in transcripts per million (TPM). Each dot corresponds to one gene. In the mutant bacteriophage T7, the promoter in front of gene 9 was deleted, and this resulted in reduced mRNA abundances of gene 9 as well as the neighboring genes 8 and 10A (highlighted). Data source: Paff et al. (2018)

Gene expression levels in a mutant bacteriophage T7 relative to wild-type. By plotting this dataset against a background grid, instead of a diagonal line, we are obscuring which genes are higher or lower in the mutant than in the wild-type bacteriophage. Data source: Paff et al. (2018)

Figure 18.13: Gene expression levels in a mutant bacteriophage T7 relative to wild-type. By plotting this dataset against a background grid, instead of a diagonal line, we are obscuring which genes are higher or lower in the mutant than in the wild-type bacteriophage. Data source: Paff et al. (2018)

Of course we could take the diagonal line from Figure 18.12 and add it on top of the background grid of Figure 18.13, to ensure that the relevant visual reference is present. However, the resulting figure is getting quite busy (Figure 18.14). I had to make the diagonal line darker so it would stand out against the background grid, but now the data points almost seem to fade into the background. We could ameliorate this issue by making the data points larger or darker, but all considered I’d rather choose Figure 18.12.

Gene expression levels in a mutant bacteriophage T7 relative to wild-type. This figure combines the background grid from Figure 18.13 with the diagonal line from Figure 18.12. In my opinion, this figure is visually too busy compared to Figure 18.12, and I would prefer Figure 18.12. Data source: Paff et al. (2018)

Figure 18.14: Gene expression levels in a mutant bacteriophage T7 relative to wild-type. This figure combines the background grid from Figure 18.13 with the diagonal line from Figure 18.12. In my opinion, this figure is visually too busy compared to Figure 18.12, and I would prefer Figure 18.12. Data source: Paff et al. (2018)

18.4 Summary

Overloading a figure with non-data ink is bad, but excessively erasing non-data ink is not necessarily better. We need to find a healthy medium, where the data points are the main emphasis of the figure while sufficient context is provided about what data is shown, where the points lie relative to each other, and what they mean.

With respect to backgrounds and background grids, there is no one choice that is preferable in all contexts. I recommend to be judicious about grid lines. Think carefully about which specific grid or guide lines are most informative for the plot you are making, and then only show those. I prefer minimal, light grids on a white background, since white is the default neutral color on paper and supports nearly any foreground color. However, a shaded background can help the plot appear as a single visual entity, and this may be particularly useful in small multiples plots. Finally, we have to consider how all these choices relate to visual branding and identity. Many magazines and websites like to have an immediately recognizable in-house style, and a shaded background and specific choice of background grid can help create a unique visual identity.

References

Tufte, E. 2001. The Visual Display of Quantitative Information. 2nd ed. Cheshire, Connecticut: Graphics Press.

Telford, R. D., and R. B. Cunningham. 1991. “Sex, Sport, and Body-Size Dependency of Hematology in Highly Trained Athletes.” Medicine and Science in Sports and Exercise 23: 788–94.

Wickham, H. 2016. ggplot2: Elegant Graphics for Data Analysis. 2nd ed. New York: Springer.

Paff, M. L., B. R. Jack, B. L. Smith, J. J. Bull, and C. O. Wilke. 2018. “Combinatorial Approaches to Viral Attenuation.” bioRxiv, 29918. http://dx.doi.org/10.1101/299180.