Effective Data Visualization Part 2: Formatting numbers

Marton Trencseni - Sun 23 August 2020 - Data

Introduction

Format numbers for human consumption. What is more readable, 1.539e+5 or 153,859? Showing numbers effectively on spreadsheets, charts, dashboards and reports is a basic ingredient for readability, like formatting code.

For this article, I will use the d3 Javascript visualization library. I use Superset for charts and dashboards on a daily basis, and Superset uses d3.

Where should numbers be formatted? By the database? No. Numbers and dates should be formatted at the last possible moment, before human consumption, eg. by the Javascript in the browser. Why? Because formatting (i) may lose information, for example the time part of a datetime, or numbers are rounded (ii) the type of the data (date, datetime, int, float) is lost as everything is converted to a string. This can lead to incorrect ordering, because '9' > '10' even though 9 < 10.

On my Macbook, to play around with d3, I use node.js:

$ npm install d3 && npm install -g d3
$ node
>> var d3 = require("d3");
>> d3.format(',')(153859)
'153,859'
>> d3.timeFormat('%Y-%m-%d')(new Date())
'2020-08-23'

Localization

Number formatting is locale specific. For example, 153,859.12 is written as 153.859,12 in Turkey. There is no hard rule here. Personally, I try to stick with US-standards. For more, see:

Dates

Format dates per the ISO-8601 standard, like YYYY-MM-DD. In d3, the format string for this is %Y-%m-%d.

What if we're showing monthly data, like monthly revenue? What to show for the day part? I standardize on always using YYYY-MM-DD, so I show -01 for the day part. An alternative is to use textual three letter months, like:

> d3.timeFormat('%Y-%b')(new Date())
'2020-Aug'

What about timezones? It's best to standardize on one timezone, preferably UTC, and show all dates in UTC time, without specifiying the timezone. But this can still lead to weird artefacts. For example, countries that are ahead in time, their Monday morning usage peak will show up in Sunday UTC time. So the weekly seasonality gets shifted around, per country/timezone. However, this is still preferable to showing times in local times on dashboards, as that is too error-prone. Imagine debugging a drop in traffic, and having to shift the times for each country's local timezone in your head.

Currencies

Monetary values should be converted to a common currency, like USD or AED. Showing different currencies on the same chart is an error, whether it's a linechart or stacked area chart. Everybody knows what $1,000 means, but other currencies don't have such one-letter abbreviation. I always use the ISO 4217 currency designation, after the number, with a space, like 1,000 USD, which is 3,672.94 AED right now. The ISO standard does not specify the ordering or spacing.

Numbers

Don't round numbers. If your user doesn't care about the details, let him ignore the rest. Rounded numbers can be confusing, because it seems like we don't know the exact count.

// don't do this, it seems like we don't know the exact number
> d3.format('.2r')(12345)
'12000'

Don't use Ms for millions and ks for thousands. Same argument as for rounding: the user's eyes and brains has to do extra processing to tell what's going on, because the formatting changed in the middle.

// don't do this, it looks weird
> for (i = 800*1000; i <= 1200*1000; i += 100*1000) { console.log(d3.format('.2s')(i)) }
800k
900k
1.0M
1.1M
1.2M

Just show the number, aligned:

// do this: the 9 in the format string pads it on the left with spaces so it's 9 characters wide
> for (i = 800*1000; i <= 1200*1000; i += 100*1000) { console.log(d3.format('9,')(i)) }
  800,000
  900,000
1,000,000
1,100,000
1,200,000

Use separators for thousands, like 1,234,456. In d3, this is accomplished with the , format string:

d3.format(',')(123456)
> 1,234,456

Use the appropriate number of decimals. For integer numbers, like number of users, don't show decimals.

For numbers that have a fractional part, but it's insignificant, don't show it [in Data Science work]. If you're showing the monthly revenues, and it's in the millions of dollars, don't show the cents.

> d3.format(',.0f')(123456.123)
'123,456'

For ratios and probabilities that are usually shown as percentages, show a percentage between 0 and 100, and not a ratio between 0 and 1. For example, funnel conversions are usually shown as percentages. Show the appropriate number of digits after the decimal, which depends on the type of measurement. I usually show 0 or 1 digits after the decimal. Sometimes it's good to show at least one digit just to signify that the numbers are not estimates (which helps is the numbers happen to be round, like A had 40.2% conversion and B had 50.4%).

> d3.format('.1%')(0.402)
'40.2%'

For p values in statistical significance testing, I like to show 3 decimals, like 0.001 (d3 format string is .3f). This is a bit weird if the value is 0.00001, because it will show as 0.000, but that's fine.

> d3.format('.3f')(0.001)
'0.001'
> d3.format('.3f')(0.00001)
'0.000'

Conclusion

Remember: format numbers for human consumption. Make is easy on the eyes, so the user doesn't have to think or. Well formatted numbers are beautiful.