SQL best practices for Data Scientists and Analysts
Marton Trencseni - Sun 26 January 2020 - Data
Introduction
The previous post was about SQL code style, like uppercasing, indentation and overall structure like WITH
. This is about how to write readable, concise and efficient SELECT
statements that minimize bugs.
For completeness, I will quickly list out the coding style suggestions from the last post, without explanations:
- Use all upper-case for keywords like
SELECT
- Keep
SELECT .. FROM .. WHERE .. GROUP BY
unindented - Line up
WHERE
conditions - Write
GROUP BY 1, 2
- Indent
WHEN
relative toCASE
- Use
WITH
instead of nested sub-SELECT
s - Long lines are okay
- Break the rules for readability and flow of code
- Follow the same rules when writing interactive SQL code
Without further ado, my SQL best practices for Data Scientists and Analysts follow below.
Use COUNT(DISTINCT id)
instead of COUNT(*)
COUNT(*)
is dangerous, because you could be counting things do don't want to. For example, a table called dispatches
might include multiple dispatch events for a delivery order. If you're counting dispatches for a week, do you want to double-count dispatches for the same order, if the order was dispatched but didn't get delivered the first time (customer was unreachable), and was re-dispatched again? Also, tables can sometimes get polluted with bad data, eg. if there was a software issue issue in production and the warehouse staff triggered multiple dispatch events. Also, although we expect data engineers to pre-clean data, sometimes new modes of dirtyness appear. And sometimes it happens that a bug is introduced in the ETL pipeline, and eg. all rows are duplicated in a table. Although it's not the analyst's fault, it's better to be defensive. Example:
SELECT
COUNT(*), -- BAD: what are we counting here?
COUNT(DISTINCT tracking_id) -- GOOD: much clearer, we're counting unique orders
FROM
dispatches
WHERE
...
Use ROW()
when computing metrics over multiple days
In the above example, suppose we want to count dispatches for 7 days, and we want to count re-dispatches of the same package. In this case COUNT(DISTINCT tracking_id)
won't work, because it won't double count double dispatches. And we said COUNT(*)
is evil. The way out is to DISTINCT
on both days and orders, and the way to do that is with ROW()
:
SELECT
COUNT(*), -- BAD: what are we counting here?
COUNT(DISTINCT tracking_id) -- GOOD: unique orders dispatched
COUNT(DISTINCT ROW(day, tracking_id)) -- GOOD: dispatches, but we only count an order once a day
FROM
dispatches
WHERE
day BETWEEN DATE('2020-01-06') AND DATE('2020-01-12')
Always name columns with AS
Continuing with the previous example, we should always name our columns:
SELECT
COUNT(*) AS clown_town,
COUNT(DISTINCT tracking_id) AS num_unique_orders_dispatched,
COUNT(DISTINCT ROW(day, tracking_id)) AS num_dispatches
FROM
dispatches
WHERE
day BETWEEN DATE('2020-01-06') AND DATE('2020-01-12')
Use CASE WHEN
instead of writing multiple queries
Suppose we want to count both dispatches and deliveries, and there's a table core_events
which has both. We can accomplish this with one SELECT
, we don't have to write two:
SELECT
COUNT(DISTINCT CASE WHEN event = 'dispatched' THEN ROW(day, tracking_id) ELSE NULL END) AS num_dispatches,
COUNT(DISTINCT CASE WHEN event = 'delivered' THEN ROW(day, tracking_id) ELSE NULL END) AS num_delivered,
COUNT(DISTINCT CASE WHEN event = 'dispatched' THEN tracking_id ELSE NULL END) AS num_unique_orders_dispatched,
COUNT(DISTINCT CASE WHEN event = 'delivered' THEN tracking_id ELSE NULL END) AS num_unique_orders_delivered
FROM
core_events
WHERE
day BETWEEN DATE('2020-01-06') AND DATE('2020-01-12')
Remember, COUNT()
doesn't count NULL
s.
In the above example, we expect:
num_delivered = num_unique_orders_delivered
because an order can only be delivered oncenum_unique_orders_dispatched <= num_dispatches
num_delivered <= num_dispatches
Use IN (...)
instead or OR
s
In the above example, for clarity and efficiency, we should only include dispatched
and delivered
rows. Instead of writing event = 'dispatched' OR event = 'delivered'
, use IN()
:
SELECT
...
FROM
core_events
WHERE
event IN ('dispatched', 'delivered')
Use BETWEEN
for dates instead of >=
and =<
Don't write day >= DATE('2020-01-06') AND day <= DATE('2020-01-12')
.
Write day BETWEEN DATE('2020-01-06') AND DATE('2020-01-12')
, it's much more readable.
Remember that BETWEEN
is inclusive, so x BETWEEN 1 AND 3
is the same as x IN (1, 2, 3)
for an INT
.
Use “advanced” aggregation functions such as MAX_BY()
If you're doing analytics work, use a database meant for it, like Presto. A good database meant for analytics work will have lots of useful aggregation functions like:
MAX_BY()
MIN_BY()
ARRAY_AGG()
ARBITRARY()
COUNT_IF()
APPROX_PERCENTILE()
KURTOSIS()
SKEWNESS()
To understand what these are, check the Presto docs.
When taking AVG()
, also check min/max/median/p10/p90 values
We often write aggregations and compute the mean summary statistic with AVG()
. In the logistics/delivery world, the most common example is to compute the metric average Deliveries per Driver for a fleet. So we write the query, the fleet average comes out to DPD=30.4
, which means on average a driver makes 30.4 deliveries per day. Sounds reasonable. But there could be a lot of junk in there:
- drivers who are just hired and in training, and making 2-3 deliveries/day; probably shouldn't be counted
- internal drivers making internal deliveries; probably shouldn't be counted
- suppose the company just introduced self-pickups, but technically in the production system these show up as a
SELF_PICKUP
driver, ie. all self-pickups are under one virtual driver's accounts, who has 1000+ deliveries; probably shouldn't be counted
See my earlier post Beat the averages for more on this.
There's lots of ways to catch problems like this, but one cheap way is to check the edges of the distribution:
WITH
daily_dpd AS
(
SELECT
day,
driver_id,
COUNT(DISTINCT tracking_id) AS num_delivered
FROM
deliveries
GROUP BY
1, 2
),
SELECT
day,
AVG(num_delivered) AS dpd,
-- let's get some additional statistics to make sure we're
-- not fooling ourselves by reporting the average
MIN(num_delivered) AS min_delivered,
MIN_BY(driver_id, num_delivered) AS min_delivered_driver_id,
MAX(num_delivered) AS max_delivered,
MAX_BY(driver_id, num_delivered) AS max_delivered_driver_id,
APPROX_PERCENTILE(num_delivered, 0.1) dpd_p10,
APPROX_PERCENTILE(num_delivered, 0.5) dpd_p50, -- aka median
APPROX_PERCENTILE(num_delivered, 0.9) dpd_p90
FROM
daily_dpd
GROUP BY
1
Don’t forget to COALESCE()
when doing LEFT/RIGHT/OUTER JOIN
Similar to the above example, suppose you want to list out driver's names and daily DPDs for 2020-01-06:
WITH
daily_dpd AS
(
SELECT
driver_id,
COUNT(DISTINCT tracking_id) AS num_delivered
FROM
deliveries
WHERE
day = DATE('2020-01-06')
GROUP BY
1
)
SELECT
COALESCE( -- COALESCE() returns the first non-NULL argument passed to it
drivers.name,
CAST(daily_dpd.driver_id AS VARCHAR)
) AS driver_name,
daily_dpd.num_delivered
FROM
daily_dpd
LEFT JOIN
drivers
ON
daily_dpd.driver_id = drivers.id
Also, always write out the table aliases (daily_dpd
and drivers
in this example) for clarity. It may be clear to you now where each column is coming from, but will you know in 3 months? Will the next guy know?
Double-check your parentheses in WHERE
when using OR
s
A nasty source of bugs in SQL code is a list of AND
, with an OR
hiding in there, with no parentheses, like:
SELECT
...
FROM
...
WHERE
country = 'UAE' AND day = DATE('2020-01-16') AND fleet = 'B2C' OR client = 'Acme Bank'
This is almost certainly not what the writer intended. Notice I didn't follow my indentation rules here, which would help catch this. What we really want here is:
SELECT
...
FROM
...
WHERE
country = 'UAE'
AND day = DATE('2020-01-16')
AND (fleet = 'B2C' OR client = 'Acme Bank')
Write 1000*1000
instead of 100000
Notice how I left out a 0 above?
Unfortunately, we cannot put commas into the numbers to improve readability in SQL (or other programming languages), so we can't write 1,000,000
. The problem with 1000000
is that it's hard to see whether we got the number of zeros right. This can lead to nasty bugs. It's better to pretend commas by multiplying like 1000*1000
or 500*1000
.
Use DECIMAL
not DOUBLE
when dealing with money
When dealing with money, never use DOUBLE
. There is a type for it, called DECIMAL
. In SQL, like in many programming languages, doubles are IEEE 754 floating points, and there's weird precision behaviour that may introduce nasty bugs. There's entire sites dedicated to explaining this. Tldr = DOUBLE
is meant to be used for math like sine and cosine, DECIMAL
for money. Note that by default, if you write a literal like 0.2
, it will be DOUBLE
.
Here's a simple example that may surprise you:
SELECT
0.1 + 0.2 = 0.3 AS does_math_work,
0.1 + 0.2 AS result_double_type,
CAST(0.1 AS DECIMAL(16,2)) + CAST(0.2 AS DECIMAL(16,2)) = 0.3 AS does_this_other_math_work,
CAST(0.1 AS DECIMAL(16,2)) + CAST(0.2 AS DECIMAL(16,2)) AS result_decimal_type
This returns on my Presto DB:
+-----------------+---------------------+---------------------------+---------------------+
| does_math_work | result_double_type | does_this_other_math_work | result_decimal_type |
+-----------------+---------------------+---------------------------+---------------------+
| false | 0.30000000000000004 | true | 0.3 |
+-----------------+---------------------+---------------------------+---------------------+