SQL best practices for Data Scientists and Analysts

Marton Trencseni - Sun 26 January 2020 - Data

Introduction

The previous post was about SQL code style, like uppercasing, indentation and overall structure like WITH. This is about how to write readable, concise and efficient SELECT statements that minimize bugs.

For completeness, I will quickly list out the coding style suggestions from the last post, without explanations:

  1. Use all upper-case for keywords like SELECT
  2. Keep SELECT .. FROM .. WHERE .. GROUP BY unindented
  3. Line up WHERE conditions
  4. Write GROUP BY 1, 2
  5. Indent WHEN relative to CASE
  6. Use WITH instead of nested sub-SELECTs
  7. Long lines are okay
  8. Break the rules for readability and flow of code
  9. Follow the same rules when writing interactive SQL code

Without further ado, my SQL best practices for Data Scientists and Analysts follow below.

Use COUNT(DISTINCT id) instead of COUNT(*)

COUNT(*) is dangerous, because you could be counting things do don't want to. For example, a table called dispatches might include multiple dispatch events for a delivery order. If you're counting dispatches for a week, do you want to double-count dispatches for the same order, if the order was dispatched but didn't get delivered the first time (customer was unreachable), and was re-dispatched again? Also, tables can sometimes get polluted with bad data, eg. if there was a software issue issue in production and the warehouse staff triggered multiple dispatch events. Also, although we expect data engineers to pre-clean data, sometimes new modes of dirtyness appear. And sometimes it happens that a bug is introduced in the ETL pipeline, and eg. all rows are duplicated in a table. Although it's not the analyst's fault, it's better to be defensive. Example:

SELECT
    COUNT(*),                    -- BAD:  what are we counting here?
    COUNT(DISTINCT tracking_id)  -- GOOD: much clearer, we're counting unique orders
FROM
    dispatches
WHERE
    ...

Use ROW() when computing metrics over multiple days

In the above example, suppose we want to count dispatches for 7 days, and we want to count re-dispatches of the same package. In this case COUNT(DISTINCT tracking_id) won't work, because it won't double count double dispatches. And we said COUNT(*) is evil. The way out is to DISTINCT on both days and orders, and the way to do that is with ROW():

SELECT
    COUNT(*),                              -- BAD:  what are we counting here?
    COUNT(DISTINCT tracking_id)            -- GOOD: unique orders dispatched
    COUNT(DISTINCT ROW(day, tracking_id))  -- GOOD: dispatches, but we only count an order once a day
FROM
    dispatches
WHERE
    day BETWEEN DATE('2020-01-06') AND DATE('2020-01-12')

Always name columns with AS

Continuing with the previous example, we should always name our columns:

SELECT
    COUNT(*) AS clown_town,
    COUNT(DISTINCT tracking_id) AS num_unique_orders_dispatched,
    COUNT(DISTINCT ROW(day, tracking_id)) AS num_dispatches
FROM
    dispatches
WHERE
    day BETWEEN DATE('2020-01-06') AND DATE('2020-01-12')

Use CASE WHEN instead of writing multiple queries

Suppose we want to count both dispatches and deliveries, and there's a table core_events which has both. We can accomplish this with one SELECT, we don't have to write two:

SELECT
    COUNT(DISTINCT CASE WHEN event = 'dispatched' THEN ROW(day, tracking_id) ELSE NULL END) AS num_dispatches,
    COUNT(DISTINCT CASE WHEN event = 'delivered'  THEN ROW(day, tracking_id) ELSE NULL END) AS num_delivered,
    COUNT(DISTINCT CASE WHEN event = 'dispatched' THEN tracking_id           ELSE NULL END) AS num_unique_orders_dispatched,
    COUNT(DISTINCT CASE WHEN event = 'delivered'  THEN tracking_id           ELSE NULL END) AS num_unique_orders_delivered
FROM
    core_events
WHERE
    day BETWEEN DATE('2020-01-06') AND DATE('2020-01-12')

Remember, COUNT() doesn't count NULLs.

In the above example, we expect:

  • num_delivered = num_unique_orders_delivered because an order can only be delivered once
  • num_unique_orders_dispatched <= num_dispatches
  • num_delivered <= num_dispatches

Use IN (...) instead or ORs

In the above example, for clarity and efficiency, we should only include dispatched and delivered rows. Instead of writing event = 'dispatched' OR event = 'delivered', use IN():

SELECT
    ...
FROM
    core_events
WHERE
    event IN ('dispatched', 'delivered')

Use BETWEEN for dates instead of >= and =<

Don't write day >= DATE('2020-01-06') AND day <= DATE('2020-01-12').

Write day BETWEEN DATE('2020-01-06') AND DATE('2020-01-12'), it's much more readable.

Remember that BETWEEN is inclusive, so x BETWEEN 1 AND 3 is the same as x IN (1, 2, 3) for an INT.

Use “advanced” aggregation functions such as MAX_BY()

If you're doing analytics work, use a database meant for it, like Presto. A good database meant for analytics work will have lots of useful aggregation functions like:

  • MAX_BY()
  • MIN_BY()
  • ARRAY_AGG()
  • ARBITRARY()
  • COUNT_IF()
  • APPROX_PERCENTILE()
  • KURTOSIS()
  • SKEWNESS()

To understand what these are, check the Presto docs.

When taking AVG(), also check min/max/median/p10/p90 values

We often write aggregations and compute the mean summary statistic with AVG(). In the logistics/delivery world, the most common example is to compute the metric average Deliveries per Driver for a fleet. So we write the query, the fleet average comes out to DPD=30.4, which means on average a driver makes 30.4 deliveries per day. Sounds reasonable. But there could be a lot of junk in there:

  • drivers who are just hired and in training, and making 2-3 deliveries/day; probably shouldn't be counted
  • internal drivers making internal deliveries; probably shouldn't be counted
  • suppose the company just introduced self-pickups, but technically in the production system these show up as a SELF_PICKUP driver, ie. all self-pickups are under one virtual driver's accounts, who has 1000+ deliveries; probably shouldn't be counted

See my earlier post Beat the averages for more on this.

There's lots of ways to catch problems like this, but one cheap way is to check the edges of the distribution:

WITH
daily_dpd AS
(
    SELECT
        day,
        driver_id,
        COUNT(DISTINCT tracking_id) AS num_delivered
    FROM
        deliveries
    GROUP BY
        1, 2
),

SELECT
    day,
    AVG(num_delivered) AS dpd,
    -- let's get some additional statistics to make sure we're
    -- not fooling ourselves by reporting the average
    MIN(num_delivered) AS min_delivered,
    MIN_BY(driver_id, num_delivered) AS min_delivered_driver_id,
    MAX(num_delivered) AS max_delivered,
    MAX_BY(driver_id, num_delivered) AS max_delivered_driver_id,
    APPROX_PERCENTILE(num_delivered, 0.1) dpd_p10,
    APPROX_PERCENTILE(num_delivered, 0.5) dpd_p50, -- aka median
    APPROX_PERCENTILE(num_delivered, 0.9) dpd_p90
FROM
    daily_dpd
GROUP BY
    1

Don’t forget to COALESCE() when doing LEFT/RIGHT/OUTER JOIN

Similar to the above example, suppose you want to list out driver's names and daily DPDs for 2020-01-06:

WITH
daily_dpd AS
(
    SELECT
        driver_id,
        COUNT(DISTINCT tracking_id) AS num_delivered
    FROM
        deliveries
    WHERE
        day = DATE('2020-01-06')
    GROUP BY
        1
)

SELECT
    COALESCE( -- COALESCE() returns the first non-NULL argument passed to it
        drivers.name,
        CAST(daily_dpd.driver_id AS VARCHAR)
    ) AS driver_name,
    daily_dpd.num_delivered
FROM
    daily_dpd
LEFT JOIN
    drivers
ON
    daily_dpd.driver_id = drivers.id

Also, always write out the table aliases (daily_dpd and drivers in this example) for clarity. It may be clear to you now where each column is coming from, but will you know in 3 months? Will the next guy know?

Double-check your parentheses in WHERE when using ORs

A nasty source of bugs in SQL code is a list of AND, with an OR hiding in there, with no parentheses, like:

SELECT
    ...
FROM
    ...
WHERE
    country = 'UAE' AND day = DATE('2020-01-16') AND fleet = 'B2C' OR client = 'Acme Bank'

This is almost certainly not what the writer intended. Notice I didn't follow my indentation rules here, which would help catch this. What we really want here is:

SELECT
    ...
FROM
    ...
WHERE
        country = 'UAE'
    AND day = DATE('2020-01-16')
    AND (fleet = 'B2C' OR client = 'Acme Bank')

Write 1000*1000 instead of 100000

Notice how I left out a 0 above?

Unfortunately, we cannot put commas into the numbers to improve readability in SQL (or other programming languages), so we can't write 1,000,000. The problem with 1000000 is that it's hard to see whether we got the number of zeros right. This can lead to nasty bugs. It's better to pretend commas by multiplying like 1000*1000 or 500*1000.

Use DECIMAL not DOUBLE when dealing with money

When dealing with money, never use DOUBLE. There is a type for it, called DECIMAL. In SQL, like in many programming languages, doubles are IEEE 754 floating points, and there's weird precision behaviour that may introduce nasty bugs. There's entire sites dedicated to explaining this. Tldr = DOUBLE is meant to be used for math like sine and cosine, DECIMAL for money. Note that by default, if you write a literal like 0.2, it will be DOUBLE.

Here's a simple example that may surprise you:

SELECT
    0.1 + 0.2 = 0.3 AS does_math_work,
    0.1 + 0.2 AS result_double_type,
    CAST(0.1 AS DECIMAL(16,2)) + CAST(0.2 AS DECIMAL(16,2)) = 0.3 AS does_this_other_math_work,
    CAST(0.1 AS DECIMAL(16,2)) + CAST(0.2 AS DECIMAL(16,2)) AS result_decimal_type

This returns on my Presto DB:

+-----------------+---------------------+---------------------------+---------------------+
| does_math_work  | result_double_type  | does_this_other_math_work | result_decimal_type |
+-----------------+---------------------+---------------------------+---------------------+
| false           | 0.30000000000000004 | true                      | 0.3                 |
+-----------------+---------------------+---------------------------+---------------------+