How I write SQL code

Marton Trencseni - Fri 24 January 2020 - Data

Introduction

This is a simple post about SQL code formatting. Most of this comes from my time as a Data Engineer at Facebook.

I’ve always cared a lot about writing readable code. Readability has several components:

  • directory layout
  • breaking code into files (eg. what to put in headers, how many)
  • code layed out in files (eg. ordering of includes, templates, classes, functions)
  • naming of files, classes, functions and variables
  • indentation, line width
  • comments in code files
  • the modular structure of the code itself (eg. design patterns)
  • high-level documentation that explains design choices

Readable code should be relatively easy to explore and read. I say “relatively”, because most programs have a lot of innate complexity that you have to wrap your head around, even if it’s well-written.

A famous example of high quality and readable (C++) code is the Quake3 and Doom3 code by John Carmack. When I was working on ScalienDB many years ago, I tried to write C++ code in a similar style. A good C++ book on the above points is Large Scale Software Design by John Lakos.

Writing good SQL code is much simpler than writing good C++ code. SQL code has no templates, classes, and in my experience analytical “data mining” SQL code also has no functions (=stored procedures). It’s really just a lot of standalone SELECTs, sometimes chained with WITH, or by dropping results into tables and SELECTing out of those tables in the next step. Having said that, I still see a lot of data people writing hard to read SQL code.

Without further ado, my rules for writing SQL code follow below.

Use all upper-case for keywords like SELECT

Also use upper-case for built-in types and functions like BIGINT and COUNT(). Use lower-case for column names and table names.

Keep SELECT .. FROM .. WHERE .. GROUP BY unindented

But indent the rest. An example so far:

SELECT
    customer_name,
    customer_street
FROM
    company_metrics
WHERE
    customer_name = 'Marton Trencseni'

Line up WHERE conditions

God is merciful because AND⎵ is 4 characters, a good tab width, so WHERE conditions are to be lined up like (same for JOIN conditions):

SELECT
    ...
FROM
    deliveries
WHERE
        country = 'UAE'
    AND day >= DATE('2019-07-01')
    AND DAY_OF_WEEK(day) != 5
    AND scheduled_accuracy_meters <= 10*1000

Write GROUP BY 1, 2

Always put the grouped columns first in the column list, and write GROUP BY 1, 2 .. N:

SELECT
    region_fleet,
    CASE WHEN status = 'Delivered' THEN 'Delivered' ELSE 'Not Delivered' END AS status,
    DATE_TRUNC('week', day) AS week,
    COUNT(DISTINCT ROW(day, so_number)) AS num_orders,
    COUNT(DISTINCT CASE WHEN scheduled_accuracy_meters <= 500 THEN ROW(day, so_number) ELSE NULL END) AS num_accurate,
    AVG(scheduled_accuracy_meters) AS scheduled_accuracy_meters
FROM
    deliveries
WHERE
    ...
GROUP BY
    1, 2, 3

Indent WHEN relative to CASE

SELECT
    ...,
    CASE
        WHEN schedule_channel IN ('Ticker') THEN 'Ticker'
        WHEN schedule_channel IN ('CallCenter', 'CSA') THEN 'Callcenter'
        WHEN schedule_channel IN ('CallCenterBlind', 'CSA-BD') THEN 'Blind'
        WHEN schedule_channel IN ('AutoReschedule', 'AI-AutoReschedul', 'Auto Schedule', 'DriverReschedule') THEN 'Rest'
        WHEN schedule_channel IN ('BulkSchedule') THEN 'Bulk'
        WHEN schedule_channel IN ('CallCenterSelfPickupPoint', 'CallCenterSelfPickupWarehouse') THEN 'Self-pickup'
        WHEN schedule_channel IN ('AI-AutoSchedulin', 'AI_Rango', 'AI-Rango') THEN 'AI'
        WHEN schedule_channel IN ('MWeb', 'MWEB', 'mobile-ios', 'mobile-android') THEN 'Mweb/apps (various)'
        WHEN schedule_channel IN ('Whatsapp', 'Callcenter', 'Blind', 'Driver (various)', 'Bulk', 'AI') THEN schedule_channel
        ELSE 'Rest'
    END AS schedule_channel
FROM
    deliveries

Use WITH instead of nested sub-SELECTs

Sub-SELECTs with indenting are hard to read. Instead, create aliases with WITH, and chain them. Put the WITH on a separate line, and then write the aliases. If I have no better idea, I call the aliases step1, step2 ...

For example, suppose table deliveries has columns scheduled_coordinates and actual_coordinates as lat, lon string, and you want to compute the meter distance with the Haversine formula:

WITH
step1 AS
(
    SELECT
        *,
        TRIM(SPLIT_PART(actual_coordinates,    ' ', 1)) AS act_lat_str,
        TRIM(SPLIT_PART(actual_coordinates,    ' ', 2)) AS act_lon_str,
        TRIM(SPLIT_PART(scheduled_coordinates, ' ', 1)) AS sch_lat_str,
        TRIM(SPLIT_PART(scheduled_coordinates, ' ', 2)) AS sch_lon_str
    FROM
        deliveries
),
step2 AS
(
    SELECT
        *,
        CASE
            WHEN REGEXP_LIKE(act_lat_str, '^[0-9]+\.[0-9]+$') THEN CAST(act_lat_str AS DOUBLE)
            ELSE NULL
        END AS actual_lat,
        CASE
            WHEN REGEXP_LIKE(act_lon_str, '^[0-9]+\.[0-9]+$') THEN CAST(act_lon_str AS DOUBLE)
            ELSE NULL
        END AS actual_lon,
        CASE
            WHEN REGEXP_LIKE(sch_lat_str, '^[0-9]+\.[0-9]+$') THEN CAST(sch_lat_str AS DOUBLE)
            ELSE NULL
        END AS scheduled_lat,
        CASE
            WHEN REGEXP_LIKE(sch_lon_str, '^[0-9]+\.[0-9]+$') THEN CAST(sch_lon_str AS DOUBLE)
            ELSE NULL
        END AS scheduled_lon
    FROM
        step1
),
step3 AS
(
    SELECT
        *,
        6371 * 1000 * 2 * ASIN(SQRT(
            POW(
                SIN(   RADIANS(scheduled_lat - actual_lat) / 2   ),
                2
            )
            +
              COS(RADIANS(actual_lat))
            * COS(RADIANS(scheduled_lat))
            * POW(
                SIN(   RADIANS(scheduled_lon - actual_lon) / 2   ),
                2
            )
        )) AS scheduled_accuracy_meters
    FROM
        step2
)

Long lines are okay

As you can see, long lines are okay in my opinion. We all have widescreen aspect ratio screens (Macbooks), let's use that space. Shorter lines are more readable, but breaking everything into shorter lines, re-breaking when making changes, it's too much work in my opinion, especially when dealing with lenghty "business logic" in SQL.

Break the rules for readability and flow of code

Rules are made to be broken. If doing it in another way leads to better readability, break the rules.

For example, sometimes we SELECT out a horizontal/vertical part of a table in a quick sub-SELECT to help the query optimizer. In cases like this I don't use WITH and keep it in one line, like:

SELECT
    ...

FROM
    deliveries dls
LEFT JOIN
    (SELECT id, name FROM drivers WHERE country = 'UAE') drv
ON
    dls.driver_id = drv.id

Follow the same rules when writing interactive SQL code

I follow these same rules when writing one-off queries in an interactive console. That way it’s easier to stick to them when writing long-lived code that goes into ETL and the repo. Also, one-off code oftens ends up being ETL’d.

Alternative ways to write SQL

Finally, some other, more comprehensive guides: