How I write SQL code
Marton Trencseni - Fri 24 January 2020 - Data
Introduction
This is a simple post about SQL code formatting. Most of this comes from my time as a Data Engineer at Facebook.
I’ve always cared a lot about writing readable code. Readability has several components:
- directory layout
- breaking code into files (eg. what to put in headers, how many)
- code layed out in files (eg. ordering of includes, templates, classes, functions)
- naming of files, classes, functions and variables
- indentation, line width
- comments in code files
- the modular structure of the code itself (eg. design patterns)
- high-level documentation that explains design choices
Readable code should be relatively easy to explore and read. I say “relatively”, because most programs have a lot of innate complexity that you have to wrap your head around, even if it’s well-written.
A famous example of high quality and readable (C++) code is the Quake3 and Doom3 code by John Carmack. When I was working on ScalienDB many years ago, I tried to write C++ code in a similar style. A good C++ book on the above points is Large Scale Software Design by John Lakos.
Writing good SQL code is much simpler than writing good C++ code. SQL code has no templates, classes, and in my experience analytical “data mining” SQL code also has no functions (=stored procedures). It’s really just a lot of standalone SELECT
s, sometimes chained with WITH
, or by dropping results into tables and SELECT
ing out of those tables in the next step. Having said that, I still see a lot of data people writing hard to read SQL code.
Without further ado, my rules for writing SQL code follow below.
Use all upper-case for keywords like SELECT
Also use upper-case for built-in types and functions like BIGINT
and COUNT()
.
Use lower-case for column names and table names.
Keep SELECT .. FROM .. WHERE .. GROUP BY
unindented
But indent the rest. An example so far:
SELECT
customer_name,
customer_street
FROM
company_metrics
WHERE
customer_name = 'Marton Trencseni'
Line up WHERE
conditions
God is merciful because AND⎵
is 4 characters, a good tab width, so WHERE
conditions are to be lined up like (same for JOIN
conditions):
SELECT
...
FROM
deliveries
WHERE
country = 'UAE'
AND day >= DATE('2019-07-01')
AND DAY_OF_WEEK(day) != 5
AND scheduled_accuracy_meters <= 10*1000
Write GROUP BY 1, 2
Always put the grouped columns first in the column list, and write GROUP BY 1, 2 .. N
:
SELECT
region_fleet,
CASE WHEN status = 'Delivered' THEN 'Delivered' ELSE 'Not Delivered' END AS status,
DATE_TRUNC('week', day) AS week,
COUNT(DISTINCT ROW(day, so_number)) AS num_orders,
COUNT(DISTINCT CASE WHEN scheduled_accuracy_meters <= 500 THEN ROW(day, so_number) ELSE NULL END) AS num_accurate,
AVG(scheduled_accuracy_meters) AS scheduled_accuracy_meters
FROM
deliveries
WHERE
...
GROUP BY
1, 2, 3
Indent WHEN
relative to CASE
SELECT
...,
CASE
WHEN schedule_channel IN ('Ticker') THEN 'Ticker'
WHEN schedule_channel IN ('CallCenter', 'CSA') THEN 'Callcenter'
WHEN schedule_channel IN ('CallCenterBlind', 'CSA-BD') THEN 'Blind'
WHEN schedule_channel IN ('AutoReschedule', 'AI-AutoReschedul', 'Auto Schedule', 'DriverReschedule') THEN 'Rest'
WHEN schedule_channel IN ('BulkSchedule') THEN 'Bulk'
WHEN schedule_channel IN ('CallCenterSelfPickupPoint', 'CallCenterSelfPickupWarehouse') THEN 'Self-pickup'
WHEN schedule_channel IN ('AI-AutoSchedulin', 'AI_Rango', 'AI-Rango') THEN 'AI'
WHEN schedule_channel IN ('MWeb', 'MWEB', 'mobile-ios', 'mobile-android') THEN 'Mweb/apps (various)'
WHEN schedule_channel IN ('Whatsapp', 'Callcenter', 'Blind', 'Driver (various)', 'Bulk', 'AI') THEN schedule_channel
ELSE 'Rest'
END AS schedule_channel
FROM
deliveries
Use WITH
instead of nested sub-SELECT
s
Sub-SELECT
s with indenting are hard to read. Instead, create aliases with WITH
, and chain them. Put the WITH
on a separate line, and then write the aliases. If I have no better idea, I call the aliases step1, step2 ...
For example, suppose table deliveries
has columns scheduled_coordinates
and actual_coordinates
as lat, lon
string, and you want to compute the meter distance with the Haversine formula:
WITH
step1 AS
(
SELECT
*,
TRIM(SPLIT_PART(actual_coordinates, ' ', 1)) AS act_lat_str,
TRIM(SPLIT_PART(actual_coordinates, ' ', 2)) AS act_lon_str,
TRIM(SPLIT_PART(scheduled_coordinates, ' ', 1)) AS sch_lat_str,
TRIM(SPLIT_PART(scheduled_coordinates, ' ', 2)) AS sch_lon_str
FROM
deliveries
),
step2 AS
(
SELECT
*,
CASE
WHEN REGEXP_LIKE(act_lat_str, '^[0-9]+\.[0-9]+$') THEN CAST(act_lat_str AS DOUBLE)
ELSE NULL
END AS actual_lat,
CASE
WHEN REGEXP_LIKE(act_lon_str, '^[0-9]+\.[0-9]+$') THEN CAST(act_lon_str AS DOUBLE)
ELSE NULL
END AS actual_lon,
CASE
WHEN REGEXP_LIKE(sch_lat_str, '^[0-9]+\.[0-9]+$') THEN CAST(sch_lat_str AS DOUBLE)
ELSE NULL
END AS scheduled_lat,
CASE
WHEN REGEXP_LIKE(sch_lon_str, '^[0-9]+\.[0-9]+$') THEN CAST(sch_lon_str AS DOUBLE)
ELSE NULL
END AS scheduled_lon
FROM
step1
),
step3 AS
(
SELECT
*,
6371 * 1000 * 2 * ASIN(SQRT(
POW(
SIN( RADIANS(scheduled_lat - actual_lat) / 2 ),
2
)
+
COS(RADIANS(actual_lat))
* COS(RADIANS(scheduled_lat))
* POW(
SIN( RADIANS(scheduled_lon - actual_lon) / 2 ),
2
)
)) AS scheduled_accuracy_meters
FROM
step2
)
Long lines are okay
As you can see, long lines are okay in my opinion. We all have widescreen aspect ratio screens (Macbooks), let's use that space. Shorter lines are more readable, but breaking everything into shorter lines, re-breaking when making changes, it's too much work in my opinion, especially when dealing with lenghty "business logic" in SQL.
Break the rules for readability and flow of code
Rules are made to be broken. If doing it in another way leads to better readability, break the rules.
For example, sometimes we SELECT
out a horizontal/vertical part of a table in a quick sub-SELECT
to help the query optimizer. In cases like this I don't use WITH
and keep it in one line, like:
SELECT
...
FROM
deliveries dls
LEFT JOIN
(SELECT id, name FROM drivers WHERE country = 'UAE') drv
ON
dls.driver_id = drv.id
Follow the same rules when writing interactive SQL code
I follow these same rules when writing one-off queries in an interactive console. That way it’s easier to stick to them when writing long-lived code that goes into ETL and the repo. Also, one-off code oftens ends up being ETL’d.
Alternative ways to write SQL
Finally, some other, more comprehensive guides: