# How I write SQL code

Marton Trencseni - Fri 24 January 2020 - Data

## Introduction

This is a simple post about SQL code formatting. Most of this comes from my time as a Data Engineer at Facebook.

• directory layout
• breaking code into files (eg. what to put in headers, how many)
• code layed out in files (eg. ordering of includes, templates, classes, functions)
• naming of files, classes, functions and variables
• indentation, line width
• the modular structure of the code itself (eg. design patterns)
• high-level documentation that explains design choices

Readable code should be relatively easy to explore and read. I say “relatively”, because most programs have a lot of innate complexity that you have to wrap your head around, even if it’s well-written.

A famous example of high quality and readable (C++) code is the Quake3 and Doom3 code by John Carmack. When I was working on ScalienDB many years ago, I tried to write C++ code in a similar style. A good C++ book on the above points is Large Scale Software Design by John Lakos.

Writing good SQL code is much simpler than writing good C++ code. SQL code has no templates, classes, and in my experience analytical “data mining” SQL code also has no functions (=stored procedures). It’s really just a lot of standalone SELECTs, sometimes chained with WITH, or by dropping results into tables and SELECTing out of those tables in the next step. Having said that, I still see a lot of data people writing hard to read SQL code.

## Use all upper-case for keywords like SELECT

Also use upper-case for built-in types and functions like BIGINT and COUNT(). Use lower-case for column names and table names.

## Keep SELECT .. FROM .. WHERE .. GROUP BY unindented

But indent the rest. An example so far:

SELECT
customer_name,
customer_street
FROM
company_metrics
WHERE
customer_name = 'Marton Trencseni'


## Line up WHERE conditions

God is merciful because AND⎵ is 4 characters, a good tab width, so WHERE conditions are to be lined up like (same for JOIN conditions):

SELECT
...
FROM
deliveries
WHERE
country = 'UAE'
AND day >= DATE('2019-07-01')
AND DAY_OF_WEEK(day) != 5
AND scheduled_accuracy_meters <= 10*1000


## Write GROUP BY 1, 2

Always put the grouped columns first in the column list, and write GROUP BY 1, 2 .. N:

SELECT
region_fleet,
CASE WHEN status = 'Delivered' THEN 'Delivered' ELSE 'Not Delivered' END AS status,
DATE_TRUNC('week', day) AS week,
COUNT(DISTINCT ROW(day, so_number)) AS num_orders,
COUNT(DISTINCT CASE WHEN scheduled_accuracy_meters <= 500 THEN ROW(day, so_number) ELSE NULL END) AS num_accurate,
AVG(scheduled_accuracy_meters) AS scheduled_accuracy_meters
FROM
deliveries
WHERE
...
GROUP BY
1, 2, 3


## Indent WHEN relative to CASE

SELECT
...,
CASE
WHEN schedule_channel IN ('Ticker') THEN 'Ticker'
WHEN schedule_channel IN ('CallCenter', 'CSA') THEN 'Callcenter'
WHEN schedule_channel IN ('CallCenterBlind', 'CSA-BD') THEN 'Blind'
WHEN schedule_channel IN ('AutoReschedule', 'AI-AutoReschedul', 'Auto Schedule', 'DriverReschedule') THEN 'Rest'
WHEN schedule_channel IN ('BulkSchedule') THEN 'Bulk'
WHEN schedule_channel IN ('CallCenterSelfPickupPoint', 'CallCenterSelfPickupWarehouse') THEN 'Self-pickup'
WHEN schedule_channel IN ('AI-AutoSchedulin', 'AI_Rango', 'AI-Rango') THEN 'AI'
WHEN schedule_channel IN ('MWeb', 'MWEB', 'mobile-ios', 'mobile-android') THEN 'Mweb/apps (various)'
WHEN schedule_channel IN ('Whatsapp', 'Callcenter', 'Blind', 'Driver (various)', 'Bulk', 'AI') THEN schedule_channel
ELSE 'Rest'
END AS schedule_channel
FROM
deliveries


## Use WITH instead of nested sub-SELECTs

Sub-SELECTs with indenting are hard to read. Instead, create aliases with WITH, and chain them. Put the WITH on a separate line, and then write the aliases. If I have no better idea, I call the aliases step1, step2 ...

For example, suppose table deliveries has columns scheduled_coordinates and actual_coordinates as lat, lon string, and you want to compute the meter distance with the Haversine formula:

WITH
step1 AS
(
SELECT
*,
TRIM(SPLIT_PART(actual_coordinates,    ' ', 1)) AS act_lat_str,
TRIM(SPLIT_PART(actual_coordinates,    ' ', 2)) AS act_lon_str,
TRIM(SPLIT_PART(scheduled_coordinates, ' ', 1)) AS sch_lat_str,
TRIM(SPLIT_PART(scheduled_coordinates, ' ', 2)) AS sch_lon_str
FROM
deliveries
),
step2 AS
(
SELECT
*,
CASE
WHEN REGEXP_LIKE(act_lat_str, '^[0-9]+\.[0-9]+$') THEN CAST(act_lat_str AS DOUBLE) ELSE NULL END AS actual_lat, CASE WHEN REGEXP_LIKE(act_lon_str, '^[0-9]+\.[0-9]+$') THEN CAST(act_lon_str AS DOUBLE)
ELSE NULL
END AS actual_lon,
CASE
WHEN REGEXP_LIKE(sch_lat_str, '^[0-9]+\.[0-9]+$') THEN CAST(sch_lat_str AS DOUBLE) ELSE NULL END AS scheduled_lat, CASE WHEN REGEXP_LIKE(sch_lon_str, '^[0-9]+\.[0-9]+$') THEN CAST(sch_lon_str AS DOUBLE)
ELSE NULL
END AS scheduled_lon
FROM
step1
),
step3 AS
(
SELECT
*,
6371 * 1000 * 2 * ASIN(SQRT(
POW(
SIN(   RADIANS(scheduled_lat - actual_lat) / 2   ),
2
)
+
* POW(
SIN(   RADIANS(scheduled_lon - actual_lon) / 2   ),
2
)
)) AS scheduled_accuracy_meters
FROM
step2
)


## Long lines are okay

As you can see, long lines are okay in my opinion. We all have widescreen aspect ratio screens (Macbooks), let's use that space. Shorter lines are more readable, but breaking everything into shorter lines, re-breaking when making changes, it's too much work in my opinion, especially when dealing with lenghty "business logic" in SQL.

## Break the rules for readability and flow of code

Rules are made to be broken. If doing it in another way leads to better readability, break the rules.

For example, sometimes we SELECT out a horizontal/vertical part of a table in a quick sub-SELECT to help the query optimizer. In cases like this I don't use WITH and keep it in one line, like:

SELECT
...

FROM
deliveries dls
LEFT JOIN
(SELECT id, name FROM drivers WHERE country = 'UAE') drv
ON
dls.driver_id = drv.id


## Follow the same rules when writing interactive SQL code

I follow these same rules when writing one-off queries in an interactive console. That way it’s easier to stick to them when writing long-lived code that goes into ETL and the repo. Also, one-off code oftens ends up being ETL’d.

## Alternative ways to write SQL

Finally, some other, more comprehensive guides: