This is a simple post about SQL code formatting. Most of this comes from my time as a Data Engineer at Facebook.
I’ve always cared a lot about writing readable code. Readability has several components:
- directory layout
- breaking code into files (eg. what to put in headers, how many)
- code layed out in files (eg. ordering of includes, templates, classes, functions)
- naming of files, classes, functions and variables
- indentation, line width
- comments in code files
- the modular structure of the code itself (eg. design patterns)
- high-level documentation that explains design choices
Readable code should be relatively easy to explore and read. I say “relatively”, because most programs have a lot of innate complexity that you have to wrap your head around, even if it’s well-written.
A famous example of high quality and readable (C++) code is the Quake3 and Doom3 code by John Carmack. When I was working on ScalienDB many years ago, I tried to write C++ code in a similar style. A good C++ book on the above points is Large Scale Software Design by John Lakos.
Writing good SQL code is much simpler than writing good C++ code. SQL code has no templates, classes, and in my experience analytical “data mining” SQL code also has no functions (=stored procedures). It’s really just a lot of standalone
SELECTs, sometimes chained with
WITH, or by dropping results into tables and
SELECTing out of those tables in the next step. Having said that, I still see a lot of data people writing hard to read SQL code.
Without further ado, my rules for writing SQL code follow below.
Use all upper-case for keywords like
Also use upper-case for built-in types and functions like
Use lower-case for column names and table names.
SELECT .. FROM .. WHERE .. GROUP BY unindented
But indent the rest. An example so far:
SELECT customer_name, customer_street FROM company_metrics WHERE customer_name = 'Marton Trencseni'
God is merciful because
AND⎵ is 4 characters, a good tab width, so
WHERE conditions are to be lined up like (same for
SELECT ... FROM deliveries WHERE country = 'UAE' AND day >= DATE('2019-07-01') AND DAY_OF_WEEK(day) != 5 AND scheduled_accuracy_meters <= 10*1000
GROUP BY 1, 2
Always put the grouped columns first in the column list, and write
GROUP BY 1, 2 .. N:
SELECT region_fleet, CASE WHEN status = 'Delivered' THEN 'Delivered' ELSE 'Not Delivered' END AS status, DATE_TRUNC('week', day) AS week, COUNT(DISTINCT ROW(day, so_number)) AS num_orders, COUNT(DISTINCT CASE WHEN scheduled_accuracy_meters <= 500 THEN ROW(day, so_number) ELSE NULL END) AS num_accurate, AVG(scheduled_accuracy_meters) AS scheduled_accuracy_meters FROM deliveries WHERE ... GROUP BY 1, 2, 3
WHEN relative to
SELECT ..., CASE WHEN schedule_channel IN ('Ticker') THEN 'Ticker' WHEN schedule_channel IN ('CallCenter', 'CSA') THEN 'Callcenter' WHEN schedule_channel IN ('CallCenterBlind', 'CSA-BD') THEN 'Blind' WHEN schedule_channel IN ('AutoReschedule', 'AI-AutoReschedul', 'Auto Schedule', 'DriverReschedule') THEN 'Rest' WHEN schedule_channel IN ('BulkSchedule') THEN 'Bulk' WHEN schedule_channel IN ('CallCenterSelfPickupPoint', 'CallCenterSelfPickupWarehouse') THEN 'Self-pickup' WHEN schedule_channel IN ('AI-AutoSchedulin', 'AI_Rango', 'AI-Rango') THEN 'AI' WHEN schedule_channel IN ('MWeb', 'MWEB', 'mobile-ios', 'mobile-android') THEN 'Mweb/apps (various)' WHEN schedule_channel IN ('Whatsapp', 'Callcenter', 'Blind', 'Driver (various)', 'Bulk', 'AI') THEN schedule_channel ELSE 'Rest' END AS schedule_channel FROM deliveries
WITH instead of nested sub-
SELECTs with indenting are hard to read. Instead, create aliases with
WITH, and chain them. Put the
WITH on a separate line, and then write the aliases. If I have no better idea, I call the aliases
step1, step2 ...
For example, suppose table
deliveries has columns
lat, lon string, and you want to compute the meter distance with the Haversine formula:
WITH step1 AS ( SELECT *, TRIM(SPLIT_PART(actual_coordinates, ' ', 1)) AS act_lat_str, TRIM(SPLIT_PART(actual_coordinates, ' ', 2)) AS act_lon_str, TRIM(SPLIT_PART(scheduled_coordinates, ' ', 1)) AS sch_lat_str, TRIM(SPLIT_PART(scheduled_coordinates, ' ', 2)) AS sch_lon_str FROM deliveries ), step2 AS ( SELECT *, CASE WHEN REGEXP_LIKE(act_lat_str, '^[0-9]+\.[0-9]+$') THEN CAST(act_lat_str AS DOUBLE) ELSE NULL END AS actual_lat, CASE WHEN REGEXP_LIKE(act_lon_str, '^[0-9]+\.[0-9]+$') THEN CAST(act_lon_str AS DOUBLE) ELSE NULL END AS actual_lon, CASE WHEN REGEXP_LIKE(sch_lat_str, '^[0-9]+\.[0-9]+$') THEN CAST(sch_lat_str AS DOUBLE) ELSE NULL END AS scheduled_lat, CASE WHEN REGEXP_LIKE(sch_lon_str, '^[0-9]+\.[0-9]+$') THEN CAST(sch_lon_str AS DOUBLE) ELSE NULL END AS scheduled_lon FROM step1 ), step3 AS ( SELECT *, 6371 * 1000 * 2 * ASIN(SQRT( POW( SIN( RADIANS(scheduled_lat - actual_lat) / 2 ), 2 ) + COS(RADIANS(actual_lat)) * COS(RADIANS(scheduled_lat)) * POW( SIN( RADIANS(scheduled_lon - actual_lon) / 2 ), 2 ) )) AS scheduled_accuracy_meters FROM step2 )
Long lines are okay
As you can see, long lines are okay in my opinion. We all have widescreen aspect ratio screens (Macbooks), let's use that space. Shorter lines are more readable, but breaking everything into shorter lines, re-breaking when making changes, it's too much work in my opinion, especially when dealing with lenghty "business logic" in SQL.
Break the rules for readability and flow of code
Rules are made to be broken. If doing it in another way leads to better readability, break the rules.
For example, sometimes we
SELECT out a horizontal/vertical part of a table in a quick sub-
SELECT to help the query optimizer. In cases like this I don't use
WITH and keep it in one line, like:
SELECT ... FROM deliveries dls LEFT JOIN (SELECT id, name FROM drivers WHERE country = 'UAE') drv ON dls.driver_id = drv.id
Follow the same rules when writing interactive SQL code
I follow these same rules when writing one-off queries in an interactive console. That way it’s easier to stick to them when writing long-lived code that goes into ETL and the repo. Also, one-off code oftens ends up being ETL’d.
Alternative ways to write SQL
Finally, some other, more comprehensive guides: