More Data Scientists should learn SQL

Marton Trencseni - Sun 29 May 2022 - Data

Introduction

Our Data Scientist interview is a multi-stage process. As the hiring manager, my interview is somewhere in the middle. The candidates who make it this far have had their CVs screened to make sure SQL is on it, and had to answer a few SQL screening questions, like whether they know how many rows are produced when using UNION ALL. Still, most candidates I talk to struggle to write a relatively simple SQL query.

It's worth pointing out that, although we always ask SQL questions, this is about 10-20% of the interview process. We also ask candidates to write 2-3 line Python functions, ask them about metrics, stats, modeling/ML, and maybe even some Linux.

A relatively simple SQL interview question

For this post, I wrote a slightly modified version of our interview question, but it's the same structure and complexity. Let's say there is table transactions, with columns:

transactions:
- id BIGINT
- customer_id BIGINT
- ts TIMESTAMP
- amount_usd DECIMAL

Question: how many customers spent at least $1000 in all months of the year 2021?

The solution we're looking for:

WITH
monthly_spends AS
(
    SELECT
        customer_id,
        MONTH(ts) AS month,
        SUM(amount_usd)
    FROM
        transactions
    WHERE
        YEAR(ts) = 2021
    GROUP BY
        1, 2
    HAVING
        SUM(amount_usd) >= 1000
),
qualifying_users AS
(
    SELECT
        customer_id,
        COUNT(*)
    FROM
        monthly_spends
    GROUP BY
        1
    HAVING
        COUNT(*) = 12
)
SELECT COUNT(*) FORM qualifying_users

Extracting metrics from a Vertica DWH is a daily task for Data Scientists on our current team (same for the previous 3 jobs), it's not uncommon for us to write SQL queries that are longer than 100 lines, some are longer than 1000 lines. So expecting a candidate to be able to use GROUP BY, HAVING, COUNT(), SUM() is definitely not unfair, it's what we do every day.

First, I make sure they understand the question. I spend extra time stressing the "all months of the year" part, here english-as-a-second-language sometimes is a factor. So I usually repeat this requirement phrased in 2-3 different ways (in english), to make sure the candidate got it.

What to look for

Interviewing is nerve-wrecking, so I expect candidates to perform worse than they would do under normal, real-life conditions. It's the same for me. For this reason, I tend to help them, and ignore minor errors/omissions. For example, people tend to not remember MONTH() and YEAR() and maybe try to use DATE_PART(), but then struggle with the syntax. Or they forget to actually write out the GROUP BY, even though clearly that's what they're doing. In cases like this I help them, and don't count this against them. Or, if they write the correct query, but forget the WHERE YEAR(ts) = 2021 part, I would remind them and also not count it against them. Also, most problems can be solved in several different ways, I'll take anything that is roughly right: eg. in the above query, you can write HAVING ... like I did, or move that to the subsequent SELECT as a WHERE condition.

The thing we are looking for is, can candidates think in SQL, formulate the solution as an SQL query? Can they break down the question into a series of SELECTs, which progressively get closer to the answer we are looking for. This is where most people fail — to my surprise. This tells me that these candidates haven't yet figured out that knowing how to think in SQL and write queries more complicated than a single SELECT <columns> FROM <table> WHERE <condition> is worth it.

In our interview process, depending on the position and the person, a candidate can still get hired if they can't solve the above SQL, but they have to be really good at something else. Also, if the candidate is not a junior, I would be skeptical how they made it this far without learning to write SQL. This is simply because having to write SQL has been a daily part of my last 4 jobs in this field. In my current job I'm a hands-on manager running a 20-person team, but I still have about 10,000 lines of SQL in my draft, not counting code that I've commited to Github.

Indentation

Another thing I notice is that very few candidates (i) indent their SQL code (they're writing in on their own laptop in the interview) and (ii) don't use WITH, they write subqueries instead. For my brain, indenting is a major factor to keep track of what's going on in the SELECT and be able to scan it to make sure it's good, while using WITH allows me to think sequentially, and also keep the code sequential. Using sub-queries requires more indentation and reverses the order of the SELECTs, to me it's a mental overhead that hinders clarity.

See this past article on How I write SQL code.

Why SQL gets overlooked

There is no one answer to this, because Data Scientists come from many different educational backgrounds and companies.

People coming from Software Engineerings background may miss out on SQL because we (I also have a Comp.Sci. degree) spend a lot of time learning and coding in imperative languages like C, C++, Java, Python. In the last 10 years there was a lot of buzz around Functional Programming (FP) and languages like Haskell and F#, and a lot of good functional patterns made it into mainstream programming languages and practice. But there's not a lot of buzz around Declarative Programming and languages like Prolog and SQL, and I don't think it's taught widely at Universities ( I was lucky enough to learn Prolog and SQL at University). In a declarative language like SQL we declare what results we want, and we let the runtime (query optimizer and execution engine) figure out the best way to get it.

People coming from math and science background probably invest their time learning Python and all the interesting libraries and frameworks that we have today. After all, it's hard to get yourself to invest time in SQL, if you can spend the same time playing around with Reinforcement Learning in Pytorch.

Lastly, there are companies, where Data Science teams haven't figured this out yet, so the candidates haven't learned this through osmosis. In these teams the Data Scientists block waiting for other people to write SQL for them to extract data from a DWH and receive it in a flat file.

Why SQL is worth knowing

1. "SQL just another language, like Python or Java!" ... and you can't expect everybody to know all programming languages. I strongly disagree with this framing. SQL has been around since the 1970s, it's the de facto standard for getting data out of a (relational) database system. Even database systems that start out without SQL support usually end up with some sort of SQL support, because it's just so damn useful. In an analytics/DS/ML setting, the Data Warehouse (DWH) will almost always be queried with SQL (or something resembling SQL, like HiveQL). This even tends to happen if the underlying storage is not strictly relational, like when running a Presto engine on flat files stored on S3. Yes, there are ML projects that don't deal with relational data, like image, video and audio processing, but in my experience (i) in real life, there's always projects that do, and (ii) even something like image recognition will usually have additional input (eg. when recognizing faces on an image, it's worth knowing who was tagged on the image, what the location and time was, etc, and these additional fields will come from a DWH).

SQL is not just like a language like Python or Java. It's been around since the 1970s, and there is no sign that it will go away anytime soon, including humans directly writing SQL.

2. "I usually just get the data out and do it in Python!" This breaks if there is too much data "to get out", which can happen even at medium sized companies. Also, it's very inefficient and slow, since all the data has to be copied out from the database disks, over a network, to a laptop or devserver. Usually, the combined program to get the data out and do the processing is harder to maintain and more error-prone then a flat SQL query like above. And the biggest reason: if you do it in Python, things like JOINs and GROUP BYs will be executed by the query optimizer taking into account indexes, storage patterns, statistics, etc --- it will be a lot faster, since (i) only the relevant data needs to be read from disk (ii) the execution engine is highly optimized C/C++/Java/Assembly code (iii) instead of all data getting transfered out of the database, only the results are.

3. "Data Analysts / Data Engineers write SQL queries for me!" At many companies, there are separate roles that do most of the SQL. But, in my experience, even at companies like this (like Facebook), Data Scientists need to and should write SQL:

it's unreasonable to block and wait for somebody to write a 20-50 line SQL which takes 5-10 minutes to write
the Data Engineering person or team may be busy or have long turnaround times
at this or a future job, there may be no other team/person to do this
the other team/person may make a mistake, which could invalidate the Data Scientist's results

If my current team of Data Scientists would block on the Data Engineering team to write SQL queries, our productivity/impact would at least be 10x less that it is today.

4. It's just SELECTs! Data Scientists mostly just have to write SELECTs (versus designing tables, choosing indexes, writing upserts, etc.), which is a relatively small part of SQL overall!

5. Writing SQL absurdly increases the value of Data Scientists! In my experience, if a Data Scientist or a team of Data Scientists writes their own SQL, they will become experts at the data and metrics. Add to this Data Scientist's objectivity, understanding of statistics, and ability to put models on top. Since they write their own SQL, they don't block on other teams, so this means they can answer incoming questions very quickly. Eg. "the Finance team came up with this forecast for Q4, but it may be biased by company OKRs, what do you think?" Get the relevant metrics out with SQL, play around with it, put a Prophet model on top with some external regressors, and make a better forecast than the Finance team with a 1-day turnaround. Altough this is not the core job of a Data Science team, this makes us absurdly valuable to the company, which means good salaries, good raises, no layoffs, plus headcount, etc.

Conclusion

If you don't know SQL, learn it! 😀