How good a Data Scientist is GPT-3?

Marton Trencseni - Sun 31 July 2022 - gpt

Recently I have been playing around with OpenAI's GPT-3 and I am very impressed by it. It reminds of the famous Arthur C. Clarke quote, “Any sufficiently advanced technology is indistinguishable from magic.” If you don't know what GPT-3 is, this is a good introduction with a lot of links.

Recently, a Google engineer has made the claim that one of Google's proprietary internal AI systems called LaMDA is sentient. A more useful perspective on such advanced statistical AI systems is to say that they are an extremely useful, wonderful illusion.

GPT-3

You can also talk to GPT-3, just create an account on OpenAI and go to the sandbox!

Here I will show a "conversation" with GPT-3 to gauge how good a Data Scientist — or an illusion of a Data Scientist — it is. Mostly these are questions I might ask a candidate in a Data Science interview. The questions are independent of one another, so they work with a context-free system such as GPT-3. By context-free I mean that GPT-3 does not maintain conversation state, each question and answer are indepedent from the AI's perspective.

GPT-3 is trained on large amounts of text downloaded from the Internet. So I expect it to do well on questions where there are high-quality answers on the Internet, and it "just" has to return it, in the correct english structure. For questions that require more reasoning and combining facts (from the question and training data) I expect to get less impressive answers.

Culture

Q: What does a good Data Scientist look like?

A: There is no one-size-fits-all answer to this question, as the ideal data scientist depends on the specific needs of the organization. However, some key qualities that a good data scientist should possess include strong analytical and problem-solving skills, the ability to think creatively, and experience with data-driven decision making. Additionally, a good data scientist should be able to effectively communicate their findings to both technical and non-technical audiences.

Note: This was an easy question because there are lots of sentences like this on the Internet, and the same applies to the following questions. The answer is still impressive and creates a strong illusion, because the quality of the answer is so strong, and fits the structure of the question.

Q: What's the difference between a Data Scientist and a Data Engineer?

A: A data engineer is responsible for designing, building, and maintaining the data infrastructure that a company uses to collect, store, and process data. A data scientist is responsible for analyzing and extracting insights from data.

Q: What's the difference between a Data Scientist and a Product Analyst?

A: A data scientist is someone who is responsible for extracting meaning from data. A product analyst is someone who is responsible for analyzing a product and determining its feasibility.

Q: What's the difference between a Data Scientist and an ML Engineer?

A: A Data Scientist is someone who is responsible for extracting insights from data. An ML Engineer is someone who is responsible for building and deploying machine learning models.

Q: In a tech company working on a SaaS product, what's a good ratio of Data Scientists and Product Managers?

A: There is no definitive answer to this question since it can vary depending on the specific needs of the company and product. However, a good rule of thumb is to have at least one Data Scientist for every two Product Managers.

Q: In a tech company working on a SaaS product, what's a good ratio of Data Scientists, Data Engineers, Software Engineers and Product Managers?

A: There is no definitive answer to this question, as it will vary depending on the specific needs of the company and product. However, a good general ratio to aim for is 1 Data Scientist, 1 Data Engineer, 2 Software Engineers, and 1 Product Manager. This will ensure that there is enough expertise and manpower dedicated to each area, while still allowing for some overlap and collaboration between teams.

Note: the AI is contradicting itself; in the first question, it recommended a 1:2 DS:PM ratio, here it's recommending a 1:1 ratio. I'm not sure where the 1:1 is coming from, in my experience at tech companies a 1:1 ratio would be too many DEs.

Coding

Q: Given a list li, write a function top_sum(li, n) that returns the sum of the top n elements: top_sum(li, n). Eg. top_sum(li=[5, 1, 5, 2, 0], n=3) returns 12.

A:

def top_sum(li, n):
  sorted_list = sorted(li, reverse=True)
  return sum(sorted_list[:n])

Q: Given two lists keys and values of identical length containing key-value pairs, write a generator function yield_sorted_kvs(keys, values) which yields key-value pairs from the two lists, in ascending key order. Note that the input list keys may not be ordered. Eg. yield_sorted_kvs(keys=[3, 1, 2], values=[‘a’, None, 5]) yields (1, None) .. (2, 5) .. (3, ‘a’).

A:

def yield_sorted_kvs(keys, values):
    d = {k: v for k, v in zip(keys, values)}
    for k in sorted(d.keys()):
        yield k, d[k]

Note: GPT-3 nailed both, even though these exact top_sum() and yield_sorted_kvs() functions are not on the Internet. Also note that GPT-3 is not specifically trained to write code, being able to implement small functions like this is just an "accidental sife-effect". OpenAI is working on a model called OpenAI Codex, specifically to write code.

Q: Can you show me some Python code that you consider beautiful?

A:

def main():
    print("Hello, world!")

if __name__ == "__main__":
    main()

Note: This is a very weak answer, it breaks the illusion. Let's try again.

Q: Can you show me some Python code that you consider beautiful, that implements a famous algorithm of your choosing?

A:

def fft(x):
    N = len(x)
    if N <= 1: return x
    even = fft(x[0::2])
    odd =  fft(x[1::2])
    T= [exp(-2j*pi*k/N)*odd[k] for k in range(N//2)]
    return [even[k] + T[k] for k in range(N//2)] + \
           [even[k] - T[k] for k in range(N//2)]

Note: This is a great answer, but it's copy/paste from this Stackoverflow answer. It would be interesting to know why the AI picked FFT, and this particular implementation!

Q: Can you show me some SQL code that you consider beautiful?

A: This is a difficult question since beauty is subjective. However, here is an example of SQL code that is well written and easy to read:

SELECT
    first_name,
    last_name,
    email
FROM
    users
WHERE
    active = 1
ORDER BY
    last_name,
    first_name

Note: this is actually a great answer. I agree with GPT-3 that the only way to interpret "beautiful" in the context of SQL is formatting, and this is nicely formatted SQL code. I follow the same formatting. But the illusion would work better if the sample query would be a bit longer:

Q: Can you show me some SQL code that you consider beautiful, that involves use of the WITH clause and also a JOIN?

A:

WITH user_sessions AS (
  SELECT user_id, session_id, start_time, end_time
  FROM sessions
  WHERE user_id = 1
)
SELECT u.username, s.session_id, s.start_time, s.end_time
FROM users u
JOIN user_sessions s ON u.user_id = s.user_id

Note: here the illusion breaks, because now it's using a different style of formatting!

A/B testing

Q: Let's say we're running an A/B test with two variants. How should I split traffic to maximize the probability of getting a statistically significant result?

A: The traffic should be split evenly between the variants.

Note: a lot of people don't know this in interviews.

Q: Let's say we're running an A/B test with two variants. If we split traffic evenly between A and B, we maximize the probability of getting a statistically significant result. However, in real life usually traffic is not split evenly. Why?

A: There are many reasons why traffic might not be split evenly in a real-life A/B test. For example, if the variants are assigned randomly, there is a chance that one variant will be assigned to more users than the other. In addition, some users may be more likely to use one variant over the other, or one variant may be more popular than the other.

Note: this one is surprising. The answer I was expecting was around managing the risk of trying a new version. Let's see if we can lead it.

Q: Let's say we're running an A/B test with two variants. If we split traffic evenly between A and B, we maximize the probability of getting a statistically significant result. However, in real life A/B tests are usually not configured to split traffic evenly. In most A/B tests, the new variant gets 10 or 20% of traffic. Why?

A: The reason is that if the new variant is worse than the old variant, we want to minimize the number of users who experience a worse experience. If the new variant is better than the old variant, we want to quickly find out and roll it out to all users. So in most A/B tests, the new variant gets a small amount of traffic at first. If the results look promising, the new variant gets more traffic. If the results are not promising, the new variant gets less traffic.

Note: by priming GPT-3 in the question, we were able to direct it to the expected answer.

Q: Let's say we're running an A/B test with two variants of an email. The evaluation metric is CTR. One branch has a CTR of 4%, the other branch has a CTR of 91%. What do you think?

A: There is a significant difference in CTR between the two variants of the email. The branch with a CTR of 91% is likely to be the more successful email.

Note: while technically correct, this is not a strong answer. A strong(er) answer would be to say that this is an unlikely result, almost certainly due to a bug or measurement error. See Twyman's law.

Q: Can you explain to me what a p-value is in A/B testing?

A: A p-value is the probability that the results of a hypothesis test are due to chance. A low p-value indicates that the results are not due to chance, and are therefore significant.

Q: Can you explain to me what the difference between frequentist and bayesian A/B testing is?

A: Frequentist A/B testing is based on the idea of hypothesis testing, where we test a null hypothesis (usually that there is no difference between the two groups) against an alternative hypothesis (usually that there is a difference between the two groups). We then calculate a p-value, which is the probability of seeing our data if the null hypothesis were true. If the p-value is less than a certain threshold (usually 0.05), we reject the null hypothesis and conclude that there is a difference between the two groups.

Bayesian A/B testing is based on the idea of posterior inference, where we use the data to calculate the posterior probability of the null hypothesis being true. We then compare this posterior probability to a prior probability, which represents our beliefs about the null hypothesis before we saw the data. If the posterior probability is greater than the prior probability, we conclude that the null hypothesis is more likely to be true.

Note: the last 2 questions were simple, but the illusion is still impressive!

Q: Can you recommend a good webpage to me that explains the difference between frequentist and bayesian A/B testing?

A: There is a lot of information out there on this topic. However, this webpage provides a clear explanation of the differences between frequentist and bayesian A/B testing: https://www.evanmiller.org/bayesian-ab-testing.html

Note: here the illusion is strong again, I would also recommend Evan's blog on this topic!

Q: In the context of A/B testing, can you explain to me how CUPED uses historical data?

A: CUPED uses historical data to help identify which variation of a given test is performing better. By looking at past performance, CUPED can more accurately predict which variation is more likely to lead to a successful outcome for the current test. This helps to ensure that the results of the test are more reliable and accurate.

Note: this is a very generic answer. It's interesting to note that originally I had a typo in my question, I wrote "used" instead of "uses". This totally tripped up GPT-3 and yielded a very generic answer:

Q: In the context of A/B testing, can you explain to me how CUPED used historical data?

A: CUPED used historical data to analyze the conversion rates for each variation of their A/B test. They then used this data to determine which variation was more effective in converting users.

Conclusion

Overall I find GPT-3 to be very impressive. It performs better as a question answering bot than many human Data Scientists I interview. Although this AI is not yet good enough to take away any technical contributor's job, it's not hard to imagine systems that are trained more specifically for a domain (such as Data Science and programming) and are potentially 10-100x bigger.

I will continue this post with a Part II, where I will ask it modeling questions. Overall, I am very impressed by GPT-3.