Solving 5 algorithmic interview questions

Marton Trencseni - Sat 26 March 2022 - Math

Introduction

Recently I was considering whether to introduce some CS style algorithmic interview questions into our Data Science hiring loop, since having an understanding of algorithms and data structures can be useful for Data Scientists. Not having done this soft of interview for a few years I picked up my copy of Daily Coding Problem and starting solving a few problems to refresh my feeling for what it feels like as a candidate, and whether it would give us any useful signals.

Daily coding problem

Arrays #1

Given an input array li of numbers, return an array of numbers where the ith element is the multiplication of all other elements except the ith from li.

# easy, with division
def simple(li):
    product = reduce((lambda x, y: x * y), li)
    return [int(product/i) for i in li]

# tricky, without divisions
def tricky(li):
    befs = deepcopy(li)
    for i in range(1, len(li)):
        befs[i] = befs[i-1] * li[i]
    afts = deepcopy(li)
    for i in range(len(li)-2, -1, -1):
        afts[i] = afts[i+1] * li[i]
    result = [befs[i-1]*afts[i+1] for i in range(1, len(li)-1)]
    result.insert(0, afts[1])
    result.append(befs[len(li)-2])
    return result

N = randint(5, 10)
li = [randint(1, 10) for _ in range(N)]
print(f'Input: {li}')
result = simple(li)
print(f'Result: {result}')
result = tricky(li)
print(f'Result: {result}')

Yields:

Input: [2, 5, 1, 5, 6, 5, 8, 7, 9, 7]
Result: [2646000, 1058400, 5292000, 1058400, 882000, 1058400, 661500, 756000, 588000, 756000]
Result: [2646000, 1058400, 5292000, 1058400, 882000, 1058400, 661500, 756000, 588000, 756000]

Quick and dirty large scale test:

num_tests = 1000*1000
for _ in range(num_tests):
    N = randint(2, 10)
    li = [randint(1, 10) for _ in range(N)]
    assert(simple(li) == tricky(li))
print(f'{num_tests} random tests passed')

Arrays #2

Given an input array li of ints, what is the minimum range that need to be sorted? Ie. elements outside of this range are already in their correct position in li.

# easy, sort and compare
def simple(li):
    sd = sorted(li)
    eq = [i for i in range(N) if li[i] != sd[i]]
    return (eq[0], eq[-1]) if len(eq) > 0 else (0, N-1)

# tricky
def tricky(li):
    mins = deepcopy(li)
    for i in range(len(li)-2, -1, -1):
        if mins[i+1] < mins[i]:
            mins[i] = mins[i+1]
    maxs = deepcopy(li)
    for i in range(1, len(li)):
        if maxs[i-1] > maxs[i]:
            maxs[i] = maxs[i-1]
    mins = [i for i in range(len(li)) if
            mins[i] != li[i]]
    maxs = [i for i in range(len(li)) if maxs[i] != li[i]]
    mn = mins[0] if len(mins) > 0 else 0
    mx = maxs[-1] if len(mins) > 0 else len(li)-1
    return mn, mx

Yields:

Input: [2, 4, 1, 10, 3, 8, 7]
Range to be sorted: [0, 6]
Range to be sorted: [0, 6]

Quick and dirty large scale test:

num_tests = 1000*1000
for _ in range(num_tests):
    N = randint(1, 10)
    li = [randint(1, 10) for _ in range(N)]
    assert(simple(li) == tricky(li))
print(f'{num_tests} random tests passed')

Arrays #3

Given an array of integers li, what is largest contiguous sub-array sum?

def simple(li):
    max_sum, max_range = 0, []
    for mn in range(len(li)):
        for mx in range(mn, len(li)):
            s = sum(li[mn:mx+1])
            if s > max_sum:
                max_sum = s
                max_range = li[mn:mx+1]
    return sum(max_range)

def tricky(li):
    best, mx = 0, 0
    for i in range(len(li)):
        mx = max(li[i], mx+li[i])
        best = max(best, mx)
    return best

N = randint(1, 10)
li = [randint(-10, 10) for _ in range(N)]
print(f'Input: {li}')
max_range = simple(li)
print(f'Max sum: {max_range}')
max_range = tricky(li)
print(f'Max sum: {max_range}')

Yields:

Input: [7, 7, -10, 9, -1, -2, 10, 10, -9]
Max sum: 30
Max sum: 30

Quick and dirty large scale test:

num_tests = 1000*1000
for _ in range(num_tests):
    N = randint(1, 10)
    li = [randint(-10, 10) for _ in range(N)]
    assert(simple(li) == tricky(li))
print(f'{num_tests} random tests passed')

Strings #1

Given a word w and a string s, find starting locations of s that are anagrams of w (anagram = same length, same letters, different order).

# simple, continuously re-compute the per-character sums
def simple(w, s):
    return [i for i in range(len(s)) if Counter(w) == Counter(s[i:i+len(w)])]

# tricky, maintain the running sum
def tricky(w, s):
    ref = defaultdict(int)
    for c in w: ref[c] += 1
    inds, run = [], deepcopy(ref)
    for i in range(0, len(s)):
        if i >= len(w):
            run[s[i-len(w)]] += 1
        run[s[i]] -= 1
        if sum([abs(x) for x in run.values()]) == 0:
            inds.append(i-len(w)+1)
    return inds

w, s = 'bb', 'abbxbbbaabb'
print(f'w: {w}, s: {s}')
result = simple(w, s)
print(f'Result: {result}')
result = tricky(w, s)
print(f'Result: {result}')

Yields:

w: bb, s: abbxbbbaabb
Result: [1, 4, 5, 9]
Result: [1, 4, 5, 9]

Quick and dirty large scale test:

num_tests = 1000*1000
for _ in range(num_tests):
    w = str(randint(1, 999))
    s = str(randint(1000, 9999999999))
    assert(simple(w, s) == tricky(w, s))
print(f'{num_tests} random tests passed')

Strings #2

Given a list of words wl, find all pairs of words so that the concatenation of the words is a palindrome (eg. 'ab'+'a' is a palindrome).

def reverse(s):
    return s[::-1]

def is_palindrome(s):
    return s == reverse(s)

def simple(li):
    return sorted([(i, j) for i, j in permutations(range(len(li)), 2) if is_palindrome(li[i] + li[j])])

def tricky(wl):
    d = {w:i for i, w in enumerate(wl)}
    inds = []
    for i, s in enumerate(wl):
        for j in range(len(s)+1):
            pre = s[:j]
            suf = s[j:]
            if reverse(pre) in d.keys() and is_palindrome(suf) and i != d[reverse(pre)]:
                inds.append((i, d[reverse(pre)]))
            if reverse(suf) in d.keys() and is_palindrome(pre) and i != d[reverse(suf)]:
                inds.append((d[reverse(suf)], i))
    return sorted(list(set(inds)))

wl = ['ab', 'bba', 'xx', 'x', 'bab', 'a']
print(f'Input: {wl}')
result = simple(wl)
print(f'Result sum: {result}')
result = tricky(wl)
print(f'Result sum: {result}')

Yields:

Input: ['ab', 'bba', 'xx', 'x', 'bab', 'a']
Result sum: [(0, 1), (0, 5), (2, 3), (3, 2), (4, 0), (5, 1)]
Result sum: [(0, 1), (0, 5), (2, 3), (3, 2), (4, 0), (5, 1)]

Quick and dirty large scale test:

num_tests = 1000*1000
for _ in range(num_tests):
    l = randint(5, 10)
    wl = [str(randint(1, 9999)) for _ in range(l)]
    assert(simple(wl) == tricky(wl))
print(f'{num_tests} random tests passed')

Conclusion

My opinion, from having done many Software/Data Engineering loops (both as a candidate and an interviewer), is that I don't like these types of questions, because:

Candidates who tend to stress in such situations (like me, I get high blood pressure) will do much worse than in real life
The specific questions/problems are not relevant to the job (whether SWE or DS)
These interview situations are a competition against time, but being able to solve a 5-10 line programming puzzle in 10 minutes is like saying: "we want to hire people who can run fast, so they will spend less time walking to and from lunch, so they'll have more time left to work"
These sort of problems usually come from a question pool (like a book), and the questions tend to be alike, so they can be prepared for. Candidates can prepare for such interviews by solving a lot of similar problems, and if they're lucky, they'll have seen the problem on the interview, or it will follow a usual pattern (eg. use a hashmap). These sort of questions favor more junior candidates, who studied algorithms in school more recently, and who have more time to prepare (no family, more hungry).

Realistically, in a live interview situation, I would probably fail on 3-5 of these 5 questions [in a 10-15 minute limit].

I think the flaw is to expect candidates to come up with the more tricky efficient solution. I think for Data Scientists, more signal can be gained from telling them to come up with the simple, less efficient solution (and instead asking them to be as pythonic as possible in their solution). Data Scientists tend to use libraies (and SQL query engines) to do the heavy lifting, almost all the code they write is essentially "driver code"; being able to understand the structure of the problem and writing a clean Python solution is good enough.