Python types for Data Scientists - Part III

Marton Trencseni - Fri 22 April 2022 - Python

Introduction

In the first post I showed how to get started using Python static type checking in ipython notebooks. The second post looked at slightly more advanced uses of typing to further increase the safety and readability of code. Here I will continue, and look at some aspects of type hinting:

  • type check errors and runtime errors
  • where type hints don't work
  • Abstract Base Classes vs. Protocols
  • types for class variables vs instance variables

The ipython notebook is up on Github. The best reference is the official Python documentation of the typing module.

Mypy

Type check errors and runtime errors

It's important to remember that in Python, type hints are optional and ignored (not enforced) by the Python runtime. Type hints are interpreted by external programs. In these example, I use nb_mypy, which actually runs mypy to do type checking. Then, irrespective of the result of type checking, the regular Python runtime runs (and ignores all type hints). In other contexts, when using an IDE, the IDE would run type checks in the background, and show errors, or use the type hint informations for code complete.

Given the roots of Python, I find this to be a good trade-off to introduce typing and get 80% of the benefits.

But, it leads to some weird behaviour, which cannot be changed with nb_mypy: even if there is a typecheck error in the current cell, the code code in the cell will still be run after the typecheck completed. This leads to some confusing outputs. For example:

def foo(i: int) -> None:
    print(i)

foo("hello")

Output:

error: Argument 1 to "foo" has incompatible type "str"; expected "int"
hello

First the nb_mypy type checker runs, finds the type error, prints the error, but then the code is executed anyway. And since the Python runtime ignores all typehints, the code runs just fine, since to Python the function foo() is equivalent to:

def foo(i):
    print(i)

Where type hints don't work

There are some cases where writing type hints does not work as we'd expect. The big ones are for and while loops:

for i: int in range(3):
    print(i)

Output:

SyntaxError: invalid syntax

This is not a type check error, it's a syntax error. We cannot put the type hint for i in the for loop itself. It has to go before:

i: int
for i in range(3):
    print(i)

However, at least in such simple cases, I would just skip the type hint since it's quite ugly. It's actually not required, the type checker can infer the int type from range(), so this will throw a type check error:

def f(s: str) -> None:
    pass

for i in range(3): # okay
    f(i)           # not okay

Output:

error: Argument 1 to "f" has incompatible type "int"; expected "str"

Abstract Base Classes vs. Protocols

In the previous post, there was the example of declaring a Protocol for addability:

class Addable(Protocol):
    def __add__(self, other): # anything that declares __add__() can stand in for an Addable
        raise NotImplementedError

T = TypeVar('T', bound=Addable)

def add(a: T, b: T):
    print(type(a), type(b))
    print(a+b) # checks for __add__(), uses __str__()

add(int(1), int(2)) # okay

In this example, what we pass to add() needs to declare an __add__(). So if we define our own class MyInt like this:

class MyInt():
    num: int
    def __init__(self, num: int):
        self.num = num

add(MyInt(1), MyInt(2)) # not okay

Output:

error: Value of type variable "T" of "add" cannot be "MyInt"

We can fix this by implementing __add__() in MyInt:

class MyInt(): # note that MyInt does not inherit Addable
    num: int
    def __init__(self, num: int):
        self.num = num
    def __add__(self, other):
        return MyInt(self.num + other.num)
    def __str__(self): # so print() works
        return str(self.num)

T = TypeVar('T', bound=Addable)

def add(a: T, b: T):
    print(type(a), type(b))
    print(a+b) # checks for __add__(), uses __str__()

add(MyInt(1), MyInt(2)) # okay
add(int(1), int(2))     # okay

Output:

<class '__main__.MyInt'> <class '__main__.MyInt'>
3
<class 'int'> <class 'int'>
3

This works, even though MyInt does not mention Addable in the class declaration at all!

What happens if we go back to the Addable declaration and change Protocol to ABC, like:

class Addable(ABC):
    def __add__(self, other):
        raise NotImplementedError

...

add(MyInt(1), MyInt(2)) # not okay
add(int(1), int(2))     # not okay

We get a type error from both lines:

error: Value of type variable "T" of "add" cannot be "MyInt"
error: Value of type variable "T" of "add" cannot be "int"

Neither MyInt or int can stand in for an Addable if it's an ABC. Only classes that derive can stand in for abstract base classes.

Let's change MyInt to derive from Addable:

class MyInt(Addable):
    ....

add(MyInt(1), MyInt(2)) # okay, MyInt derives from Addable
add(int(1), int(2))     # not okay

Output:

error: Value of type variable "T" of "add" cannot be "int"

MyInt is now fine, int still cannot stand in for an Addable. This shows the difference between Protocol and ABC. With Protocol, anything that implements the declared functions can stand-in for that type, irrespective of inheritance. With ABC, only types that inherit from the base class (in the example above, MyInt inherits from Addable) can stand in for that type.

Types for class variables vs instance variables

Let's try this code:

class Foo:
    num: int # num is NOT a class variable
    def __init__(self, num: int):
        self.num = num

f = Foo(1)
print(f.num)        # prints 1
g = Foo(2)
print(f.num, g.num) # prints 1 2
print(Foo.num)      # AttributeError: type object 'Foo' has no attribute 'num'

Output:

1
1 2
AttributeError: type object 'Foo' has no attribute 'num'

Here, we declare the num instance variable of class Foo to be of type int. We create 2 instances of Foo, and we see that each of them carries a separate num. Then we try to access Foo.num class variable, and we get an AttributeError, because it doesn't exist.

Let's make one minor modification of the code, and assign some initial value to num:

class Foo:
    num: int = 0 # num is now a class variable
    def __init__(self, num: int):
        self.num = num

f = Foo(1)
print(f.num)          # prints 1
g = Foo(2)
print(f.num, g.num)   # prints 1 2
print(Foo.num)        # prints 0

This one change creates num as a class variable, which can be accessed. Note that both the class variable and the instance variable carry the int type:

f = Foo(1)
print(Foo.num, f.num) # prints 0 1
f.num = "hello"       # not okay
Foo.num = "world"     # not okay

Output:

error: Incompatible types in assignment (expression has type "str", variable has type "int")
error: Incompatible types in assignment (expression has type "str", variable has type "int")

Conclusion

This post concludes this short series on Python typing for Data Scientists. I think the verdict is still out whether type hints are worth in in Data Science code (which tends to be short, linear and less structured than application software), but it's good to know that type hints exist and how it works.