Python types for Data Scientists - Part II

Marton Trencseni - Sun 17 April 2022 - Python

Introduction

In the previous post I showed how to get started using Python static type checking in ipython notebooks. Here I will look at slightly more advanced uses of typing to further increase the safety and readability of code. The ipython notebook is up on Github. The best reference is the official Python documentation of the typing module.

Python snake

Optional types and Union

Sometimes we want to declare that something can be of a certain type, or None. Imagine we don't know about numpy.random.random_sample and we're writing a function randoms() to return a random list[float] of length num:

def randoms(num: int) -> list[float]:
    if num >= 0:
        return [random() for _ in range(num)]
    else:
        return None # not okay, NoneType is not list[float]

We want to be good programmers and return None if a negative value for num is passed in, but None is not a list[float], so this won't work:

error: Incompatible return value type (got "None", expected "List[float]")

This is what Optional[T] is for, it declares that the type will be T or None:

def randoms(num: int) -> Optional[list[float]]:
    if num >= 0:
        return [random() for _ in range(num)]
    else:
        return None # okay, return type is Optional[...]

The same can also be achieved by using Union[]:

def randoms(num: int) -> Union[list[float], None]:
    if num >= 0:
        return [random() for _ in range(num)]
    else:
        return None # okay, return type is Union[..., None]

Note that None as a type hint is a special case and is replaced by type(None) by Python.

What if we're a different kind of programmer, and we want to raise an exception instead of returning None, like:

def randoms(num: int) -> list[float]:
    if num >= 0:
        return [random() for _ in range(num)]
    else:
        raise ValueError # okay, there is no typed way to communicate raised exceptions

This is okay, in Python there is no typed way to communicate raised exceptions.

Finally, what if we want to return just a float if the user is asking for one random, and None on negative input. Union is the solution:

def randoms(num: int) -> Union[float, list[float], None]:
    if num == 1:
        return random()                       # float
    elif num >= 0:
        return [random() for _ in range(num)] # list[float]
    else:
        return None                           # None

Note that as of Python 3.10, Union[X, Y] can be written as X | Y, but this does not work yet on Python 3.9:

def randoms(num: int) -> float | list[float] | None:
    if num == 1:
        return random()                       # float
    elif num >= 0:
        return [random() for _ in range(num)] # list[float]
    else:
        return None                           # None

Type aliases and NewType

Suppose we are building a library for machine learning and we are using list[float] for feature vectors. One way we can communicate this to the user of our library is by calling the arguments of our functions names like feature_vector. We can also accomplish this in our typing, by declaring an alias for list[float]:

FeatureVector = list[float] # type alias

We can now write FeatureVector interchangeably with list[float]:

def predict(fv: FeatureVector) -> float:
    return random()

fv: list[float] = [0.1, 0.2, 0.3]
predict(fv) # okay
fv: FeatureVector = [0.1, 0.2, 0.3]
predict(fv) # okay

Suppose that we want to declare our type classes like in type aliases, but we want to be more strict. We only want to accept list[float]s that were explicitly declared to be FeatureVector classes. We can achieve this by using NewType:

FeatureVector = NewType('FeatureVector', list[float])
# all FeatureVectors are list[float], but not all list[float] are FeatureVectors

def predict(fv: FeatureVector) -> float:
    return random()

fv: list[float] = [0.1, 0.2, 0.3]
predict(fv) # not okay
fv: FeatureVector = FeatureVector([0.1, 0.2, 0.3]) # explicit cast
predict(fv) # okay

Output:

error: Argument 1 to "predict" has incompatible type "List[float]"; expected "FeatureVector"

In the above example, all FeatureVectors are list[float], but not all list[float] are FeatureVectors. So any function that accepts a list[float] will accept a FeatureVector, but not the other way around.

Generics with TypeVar

Suppose we want to write a function first() which returns the first element of a list, and we want to declare that the list contains things of type T, and the return type will be the same type T. We can accomplish this with a TypeVar:

T = TypeVar('T') # declare type variable T to be used
def first(li: list[T]) -> T:
    return li[0]

We can also mix TypeVars with Optional to make first() more useful:

T = TypeVar('T') # declare type variable T to be used
def first(li: list[T]) -> Optional[T]:
    return li[0] if len(li) > 0 else None # okay

Note that we cannot bind a TypeVar by usage. In the example below, we cannot bind T to be str (there is no "type solver"), this will return an error:

T = TypeVar('T') # declare type variable T to be used
def first(li: list[T]) -> T:
    return "hello" # not okay

The error is:

error: Incompatible return value type (got "str", expected "T")

Protocols

Let's look at another example, where we want to add two things:

T = TypeVar('T') # declare type variable T to be used
def add(a: T, b: T) -> T:
    return a+b # checks for __add__()

This will result in a type-error, because Python doesn't know whether T implements __add__():

error: Unsupported left operand type for + ("T")

To achieve the desired typing, we have to use Protocols:

class Addable(Protocol):
  def __add__(self, other): # anything that declares __add__() can stand in for an Addable
    raise NotImplementedError

Here we are declaring a class Addable using typing.Protocol, which declares __add__(). Anything that declares __add__() can stand in for an Addable, even if it's not descended from Addable. For example, an int is an Addable. Examples:

def add(a: Addable, b: Addable) -> Addable:
    print(type(a), type(b))
    return a+b # checks for __add__()

add(str(1), str(2))   # okay
add(int(1), int(2))   # okay
add(int(1), float(2)) # okay
add(int(1), str(2))   # not a typecheck error, but a runtime error

Output:

<class 'str'> <class 'str'>
<class 'int'> <class 'int'>
<class 'int'> <class 'float'>
<class 'int'> <class 'str'>
TypeError: unsupported operand type(s) for +: 'int' and 'str' # coming from the last add()

All 4 of these pass the type checks, because str, int and float are all Addable, since they have __add__(). The last one will raise a run-time exceptions, since + doesn't work implicitly for int and str. Note that this is a runtime exception coming from running the code, not a type error — the type checker did not raise any errors.

There are 2 ways we can think about this mini-problem:

  1. We want to allow adding of 2 different types (eg. int and float), but only if it makes sense. We want the type checker to raise an error for cases when a runtime exception would be raised, (eg. adding int and str)
  2. We only want to allow adding of exactly the same types, eg. int, int, float, float, str, str. We will see that this is not achievable in Python with generic types.

Let's look at another version of this, where we declare a TypeVar and bind it to be Addable:

T = TypeVar('T', bound=Addable)

def add(a: T, b: T) -> T:
    print(type(a), type(b))
    return a+b # checks for __add__()

add(str(1), str(2))   # okay
add(int(1), int(2))   # okay
add(int(1), float(2)) # okay
add(int(1), str(2))  # typecheck error and runtime error

Here, the last line raises a type check error and a runtime error:

error: Value of type variable "T" of "add" cannot be "object"  # typecheck error coming from the last add()
...
TypeError: unsupported operand type(s) for +: 'int' and 'str'  # runtime error coming from the last add()

Not that the third int, float version still runs fine. So this version implements case 1. above, where different types can still be passed, as long as addition makes sense for them.

One last attempt could be to use a union'd TypeVar, where we limit ourselves to certain types that can stand in for T. But as before, in this case the type checker also doesn't enforce the instances of T to be the same:

T = TypeVar('T', int, float, str)

def add(a: T, b: T) -> T:
    return a+b # checks for __add__()

add(int(1), float(2)) # okay

It turns we cannot use binding to get case 2. above, ie. to force the type checker to make sure that both a and b arguments are actually the same type in add(a, b).

Conclusion

In the next article I will look at more uses of protocol and abstract base classes.