Pinball review
Marton Trencseni - Sat 06 February 2016 - Data
Introduction
Pinball is Pinterest’s open sourced workflow manager / ETL system. It supports defining several workflows (DAGs) consisting of jobs, and dependencies within jobs. Workflows are defined using a combination of declarative-style Python dictionary objects (like JSON) and Python code referenced in these objects. Pinball comes with a dashboard for checking currently running and past workflows.
This review will be shorter than the previous Luigi and Airflow reviews, because Pinball turned out to be not very interesting to me for the following reasons:
- Very little documentation
- Very few recent commits in the Github repo
- Very few meaningful answers to Github issues from the maintainers
- Complicated and undocumented architecture
Unfortunately pip install pinball
doesn’t work and the maintainers don’t care, so I didn't invest time in actually trying out Pinball, I just read the source code. Since this review is short and opinionated, I recommend also reading the Pinterest posts:
Architecture
Pinball has a modularized architecture. There are 5 modules:
- Master (sits on the DB)
- Scheduler (also accessed DB)
- Worker (also accessed DB)
- UI web server (also accessed DB)
- Command-line
The master module sits on top of a Mysql database (no others supported) and uses Django for ORM. The master exposes a synchronization token API using Thrift to the other modules, and that’s all the master does. I think this is an unnecessary layer of abstraction; the Airflow design decision is better: everybody sees the DB and uses that to communicate, get ACID for free; no need to define and maintain an API, no need for Thrift. In the blog post, they say “component-wise design allows for easy alterations”, eg. you could write a different scheduler implementation. But:
- Who’d ever want to write a different scheduler implementation? I'm using an opensource project to avoid writing my own ETL system.
- You can change the code in other architectures as well as long as it’s modularized.
Moving on, the other daemon modules are the scheduler, the worker and the UI web server. The scheduler performs calendar scheduling of workflows. The workers actually execute individual jobs.
An important piece of the Pinball architecture are tokens. Tokens are basically records, and the collection of all tokens is the system state. Unfortunately the different sort of tokens are not documented, and since Python is dynamic, there’s also no usable documentation in the code (eg. a header file in C++). Tokens have a data
member, and Python objects are pickled and stored there on the fly as the state.
At first when I read the blog posts and code, I saw this diagram and then this, and I thought that only the master accesses the database, and the scheduler and workers don’t, everything goes through the master using tokens. But actually that’s not true, I think the architecture is everybody accesses the database for reads (as an optimization), but only the master writes to the database. This seems like a leaky abstraction, and again it’s not clear why the modules can’t use the DB to communicate state, why the need for Thrift. Relevant parts from the blog post:
Every state (token) change goes through the master and gets committed to the persistent store before the worker request returns… workers can read archived tokens directly from the persistent storage, bypassing the master, greatly improving system scalability.
An interesting design decision is separation of workflow description, which is given in Python dictionaries, and the actual job codes. See example here. It’s a bit wierd that the workflow references the actual job using a string. I think this is because many modules load the workflow (eg. scheduler), but only the workers actually load the jobs.
Contrib stuff
Pinball has contrib stuff for the following job types:
- Bash
- Python
- S3 (also EMR)
- Hadoop, Hive
- Qubole (a data processing platform-as-a-service Pinterest uses)
There are no connectors to Postgres, Mysql, Redshift, Presto or any SQL databases.
Source code and tests
The main codebase is ~18,000 LOC (python), plus about ~7,000 lines of unit test code. Other Python libraries used on the server side:
I think it’s cool that Pinball doesn’t have many library dependencies; for a Python project, it barely has any.
Conclusions
If I had to build an ETL system from scratch today, I would not use Pinball. It’s not documented, not a lot of commits, can't find other users, and I'm suspicious of the architecture. I would use Airflow.