Luigi vs Airflow vs Pinball
Marton Trencseni - Sat 06 February 2016 - Data
After reviewing these three ETL worflow frameworks, I compiled a table comparing them. Here's the original Gdoc spreadsheet. If I had to build a new ETL system today from scratch, I would use Airflow. If you find any mistakes, please let me know at mtrencseni@gmail.com.
| A | B | C | D | |
|---|---|---|---|---|
1  | Luigi | Airflow | Pinball | |
2  | repo | https://github.com/spotify/luigi | https://github.com/airbnb/airflow | https://github.com/pinterest/pinball | 
3  | docs | http://luigi.readthedocs.org | https://airflow.readthedocs.org | none | 
4  | my review | http://bytepawn.com/luigi.html | http://bytepawn.com/airflow.html | http://bytepawn.com/pinball.html | 
5  | github forks | 750 | 345 | 58 | 
6  | github stars | 4029 | 1798 | 506 | 
7  | github watchers | 319 | 166 | 47 | 
8  | commits in last 30 days | lots of commits | lots of commits | 3 commits | 
9  | architecture | |||
10  | web dashboard | not really, minimal | very nice | yes | 
11  | code/dsl | code | code | python dict + python code | 
12  | files/datasets | yes, targets | not really, as special tasks | ? | 
13  | calendar scheduling | no, use cron | yes, LocalScheduler | yes | 
14  | datadoc'able [1] | maybe, doesn't really fit | probably, by convention | yes, dicts would be easy to parse | 
15  | backfill jobs | yes | yes | ? | 
16  | persists state | kindof | yes, to db | yes, to db | 
17  | tracks history | yes | yes, in db | yes, in db | 
18  | code shipping | no | yes, pickle | workflow is shipped using pickle, jobs are not? | 
19  | priorities | yes | yes | ? | 
20  | parallelism | yes, workers, threads per workers | yes, workers | ? | 
21  | control parallelism | yes, resources | yes, pools | ? | 
22  | cross-dag deps | yes, using targets | yes, using sensors | yes | 
23  | finds new deployed tasks | no | yes | ? | 
24  | executes dag | no, have to create special sink task | yes | yes | 
25  | multiple dags | no, just one | yes, also several dag instances (dagruns) | yes | 
26  | scheduler/workers | |||
27  | starting workers | users start worker procceses | scheduler spawns workers processes | users start worker procceses | 
28  | comms | scheduler's HTTP API | minimal, in state db | through master module using Swift | 
29  | workers execute | worker can execute tasks that is has locally | worker reads pickled tasks from db | worker can execute tasks that is has locally? | 
30  | contrib | |||
31  | hadoop | yes | yes | yes | 
32  | pig | yes | doc mentions PigOperator, it's not in the source | no | 
33  | hive | yes | yes | yes | 
34  | pgsql | yes | yes | no | 
35  | mysql | yes | yes | no | 
36  | redshift | yes | no | no | 
37  | s3 | yes | yes | yes | 
38  | source | |||
39  | written in | python | python | python | 
40  | loc | 18,000 | 21,000 | 18,000 | 
41  | tests | lots | minimal | lots | 
42  | maturity | fair | low | low | 
43  | other serious users | yes | not really | no | 
44  | pip install | yes | yes | broken | 
45  | niceties | - | sla, xcom, variables, trigger rules, celery, charts | pass data between jobs | 
46  | does it for you | |||
47  | sync tasks to workers | no | yes | no | 
48  | scheduling | no | yes | yes | 
49  | monitoring | no | no | no | 
50  | alerting | no | slas, but probably not enough | sends emails | 
51  | dashboards | no | yes | yes | 
[1] By datadoc'able I mean: could you write a script which reads and parses the ETL jobs, and generates a nice documentation about your datasets and which ETL jobs read/write them. At Prezi we did this, we called it datadoc.