Luigi vs Airflow vs Pinball
Marton Trencseni - Sat 06 February 2016 - Data
After reviewing these three ETL worflow frameworks, I compiled a table comparing them. Here's the original Gdoc spreadsheet. If I had to build a new ETL system today from scratch, I would use Airflow. If you find any mistakes, please let me know at mtrencseni@gmail.com.
A | B | C | D | |
---|---|---|---|---|
1 | Luigi | Airflow | Pinball | |
2 | repo | https://github.com/spotify/luigi | https://github.com/airbnb/airflow | https://github.com/pinterest/pinball |
3 | docs | http://luigi.readthedocs.org | https://airflow.readthedocs.org | none |
4 | my review | http://bytepawn.com/luigi.html | http://bytepawn.com/airflow.html | http://bytepawn.com/pinball.html |
5 | github forks | 750 | 345 | 58 |
6 | github stars | 4029 | 1798 | 506 |
7 | github watchers | 319 | 166 | 47 |
8 | commits in last 30 days | lots of commits | lots of commits | 3 commits |
9 | architecture | |||
10 | web dashboard | not really, minimal | very nice | yes |
11 | code/dsl | code | code | python dict + python code |
12 | files/datasets | yes, targets | not really, as special tasks | ? |
13 | calendar scheduling | no, use cron | yes, LocalScheduler | yes |
14 | datadoc'able [1] | maybe, doesn't really fit | probably, by convention | yes, dicts would be easy to parse |
15 | backfill jobs | yes | yes | ? |
16 | persists state | kindof | yes, to db | yes, to db |
17 | tracks history | yes | yes, in db | yes, in db |
18 | code shipping | no | yes, pickle | workflow is shipped using pickle, jobs are not? |
19 | priorities | yes | yes | ? |
20 | parallelism | yes, workers, threads per workers | yes, workers | ? |
21 | control parallelism | yes, resources | yes, pools | ? |
22 | cross-dag deps | yes, using targets | yes, using sensors | yes |
23 | finds new deployed tasks | no | yes | ? |
24 | executes dag | no, have to create special sink task | yes | yes |
25 | multiple dags | no, just one | yes, also several dag instances (dagruns) | yes |
26 | scheduler/workers | |||
27 | starting workers | users start worker procceses | scheduler spawns workers processes | users start worker procceses |
28 | comms | scheduler's HTTP API | minimal, in state db | through master module using Swift |
29 | workers execute | worker can execute tasks that is has locally | worker reads pickled tasks from db | worker can execute tasks that is has locally? |
30 | contrib | |||
31 | hadoop | yes | yes | yes |
32 | pig | yes | doc mentions PigOperator, it's not in the source | no |
33 | hive | yes | yes | yes |
34 | pgsql | yes | yes | no |
35 | mysql | yes | yes | no |
36 | redshift | yes | no | no |
37 | s3 | yes | yes | yes |
38 | source | |||
39 | written in | python | python | python |
40 | loc | 18,000 | 21,000 | 18,000 |
41 | tests | lots | minimal | lots |
42 | maturity | fair | low | low |
43 | other serious users | yes | not really | no |
44 | pip install | yes | yes | broken |
45 | niceties | - | sla, xcom, variables, trigger rules, celery, charts | pass data between jobs |
46 | does it for you | |||
47 | sync tasks to workers | no | yes | no |
48 | scheduling | no | yes | yes |
49 | monitoring | no | no | no |
50 | alerting | no | slas, but probably not enough | sends emails |
51 | dashboards | no | yes | yes |
[1] By datadoc'able I mean: could you write a script which reads and parses the ETL jobs, and generates a nice documentation about your datasets and which ETL jobs read/write them. At Prezi we did this, we called it datadoc.