Luigi vs Airflow vs Pinball

Posted on Sat 06 February 2016 in Data

After reviewing these three ETL worflow frameworks, I compiled a table comparing them. Here's the original Gdoc spreadsheet. If I had to build a new ETL system today from scratch, I would use Airflow. If you find any mistakes, please let me know at mtrencseni@gmail.com.

ABCD
1
LuigiAirflowPinball
2
repohttps://github.com/spotify/luigihttps://github.com/airbnb/airflowhttps://github.com/pinterest/pinball
3
docshttp://luigi.readthedocs.orghttps://airflow.readthedocs.orgnone
4
my reviewhttp://bytepawn.com/luigi.htmlhttp://bytepawn.com/airflow.htmlhttp://bytepawn.com/pinball.html
5
github forks75034558
6
github stars40291798506
7
github watchers31916647
8
commits in last 30 dayslots of commitslots of commits3 commits
9
architecture
10
web dashboardnot really, minimalvery niceyes
11
code/dslcodecodepython dict + python code
12
files/datasetsyes, targetsnot really, as special tasks?
13
calendar schedulingno, use cronyes, LocalScheduleryes
14
datadoc'able [1]maybe, doesn't really fitprobably, by conventionyes, dicts would be easy to parse
15
backfill jobsyesyes?
16
persists statekindofyes, to dbyes, to db
17
tracks historyyesyes, in dbyes, in db
18
code shippingnoyes, pickleworkflow is shipped using pickle, jobs are not?
19
prioritiesyesyes?
20
parallelismyes, workers, threads per workersyes, workers?
21
control parallelismyes, resourcesyes, pools?
22
cross-dag depsyes, using targetsyes, using sensorsyes
23
finds new deployed tasksnoyes?
24
executes dagno, have to create special sink taskyesyes
25
multiple dagsno, just oneyes, also several dag instances (dagruns)yes
26
scheduler/workers
27
starting workersusers start worker proccesesscheduler spawns workers processesusers start worker procceses
28
commsscheduler's HTTP APIminimal, in state dbthrough master module using Swift
29
workers executeworker can execute tasks that is has locallyworker reads pickled tasks from dbworker can execute tasks that is has locally?
30
contrib
31
hadoopyesyesyes
32
pigyesdoc mentions PigOperator, it's not in the sourceno
33
hiveyesyesyes
34
pgsqlyesyesno
35
mysqlyesyesno
36
redshiftyesnono
37
s3yesyesyes
38
source
39
written inpythonpythonpython
40
loc18,00021,00018,000
41
testslotsminimallots
42
maturityfairlowlow
43
other serious usersyesnot reallyno
44
pip installyesyesbroken
45
niceties-sla, xcom, variables, trigger rules, celery, chartspass data between jobs
46
does it for you
47
sync tasks to workersnoyesno
48
schedulingnoyesyes
49
monitoringnonono
50
alertingnoslas, but probably not enoughsends emails
51
dashboardsnoyesyes

[1] By datadoc'able I mean: could you write a script which reads and parses the ETL jobs, and generates a nice documentation about your datasets and which ETL jobs read/write them. At Prezi we did this, we called it datadoc.