Luigi vs Airflow vs Pinball

Marton Trencseni - Sat 06 February 2016 - Data

After reviewing these three ETL worflow frameworks, I compiled a table comparing them. Here's the original Gdoc spreadsheet. If I had to build a new ETL system today from scratch, I would use Airflow. If you find any mistakes, please let me know at mtrencseni@gmail.com.

	A	B	C	D
1		Luigi	Airflow	Pinball

2	repo	https://github.com/spotify/luigi	https://github.com/airbnb/airflow	https://github.com/pinterest/pinball
3	docs	http://luigi.readthedocs.org	https://airflow.readthedocs.org	none
4	my review	http://bytepawn.com/luigi.html	http://bytepawn.com/airflow.html	http://bytepawn.com/pinball.html
5	github forks	750	345	58
6	github stars	4029	1798	506
7	github watchers	319	166	47
8	commits in last 30 days	lots of commits	lots of commits	3 commits
9	architecture
10	web dashboard	not really, minimal	very nice	yes
11	code/dsl	code	code	python dict + python code
12	files/datasets	yes, targets	not really, as special tasks	?
13	calendar scheduling	no, use cron	yes, LocalScheduler	yes
14	datadoc'able [1]	maybe, doesn't really fit	probably, by convention	yes, dicts would be easy to parse
15	backfill jobs	yes	yes	?
16	persists state	kindof	yes, to db	yes, to db
17	tracks history	yes	yes, in db	yes, in db
18	code shipping	no	yes, pickle	workflow is shipped using pickle, jobs are not?
19	priorities	yes	yes	?
20	parallelism	yes, workers, threads per workers	yes, workers	?
21	control parallelism	yes, resources	yes, pools	?
22	cross-dag deps	yes, using targets	yes, using sensors	yes
23	finds new deployed tasks	no	yes	?
24	executes dag	no, have to create special sink task	yes	yes
25	multiple dags	no, just one	yes, also several dag instances (dagruns)	yes
26	scheduler/workers
27	starting workers	users start worker procceses	scheduler spawns workers processes	users start worker procceses
28	comms	scheduler's HTTP API	minimal, in state db	through master module using Swift
29	workers execute	worker can execute tasks that is has locally	worker reads pickled tasks from db	worker can execute tasks that is has locally?
30	contrib
31	hadoop	yes	yes	yes
32	pig	yes	doc mentions PigOperator, it's not in the source	no
33	hive	yes	yes	yes
34	pgsql	yes	yes	no
35	mysql	yes	yes	no
36	redshift	yes	no	no
37	s3	yes	yes	yes
38	source
39	written in	python	python	python
40	loc	18,000	21,000	18,000
41	tests	lots	minimal	lots
42	maturity	fair	low	low
43	other serious users	yes	not really	no
44	pip install	yes	yes	broken
45	niceties	-	sla, xcom, variables, trigger rules, celery, charts	pass data between jobs
46	does it for you
47	sync tasks to workers	no	yes	no
48	scheduling	no	yes	yes
49	monitoring	no	no	no
50	alerting	no	slas, but probably not enough	sends emails
51	dashboards	no	yes	yes

[1] By datadoc'able I mean: could you write a script which reads and parses the ETL jobs, and generates a nice documentation about your datasets and which ETL jobs read/write them. At Prezi we did this, we called it datadoc.