Fetchr Data Science Infra at 1 year

Marton Trencseni - Tue 14 August 2018 • Tagged with data, etl, workflow, airflow, fetchr, model, ml

A description of our Analytics+ML cluster running on AWS, using Presto, Airflow and Superset.

Fetchr Data Science Infra

Continue reading

Building the Fetchr Data Science Infra on AWS with Presto and Airflow

Marton Trencseni - Wed 14 March 2018 • Tagged with data, etl, workflow, airflow, fetchr

We used Hive/Presto on AWS together with Airflow to rapidly build out the Data Science Infrastructure at Fetchr in less than 6 months.

Warehouse DAG

Continue reading

Luigi vs Airflow vs Pinball

Marton Trencseni - Sat 06 February 2016 • Tagged with data, etl, workflow, luigi, airflow, pinball

A spreadsheet comparing the three opensource workflow tools for ETL.

Comparison

Continue reading

Pinball review

Marton Trencseni - Sat 06 February 2016 • Tagged with data, etl, workflow, pinball

Pinball is an ETL tool written by Pinterest. Like Airflow, it supports defining tasks and dependencies as Python code, executing and scheduling them, and distributing tasks across worker nodes. It supports calendar scheduling (hourly/daily jobs, also visualized on the web dashboard). Unfortunately, I found Pinball has very little documentation, very few recent commits in the Github repo and few meaningful answers to Github issues by maintainers, while it's architecture is complicated and undocumented.

Continue reading

Airflow review

Marton Trencseni - Wed 06 January 2016 • Tagged with data, etl, workflow, airflow

Airflow is a workflow scheduler written by Airbnb. It supports defining tasks and dependencies as Python code, executing and scheduling them, and distributing tasks across worker nodes. It supports calendar scheduling (hourly/daily jobs, also visualized on the web dashboard), so it can be used as a starting point for traditional ETL. It has a nice web dashboard for seeing current and past task state, querying the history and making changes to metadata such as connection strings.

Airflow

Continue reading

Luigi review

Marton Trencseni - Sun 20 December 2015 • Tagged with data, etl, workflow, luigi

I review Luigi, an execution framework for writing data pipes in Python code. It supports task-task dependencies, it has a simple central scheduler with an HTTP API and an extensive library of helpers for building data pipes for Hadoop, AWS, Mysql etc. It was written by Spotify for internal use and open sourced in 2012. A number of companies use it, such as Foursquare, Stripe, Asana.

Continue reading