5 things that happened in Data Science in 2018

Posted on Wed 09 January 2019 in Data • Tagged with data, openai, waymo, deepmind, tesla, reinforce

2018 was a hot year for Data Science and AI. Here we picked out 5 highlights, which in our opinion shaped the field in the past year.

Deepmind playing CTF

Continue reading

Warehouse locations with k-means

Posted on Wed 26 September 2018 in Data • Tagged with data, data-science, metrics

Sometimes, the seven gods of data science, Pascal, Gauss, Bayes, Poisson, Markov, Shannon and Fisher, all wake up in a good mood, and things just work out. Recently we had such an occurence at Fetchr, when the Operational Excellence team posed the following question: if we could pick our Saudi warehouse locations, where would be put them? What is the ideal number of warehouses, and, what does ideal even mean? Also, what should our “delivery radius” be?

Continue reading

Growth Accounting and Backtraced Growth Accounting

Posted on Sun 16 September 2018 in Data • Tagged with data, data-science, metrics, growth-accounting

Previously I wrote two articles about data infra and data engineering at Fetchr. This time I want to move up the stack and talk about a simple piece of metrics engineering that proved to be very impactful: Growth Accounting and Backtraced Growth Accounting.

Backtraced Growth Accounting

Continue reading

Fetchr Data Science Infra at 1 year

Posted on Tue 14 August 2018 in Data • Tagged with data, etl, workflow, airflow, fetchr, model, ml

A description of our Analytics+ML cluster running on AWS, using Presto, Airflow and Superset.

Fetchr Data Science Infra

Continue reading

Beat the averages

Posted on Sat 07 July 2018 in Data • Tagged with data, statistics

When working with averages, we have to be careful. There are pitfalls lurking to pollute our statistics and results reported.

Probability distribution

Continue reading

Building the Fetchr Data Science Infra on AWS with Presto and Airflow

Posted on Wed 14 March 2018 in Data • Tagged with data, etl, workflow, airflow, fetchr

We used Hive/Presto on AWS together with Airflow to rapidly build out the Data Science Infrastructure at Fetchr in less than 6 months.

Warehouse DAG

Continue reading

Don’t build cockpits, become a coach

Posted on Wed 09 November 2016 in Data • Tagged with data, science, product, analytics

I used to think that a good analogy for using data is the instrumentation of a cockpit in an airliner. Lots of instruments, and if they fail, the pilot can’t fly the plane and bad things happen. There’s no autopilot for companies. The problem with this analogy is that planes aren’t built in mid-air. Product teams and companies constantly need to build and ship new products.

A big complicated cockpit

Continue reading

Luigi vs Airflow vs Pinball

Posted on Sat 06 February 2016 in Data • Tagged with data, etl, workflow, luigi, airflow, pinball

A spreadsheet comparing the three opensource workflow tools for ETL.

Comparison

Continue reading

Pinball review

Posted on Sat 06 February 2016 in Data • Tagged with data, etl, workflow, pinball

Pinball is an ETL tool written by Pinterest. Like Airflow, it supports defining tasks and dependencies as Python code, executing and scheduling them, and distributing tasks across worker nodes. It supports calendar scheduling (hourly/daily jobs, also visualized on the web dashboard). Unfortunately, I found Pinball has very little documentation, very few recent commits in the Github repo and few meaningful answers to Github issues by maintainers, while it's architecture is complicated and undocumented.

Continue reading

Airflow review

Posted on Wed 06 January 2016 in Data • Tagged with data, etl, workflow, airflow

Airflow is a workflow scheduler written by Airbnb. It supports defining tasks and dependencies as Python code, executing and scheduling them, and distributing tasks across worker nodes. It supports calendar scheduling (hourly/daily jobs, also visualized on the web dashboard), so it can be used as a starting point for traditional ETL. It has a nice web dashboard for seeing current and past task state, querying the history and making changes to metadata such as connection strings.

Airflow

Continue reading

Luigi review

Posted on Sun 20 December 2015 in Data • Tagged with data, etl, workflow, luigi

I review Luigi, an execution framework for writing data pipes in Python code. It supports task-task dependencies, it has a simple central scheduler with an HTTP API and an extensive library of helpers for building data pipes for Hadoop, AWS, Mysql etc. It was written by Spotify for internal use and open sourced in 2012. A number of companies use it, such as Foursquare, Stripe, Asana.

Continue reading

Cargo Cult Data

Posted on Mon 26 January 2015 in Data • Tagged with data

Cargo cult data is when you're collecting and looking at data when making decisions, but you're only following the forms and outside appearances of scientific investigation and missing the essentials, so it doesn't work.

Continue reading