Unevenness at the edges

Marton Trencseni - Fri 30 October 2020 • Tagged with stats, data

Sometimes we look at the top performers in a field and see obviously uneven representations of groups (gender, ethnicity, etc). There a multitude of factors that can lead to it, such as unfair bias in access to opportunities. Here I will show one unintuitive mathematical effect that can contribute to such unevenness in the case of normal distributions.

Continue reading

Effective Data Visualization Part 3: Line charts and stacked area charts

Marton Trencseni - Tue 01 September 2020 • Tagged with charts, dashboards, data, visualization

Most charts should be line charts or stacked area chart, because they communicate valuable trend information and are easy to parse for the human eyes and brain.

Continue reading

Effective Data Visualization Part 2: Formatting numbers

Marton Trencseni - Sun 23 August 2020 • Tagged with charts, dashboards, data, visualization

Format numbers for human consumption. What is more readable, 1.539e+5 or 153,859? Showing numbers effectively on spreadsheets, charts, dashboards, reports is a basic ingredient for readability, like formatting code in programming.

Continue reading

Effective Data Visualization Part 1: Categorical data

Marton Trencseni - Sat 22 August 2020 • Tagged with charts, dashboards, data, visualization

Making clear, readable charts is part of the craftmanship minimum for any data related role. In part one, I look at how to present categorical data.

A pie chart

Continue reading

Beyond the Central Limit Theorem

Marton Trencseni - Thu 06 February 2020 • Tagged with data, ab testing, statistics

In the previous post, I talked about the importance of the Central Limit Theorem (CLT) to A/B testing. Here we will explore cases when we cannot rely on the CLT to hold.

Running mean for Cauchy distribution

Continue reading

A/B testing and the Central Limit Theorem

Marton Trencseni - Wed 05 February 2020 • Tagged with data, ab testing, statistics

When working with hypothesis testing, the desciptions of the statistical method often has normality assumptions. For example, the Wikipedia page for the z-test starts like this: "A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution". What does this mean? How do I know it’s a valid assumption for my data?

Normal distribution from uniform

Continue reading

Optimizing waits in Airflow

Marton Trencseni - Sat 01 February 2020 • Tagged with data, airflow, python

Sometimes I get to put on my Data Engineering hat for a few days. I enjoy this because I like to move up and down the Data Science stack and I try to keep myself sharp technically. Recently I was able to spend a few days optimizing our Airflow ETL for speed.

Airflow DAG

Continue reading

SQL best practices for Data Scientists and Analysts

Marton Trencseni - Sun 26 January 2020 • Tagged with data, programming, sql

My list of SQL best practices for Data Scientists and Analysts, or, how I personally write SQL code. I picked this up at Facebook, and later improved it at Fetchr.

SQL code

Continue reading

How I write SQL code

Marton Trencseni - Fri 24 January 2020 • Tagged with data, programming, sql

This is a simple post about SQL code formatting. Most of this comes from my time as a Data Engineer at Facebook.

SQL code

Continue reading

Metrics Atlas

Marton Trencseni - Thu 29 August 2019 • Tagged with data, fetchr

The idea is simple: write a document which helps new and existing people—both managers and individual contributors—get an objective, metrics-based picture of the business. This is helpful when new people join, when people start working in new segments of the business, and to understand other parts of the company.

Metrics atlas

Continue reading