One of the most common Data Science tasks in a business setting is timeseries forecasting. Examples include:
- given N years of historic daily sales, build a forecast for next year's daily sales
- given N years of historic Daily Active Users (DAU), build a DAU forecast for the remainder of the year
- given N years of historic hourly transaction counts per retail store, build a forecast for next month's hourly transaction count per store
I was curious what methods and libraries other Data Scientists use, so I posted an "Ask HN" on Hacker News: Data Scientists, what libraries do you use for timeseries forecasting? I expected zero to moderate engagement, I would have been happy with 10 answers. In the end, the post generated 89 comments, most of them high-quality. This is my summary of the discussion.
Hacker News comments
Below is my enumeration of main points made in the comments.
There were a number of recommendations for Darts, which is a Python super-library for forecasting, most notably by user hrzn, the library's creator. By "super-library", I mean that it implements its own models, but also wraps existing libraries (such as Prophet). I was not aware of Darts, I definitely plan to invest time to experiment with it.
As an example, imagine you want to calculate only a single sample into the future. Say furthermore that you have six input timeseries sampled hourly, and you don't expect meaningful correlation beyond 48h old samples. You create 6x48 input features, take the single target value that you want to predict as output, and feed this into your run of the mill gradient boosted tree. The above gives you a less complex approach than reaching for bespoke time-series stuff; I've personally have had success doing something like this. If your regressor does not support multiple outputs, you can always wrap it in sklearns MultiOutputRegressor (or optionally RegressorChain; check it out). This is useful if, in the above example, you are not looking to predict only the next sample, but maybe the next 12 samples.
In a response, user em500 points out:
[Prophet] is mostly just regression though, with features for trend, yearly and weekly periodicity (smoothed a bit using trigonometric regressors), and holiday features. The only non-standard linear regression part is that it includes a flexible piecewise linear trend, with regularization to select where the trend is allowed to change.
To my surprise, there a quite a few comments critical of Prophet. In my experience at work (where we do forecasting of business timeseries, see examples in the introduction) Prophet does a good job. Prophet supports holidays, external regressors, growth trends, changepoints, and just the right seasonalities out of the box. In the majority of cases I don't feel a need for using another library. Back to the criticism, user dxbydt writes:
Peter Cotton has atleast a dozen very credible studies/results on Prophet vs other timeseries libraries. Before committing to prophet, please check out a few of these (all over linkedin). His tone is acerbic because he believes prophet is suboptimal & makes poor forecasts compared to the other contenders. That said, you can ignore the tone, just download the packages & test out the scenarios for yourself. I personally will not use Prophet. Like most stat tools in the python ecosystem, it is super easy to deploy & code up, but often inaccurate if you actually care about the results. ofcourse, if its some sales prediction forecast where everything’s pretty much made up & data is sparse/unverifiable, then Prophet ftw.
Note: I don't understand why he/she says sales forecasting is "made up".
User qsort writes:
[Prophet is] not as good as they'd have you believe, but it's fast and analysts can play with it to some extent.
Another interesting comment from user tfehring:
For cases that Prophet doesn't cover I recommend bsts, which is much more flexible and powerful. Anything too complicated for bsts, I'll typically implement in Stan.
Multiple users mention ensemble methods, ie. building multiple forecasts with different models and combining them. The most interesting comment came from user d4rti:
I did some anomaly detection work, in business transactions, and found the best way was to create a sort of ensemble model, where we applied all the models, and kept any anomalies, then used simple rules to only alert on 'interesting' anomalies, like: 2-3 anomalies in a row, high deviation from expected, multiple models detected anomaly, to improve signal vs noise.
User jll29 is of many who mentions neural network models:
Former Reuters Research Director here. When modeling time series, you will want a model that is sensitive both to short term and longer term movements. In other words, a Long Term Short Term Memory (LSTM).
And finally, a mention for classical ARIMA from user crimsoneer:
For time series, classical methods (ARIMA etc) still continue to perform very well for most problems.
Methods and libraries mentioned
A list of methods/libraries mentioned in this thread (excluding R libraries):
- ARIMA (method)
- Exponential smoothing (method)
- Stan: a statistical library used internally by Prophet
- XGBoost: how to use XGBoost's regressor to build a timeseries forecast
- LightGBM: how to use LightGBM's regressor to build a timeseries forecast
- State Space Model and Kalman Filters
- Pybsts: stands for Python Bayesian Structural Time Series
- Tsfresh: timeseries feature extraction
- statsmodels for timeseries forecasting
The comments are very interesting and informative. I plan to look into Darts, Pytorch-forecasting and some of the other methods mentioned here.