Peter's stats stuff

Applying statistics and data science 'in the wild'

I write about applications of data and analytical techniques like statistical modelling and simulation to real-world situations. I show how to access and use data, and provide examples of analytical products and the code that produced them.

Latest post »

Recent posts

Extrapolation is tough for trees!

10 December 2016

Tree-based predictive analytics methods like random forests and extreme gradient boosting may perform poorly with data that is out of the range of the original training data.

Why time series forecasts prediction intervals aren't as good as we'd hope

07 December 2016

A quick demonstration of the impact of inevitably random estimates of the parameters and meta-parameters in ARIMA time series modelling

Error, trend, seasonality - ets and its forecast model friends

27 November 2016

I check out exponential smoothing state space models for univariate time series as a general family of forecasting models, and in particular the `ets`, `stlm` and `thetaf` functions from Hyndman's forecast R package. For monthly and quarterly seasonal data, `thetaf` seems to be slightly outperformed by its more flexible and general cousins.

Declining sea ice in the Arctic

24 November 2016

Adding a (small amount of) polish to a well known chart of seasonal Arctic sea declining over the years.

Earthquake energy over time

19 November 2016

I look more into this business of energy from earthquakes.

Extreme pie chart polishing

15 November 2016

I polish up a dramatic pie chart from on earthquake energy released in New Zealand over the last few years.

Timeseries forecasting using extreme gradient boosting

06 November 2016

I'm working on a new R package to make it easier to forecast timeseries with the xgboost machine learning algorithm. So far in tests against large competition data collections (thousands of timeseries), it performs comparably to the nnetar neural network method, but not as well as more traditional timeseries methods like auto.arima and theta.

FiveThirtyEight's polling data for the US Presidential election

29 October 2016

I have a quick look at the polling data used by the FiveThirtyEight website in predicting the USA presidential election results

Tourism forecasting competition data in the Tcomp R package

19 October 2016

A new R package `Tcomp` makes data from the 2010 tourism forecasting competition available in a format designed to facilitate the fitting and testing of en masse automated forecasts, consistent with the M1 and M3 forecasting competition data in the `Mcomp` R package.

Statistics New Zealand experimental API initiative

15 October 2016

Statistics New Zealand recently launched experimental access to some of their data over the web via an application programming interface; it can be accessed easily via the equally experimental statsNZ R package by Jonathan Marshall.