Peter's stats stuff

Applying statistics and data science 'in the wild'

I write about applications of data and analytical techniques like statistical modelling and simulation to real-world situations. I show how to access and use data, and provide examples of analytical products and the code that produced them.

Latest post »

Recent posts


Books I like

14 January 2017

My ten recommended books for applied statistics and data science. Then 13 more!


Cross-validation of topic modelling

05 January 2017

Cross-validation of the "perplexity" from a topic model, to help determine a good number of topics.


Sparse matrices, k-means clustering, topic modelling with posts on the 2004 US Presidential election

31 December 2016

I explore different sparse matrix formats in R and moving data from R to H2O. Along the way I use k-means clustering and topic modelling to explore textual data from the Daily Kos blog on the 2004 US Presidential election.


Extracting data on shadow economy from PDF tables

26 December 2016

The shadow economy as a percentage of GDP in wealthier countries is in decline; and had a spike in 2009 with the economic crisis. More research is needed to adequately understand it. Along the way I experiment with extracting data frames from PDF tables; and show it's always worthwhile looking at the same data in different ways, which can be as simple as freeing up the vertical axes of graphics.


forecastHybrid 0.3.0 on CRAN

24 December 2016

forecastHybrid 0.3.0 for ensemble time series forecasting is now on CRAN. Two new features are prediction intervals for the nnetar (neural network) component of the combination; and theta method models.


Air quality in Indian cities

18 December 2016

Air pollution in Indian cities is unambiguously seasonal, and also might have a Diwali impact.


Extrapolation is tough for trees!

10 December 2016

Tree-based predictive analytics methods like random forests and extreme gradient boosting may perform poorly with data that is out of the range of the original training data.


Why time series forecasts prediction intervals aren't as good as we'd hope

07 December 2016

A quick demonstration of the impact of inevitably random estimates of the parameters and meta-parameters in ARIMA time series modelling


Error, trend, seasonality - ets and its forecast model friends

27 November 2016

I check out exponential smoothing state space models for univariate time series as a general family of forecasting models, and in particular the `ets`, `stlm` and `thetaf` functions from Hyndman's forecast R package. For monthly and quarterly seasonal data, `thetaf` seems to be slightly outperformed by its more flexible and general cousins.


Declining sea ice in the Arctic

24 November 2016

Adding a (small amount of) polish to a well known chart of seasonal Arctic sea declining over the years.