Peter's stats stuff

Applying statistics and data science 'in the wild'

I write about applications of data and analytical techniques like statistical modelling and simulation to real-world situations. I show how to access and use data, and provide examples of analytical products and the code that produced them.

Latest post »

Recent posts

Dual axes time series plots may be ok sometimes after all

18 August 2016

Dual axis time series charts are often deprecated, but the standard alternatives have weaknesses too. In some circumstances, if done carefully, dual axis time series charts may be ok after all. In particular, you can choose two vertical scales so the drawing on the page is equivalent to drawing two indexed series, but retaining the meaningful mapping to the scale of the original variables.

Elastic net regularization of a model of burned calories

13 August 2016

Elastic net regularization of estimates is a good way of dealing with collinearity and feature selection; this is illustrated with a simple dataset of 30 daily observations from a fitbit tracker.

nzcensus on GitHub

04 August 2016

Demonstration analysis of area unit demographic data from the nzcensus R package on GitHub, which is approaching more maturity and readiness for general use.

nzelect 0.2.0 on CRAN

14 July 2016

The nzelect R package is now available on CRAN; so far it has aggregate results by voting place for the New Zealand 2014 general election.

Animated world inequality map

02 July 2016

Animated over time choropleth map of intra-country inequality, where data exist, using University of Texas Inequality Project data.

International Household Income Inequality data

30 June 2016

I explore the University of Texas Inequality Project's Estimated Household Income Inequality data, which provides modelled estimates of inequality for more than 150 countries from 1963 to 2008.

Monthly Regional Tourism Estimates

16 June 2016

My day-job released new data on estimated tourism spend by region in New Zealand, by month.

Presentation slides on using graphics

14 June 2016

Excellent statistical graphics usually reveal multivariate interactions and comparisons, and combine high data density with a minimum of ink that doesn't directly represent data.

Bootstrap and cross-validation for evaluating modelling strategies

05 June 2016

I compare 'simple' bootstrap, 'enhanced' (optimism-correcting) bootstrap, and repeated k-fold cross-validation as methods for estimating fit of three example modelling strategies.

Actual coverage of confidence intervals for standard deviation

29 May 2016

The success rate (proportion of times the true value is covered by the interval) of 95% confidence intervals from the bootstrap when estimating population standard deviation can be very poor for complex mixed distributions, such as real world weekly income from a modest sample size (<20,000).