Applying statistics and data science 'in the wild'
I write about applications of data and analytical techniques like statistical modelling and simulation to real-world situations. I show how to access and use data, and provide examples of analytical products and the code that produced them.
As part of familiarising myself with the Stan probabilistic programming language, I replicate Simon Jackman's state space modelling with house effects of the 2007 Australian federal election.
Excited that New Zealand's Government Statistician is promoting reproducibility and open access to code, tabular outputs and research products from research with confidentialised microdata.
For choropleth maps showing the whole world, we don't need to stick to static maps with Mercator projections. I like rotating globes, and interactive slippery maps with tooltips.
Sankey charts based on individual level survey data are a good way of showing change from election to election. I demonstrate this, via some complications with survey-reweighting and missing data, with the New Zealand Election Study for the 2014 and 2011 elections.
Introducing a Shiny web tool for exploring individual characteristics and party vote in the 2014 New Zealand general election.
I work through a fairly complete modelling case study utilising methods for complex surveys, multiple imputation, multilevel models, non-linear relationships and the bootstrap. People who voted for New Zealand First in the 2014 election were more likely to be older, born in New Zealand, identify as working class and male.
Linked micromaps are an ok way of presenting data and are probably the right tool in some circumstances; but they're not as cool as I thought they might be.
Shapefiles for cartogram by New Zealand Territorial Authority (ie District or City), with area proportional to population in 2013, have been added to the nzcensus package on GitHub.
Choropleth maps are useful ways of using fill colour to show densities, proportions and growth rates by political or economic boundaries, but can be visually problematic when large geographic areas represent few people, or small areas (ie cities) represent many. One solution is a cartogram, and I have a go at using them to present New Zealand census data in this post and accompanying shiny app.
It's much more important to get a well-specified model than worry about propensity score matching versus weighting, or either versus single stage regression, or increasing sample size. A regression that includes all "true" 100 explanatory variables with only 500 observations performs better in estimating a treatment effect than any of those methods when only 90 of the 100 variables are observed, even with 100,000 observations.