Importing the New Zealand Income Survey SURF

The quest for income microdata

For a separate project, I've been looking for source data on income and wealth inequality. Not aggregate data like Gini coefficients or the percentage of income earned by the bottom 20% or top 1%, but the sources used to calculate those things. Because it's sensitve personal financial data either from surveys or tax data, it's not easy to get hold of, particularly if you want to do it in the comfort of your own home and don't have time / motivation to fill in application forms. So far, what I've got is a simulated unit record file (SURF) from the New Zealand Income Survey, published by Statistics New Zealand for educational and instructional purposes. It's a simplified, simulated version of the original survey and it lets me make graphics like this one:

This plot shows the density of income from all sources for four of the most prominent ethnicity types in New Zealand. New Zealand official statistics allow people to identify with more than one ethnicity, which means there is some double counting (more on that below). Three things leap out at me from this chart:

The density curves of the different ethnic groups are very similar visually
People of Maori and Pacific Peoples ethnicity have proportionately more $300-$400 per week earners than Europeans and Asians, leading to an overall noticeable lean to the lower numbers
Weekly income is bimodal, bunching up at $345 per week and $820 per week. Actually, this image is misleading in that respect; in reality it is trimodal, with a huge bunch of people with zero income (and some negative), who aren't shown on this plot because of the logarithmic scale. So you could say that, for New Zealanders getting any income at all, there is a bimodal distribution.

Where's that bimodal distribution coming from? The obvious candidate is part time workers, and this is seen when we plot income versus hours worked in the next image:

(That interesting diagonal line below which there are very few points is the minimum wage per hour)

Statistics New Zealand warn that while this simulated data is an accurate representation of the actual New Zealand population, it shouldn't be used for precise statistics, so for now I won't draw particularly strong conclusions on anything. A simulated unit record file is a great way of solving confidentiality purposes, but in the end it has been created by a statistical model. There's a risk that interesting inferences might be just picking up something implicit in the model that wasn't taken into account when it was first put together. That's not likely to be the case for the basic distribution of the income values, but we'll note the exploratory finding for now and move on.

A box plot is better for examining the full range of reported ethnicities but not so good for picking up the bi-modal subtlety. It should also be noted that both these graphs delete people who made losses (negative income) in a given week - the data show income from all sources:

Loading the NZIS SURF into a database

Here's how I drew the graphs, and estimated the two modes. I think I'll want to re-use this data quite often, so it's worth while putting in a database that's accessible from different projects without having to move a bunch of data files around in Windows Explorer. The way Statistics New Zealand have released the file, with codings rather than values for dimensions like ethnicity and region, also makes a database a good way to make the data analysis ready.

After importing the data, the first significant job is to deal with that pesky ethnicity variable. In the released version of the data respondents with two ethnicities have both code numbers joined together eg 12 means both European (1) and Maori (2). To get around this I split the data into two fact tables, one with a single row per respondent with most of the data; and a second just for ethnicity with either one or two rows for each respondent. Here's how I do that with a combination of {dplyr} and {tidyr}:

library(dplyr)
library(RODBC)
library(tidyr)
library(mbie) # for AskCreds, which is alternatively directly available 
              # https://github.com/nz-mbie/mbie-r-package/blob/master/pkg/R/Creds.R 

# imports, clean up, and save to database the data from
# http://www.stats.govt.nz/tools_and_services/microdata-access/nzis-2011-cart-surf.aspx

url <- "http://www.stats.govt.nz/~/media/Statistics/services/microdata-access/nzis11-cart-surf/nzis11-cart-surf.csv"
nzis <- read.csv(url)


#-----------------fact tables------------------
# Create a main table with a primary key

f_mainheader <- nzis %>%
   mutate(survey_id = 1:nrow(nzis))

# we need to normalise the multiple ethnicities, currently concatenated into a single variable
cat("max number of ethnicities is", max(nchar(nzis$ethnicity)), "\n")

f_ethnicity <- f_mainheader %>%
   select(ethnicity, survey_id) %>%
   mutate(First = substring(ethnicity, 1, 1),
          Second = substring(ethnicity, 2, 2)) %>%
   select(-ethnicity) %>%
   gather(ethnicity_type, ethnicity_id, -survey_id) %>%
   filter(ethnicity_id != "") 
   
# drop the original messy ethnicity variable and tidy up names on main header
f_mainheader <- f_mainheader %>%
   select(-ethnicity) %>%
   rename(region_id = lgr,
          sex_id = sex,
          agegrp_id = agegrp,
          qualification_id = qualification,
          occupation_id = occupation)

Second step is to re-create the dimension tables that turn the codes (eg 1 and 2) into meaningful values (European and Maori). Statistics New Zealand provide these, but unfortunately in an Excel workbook that's easier for humans than computers to link up to the data. There's not too many of so it's easy enough to code them by hand, which the next set of code does:

#-----------------dimension tables------------------
# all drawn from the data dictionary available at the first link given above
d_sex <- data_frame(sex_id = 1:2, sex = c("male", "female"))

d_agegrp <- data_frame(
   agegrp_id = seq(from = 15, to = 65)) %>%
   mutate(agegrp = ifelse(agegrp_id == 65, "65+", paste0(agegrp_id, "-", agegrp_id + 4)))

d_ethnicity <- data_frame(ethnicity_id = c(1,2,3,4,5,6,9),
                          ethnicity = c(
                             "European",
                             "Maori",
                             "Pacific Peoples",
                             "Asian",
                             "Middle Eastern/Latin American/African",
                             "Other Ethnicity",
                             "Residual Categories"))


d_occupation <- data_frame(occupation_id = 1:10,
                       occupation = c(
                          "Managers",
                          "Professionals",
                          "Technicians and Trades Workers",
                          "Community and Personal Service Workers",
                          "Clerical and Adminsitrative Workers",
                          "Sales Workers",
                          "Machinery Operators and Drivers",
                          "Labourers",
                          "Residual Categories",
                          "No occupation"                          
                       ))


d_qualification <- data_frame(qualification_id = 1:5,
                        qualification = c(
                           "None",
                           "School",
                           "Vocational/Trade",
                           "Bachelor or Higher",
                           "Other"
                        ))

d_region <- data_frame(region_id =1:12,
                       region = c("Northland", "Auckland", "Waikato", "Bay of Plenty", "Gisborne / Hawke's Bay",
                                  "Taranaki", "Manawatu-Wanganui", "Wellington", 
                                  "Nelson/Tasman/Marlborough/West Coast", "Canterbury", "Otago", "Southland"))

The final step in the data cleaning is to save all of our tables to a database, create some indexes so they work nice and fast, and join them up in an analysis-ready view. In the below I use an ODBC (open database connectivity) connection to a MySQL server called "PlayPen". R plays nicely with databases; set it up and forget about it.

#---------------save to database---------------
creds <- AskCreds("Credentials for someone who can create databases")

PlayPen <- odbcConnect("PlayPen_prod", uid = creds$uid, pwd = creds$pwd)
try(sqlQuery(PlayPen, "create database nzis11") )
sqlQuery(PlayPen, "use nzis11")

# fact tables.  These take a long time to load up with sqlSave (which adds one row at a time)
# but it's easier (quick and dirty) than creating a table and doing a bulk upload from a temp 
# file.  Any bigger than this you'd want to bulk upload though - took 20 minutes or more.
sqlSave(PlayPen, f_mainheader, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, f_ethnicity, addPK = TRUE, rownames = FALSE) 
                                            # add a primary key on the fly in this case.  All other tables
                                            # have their own already created by R.

# dimension tables
sqlSave(PlayPen, d_sex, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_agegrp, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_ethnicity, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_occupation, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_qualification, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_region, addPK = FALSE, rownames = FALSE)

#----------------indexing----------------------

sqlQuery(PlayPen, "ALTER TABLE f_mainheader ADD PRIMARY KEY(survey_id)")

sqlQuery(PlayPen, "ALTER TABLE d_sex ADD PRIMARY KEY(sex_id)")
sqlQuery(PlayPen, "ALTER TABLE d_agegrp ADD PRIMARY KEY(agegrp_id)")
sqlQuery(PlayPen, "ALTER TABLE d_ethnicity ADD PRIMARY KEY(ethnicity_id)")
sqlQuery(PlayPen, "ALTER TABLE d_occupation ADD PRIMARY KEY(occupation_id)")
sqlQuery(PlayPen, "ALTER TABLE d_qualification ADD PRIMARY KEY(qualification_id)")
sqlQuery(PlayPen, "ALTER TABLE d_region ADD PRIMARY KEY(region_id)")

#---------------create an analysis-ready view-------------------
# In Oracle we'd use a materialized view, which MySQL can't do.  But
# the below is fast enough anyway:

sql1 <-
   "CREATE VIEW vw_mainheader AS SELECT sex, agegrp, occupation, qualification, region, hours, income FROM
      f_mainheader a   JOIN
      d_sex b          on a.sex_id = b.sex_id JOIN
      d_agegrp c       on a.agegrp_id = c.agegrp_id JOIN
      d_occupation e   on a.occupation_id = e.occupation_id JOIN
      d_qualification f on a.qualification_id = f.qualification_id JOIN
      d_region g       on a.region_id = g.region_id"

sqlQuery(PlayPen, sql1)

Average weekly income in NZ 2011 by various slices and dices

Whew, that's out of the way. Next post that I use this data I can go straight to the database. We're now in a position to check our data matches the summary totals provided by Statistics New Zealand. Statistics New Zealand say this SURF can be treated as a simple random sample, which means each point can get an identical individual weight, which we can estimate from the summary tables in their data dictionary. Each person in the sample represents 117.4 in the population (in the below I have population figures in thousands, to match the Statistics New Zealand summaries.

Statistics New Zealand doesn't provide region and occupation summary statistics, and the qualification summaries they provide use a more detailed classification than is in the actual SURF. But for the other categories - sex, age group, and the trick ethnicity - my results match theirs, so I know I haven't munched the data.

sex	Mean	Sample	Population
female	611	15217	1787.10
male	779	14254	1674.00

tab1 <- sqlQuery(PlayPen, 
"SELECT 
                              sex,
                              ROUND(AVG(income))          as Mean, 
                              COUNT(1)                    as Sample,
                              ROUND(COUNT(1) * .11744, 1) as Population
                           FROM vw_mainheader
                           GROUP BY sex")

agegrp	Mean	Sample	Population
15-19	198	2632	309.10
20-24	567	2739	321.70
25-29	715	2564	301.10
30-34	796	2349	275.90
35-39	899	2442	286.80
40-44	883	2625	308.30
45-49	871	2745	322.40
50-54	911	2522	296.20
55-59	844	2140	251.30
60-64	816	1994	234.20
65+	421	4719	554.20

tab2 <- sqlQuery(PlayPen, 
"SELECT 
                              agegrp,
                              ROUND(AVG(income))          as Mean, 
                              COUNT(1)                    as Sample,
                              ROUND(COUNT(1) * .11744, 1) as Population
                           FROM vw_mainheader
                           GROUP BY agegrp")

qualification	Mean	Sample	Population
School	564	7064	829.60
None	565	6891	809.30
Other	725	1858	218.20
Vocational/Trade	734	8435	990.60
Bachelor or Higher	955	5223	613.40

# qualification summary in data dictionary uses a different classification to that in data
tab3 <- sqlQuery(PlayPen, 
"SELECT 
                              qualification,
                              ROUND(AVG(income))          as Mean, 
                              COUNT(1)                    as Sample,
                              ROUND(COUNT(1) * .11744, 1) as Population
                           FROM vw_mainheader
                           GROUP BY qualification
                           ORDER BY Mean")

occupation	Mean	Sample	Population
No occupation	257	10617	1246.90
Labourers	705	2154	253.00
Residual Categories	726	24	2.80
Community and Personal Service Workers	745	1734	203.60
Sales Workers	800	1688	198.20
Clerical and Adminsitrative Workers	811	2126	249.70
Technicians and Trades Workers	886	2377	279.20
Machinery Operators and Drivers	917	1049	123.20
Professionals	1105	4540	533.20
Managers	1164	3162	371.30

# occupation summary not given in data dictionary
tab4 <- sqlQuery(PlayPen, 
"SELECT 
                             occupation,
                             ROUND(AVG(income))          as Mean, 
                             COUNT(1)                    as Sample,
                             ROUND(COUNT(1) * .11744, 1) as Population
                           FROM vw_mainheader
                           GROUP BY occupation
                           ORDER BY Mean")

region	Mean	Sample	Population
Bay of Plenty	620	1701	199.80
Taranaki	634	728	85.50
Waikato	648	2619	307.60
Southland	648	637	74.80
Manawatu-Wanganui	656	1564	183.70
Northland	667	1095	128.60
Nelson/Tasman/Marlborough/West Coast	680	1253	147.20
Otago	686	1556	182.70
Gisborne / Hawke's Bay	693	1418	166.50
Canterbury	701	4373	513.60
Auckland	720	9063	1064.40
Wellington	729	3464	406.80

# region summary not given in data dictionary
tab5 <- sqlQuery(PlayPen, 
"SELECT 
                             region,
                             ROUND(AVG(income))          as Mean, 
                             COUNT(1)                    as Sample,
                             ROUND(COUNT(1) * .11744, 1) as Population
                           FROM vw_mainheader
                           GROUP BY region
                           ORDER BY Mean ")

ethnicity	Mean	Sample	Population
Residual Categories	555	85	10.00
Maori	590	3652	428.90
Pacific Peoples	606	1566	183.90
Middle Eastern/Latin American/African	658	343	40.30
Asian	678	3110	365.20
European	706	22011	2585.00
Other Ethnicity	756	611	71.80

tab6 <- sqlQuery(PlayPen,
"SELECT
                     ethnicity,
                     ROUND(AVG(income))          as Mean,
                     COUNT(1)                    as Sample,
                     ROUND(COUNT(1) * .11744, 1) as Population
                  FROM f_mainheader m
                  JOIN f_ethnicity e ON m.survey_id = e.survey_id
                  JOIN d_ethnicity d ON e.ethnicity_id = d.ethnicity_id
                  GROUP BY ethnicity
                  ORDER BY Mean")

Graphics showing density of income

Finally, here's the code that drew the charts we started with, showing the distribution of the weekly income New Zealanders of different ethnicity.

library(showtext)
library(RODBC)
library(ggplot2)
library(dplyr)
font.add.google("Poppins", "myfont")
showtext.auto()

# download individual level data from database including ethnicity join
dtf <- sqlQuery(PlayPen,
"SELECT
                   ethnicity,
                   income
                FROM f_mainheader m
                JOIN f_ethnicity e ON m.survey_id = e.survey_id
                JOIN d_ethnicity d ON e.ethnicity_id = d.ethnicity_id")

# density plot of income by density                
dtf %>%
   filter(ethnicity %in% c("Asian", "European", "Maori", "Pacific Peoples")) %>%
   ggplot(aes(x = income, colour = ethnicity)) +
   geom_density(size = 1.1) +
   scale_x_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 345, 825, 10000)) +
   theme_minimal(base_family = "myfont") +
   theme(legend.position = "bottom") +
   scale_colour_brewer("", palette = "Set1")

# boxplot of income by ethnicity
dtf %>%
   ggplot(aes(y = income, x = ethnicity, colour = ethnicity)) +
   geom_boxplot() +
   geom_rug() +
   scale_y_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 1000, 10000)) +
   theme_minimal() +
   labs(x = "") +
   coord_flip() +
   scale_colour_brewer(palette = "Set2") +
   theme(legend.position = "none")

# scatter plot of joint density of hours and income
dtf2 <- sqlQuery(PlayPen, "select hours, income from vw_mainheader")
ggplot(dtf2, aes(x = hours, y = income)) +
   geom_jitter(alpha = 0.05) +
   scale_x_log10("Hours worked", breaks = c(1, 10, 20, 40, 80)) +
   scale_y_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 345, 825, 10000)) +
   theme_minimal(base_family = "myfont")   
   
   
# how did I choose to mark $345 and $825 on the scale?
# ripped off (and improved) from http://stackoverflow.com/questions/27418461/calculate-the-modes-in-a-multimodal-distribution-in-r
find_modes<- function(data, ...) {
   dens <- density(data, ...)
   y <- dens$y
   modes <- NULL
   for ( i in 2:(length(y) - 1) ){
      if ( (y[i] > y[i - 1]) & (y[i] > y[i + 1]) ) {
         modes <- c(modes,i)
      }
   }
   if ( length(modes) == 0 ) {
      modes = 'This is a monotonic distribution'
   }
   return(dens$x[modes])
}

x <- dtf$income
x[x < 1] <- 1 # not interested in negative income just now

# where are those modes?
exp(find_modes(log(x)))

Edited 18 August 2015 to add the scatter plot of the joint density of hours and income.

Importing the New Zealand Income Survey SURF

At a glance:

The quest for income microdata

Loading the NZIS SURF into a database

Average weekly income in NZ 2011 by various slices and dices

Graphics showing density of income