Moving largish data from R to H2O - spam detection with Enron emails

Moving around sparse matrices of text data - the limitations of `as.h2o`

This post is the resolution of a challenge I first wrote about in late 2016, moving large sparse data from an R environment onto an H2O cluster for machine learning purposes. In that post, I experimented with functionality recently added by the H2O team to their supporting R package, the ability for as.h2o() to interpret a sparse Matrix object from R and convert it to an H2O frame. The Matrix and as.h2o method is ok for medium sized data but broke down on my hardware with a larger dataset - a bags of words from New York Times articles with 300,000 rows and 102,000 columns. Cell entries are the number of times a particular word is used in the document represented by a row and are mostly empty, so my 12GB laptop has no problem managing the data in a sparse format like Matrix from the Matrix package or a simple triplet matrix from the slam package. I’m not sure what as.h2o does under the hood in converting from Matrix to an H2O frame, but it’s too much for my laptop.

My motivation for this is that I want to use R for convenient pre-processing of textual data using the tidytext approach; but H2O for high powered machine learning. tidytext makes it easy to create a sparse matrix with cast_dtm or cast_sparse, but uploading this to H2O can be a challenge.

How to write from R into SVMLight format

After some to-and-fro on Stack Overflow, the best advice was to export the sparse matrix from R into a SVMLight/LIBSVM format text file, then read it into H2O with h2o.importFile(..., parse_type = "SVMLight"). This turned the problem from an difficult and possibly intractable memory managment challenge into a difficult and possibly intractable data formatting and file writing challenge - how to efficiently write files in SVMLight format.

SVMLight format combines a data matrix with some modelling information ie the response value of a model, or “label” as it is (slightly oddly, I think) often called in this world. Instead of a more conventional row-based sparse matrix format which might convey information in row-column-value triples, it uses label-column:value indicators. It looks like this:

1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:0.3

That example is equivalent to two rows of a sparse matrix with at least 34,567 columns. The first row has 1 as the response value, 3.4 in the 10th column of explanatory variables, and 0.231 in the 34,567th column; the second row has 0.2 as the response value, 1 in the 22nd column, and so on.

Writing data from R into this format is a known problem discussed in this Q&A on Stack Overflow. Unfortunately, the top rated answer to that question, e1071::write.svm is reported as being slow, and also it is integrated into a workflow that requires you to first fit a Support Vector Machine model to the data, a step I wanted to avoid. That Q&A led me to a GitHub repo by zygmuntz that had a (also slow) solution for writing dense matrices into SVMLight format, but that didn’t help me as my data were too large for R to hold in dense format. So I wrote my own version for taking simplet triplet matrices and writing SVMLight format. My first version depended on nested paste statements that were applyd to each row of the data and was still too slow at scale, but with the help of yet another Stack Overflow interaction and some data table wizardry by @Roland this was able to reduce the expected time writing my 300,000 by 83,000 New York Times matrix (having removed stop words) from several centuries to two minutes.

I haven’t turned this into a package - it would seem better to find an existing package to add it to than create a package just for this one function, any ideas appreciated. The functions are available on GitHub but in case I end up moving them, here they are in full. One function creates a big character vector; the second writes that to file. This means multiple copies of the data need to be held in R and hence creates memory limitations, but is much much faster than writing it one line at a time (seconds rather than years in my test cases).

library(data.table)
library(slam)

# Convert a simple triplet matrix to svm format
#' @author Peter Ellis
#' @return a character vector of length n = nrow(stm)
calc_stm_svm <- function(stm, y){
  # returns a character vector of length y ready for writing in svm format
  if(!"simple_triplet_matrix" %in% class(stm)){
    stop("stm must be a simple triple matrix")
  }
  if(!is.vector(y) | nrow(stm) != length(y)){
    stop("y should be a vector of length equal to number of rows of stm")
  }
  n <- length(y)
  
  # data.table solution thanks to @roland at http://stackoverflow.com/questions/41477700/optimising-sapply-or-for-paste-to-efficiently-transform-sparse-triplet-m/41478999#41478999
  stm2 <- data.table(i = stm$i, j = stm$j, v = stm$v)
  res <- stm2[, .(i, jv = paste(j, v, sep = ":"))
             ][order(i), .(res = paste(jv, collapse = " ")), by = i][["res"]]
  
  out <- paste(y, res)
  
  return(out)
}


#' @param stm a simple triplet matrix (class exported slam) of features (ie explanatory variables)
#' @param y a vector of labels.  If not provided, a dummy of 1s is provided
#' @param file file to write to.
#' @author Peter Ellis
write_stm_svm <- function(stm, y = rep(1, nrow(stm)), file){
  out <- calc_stm_svm(stm, y)  
  writeLines(out, con = file)
}

Example application - spam detection with the Enron emails

Although I’d used the New York Times bags of words from the UCI machine learning dataset repository for testing the scaling up of this approach, I actually didn’t have anything I wanted to analyse that data for in H2O. So casting around for an example use case I decided on using the Enron email collection for spam detection, first analysed in a 2006 conference paper by V. Metsis, I. Androutsopoulos and G. Paliouras. As well as providing one of the more sensational corporate scandals of recent times, the Enron case has blessed data scientists with one of the largest published sets of emails collected from their natural habitat.

The original authors classified the emails as spam or ham and saved these pre-processed data for future use and reproducibility. I’m not terribly knowledgeable (or interested) in spam detection, so please take the analysis below as a crude and naive example only.

Data

First the data need to be downloaded and unzipped. The files are stored as 6 Tape ARchive files

library(tidyverse)
library(tidytext)
library(tm)
library(testthat)
library(data.table)
library(h2o)
library(stringr)
library(forcats)

#===========download files=============
# Files described at http://csmining.org/index.php/enron-spam-datasets.html
baseurl <- "http://csmining.org/index.php/enron-spam-datasets.html?file=tl_files/Project_Datasets/Enron-Spam%20datasets/Preprocessed/enron"

filenames <- paste0(baseurl, 1:6, ".tar.tar")


for(i in 1:6){
   message("Downloading ", i)
   dfile <- paste0("enron", i, ".tar.tar")
   download.file(filenames[i], destfile = dfile, mode = "wb")
   message("Un-archiving ", i)
   untar(dfile)
}

This creates six folders with the names enron1, enron2 etc; each with a spam and a ham subfolder containing numerous text files. The files look like this example piece of ham (ie non-spam; a legitimate email), chosen at random:

Subject: re : creditmanager net meeting
aidan ,
yes , this will work for us .
vince
" aidan mc nulty " on 12 / 16 / 99 08 : 36 : 14 am
to : vince j kaminski / hou / ect @ ect
cc :
subject : creditmanager net meeting
vincent , i cannot rearrange my schedule for tomorrow so i would like to
confirm that we will have a net - meeting of creditmanager on friday 7 th of
january at 9 . 30 your time .
regards
aidan mc nulty
212 981 7422

The pre-processing has removed duplicates, emails sent to themselves, some of the headers, etc.

Importing the data into R and making tidy data frames of documents and word counts is made easy by Silge and Robinson’s tidytext package which I never tire of saying is a game changer for convenient analysis of text by statisticians:

#=============import to R================
# Adapting the example at http://tidytextmining.com/usenet.html
folders <- paste0("enron", 1:6)

spam <- data_frame(file = dir(paste0(folders, "/spam"), full.names = TRUE)) %>%
   mutate(text = map(file, readLines, encoding = "Latin-1")) %>%
   transmute(id = basename(file), text) %>%
   unnest(text) %>%
   mutate(SPAM = "spam")

ham <- data_frame(file = dir(paste0(folders, "/ham"), full.names = TRUE)) %>%
   mutate(text = map(file, readLines, encoding = "Latin-1")) %>%
   transmute(id = basename(file), text) %>%
   unnest(text) %>%
   mutate(SPAM = "ham")

enron_raw <- rbind(spam, ham)

#============error checks - failing!===================
# Check against the counts provided at http://csmining.org/index.php/data.html
# Turns out some 3013 spam messages have gone missing
# should be 20170 spam messages and 16545 ham messages:
# returns an error
expect_equal(length(unique(enron_raw$id)), 20170 + 16545) 

enron_raw %>%
   select(id, SPAM) %>%
   distinct() %>%
   summarise(spam_count = sum(SPAM == "spam"), ham_count = sum(SPAM == "ham"))
# For my purposes I decide not to worry about this.

#=================further processing==================
enron <- enron_raw %>%
   # will just remove the "Subject:" and "Subject :" and treat subject words like any other
   mutate(text = gsub("^Subject *: *", "", text),
          text = gsub("<U.....>", "", text, fixed = FALSE)) 

enron_words <- enron %>%
   unnest_tokens(word, text) %>%
   select(-SPAM)

As well as basic word counts, I wanted to experiment with other characteristics of emails such as number of words, number and proportion of of stopwords (frequently used words like “and” and “the”). I create a traditional data frame with a row for each email, identified by id, and columns indicating whether it is SPAM and those other characteristics of interest.

# First I'm creating a summary, dense data frame with some numeric info on each document
enron_sum1 <- enron %>%
   mutate(number_characters = nchar(text)) %>%
   group_by(id, SPAM) %>%
   summarise(number_characters = sum(number_characters))

enron_sum2 <- enron_words %>%
   group_by(id) %>%
   summarise(number_words = length(word))

enron_sum3 <- enron_words %>%
   anti_join(stop_words, by = "word") %>%
   group_by(id) %>%
   summarise(number_nonstop_words = length(word))

enron_sum_total <- enron_sum1 %>%
   left_join(enron_sum2, by = "id") %>%
   left_join(enron_sum3, by = "id") %>%
   mutate(number_stop_words = number_words - number_nonstop_words,
          proportion_stop_words = number_stop_words / number_words) %>%
   select(-number_nonstop_words)

enron_sum_total

Source: local data frame [33,702 x 6]
Groups: id [33,702]

                                   id  SPAM number_characters number_words number_stop_words
                                <chr> <chr>             <int>        <int>             <int>
    0001.1999-12-10.farmer.ham.txt   ham                28            4                 0
  0001.1999-12-10.kaminski.ham.txt   ham                24            4                 3
      0001.2000-01-17.beck.ham.txt   ham              3486          559               248
     0001.2000-06-06.lokay.ham.txt   ham              3603          536               207
   0001.2001-02-07.kitchen.ham.txt   ham               322           48                18
  0001.2001-04-02.williams.ham.txt   ham              1011          202               133
    0002.1999-12-13.farmer.ham.txt   ham              4194          432               118
   0002.2001-02-07.kitchen.ham.txt   ham               385           64                40
0002.2001-05-25.SA_and_HP.spam.txt  spam               990          170                80
      0002.2003-12-18.GP.spam.txt  spam              1064          175                63

I next make my sparse matrix as a document term matrix (which is a special case of a simplet triplet matrix from the slam package), with a column for each word (having first limited myself to interesting words)

used_words <- enron_words %>%
   # knock out stop words:
   anti_join(stop_words, by = "word") %>%
   # knock out numerals, and words with only 2 or 1 letters:
   mutate(word = gsub("[0-9]", "", word),
          wordlength = nchar(word)) %>%
   filter(wordlength > 2) %>%
   group_by(word) %>%
   summarise(count = length(word)) %>%
   ungroup() %>%
   # knock out words used less than 10 times:
   filter(count >= 10)

enron_dtm <- enron_words %>%
   right_join(used_words, by = "word") %>%
   cast_dtm(id, word, count) 

# we need a version of the dense data in the same order as the document-term-matrix, to do a sort of 
# manual join of the sparse matrix with the dense one later in H2O.
rows <- data_frame(id = rownames(enron_dtm))
enron_dense <- left_join(rows, enron_sum_total, by = "id")
expect_equal(nrow(enron_dtm), nrow(enron_dense))
expect_equal(rownames(enron_dtm), enron_dense$id)

Now we can load our two datasets onto an H2O cluster for analysis:

#================import to h2o and join up there============
h2o.init(nthreads = -1, max_mem_size = "8G")

# Load up the dense matrix with counts of stopwords etc:
enron_dense_h2o <- as.h2o(enron_dense)

# Load up the sparse matrix with columns for each word:
thefile <- tempfile()
write_stm_svm(enron_dtm, file = thefile)
enron_sparse_h2o <- h2o.uploadFile(thefile, parse_type = "SVMLight")
unlink(thefile)

# Number of rows should be equal:
expect_equal(nrow(enron_sparse_h2o), nrow(enron_dtm))
# Number of columns should be 1 extra in H2O, dummy variable of labels (1) added by write_stm_svm:
expect_equal(ncol(enron_sparse_h2o), ncol(enron_dtm) + 1)

# First column should be the dummy labels = all one
expect_equal(mean(enron_sparse_h2o[ , 1]), 1)

enron_fulldata <- h2o.cbind(enron_sparse_h2o, enron_dense_h2o)
head(colnames(enron_fulldata), 10)

# Convert the target variable to a factor so h2o.glm and other modelling functions
# know what to do with it:
enron_fulldata[ , "SPAM"] <- as.factor(enron_fulldata[ , "SPAM"])

I now have an H2O frame with 33602 rows and 26592 columns; most of the columns representing words and the cells being counts; but some columns representing other variables such as number of stopwords.

Analysis

To give H2O a workout, I decided to fit four different types of models trying to understand which emails were ham and which spam:

generalized linear model, with elastic net regularization to help cope with the large number of explanatory variables
random forest
gradient boosting machine
neural network

I split the data into training, validation and testing subsets; with the idea that the validation set would be used for choosing tuning parameters, and the testing set used as a final comparison of the predictive power of the final models. As things turned out, I didn’t have patience to do much in the way of tuning. This probably counted against the latter three of my four models, because I’m pretty confident better performance would be possible with more careful choice of some of the meta parameters. Here’s the eventual results from my not-particularly-tuned models:

The humble generalized linear model (GLM) performs pretty well; outperformed clearly only by the neural network. The GLM has a big advantage in interpretability too. Here are the most important variables for the GLM in predicting spam (NEG means a higher count of the word means less likely to be spam)

                   names coefficients sign       word
                C9729    1.1635213  NEG      enron
               C25535    0.6023054  NEG      vince
                C1996    0.5990230  NEG   attached
               C15413    0.4524011  NEG     louise
               C19891    0.3905246  NEG  questions
               C11478    0.2993239  NEG        gas
proportion_stop_words    0.2935112  NEG       <NA>
               C12866    0.2774074  NEG  hourahead
                C7268    0.2600452  NEG      daren
              C16257    0.2497282  NEG      meter
              C12878    0.2439315  NEG    houston
              C21441    0.2345008  NEG      sally
              C16106    0.2179897  NEG    meeting
              C12894    0.1965571  NEG        hpl
              C16618    0.1909332  NEG     monday
              C11270    0.1873195  NEG     friday
               C8553    0.1704185  NEG        doc
               C7386    0.1673093  NEG       deal
              C21617    0.1636310  NEG   schedule
               C4185    0.1510748  NEG california
              C12921    0.3695006  POS       http
              C16624    0.2132104  POS      money
              C22597    0.2034031  POS   software
              C15074    0.1957970  POS       life
               C5394    0.1922659  POS      click
              C17683    0.1915608  POS     online
              C25462    0.1703109  POS     viagra
              C16094    0.1605989  POS       meds
              C21547    0.1583438  POS       save
               C9483    0.1498732  POS      email

So, who knew, emails containing the words “money”, “software”, “life”, “click”, “online”, “viagra” and “meds” are (or at least were in the time of Enron - things may have changed) more likely to be spam.

Here’s the code for the analysis all together:

#=====================analysis in H2O=============

#-----------prep------------------------
# names of the explanatory variables - all the columns in the sparse matrix (which are individual words)
# except the first one which is the dummy "labels" created just for the SVMLight format.  And the
# four summary variables in the dense dataframe:
xnames <- c(colnames(enron_sparse_h2o)[-1], 
            "number_characters", "number_words", "number_stop_words", "proportion_stop_words")

# A lookup table of the column names that refer to words to the words they actually mean.
# Useful when we look at variable performance down the track.
wordcols <- data_frame(
   variable = colnames(enron_sparse_h2o),
   word = c("", colnames(enron_dtm))
)

# The slower models (ie apart from glm) take ages for cross-validation so 
# we'll settle for single-split validation
enron_split <- h2o.splitFrame(enron_fulldata, ratios = c(0.6, 0.2))
dim(enron_split[[1]])

#-------------------GLM---------------------
# Binomial GLM, with elastic net regularization.  This is the minimal baseline sort of model.
# 200 seconds
system.time({
mod.glm <- h2o.glm(x = xnames, y = "SPAM", training_frame = enron_split[[1]], 
                   validation_frame = enron_split[[2]],
                   family = "binomial",
                   alpha = 0.5, lambda_search = TRUE)
})

h2o.varimp(mod.glm) %>%
   slice(1:30) %>%
   left_join(wordcols, by = c("names" = "variable")) %>%
   arrange(sign, desc(coefficients)) 

h2o.performance(mod.glm, valid = TRUE) # 131 / 6640

#--------------------Random Forest-------------------
# with ntrees = 50, nfolds = 10, max_depth  = 20, took 25 minutes to get 
# 5% through so I stopped it and went back to having a single validation frame.
# Can view progress by going to http://127.0.0.1:54321/flow/index.html
# 1340 seconds
system.time({
mod.rf <- h2o.randomForest(x = xnames, y = "SPAM", training_frame = enron_split[[1]], 
                           validation_frame = enron_split[[2]],
                           ntrees = 500, max_depth = 20,
                           stopping_tolerance = 0.0001, stopping_rounds = 3, score_tree_interval = 25)
})
# note from watching progress in flow, scoring every score_tree_interval models takes quite a lot of time.
# so I pushed out score_tree_interval.  On current settings it will score
# the latest model against the validation frame every 25 trees, and stop if it hasn't improved since 3*25 trees ago.
# If score_tree_interval is a small number, it stops growing trees too quickly

h2o.performance(mod.rf, valid = TRUE) # 172/6640 error rate

#-------------------gradient boosting machine-----------------
# 1240 seconds
system.time({
   mod.gbm <- h2o.gbm(x = xnames, y = "SPAM", training_frame = enron_split[[1]], 
                              validation_frame = enron_split[[2]],
                              ntrees = 100, max_depth = 5,
                              stopping_tolerance = 0.0001, stopping_rounds = 3, score_tree_interval = 5)
})

h2o.performance(mod.gbm, valid = TRUE) # 240/6640 error rate

#-------------------artificial neural network------------------------

# 900 seconds; much faster when sparse = TRUE
system.time({
   mod.dl <- h2o.deeplearning(x = xnames, y = "SPAM", training_frame = enron_split[[1]], 
                      validation_frame = enron_split[[2]],
                      hidden = c(200, 200),
                      stopping_tolerance = 0.0001, stopping_rounds = 5, sparse = TRUE)
})
h2o.performance(mod.dl, valid = TRUE) # 82/6640


#-----------------Naive Bayes - doesn't work--------------------
# Conditional probabilities won't fit in the driver node's memory (20.99 GB > 6.06GB)
mod.nb <- h2o.naiveBayes(x = xnames, y = "SPAM", training_frame = enron_split[[1]], 
                              validation_frame = enron_split[[2]])


#==============presentation of results==========================

perf <- function(newdata){
   return(c(
      glm = as.character(h2o.confusionMatrix(mod.glm, newdata = newdata)$Rate[1]),
      rf = as.character(h2o.confusionMatrix(mod.rf, newdata = newdata)$Rate[1]),
      gbm = as.character(h2o.confusionMatrix(mod.gbm, newdata = newdata)$Rate[1]),
      dl = as.character(h2o.confusionMatrix(mod.dl, newdata = newdata)$Rate[1])
      ))
}

# this isn't the most efficient computing wise, because the performance on the training
# and validation sets could be extracted more directly but it is quick and easy to code:
perfs <- lapply(enron_split, perf)

perfsdf <- data_frame(value = unlist(perfs),
                      model = rep(c("Generalized Linear Model", "Random Forest", "Gradient Boosting Machine", "Deep Learning"), 3),
                      dataset = rep(c("Training", "Validation", "Testing"), each = 4),
                      abbrev = rep(names(perfs[[1]]), 3)) %>%
   mutate(errors = as.numeric(str_sub(str_extract(value, "=[0-9]+"), start = 2)),
          denom = as.numeric(str_sub(str_extract(value, "/[0-9]+"), start = 2)),
          errorrate = errors / denom,
          model = factor(model, levels = c("Deep Learning", "Generalized Linear Model", 
                                           "Random Forest", "Gradient Boosting Machine")),
          dataset = factor(dataset, levels = c("Training", "Validation", "Testing")))

perfsdf %>%
   ggplot(aes(x = errorrate, y = dataset, label = abbrev, colour = model)) +
   geom_path(aes(group = abbrev), alpha = 0.5, linetype = 2) +
   geom_text() +
   scale_colour_brewer(palette = "Set1") +
   scale_x_continuous("Error rate", label = percent) +
   labs(y = "", colour = "") +
   ggtitle("Error rates detecting ham from spam",
           "Enron email dataset, four different statistical learning methods")