Success rates of automated ARIMA fitting

30 Sep 2015

At a glance:

Using an automated process to select the order of an ARMA time series model returns the true data generating process less than half the time even with simple data generating processes; and with more complex models the chance of success comes down nearly to zero even with long sample sizes.

A function for testing auto.arima()

This is the third, and the last from me on this particular topic for a while, in a series of posts on fitting time series models and comparing time series to eachother, originally inspired by a question on the Cross-Validated Q&A site. In this week's post I'm exploring and testing an almost throw-away comment in my original answer - that looking at the parameterisation of an automatically chosen ARIMA (autoregressive integrated moving average) time series model is a bad idea because very similar data generating models may quite legitimately be fit by different-looking ARIMA models. I want to look at a corollory to that:

"How successful will an automated ARIMA selection be at returning the order of the original data generating process for a simulated time series?"

An ARMA process consists of an "autoregressive" AR part (basically, a linear combination of the lagged values of the series) plus a "moving average" MA (a linear combination of a white noise series). By "order", I mean the number of lags in the AR part plus the number of lagged white noise values in the MA part. An AR(1) model is just a coefficient times the lagged value, plus white noise; an MA(1) model is just a coefficient times the lagged value of the white noise, plus white noise; an ARMA(1,1) is a combination of the two.

In particular, I wanted to explore Rob Hyndman's excellent auto.arima() function from his {forecast} package, which uses small-sample corrected Akaike Information Criterion (AICc) for model selection. We use it a lot at my work and I wanted to see how it performs when fitting a dataset that is known to have been created (because we did it ourselves) by an ARMA process of a particular order. To explore this, I created a function that:

  1. takes as an argument a model specification (eg AR(1) for autoregression of order 1 ie a time series created by a coefficient times its lagged value plus white noise)
  2. simulates a large number (user controlled) of replications of that series, of a variety of different sample lengths (defaulting to 20, 40, 80, ..., 10240)
  3. Uses the auto.arima() model to identify the best ARIMA model for each replication of the series
  4. Stores the order of the final model and compares it against the correct order

Here's the code for that function:

library(dplyr)
library(tidyr)
library(ggplot2)
library(showtext) # for fonts
library(forecast)

font.add.google("Poppins", "myfont")
showtext.auto()
theme_set(theme_light(base_family = "myfont"))

arima_fit_sim <- function(model, 
                          correct_params, 
                          n = 2 ^ (1:10) * 10, 
                          reps = 10,
                          stepwise = FALSE,
                          verbose = TRUE){
   
   # function that creates @reps number of simulated ARIMA models of type @model and 
   # length @n.  @correct_params should be a vector of the correct parameters eg
   # c("ar1", "ma1", "ma2") for ARMA(1,2)
   # peter.ellis2013nz at gmail com 25 September 2015
   
   require(forecast)
   if(verbose){require(dplyr)}
   
   results <- data.frame(n = rep(n, each = reps),
                      correct = FALSE,
                      fitted = "",
                      stringsAsFactors = FALSE)
   counter <- 0

   for(j in 1:length(n)){
      if(verbose){message(n[j])}
      for(i in 1:reps){
         counter <- counter + 1
         true_series <- arima.sim(model, n = n[j])
         fit <- auto.arima(true_series, stepwise = stepwise)
         results[counter, "fitted"] <- paste(names(fit$coef), collapse = " ")
         if(length(coef(fit)) == length(correct_params) && sum(names(fit$coef) %in% correct_params) == length(correct_params) ){
            results[counter, "correct"] <- TRUE
            
         }   else {
            results[counter, "correct"] <- FALSE
         }
         
      }
   }
   
   if(verbose){
      print(
         results %>%
            group_by(n) %>%
            summarise(correct = sum(correct) / length(correct) * 100)
         )
   }

   return(results)
   }

There's flaws with this function I didn't have time to fix. Most notably, there's a big risk in that the call to the function needs to specify a vector of parameters eg c("ar1", "ma1") for ARMA(1,1) that has to match the model provided. A better function would deduce the vector of parameter names from the model structure; if someone knows how to do this simply, let me know.

The test

For my actual test, I chose to try out two simple simulations - AR(1) and MA(1) processes - and two more complex ones - ARMA(2,2) and ARMA(3,3). There's a sticking point which is the actual autoreggressive and moving average parameters. An AR(1) process autoregression coefficient of 0.05 for example will be almost indistinguishable from white noise, whereas as it gets closer to 1 it would be much easier for the model selection algorithm to notice. To do a comprehensive test, I should actually compare models across the whole parameter space. As this is just a blog post, I arbitrarily chose some values.

Here's the code performing the actual tests on those four processes, using my new arima_fit_sim() function with 500 repetitions of each combination of model and sample length. This took about eight hours to run on my (nothing special) laptop:

our_reps <- 500

results_ar1 <- arima_fit_sim(model = list(ar = c(0.5)),
                             correct_params = c("ar1"), 
                             reps = our_reps)

results_ma1 <- arima_fit_sim(model = list(ma = c(0.5)),
                             correct_params = c("ma1"), 
                             reps = our_reps)


results_arma22 <- arima_fit_sim(model = list(ar = c(0.5, -0.2), ma = c(-0.3, 0.2)),
                                correct_params = c("ar1", "ar2", "ma1", "ma2"), 
                                reps = our_reps)

results_arma33 <- arima_fit_sim(model = list(ar = c(0.5, -0.2, 0.3), ma = c(-0.3, 0.2, 0.4)),
                                correct_params = c("ar1", "ar2", "ar3", "ma1", "ma2", "ma3"), 
                                reps = our_reps)


save(results_ar1, results_ma1, results_arma22, results_arma33, file = "_output/0012_sim_results.rda")

Results

My function stores the order (ie how many autoregressive and how many moving average parameters)of the 5000 models it has fit for each simulation. Here are the ten most common models that were selected in the case of the AR(1) model:

fitted count
ar1 1752
ar1 ar2 ar3 ma1 ma2 316
ar1 intercept 278
ma1 238
ar1 ar2 224
ar1 ma1 210
ar1 ar2 ma1 185
ma1 ma2 140
ar1 ar2 ar3 124
ma1 ma2 ma3 116

As a percentage correct for different sample lengths, here's how they end up:

n correct
20 29.0
40 35.0
80 38.6
160 31.2
320 29.8
640 35.4
1280 37.2
2560 37.4
5120 36.0
10240 40.8

In this simple case, auto.arima() is moderately successful. It correctly picked an AR(1) model as the best 1752 times out of 5000, much more than the second most popular model which was a very complex ARMA(3,2). Even when the time series is only 20 observations long it gets it right 29 percent of the time, and this goes up to 41 percent by the time there are 10,000 observations. I suspect it gets better as the time series gets longer, but I'd (naively) expected better than this. It certainly means, for example, that when your automated model selection tells you you have an ARMA(3,2) model you should not dismiss the possibility that an AR(1) model might be the true generating process (other information might help you dismiss that of course, including the actual parameters). I was surprised that even for a longish time series only 41 percent of the time is the true model being recovered.

Here's a plot summarising the overall results

Here it is separated out by facets so we can see that pattern closer.

I was a bit astonished at the poor success rate of picking up the ARMA(3, 3) process so I tabulate here the counts of every final model that actually was selected.

fitted count
ar1 ma1 ma2 ma3 ma4 1446
ma1 ma2 ma3 ma4 ma5 500
ma1 ma2 ma3 ma4 444
ar1 ma1 ma2 ma3 339
ar1 ma1 ma2 ma3 ma4 intercept 276
ar1 ar2 ar3 ma1 ma2 248
ar1 ar2 drift 229
ar1 ar2 ar3 168
ar1 ar2 ma1 ma2 ma3 157
ma1 ma2 ma3 ma4 intercept 151
intercept 115
ar1 ma1 ma2 ma3 intercept 103
88
ma1 ma2 ma3 61
ma1 54
ar1 ar2 ma1 ma2 drift 50
ar1 ar2 ma1 ma2 ma3 intercept 43
ar1 ar2 ar3 ar4 ar5 42
ma1 ma2 ma3 intercept 41
ar1 ar2 ar3 ar4 32
ar1 ar2 ar3 ar4 ma1 31
ar1 ar2 ma1 ma2 29
ma1 ma2 ma3 ma4 drift 28
ma1 drift 27
ar1 ar2 ar3 drift 25
ar1 ar2 ar3 ma1 ma2 intercept 24
ar1 ar2 intercept 24
ar1 21
ar1 ar2 19
ma1 intercept 16
ar1 ar2 ar3 ar4 drift 14
drift 13
ar1 ar2 ar3 ma1 12
ar1 ar2 ma1 drift 12
ma1 ma2 ma3 ma4 ma5 drift 12
ma1 ma2 ma3 ma4 ma5 intercept 12
ar1 ar2 ar3 intercept 11
ar1 ma1 ma2 11
ar1 ar2 ar3 ar4 ar5 intercept 8
ma1 ma2 ma3 drift 8
ma1 ma2 drift 7
ar1 ar2 ar3 ar4 ar5 drift 5
ar1 ar2 ar3 ar4 ma1 intercept 5
ar1 ar2 ma1 ma2 ma3 drift 5
ar1 intercept 5
ma1 ma2 5
ma1 ma2 intercept 5
ar1 ar2 ar3 ma1 ma2 drift 4
ar1 ar2 ma1 intercept 4
ar1 ar2 ar3 ar4 intercept 3
ar1 drift 2
ar1 ma1 2
ar1 ar2 ar3 ar4 ma1 drift 1
ar1 ar2 ar3 ma1 intercept 1
ar1 ar2 ma1 1
ar1 ma1 intercept 1

Yes, that's right, many many models are picked but not a single instance matches the "ar1 ar2 ar3 ma1 ma2 ma3" that was the original data generating process. So the take home message is even stronger than I thought - don't pay much attention to the order of your fitted ARIMA models! This doesn't mean there's a problem with using them - they give good fits and work well in many forecasting situations. But we shouldn't confuse them with some essential reality of an underlying process. It's a striking example of the famous (well in some quarters) quote by George E. P. Box:

"All models are wrong, but some are useful" (Box)

I found all this a bit surprising. I'd expected a material amount of variation in what models were returned, but not this much. And I'd expected the model selection process to be more consistent (to use a term loosely) ie converge to the true model more noticeably as sample length got bigger. I presume there's a literature on all this but didn't have time to research it, so any pointers are welcomed in the comments or by Twitter.

Code that produced the summary tables

results_ar1 %>%
   group_by(fitted) %>%
   summarise(count = length(fitted)) %>%
   arrange(-count) %>%
   head(10) 

results_ar1 %>%
   group_by(n) %>%
   summarise(correct = sum(correct) / length(correct) * 100)

results_arma33 %>%
   group_by(fitted) %>%
   summarise(count = length(fitted)) %>%
   arrange(-count) 

Code that produced the summary plot

r1 <- results_ar1 %>%
   mutate(model = "AR(1)")

r2 <- results_ma1 %>%
   mutate(model = "MA(1)")

r3 <- results_arma22 %>%
   mutate(model = "ARMA(2, 2)")

r4 <- results_arma33 %>%
   mutate(model = "ARMA(3, 3)")

r_all <- rbind(r1, r2, r3, r4) %>%
   mutate(model = factor(model, 
                         levels = c("AR(1)", "MA(1)", "ARMA(2, 2)", "ARMA(3, 3)")))
						 
p1 <- r_all %>%
   group_by(model, n) %>%
   summarise(correct = sum(correct) / length(correct)) %>%
   ggplot(aes(x = n, y = correct, colour = model)) +
   geom_point() +
   geom_line() +
   scale_y_continuous("Percent of models correctly identified", label = percent) +
   scale_x_log10("Length of time series", 
                # label = comma, 
                breaks = 2 ^ (1:10) * 10) +
   ggtitle("Success rate of automated ARIMA model selection\nfitting to a known data generating process")

print(p1)


print(p1 + 
		facet_wrap(~model, scales = "free_y") +
		theme(legend.position = "none"))