Success rates of automated ARIMA fitting

A function for testing auto.arima()

This is the third, and the last from me on this particular topic for a while, in a series of posts on fitting time series models and comparing time series to eachother, originally inspired by a question on the Cross-Validated Q&A site. In this week's post I'm exploring and testing an almost throw-away comment in my original answer - that looking at the parameterisation of an automatically chosen ARIMA (autoregressive integrated moving average) time series model is a bad idea because very similar data generating models may quite legitimately be fit by different-looking ARIMA models. I want to look at a corollory to that:

"How successful will an automated ARIMA selection be at returning the order of the original data generating process for a simulated time series?"

An ARMA process consists of an "autoregressive" AR part (basically, a linear combination of the lagged values of the series) plus a "moving average" MA (a linear combination of a white noise series). By "order", I mean the number of lags in the AR part plus the number of lagged white noise values in the MA part. An AR(1) model is just a coefficient times the lagged value, plus white noise; an MA(1) model is just a coefficient times the lagged value of the white noise, plus white noise; an ARMA(1,1) is a combination of the two.

In particular, I wanted to explore Rob Hyndman's excellent auto.arima() function from his {forecast} package, which uses small-sample corrected Akaike Information Criterion (AICc) for model selection. We use it a lot at my work and I wanted to see how it performs when fitting a dataset that is known to have been created (because we did it ourselves) by an ARMA process of a particular order. To explore this, I created a function that:

takes as an argument a model specification (eg AR(1) for autoregression of order 1 ie a time series created by a coefficient times its lagged value plus white noise)
simulates a large number (user controlled) of replications of that series, of a variety of different sample lengths (defaulting to 20, 40, 80, ..., 10240)
Uses the auto.arima() model to identify the best ARIMA model for each replication of the series
Stores the order of the final model and compares it against the correct order

Here's the code for that function:

library(dplyr)
library(tidyr)
library(ggplot2)
library(showtext) # for fonts
library(forecast)

font.add.google("Poppins", "myfont")
showtext.auto()
theme_set(theme_light(base_family = "myfont"))

arima_fit_sim <- function(model, 
                          correct_params, 
                          n = 2 ^ (1:10) * 10, 
                          reps = 10,
                          stepwise = FALSE,
                          verbose = TRUE){
   
   # function that creates @reps number of simulated ARIMA models of type @model and 
   # length @n.  @correct_params should be a vector of the correct parameters eg
   # c("ar1", "ma1", "ma2") for ARMA(1,2)
   # peter.ellis2013nz at gmail com 25 September 2015
   
   require(forecast)
   if(verbose){require(dplyr)}
   
   results <- data.frame(n = rep(n, each = reps),
                      correct = FALSE,
                      fitted = "",
                      stringsAsFactors = FALSE)
   counter <- 0

   for(j in 1:length(n)){
      if(verbose){message(n[j])}
      for(i in 1:reps){
         counter <- counter + 1
         true_series <- arima.sim(model, n = n[j])
         fit <- auto.arima(true_series, stepwise = stepwise)
         results[counter, "fitted"] <- paste(names(fit$coef), collapse = " ")
         if(length(coef(fit)) == length(correct_params) && sum(names(fit$coef) %in% correct_params) == length(correct_params) ){
            results[counter, "correct"] <- TRUE
            
         }   else {
            results[counter, "correct"] <- FALSE
         }
         
      }
   }
   
   if(verbose){
      print(
         results %>%
            group_by(n) %>%
            summarise(correct = sum(correct) / length(correct) * 100)
         )
   }

   return(results)
   }

There's flaws with this function I didn't have time to fix. Most notably, there's a big risk in that the call to the function needs to specify a vector of parameters eg c("ar1", "ma1") for ARMA(1,1) that has to match the model provided. A better function would deduce the vector of parameter names from the model structure; if someone knows how to do this simply, let me know.

The test

For my actual test, I chose to try out two simple simulations - AR(1) and MA(1) processes - and two more complex ones - ARMA(2,2) and ARMA(3,3). There's a sticking point which is the actual autoreggressive and moving average parameters. An AR(1) process autoregression coefficient of 0.05 for example will be almost indistinguishable from white noise, whereas as it gets closer to 1 it would be much easier for the model selection algorithm to notice. To do a comprehensive test, I should actually compare models across the whole parameter space. As this is just a blog post, I arbitrarily chose some values.

Here's the code performing the actual tests on those four processes, using my new arima_fit_sim() function with 500 repetitions of each combination of model and sample length. This took about eight hours to run on my (nothing special) laptop:

our_reps <- 500

results_ar1 <- arima_fit_sim(model = list(ar = c(0.5)),
                             correct_params = c("ar1"), 
                             reps = our_reps)

results_ma1 <- arima_fit_sim(model = list(ma = c(0.5)),
                             correct_params = c("ma1"), 
                             reps = our_reps)


results_arma22 <- arima_fit_sim(model = list(ar = c(0.5, -0.2), ma = c(-0.3, 0.2)),
                                correct_params = c("ar1", "ar2", "ma1", "ma2"), 
                                reps = our_reps)

results_arma33 <- arima_fit_sim(model = list(ar = c(0.5, -0.2, 0.3), ma = c(-0.3, 0.2, 0.4)),
                                correct_params = c("ar1", "ar2", "ar3", "ma1", "ma2", "ma3"), 
                                reps = our_reps)


save(results_ar1, results_ma1, results_arma22, results_arma33, file = "_output/0012_sim_results.rda")

Results

My function stores the order (ie how many autoregressive and how many moving average parameters)of the 5000 models it has fit for each simulation. Here are the ten most common models that were selected in the case of the AR(1) model:

fitted	count
ar1	1752
ar1 ar2 ar3 ma1 ma2	316
ar1 intercept	278
ma1	238
ar1 ar2	224
ar1 ma1	210
ar1 ar2 ma1	185
ma1 ma2	140
ar1 ar2 ar3	124
ma1 ma2 ma3	116

As a percentage correct for different sample lengths, here's how they end up:

n	correct
20	29.0
40	35.0
80	38.6
160	31.2
320	29.8
640	35.4
1280	37.2
2560	37.4
5120	36.0
10240	40.8

In this simple case, auto.arima() is moderately successful. It correctly picked an AR(1) model as the best 1752 times out of 5000, much more than the second most popular model which was a very complex ARMA(3,2). Even when the time series is only 20 observations long it gets it right 29 percent of the time, and this goes up to 41 percent by the time there are 10,000 observations. I suspect it gets better as the time series gets longer, but I'd (naively) expected better than this. It certainly means, for example, that when your automated model selection tells you you have an ARMA(3,2) model you should not dismiss the possibility that an AR(1) model might be the true generating process (other information might help you dismiss that of course, including the actual parameters). I was surprised that even for a longish time series only 41 percent of the time is the true model being recovered.

Here's a plot summarising the overall results

Here it is separated out by facets so we can see that pattern closer.

I was a bit astonished at the poor success rate of picking up the ARMA(3, 3) process so I tabulate here the counts of every final model that actually was selected.

fitted	count
ar1 ma1 ma2 ma3 ma4	1446
ma1 ma2 ma3 ma4 ma5	500
ma1 ma2 ma3 ma4	444
ar1 ma1 ma2 ma3	339
ar1 ma1 ma2 ma3 ma4 intercept	276
ar1 ar2 ar3 ma1 ma2	248
ar1 ar2 drift	229
ar1 ar2 ar3	168
ar1 ar2 ma1 ma2 ma3	157
ma1 ma2 ma3 ma4 intercept	151
intercept	115
ar1 ma1 ma2 ma3 intercept	103
	88
ma1 ma2 ma3	61
ma1	54
ar1 ar2 ma1 ma2 drift	50
ar1 ar2 ma1 ma2 ma3 intercept	43
ar1 ar2 ar3 ar4 ar5	42
ma1 ma2 ma3 intercept	41
ar1 ar2 ar3 ar4	32
ar1 ar2 ar3 ar4 ma1	31
ar1 ar2 ma1 ma2	29
ma1 ma2 ma3 ma4 drift	28
ma1 drift	27
ar1 ar2 ar3 drift	25
ar1 ar2 ar3 ma1 ma2 intercept	24
ar1 ar2 intercept	24
ar1	21
ar1 ar2	19
ma1 intercept	16
ar1 ar2 ar3 ar4 drift	14
drift	13
ar1 ar2 ar3 ma1	12
ar1 ar2 ma1 drift	12
ma1 ma2 ma3 ma4 ma5 drift	12
ma1 ma2 ma3 ma4 ma5 intercept	12
ar1 ar2 ar3 intercept	11
ar1 ma1 ma2	11
ar1 ar2 ar3 ar4 ar5 intercept	8
ma1 ma2 ma3 drift	8
ma1 ma2 drift	7
ar1 ar2 ar3 ar4 ar5 drift	5
ar1 ar2 ar3 ar4 ma1 intercept	5
ar1 ar2 ma1 ma2 ma3 drift	5
ar1 intercept	5
ma1 ma2	5
ma1 ma2 intercept	5
ar1 ar2 ar3 ma1 ma2 drift	4
ar1 ar2 ma1 intercept	4
ar1 ar2 ar3 ar4 intercept	3
ar1 drift	2
ar1 ma1	2
ar1 ar2 ar3 ar4 ma1 drift	1
ar1 ar2 ar3 ma1 intercept	1
ar1 ar2 ma1	1
ar1 ma1 intercept	1

Yes, that's right, many many models are picked but not a single instance matches the "ar1 ar2 ar3 ma1 ma2 ma3" that was the original data generating process. So the take home message is even stronger than I thought - don't pay much attention to the order of your fitted ARIMA models! This doesn't mean there's a problem with using them - they give good fits and work well in many forecasting situations. But we shouldn't confuse them with some essential reality of an underlying process. It's a striking example of the famous (well in some quarters) quote by George E. P. Box:

"All models are wrong, but some are useful" (Box)

I found all this a bit surprising. I'd expected a material amount of variation in what models were returned, but not this much. And I'd expected the model selection process to be more consistent (to use a term loosely) ie converge to the true model more noticeably as sample length got bigger. I presume there's a literature on all this but didn't have time to research it, so any pointers are welcomed in the comments or by Twitter.

Code that produced the summary tables

results_ar1 %>%
   group_by(fitted) %>%
   summarise(count = length(fitted)) %>%
   arrange(-count) %>%
   head(10) 

results_ar1 %>%
   group_by(n) %>%
   summarise(correct = sum(correct) / length(correct) * 100)

results_arma33 %>%
   group_by(fitted) %>%
   summarise(count = length(fitted)) %>%
   arrange(-count)

Code that produced the summary plot

r1 <- results_ar1 %>%
   mutate(model = "AR(1)")

r2 <- results_ma1 %>%
   mutate(model = "MA(1)")

r3 <- results_arma22 %>%
   mutate(model = "ARMA(2, 2)")

r4 <- results_arma33 %>%
   mutate(model = "ARMA(3, 3)")

r_all <- rbind(r1, r2, r3, r4) %>%
   mutate(model = factor(model, 
                         levels = c("AR(1)", "MA(1)", "ARMA(2, 2)", "ARMA(3, 3)")))
						 
p1 <- r_all %>%
   group_by(model, n) %>%
   summarise(correct = sum(correct) / length(correct)) %>%
   ggplot(aes(x = n, y = correct, colour = model)) +
   geom_point() +
   geom_line() +
   scale_y_continuous("Percent of models correctly identified", label = percent) +
   scale_x_log10("Length of time series", 
                # label = comma, 
                breaks = 2 ^ (1:10) * 10) +
   ggtitle("Success rate of automated ARIMA model selection\nfitting to a known data generating process")

print(p1)


print(p1 + 
		facet_wrap(~model, scales = "free_y") +
		theme(legend.position = "none"))