# M4 Forecasting Competition: Explorative Data Analysis

## May 1, 2018 Data ScienceForecastingEDA

M4 is an academic forecasting competition which is running until May 31st 2018. I didn’t really intend to participate but I wanted to have a look at the data anyway. This was actually my notebook when I did EDA in the first place, so the plots are probably quite ugly, and some things are left undone. I would’ve never gotten the deadline with my thesis and all the work going on. So consider this a rough sketch, maybe you find something interesting anyway. I might find the time to polish this a bit in the future.

library(ggplot2)
library(ggthemes)
library(gridExtra)
library(reshape2)
library(plyr)
library(dplyr)
library(forecast)
library(TSdist)
# devtools::install_github("carlanetto/M4comp2018")
library(M4comp2018)

freq.cat <- function(a) { rows <- c(); uq <- unique(a); for (k in uq) { rows <- rbind(rows, c(k, sum(a==k))); }; colnames(rows) <- c("x", "freq"); return(rows); }

## Part I: Overview

The data is provided by the M4comp2018 package and can be installed via devtools::install_github("carlanetto/M4comp2018"). Warning: the data includes one million time series, the size needed in RAM is acceptable, but might cause strain on older computers.

M4 consists of a list of a million time series, each being represented by an object with the following attributes:

M4[[1]] %>% names
## [1] "st"     "x"      "n"      "type"   "h"      "period"
• st: series number (unique identifier)
• n: numbers of observations in the time series
• h: required forecasts
• period: sample rate of the time series, for example “yearly” or “hourly”.
• type: type / industry of the time series, for example “finance” or “demographic”
• x: the very time series

The package states that the starting date is always the start of the year.

Let’s take a closer look at the data in the following section.

all.ns <- unlist(Map(function(l) { c(l$n[[1]][1]) }, M4)) all.ns %>% summary ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 13 49 97 240 234 9919 data.frame(melt(all.ns), x=1:length(all.ns)) %>% ggplot((aes(x=x, y=value))) + geom_violin() + scale_x_discrete() + geom_hline(yintercept = quantile(all.ns)[4], color = "firebrick") We see that most of the time series are of length some where between 50 and 240 data points with an interquartile range of $$IQR = 185$$. Interquartile range simply describes the range of values for which fifty percent of the data are included. The extreme range out outliers spanning from 374.5 to 9919 renders the boxplot quite unusable in this form. But as we are only interested in the number of average data points per time series we have to deal with, summary suffices for now. (A possible fix would be to cap off the data at $$Q_1 + IQR * 1.5$$). (The red line indicates the maximum value where 75% of data are included.) The following counts the number of outliers we have, if we define outliers as being outside of the interquartile range $$IQR$$: qs.all.ns <- quantile(all.ns) iqr.all.ns <- IQR(all.ns) # we could've just subtracted Q1 from Q3, but we're being lazy here lower.bound <- qs.all.ns[2] - (1.5 * iqr.all.ns) upper.bound <- qs.all.ns[4] + (1.5 * iqr.all.ns) unlist(Map(function(x) { return ( x < lower.bound || x > upper.bound) }, all.ns)) %>% sum ## [1] 4897 So we have 4897 data points, or 0.49 % outliers in our data of the million time series. That’s a good rate, at least compared to “data observed in the wild”. Next we’ll investiage if the period and the number of data points show some pattern. ### The Forecasting Horizon & Sample Rate The forecasting horizon tells us if we need to forecast for a year, a month or actually a number of days. all.hs <- sort(unlist(Map(function(l) { l$h[[1]][1] }, M4)))
plyr::count(all.hs)
##    x  freq
## 1  6 23000
## 2  8 24000
## 3 13   359
## 4 14  4227
## 5 18 48000
## 6 48   414

Let’s analyze what horizons occur with what period of the data.

all.periods <- unlist(Map(function(l) { as.factor(l$period)[1] }, M4)) periods <- levels(M4[[1]]$period)
uq.period.per.horizon <- unique(cbind(all.periods, all.hs))
data.frame(
period=unlist(Map(function(x) { return(as.character(periods[x])) }, uq.period.per.horizon[,1])),
horizon=uq.period.per.horizon[,2])
##       period horizon
## 1      Daily       6
## 2     Hourly       6
## 3    Monthly       6
## 4    Monthly       8
## 5    Monthly      13
## 6    Monthly      14
## 7    Monthly      18
## 8  Quarterly      18
## 9     Weekly      18
## 10    Yearly      18
## 11    Yearly      48

This shows that the smaller the horizon, the smaller the sample unit. It also shows that we have to forecast up to 18 years into the future.

How are the sample periods distributed?

all.periods %>% freq.cat
##      x           freq
## [1,] "Daily"     "4227"
## [2,] "Hourly"    "414"
## [3,] "Monthly"   "48000"
## [4,] "Quarterly" "24000"
## [5,] "Weekly"    "359"
## [6,] "Yearly"    "23000"

### What Types of Time Series Are There?

levels(M4[[1]]$type) %>% print ## [1] "Demographic" "Finance" "Industry" "Macro" "Micro" ## [6] "Other" all.types <- unlist(Map(function(l) { as.factor(l$type)[1] }, M4))
all.types %>% freq.cat
##      x             freq
## [1,] "Macro"       "19402"
## [2,] "Micro"       "25121"
## [3,] "Demographic" "8708"
## [4,] "Industry"    "18798"
## [5,] "Finance"     "24534"
## [6,] "Other"       "3437"

For now this information isn’t terribly interesting, but we’ll see if the type of time series somehow asserts a certain gestalt of the data.

## Part II: Inspecting The Data

For now let’s have a look at raw data to get a feeling for what we’re working with.

M4[[30001]]$x %>% ggseasonplot(year.labels=TRUE, year.labels.left=TRUE) This is just a randomly picked, monthly sampled time series to display some things we can do with individual time series. The time series displayed here is a good example of a somewhat yearly increasing time series (note the increasing number of the year on the sides). The only background we have on it though, is the type “Demographic” of the time series. M4[[30001]]$x %>% ggmonthplot

This plot shows a so called seasonal subseries plot which allows one to examine seasonal patterns. The time series shown here has one data point per month. In the plot above, for each month the data points are extracted and then plotted chronologically. So the first January value is plotted, then the second and so on (we have 17 months of data for each month, as we have 17 years of data). This is done for each month. The blue line shows the mean of the month.

M4[[30001]]$x %>% decompose %>% autoplot This shows the components as an additive model. We can inspect if this is a good model or not. White noise should be distributed normally, and all the ACF coefficients should lie within the blue lines. rand <- decompose(M4[[30001]]$x)$random df.rand <- data.frame(melt(rand), x=1:length(rand)) grid.arrange( ggplot(df.rand) + geom_qq(aes(sample=value)), ggplot(df.rand) + geom_density(aes(x=value)), heights = c(0.35, 0.65), nrow=2 ) This plot can be interpreted as the random component of the model being somewhat normally distributed–or at least seems to be a good candidate for a Gaußian mixture model. ggAcf(decompose(M4[[30001]]$x)$random) This plot however shows that there are at least four somewhat significantly correlated lags. We expect that each autocorrelation is close to zero. 95% of the autocorrelations should lie within the blue lines. 0.83% are within these boundaries. Box.test(x = M4[[30001]]$x, type = "Ljung-Box", lag = 24)
##
##  Box-Ljung test
##
## data:  M4[[30001]]\$x
## X-squared = 3504.4, df = 24, p-value < 2.2e-16

Tje Box-Ljung test checks whether the first $$h=24$$ autocorrelations are significantly different from what we expect from white noise. Such a small value of $$p \approx 0$$ suggests does in fact not come from a white noise process–as does the plot.

## Are There Groups of Data?

To group the time series based solely on the data we need to somehow measure the distance between them. Another way is to find feature representations of the time series, such as SAX, PCA or decomposing the time series with fourier analysis.

NB: originally I wanted to cluster the data points with whatever distance measure from the TSDist package was the fastest. So first, I tried to benchmark the distances. I planned on saving the already computed distances into a postgres data base, but never got around doing that. Distances were provided by TSdist, benchmarking by microbenchmark.