1. Introduction
In Part 1 of this series, recent daily concentrations of airborne particulate pollution from a site in central Sydney were used to simulate future pollution levels using block bootstrap replicates. A seasonal pattern is clearly evident in the time series of PM10 (particulate material with diameter less than 10 microns), with higher concentrations at the ends of the years than in the middle. A time series of concentrations was generated by grouping the data by month and, for each future date, sampling only from the data for that month. This produces a stable time series of PM10, which follows the same distribution of concentrations as the observations, month by month. However, within each month, the sampled concentrations independent, and the simulation is essentially white noise. The observations appear to be autocorrelated, as the true day to day variability is less that simulated.
In this Part, several time-series models for the monthly-average PM10 are examined, which incorporate autocorrelation and seasonality. Although the daily-average PM10 is important from a regulatory point of view, the monthly-average is computationally more amenable to analysis that attempts to model the seasonality. An examination of monthly data provides useful guidance on how to simulate daily averages.
The analysis is carried out in R, using tidyverse libraries dplyr and ggplot2. It also uses time-series package forecast, developed by Rob Hyndman and George Athanasopoulos, and their online text Forecasting: Principles and Practice. The aim of this work is to produce a realistic long-term simulation of future PM10 concentrations; a forecast of the expected mid-range, including prediction intervals, is a milestone on the way towards this.
2. Monthly PM10 – Seasonality and Autocorrelation
Data were downloaded from the New South Wales Government’s Office of Environment & Heritage, Australia, obtaining hourly PM10 concentrations for the years 2011-2017 from an air quality monitoring site in central Sydney. Derivation of daily-average concentrations is outlined in Section 2 of Part 1.
The daily-average concentrations show a clear seasonal pattern. Whilst there are a number of time‑series model types which will account for seasonal behaviour, the computational resources required to carry out the model fitting make them prohibitive to work with if the seasonal period is too long. This is certainly the case for daily data (with a return period of 365 or 366 time steps), but not monthly (12 time steps), or quarterly (4 time steps per year). Consequently this part of the study focuses on monthly-averaged air quality data. Imputation of daily concentrations in the original data set from central Sydney produced a complete daily time series of PM10. The monthly-average PM10 was then calculated from the completed time series.
The resulting monthly-average PM10 time series is shown in Figure 1, along with a decomposition into trend, seasonal and remainder components. This was done using the stl function in R. A decomposition with a periodic seasonal component was specified, leaving the rest of the variability in the other components. The monthly-average PM10 is composed of a slowing-changing trend averaging around 17 μg/m3, a seasonal oscillation of amplitude around 4 μg/m3, and a remainder that is often as large as the seasonal component.
The autocorrelation functions for the monthly PM10 are shown in Figure 2. The upper panel acf contains significant non-seasonal peak correlations at one- and two-month lags, and seasonal peaks at twelve‑month intervals. These tailing-off sets of correlations are a signature of autoregressive behaviour. Although there are many peaks, the underlying relationship between months may not be complex, as a correlation between neighbouring months will give rise to acf peaks at all lags (even if there is no explicit correlation between data points two months apart). The partial autocorrelation function pacf (lower panel of Figure 2) accounts for this, showing significant peaks at a lag of one month (not two), an anti-correlation at around six months, and no significant peaks thereafter. Thus Figure 2 indicates that a simple autoregressive model may fit the data, where the PM10 in each month is explicitly related to PM10 from the previous month, and from the previous year.
3. Time-Series Forecast Models for Monthly PM10
This section examines several time series models, assesses their fit to the PM10 time series and uses the model to make forecasts several years into the future. The models are used to forecast expected mid-range conditions, bounded by appropriate prediction intervals.
The seven-year PM10 time series was divided into training and test data sets; the model was trained on the first five years of data (2011-2015 inclusive), and the remaining years (2016-2017) used to test the model. The model was then used to produce a five-year forecast (nominally 2016-2020). Given that the time series is expressed as monthly averages, this reduces the number of data points from 60 for the training data, and 24 for the testing data. This may affect the robustness of the results, as the seasonal pattern is not established strongly in 2011 nor in half of 2012, and the full data set comprises only seven complete cycles.
The simplest model is the seasonal ‘naive’ model, in which each month’s PM10 concentration is simply the concentration of the same month in the previous year. Hence the forecast for years 2016 to 2020 comprises repeats of the 2015 time series. This forecast is shown in the upper panel of Figure 3, including the 80 % and 95 % prediction intervals, which increase in magnitude with time.
The Holt-Winters method is a type of exponential smoothing, in which predicted values are a weighted average of the most recent data (with weights exponentially decreasing, back in time). The method is able to model typical values, trends over time, and cyclical patterns. In this case, the model does not find a trend over time, but does produce a clearly repeating pattern. The middle panel of Figure 3 shows results using the Holt-Winters method (assuming additive seasonality). The forecast for 2016‑2020 is weighted more in favour of 2015 than previous years, but is not an exact copy like the seasonal naive forecast. The Holt-Winters prediction interval grows at a slower rate than the seasonal naive interval.
State-space models for exponential smoothing (ETS models) permit additive and damped-additive trends over time, and additive or multiplicative seasonal components and errors. The ets routine in the forecast library determines the best model type given the input data, and calculates the relevant parameters. For the PM10 training data, the ets routine determines the model to be ETS(M,N,A), meaning errors are multiplicative, there is no trend, and the seasonal component is additive. The forecast is visually similar to the Holt-Winters forecast, and not displayed here.
Autoregressive integrated moving average (ARIMA) models are also used to forecast a variable of interest using previous values of itself, plus random-noise components. ARIMA models more general than linear ETS models, as the weights of the previous values do not have to be exponentially decreasing in ARIMA. However the ETS framework permits multiplicative components that do not have ARIMA counterparts. The auto.arima routine in the forecast library has been used to fit an ARIMA model to the PM10 training data. The routine determines that the monthly PM10 data behaves as an ARIMA(1,0,0)(0,1,1)12 model. The monthly data behave as ARIMA(1,0,0) and the seasonal component is ARIMA(0,1,1). The subscript 12 denotes a twelve-interval period; this parameter is given to auto.arima by the user, not calculated by the routine. The forecast PM10 using the ARIMA model is shown in the lower panel of Figure 3. The forecast mid-range is similar to the Holt-Winters and ETS(M,N,A), but the prediction interval is narrower for the ARIMA forecast and increasing much more slowly.
4. Forecast Accuracy
Accuracy statistics for a number of models are shown in Table 1, including some model variants not discussed above. The first three are non-seasonal, and thus have higher RMSE and MAE. The Holt‑Winters, ETS and ARIMA results are similar; they all match the two years of testing data equally well. In two cases (rows 8 and 10), the models included a Box-Cox transformation; both selected the parameter λ = -1. The final ARIMA model (row 11) was produced specifying a stationary model was required. The original seasonal component, being denoted ARIMA(0,1,1), indicates that the data set needs to be differenced (with lag 12) to produce a stationary model, as the original data set may not be stationary. To simulate a long time series, based on an ARIMA formulation, a stationary model should be specified.
Table 1: Model accuracy statistics. Root-mean-square error (RMSE) and mean absolute error (MAE) of forecast of monthly testing data. Values are in PM10 concentration units of μg/m3.
Model | RMSE | MAE |
---|---|---|
Naive | 4.518 | 3.676 |
Simple exponential smoothing (SES) | 4.377 | 3.612 |
SES with Holt’s trend method | 4.036 | 3.447 |
Seasonal naive | 3.466 | 2.884 |
Holt-Winters (additive seasonality) | 3.190 | 2.575 |
Holt-Winters (multiplicative seasonality) | 2.960 | 2.310 |
ETS(M,N,A) | 2.951 | 2.269 |
ETS(A,N,A) λ = -1 | 3.035 | 2.400 |
ARIMA(1,0,0) (0,1,1)12 | 2.976 | 2.246 |
ARIMA(0,0,0) (0,1,1)12 λ = -1 | 3.043 | 2.282 |
ARIMA(1,0,0) (1,0,1)12 (non-zero mean) | 2.961 | 2.366 |
The above measures of accuracy compare the forecast for 2016 and 2017 with the PM10 test data for that period. The forecast package also checks the residuals left when the model is fitted to the training data. These should be uncorrelated, and ideally should be normally distributed. Formally, a Ljung-Box test can be carried out on the residuals to determine whether they behave like white noise. The function checkresiduals in forecast does this; if the results are not significant (returning a large p-value), then the residuals are behaving like white noise. If the results are significant (low p-value), then the residuals do not behave like white noise, and further predictive information could be extracted from the data set. Table 2 shows results from the Ljung‑Box test for each model, including the test statistic Q* whose significance is shown by the p-value, and the number of degrees of freedom used in the test. The number of degrees of freedom is the number of lags considered in the test, minus the number of model parameters that have been fitted; the more complex models have a lower number of degrees of freedom.
Table 2: Results from the Ljung-Box test on residuals from fitting the models to the training data.
Model | Q* | df | p-value |
---|---|---|---|
Naive | 18.515 | 12 | 0.10 ns |
Simple exponential smoothing (SES) | 27.547 | 10 | 0.0021 ** |
SES with Holt’s trend method | 27.81 | 8 | 0.00051 *** |
Seasonal naive | 17.433 | 12 | 0.13 ns |
Holt-Winters (additive seasonality) | 15.211 | 3 | 0.0016 ** |
Holt-Winters (multiplicative seasonality) | 18.931 | 3 | 0.00028 *** |
ETS(M,N,A) | 15.343 | 3 | 0.0015 ** |
ETS(A,N,A) λ = -1 | 12.556 | 3 | 0.0057 ** |
ARIMA(1,0,0) (0,1,112 | 9.307 | 10 | 0.50 ns |
ARIMA(0,0,0) (0,1,1)12 λ = -1 | 10.585 | 11 | 0.48 ns |
ARIMA(1,0,0) (1,0,1)12 (non-zero mean) | 12.003 | 8 | 0.15 ns |
Table 2 shows that residuals from most of the exponential smoothing models with seasonality (rows 5-8) do not pass the Ljung-Box test. These models are relatively complex. The ARIMA models (rows 9‑11) are simpler, and they pass the Ljung-Box test. The naive models (rows 1 and 4) are the simplest models, which also pass the Ljung-Box test.
From here on, the analysis will focus on ARIMA models to describe monthly-average PM10 concentrations at the air quality monitoring site in central Sydney. They pass the Ljung-Box test for model residuals, and produce more accurate forecasts in comparison with the test data.
5. Simulation of a Monthly Time Series
As stated already, the aim of this work is to produce simulations of daily PM10. Here, simulations of monthly PM10 have been created, based on the parameters for the stationary ARIMA model. The stationary model is designated ARIMA(1,0,0)(1,0,1)12 with non-zero mean (row 11 of Table 2). It can be expressed in a compact mathematical form as follows:
(1 – Φ1B12)(1 – φ1B)(Xt – μ) = (1 + Θ1B12)εt
where
Xt is the monthly-averaged PM10 concentration at month t .
B is the backshift operator (for example, BXt = Xt-1 and B12Xt = Xt-12).
μ is the long-term mean of X; the calculated value here is 17.2 μg/m3.
ε represents random noise, with a value εt at month t; it is assumed normal, with mean zero and constant standard deviation calculated as 2.35 μg/m3.
The left-hand side of the equation is the auto-regressive component of the model, with coefficients Φ1 = 0.87 (for seasonal correlations) and φ1 = 0.41 (month by month). The PM10 concentration shows a strong positive correlation at the twelve month lag time, and a moderate positive correlation at the one-month lag time. The moving-average component is on the right-hand side, with coefficient Θ1 = ‑ 0.57 and the right-hand side evaluating to (εt – 0.57εt-12). In other words, the model’s ARIMA behaviour includes a relationship between the current concentration and a white-noise component from the current and previous years. A simulation of the ARIMA(1,0,0)(1,0,1)12 process for monthly PM10 produces the time series shown in Figure 4. The figure shows training data (blue, 2011-2015), test data (red, 2016-2017), and a ten-year ARIMA simulation (black curve, 2016-2025). Also, the 80 % and 95 % prediction intervals for the seasonal forecast are shaded. The simulation falls within the prediction intervals for the expected proportions of the time. However, the regular annual cycle that is evident in the observations, and present in the forecast, is not strongly evident in the simulation. As shown in Part 1, the monthly-average PM10 is higher at the beginning and end of the year, lower in the middle, with a localized peak in May. In the simulation, the periodic nature of the time series gradually erodes over the years, due to noise and smoothing associated with a less-than-perfect correlation (Φ1 < 1).
6. Simulation versus Forecast
In the previous section, a time-series of monthly PM10 concentration was simulated as an ARIMA process, whose properties have been derived from a training data set containing seven years of monthly observations. The simulation does not show the strong seasonal periodicity that is evident in the forecast. The reasons for this are (presumably) as follows:
- The forecast is of the expected mid-range conditions, and includes prediction intervals. It is the representation of an ensemble, not an individual realisation of the model.
- The forecast contains a cyclic seasonal pattern. Whilst the simulation allows for seasonal correlations, they are not imposed, and other processes tend to erode the seasonal patterns. The processes which drive the seasonal contrasts – emissions of pollutants and weather conditions – are not accounted for in any of the time-series models.
- The requirement for a stationary process means that the simulation does not use the optimal ARIMA coefficients.
- The training data includes only five seasons, so models generated using those may not be robust. Training the model on other subsets of the seven-year data set may produce different model coefficients.
Further analysis of the monthly-average data has been done, but is not presented here, as the aim is to produce a long-term simulation of the daily PM10 concentration. However, the examination of the monthly data set has provided useful guidance on how to handle the daily data set, as follows:
- The modelling should be done on the remainder, after decomposition of the time series. This will preserve the ARIMA behaviour of the remainder, whilst retaining the seasonal pattern and any long-term trend. This is accepted practice, as calculating ARIMA coefficients at lags of 365 days is computationally prohibitive, and as shown above, the seasonal pattern is likely to erode away in time.
- The model may be improved by transforming the data to reduce skewness and render the residuals more normally distributed. A Box-Cox transformation may do this, noting that half of the remainder data are negative values.
In Part 3, the daily PM10 will be simulated, based on these guidelines.
Note that no causation has been implied in this work, nor physical mechanisms accounted for. This month’s concentration is not caused by, last month’s or last year’s, but happens to exhibit ARIMA behaviour. The ambient PM10 concentration depends on discharges of pollution from various sources, which are then dispersed by the wind. Patterns and persistent features in the time series of PM10 – on monthly or yearly time scales – are due to the regional weather and climate. On shorter time scales – daily or hourly – these features of the PM10 time series are due to repeating hourly emissions patterns each day of the week, and persistence of local weather conditions from one day to the next.