Future daily pollution concentrations are simulated in central Sydney, using a periodic STL decomposition of recent time-series data and an ARIMA model for the seasonally-adjusted component.
1. Introduction
This is Part 3 of a series of posts aimed at simulating future daily pollution concentrations in central Sydney. The simulations are based on a seven-year hourly data set from an air quality monitoring site in the city. As far as possible, the simulated time series must have the same properties as the observations, including the following:
- The daily concentration is stable; it does not drift from a constant mean value.
- There is a strong seasonal component, with annual periodicity, which must persist into the future.
- The frequency distribution of concentrations must be preserved, including the seasonal component.
- Concentrations must be non-negative.
- Autocorrelations must be preserved, both daily and yearly.
- The simulation must include random-noise components, following the same behaviour found in the observations.
In Part 1, pollution levels were simulated as block bootstrap replicates. A time series was generated by grouping the data by month and sampling future concentrations only from the relevant month’s data. This produced a stable time series, which retained a seasonal component with high concentrations at the end of the year and lower concentrations mid-year. However, the modelled daily concentrations were independent but the observations appeared to be correlated. The block bootstrap time series satisfied properties 1 to 4, but not 5 and 6.
In Part 2, several time-series models were fitted to the monthly-average concentration, for the purposes of forecasting the expected mid-range concentrations and simulating individual realizations of the time series. This was done to gain insight into the potential pitfalls of simulating the daily time series, but based on a more tractable problem. Ultimately, a seasonal model is sought with a return period of 365 time steps, but this is not computationally feasible; hence the monthly-average time series was examined, with its return period of only 12 time steps. The best fit was found from a stationary seasonal ARIMA model, which retained the autocorrelation behaviour on the monthly and yearly time scales. Forecasts using this model retained the seasonal component, as intended. In the simulations, however, the seasonal component eroded away after a couple of years. Although one month’s concentration depends mathematically on the same month’s concentration in previous years, the noise components eventually dominated. The physical causes of the seasonal pattern – weather and pollution emissions – are not included in the statistical model. The simulated monthly time series satisfied properties 1, 4, 5 and 6, but not 2 and 3.
In this Part, the daily time series is decomposed to preserve the seasonal component, and a non-seasonal ARIMA model fitted to the seasonally-adjusted remainder. Several choices for the decomposition have been examined, including additive and multiplicative options, and different specifications of the seasonal smoothing. These lead to different ARIMA models for the remainder. Future concentrations have been simulated with each ARIMA model, and compared with the original time series.
The analysis is carried out in R, using tidyverse libraries dplyr and ggplot2. It also uses time-series package forecast, developed by Rob Hyndman and George Athanasopoulos, and their online text Forecasting: Principles and Practice. For some of the residual checking, routines from the astsa package, developed by Robert Shumway and David Stoffer have also been used.
2. Daily PM10 Data
Data were downloaded from the New South Wales Government’s Office of Environment & Heritage, Australia, obtaining hourly PM10 concentrations for the years 2011-2017 from an air quality monitoring site in central Sydney. PM10 refers to airborne particles with diameter less than 10 microns. Derivation of daily-average concentrations is outlined in Section 2 of Part 1, and the time series shown in Figure 1 has a clear seasonal pattern.
An initial examination of the time series data shows that the daily-average PM10 is positively skewed, and platykurtic; there are more extreme values, predominantly at higher concentrations, than normally-distributed data. This can be a limitation when applying theoretical models, as they often require the residuals of a fitted model to follow a normal distribution. The process of fitting a model may include a transformation of the data, such as a Box-Cox transformation, to make the data more normally distributed (with a constant variance). An Augmented Dickey-Fuller test was carried out on the daily PM10, which showed the data are stationary. Hence the Box-Cox transformation of the PM10 data has not been investigated in depth here. However, the analysis has also been carried out on the natural logarithm of the PM10 concentration. The logarithm of the PM10 is also stationary, its skewness is nearly zero and kurtosis is close to 3, which are the values of the normal distribution. Figure 2 shows normal quantile-quantile plots for the daily PM10 and its logarithm. Normally-distributed data should lie along the depicted straight lines. Skewness in the PM10 is evident with extremes of high concentration; the distribution of the logarithm of PM10 is nearly normal.
Note that whilst the data as shown in Figure 2 may be informative, the data will be seasonally-adjusted and the remainder modelled. It is more relevant to consider the stationarity, skewness, kurtosis and distribution of the model residuals. This is done in the following sections.
3. Seasonally-Adjusted Daily PM10
Seasonal Decomposition of Time Series by Loess (STL) was applied to the time series to produce four cases, as follows:
- Decomposition of daily PM10 assuming a periodic seasonal component (in stl, s.window = “periodic”).
- Decomposition of daily PM10 with s.window = 7, giving an aperiodic seasonal component.
- Decomposition of the natural logarithm of daily PM10, with s.window = “periodic”. This is effectively a multiplicative decomposition of the daily PM10.
- Decomposition of the natural logarithm of daily PM10, with s.window = 7.
For illustration, the multiplicative decomposition, with periodic seasonal component is shown in the Figure 3 (the other cases are visually similar). The logarithm transformation has been inverted, thus the trend, seasonal and remainder components are multiplied together to give the original time series. It is convenient to consider the trend component as having concentration units (microgrammes of PM10 per cubic metre of air (μg/m3)), with values around 16 μg/m3, and the other components being non-dimensional multiplicative factors, with values around 1.0. With s.window set at 7, the trend is a little smoother, and more of the total variation shows in the other components.
Some statistical measures of the remainder components are shown in Table 1. In all four cases the mean is zero. Similar to the full data series, the remainders of the daily PM10 (case 1 and 2) are positively skewed and platykurtic, whereas the remainders of the logarithm of daily PM10 cases (henceforth denoted “log-PM10“) are much less skewed, and have the same kurtosis values as a normal distribution. Indeed, the whole range of log-PM10 remainder fits the normal distribution well for each case. The results change little when switching between s.window options.
Table 1: Statistical moments of the remainder component of the four STL cases. The means are all zero. Parameters are non-dimensional, except where PM10 concentration units are shown.Case: Description | Std. deviation | Skewness | Kurtosis |
---|---|---|---|
1: PM10 : s.window = “periodic” | 5.85 (μg/m3) | 0.90 | 5.46 |
2: PM10 : s.window = 7 | 5.44 (μg/m3) | 0.79 | 5.40 |
3: Log-PM10 : s.window = “periodic” | 0.33 | -0.06 | 2.95 |
4: Log-PM10 : s.window = 7 | 0.30 | -0.03 | 2.95 |
4. ARIMA Modelling
The four remainder time series were divided into training and testing data sets, with the training data set being the first five years, and the testing data set the final two years. Non-seasonal ARIMA models were fitted to these time series. No Box-Cox transformations were applied (option lambda = NULL), and stationarity was not imposed (although the models turn out to be stationary). Options stepwise = FALSE and approximation = FALSE were selected.
ARIMA(2, 0, 2) models with zero mean were found in cases 1 and 2, with similar model coefficients in each case.
ARIMA(2, 0, 0) models with zero mean was found in cases 3 and 4, with similar model coefficients in each case.
Based on the calculated information criteria (such as AIC, AICc, and BIC), the models with option s.window = 7 are slightly better predictors of future PM10 than those using the periodic option.
The training data residuals were examined using the checkresiduals routine in the forecast package, and the sarima routine in the astsa package. The residuals are defined as the difference between observations and fitted values from a one-step forecast, based on observations up to the previous step. The statistical moments of the model residuals are shown in Table 2. Cases 1 and 2 are positively skewed and platykurtic; cases 3 and 4 have negative skewness (but lower magnitude), and kurtosis closer to a normal distribution.
Statistical moments of the model residuals, and results of the Ljung-Box test of residual correlations up to 365 lags.Case | Skewness | Kurtosis | Ljung-Box p-value (365 lags) |
---|---|---|---|
1 | 0.66 | 7.2 | 0.011 * |
2 | 0.56 | 7.35 | ~ 0 *** |
3 | -0.31 | 3.9 | 0.12 |
4 | -0.25 | 3.8 | ~ 0 *** |
Figure 4 shows normal quantile-quantile plots of the model residuals for each case. The positively-skewed cases 1 and 2 have extreme outliers at the high end of the distribution, and some at the low end. Cases 3 and 4 have good fits to the normal distribution at the high end of the distribution, but some extreme values at the lower end.
The closeness of the residuals to a normal distribution is an important factor in the time series simulation, as the model assumes that the random component is normally distributed. If the residuals were positively skewed, the simulation may under-predict higher concentrations of PM10, and conversely if the skewness were negative.
The autocorrelation of the residuals is shown in Figure 5 for each case. There is generally a low autocorrelation at most lags, indicating the residuals are independent of each other. However, each case has a significant negative correlation at a lag of 365 days, worse for cases 2 and 4. Not all of the periodicity in the original time series has been captured in the seasonal component the STL decomposition, and some is left in the remainder. Consequently, the Ljung-Box test on the residuals shows they are significantly correlated when evaluated up to a lag of 365, in all except case 3 (see Table 2 above). (The sarima routine shows Ljung-Box test results at lower lag values; none of the models have correlated residuals at lags up to 20 days).
As the remainders are non-seasonal, their forecasts settle quickly to a constant value of zero, with constant prediction intervals, shown in Table 3. The intervals are slightly narrower for the non-periodic seasonal option (cases 2 and 4, compared to cases 1 and 3).
Table 3: Forecast prediction intervals for ARIMA models. The point-forecast value is zero. Cases 1 and 2 have concentration units; cases 3 and 4 are non-dimensional.Case: Model | 80 % prediction interval | 95 % prediction interval |
---|---|---|
1: ARIMA(2, 0, 2) | (-7.39, 7.39) (μg/m3) | (-11.31, 11.31) (μg/m3) |
2: ARIMA(2, 0, 2) | (-6.82, 6.82) (μg/m3) | (-10.43, 10.43) (μg/m3) |
3: ARIMA(2, 0, 0) | (-0.412, 0.412) | (-0.630, 0.630) |
4: ARIMA(2, 0, 0) | (-0.376, 0.376) | (-0.575, 0.575) |
5. Simulations of Daily PM10
The choice of optimal model depends on its performance in terms of training data residual behaviour, accuracy with respect to test data, and information criteria. All indicators point to the multiplicative models (cases 3 and 4, based on a decomposition of log-PM10) being of higher quality than the additive ones (cases 1 and 2). Accuracy and information content favour case 4 over case 3, but residual checks favour case 3. Case 3 is also a more intuitive choice of long-term model, having a truly periodic, rather than near-periodic, seasonal component. Therefore simulations shown here are based on case 3, and case 1 is shown for comparison.
Case 1 is an ARIMA(2, 0, 2) model (equivalently ARMA(2, 2)). The relationship between the daily PM10 concentration and random-noise components is expressed as follows:
(1 – φ1,1B – φ1,2B2)X1,t = (1 + θ1,1B + θ1,2B2)ε1,t
where
- Xj,t is the seasonally-adjusted daily PM10 on day t.
- B is the backshift operator: BkXj,t = Xj,t-k.
- ε represents random noise, with a value εj,t on day t for case j, where ε1,t ~ N(0, 24).
- φj,k are the auto-regressive coefficients of the ARIMA model.
- θj,k are the moving-average coefficients of the ARIMA model.
- k is the lag number; j is the case number.
Substituting the numerical values of the coefficients gives the following relationship:
(1 + 0.50B – 0.36B2)X1,t = (1 + 1.06B + 0.16B2)ε1,t
The seasonal component is a periodic series of concentrations, here denoted Sj,t, where Sj,t‑365 = Sj,t. The long-term component is assumed constant, equal to the arithmetic mean of the STL trend, which is 17.3 μg/m3. Hence the total concentration at time t is denoted Cj,t, where
C1,t = 17.3 + S1,t + X1,t
Figure 6 shows time series of the PM10 observations from the period 2011 to 2017, and a ten-year simulation for 2016 to 2025 (inclusive) based on observations from 2011 to 2015, for case 1. The model matches the distribution of the observations for most of the PM10 range, but misses the high peaks. It also overstates the lower extremes of concentration, giving some negative values.
Case 3 is an ARIMA(2, 0, 0) model (equivalently AR(2)), as follows:
(1 – φ3,1B – φ3,2B2)X3,t = ε3,t
Numerically,
(1 – 0.56B + 0.05B2)X3,t = ε3,t
X3,t is now the remainder after STL decomposition of log-PM10. There are no moving-average components in this model, and ε3,t ~ N(0, 0.073). In this case, the model uses the geometric mean of the STL trend, which is 16.1 μg/m3, and the total concentration is the product of the trend, seasonal and remainder components, as follows:
C3,t = 16.1 exp(S3,t) exp(X3,t)
This is a simpler model than that of case 1 – fewer parameters are needed to describe its behaviour.
Figure 7 shows the time-series simulation of the case-3 model. This reproduces the higher observed values more closely, and gives no negative concentrations. The cumulative density function for the model and observations have been calculated, and the two distributions match each other well.
6. Discussion
The model simulation based on case 3 gives a good representation of the observations. It can be extended for an arbitrary period of time, to provide baseline information required to examine cumulative effects of potential new sources of pollution. The simulation extracts a multiplicative seasonal component from the observed time series, repeats it annually, and fits an auto-regressive model to the remainder. The model for the total PM10 concentration satisfies the properties listed in the Introduction.
The remainder, or seasonally-adjusted, component of the time series, exhibits a moderate autocorrelation from one day to the next. As the ambient PM10 concentration arises due to discharges from local sources and subsequent dispersion by the wind, the autocorrelation must be due to similarities in these factors from day to day. This is not surprising, as traffic patterns are largely the same each day, and the progression of weather systems through the city can be slow enough so that wind patterns are similar on consecutive days.
The main difference between the model for daily PM10 presented in this Part and the model for monthly PM10 presented in Part 2 is that here, the seasonal component is imposed on the time series for all time. The model for monthly PM10 included a seasonal ARIMA component, but the initially strong seasonal pattern was eroded by other components. As pointed out previously, the physical drivers of the seasonal pattern – annual cycles in emissions and seasonal changes in the weather – were not included in that model. Neither are they included in the model for daily PM10, but the seasonal pattern is incorporated in such a way that it persists into the future.
Sources of PM10 are reasonably well known and air-dispersion models are routinely used to simulate them. Dispersion models are deterministic numerical models, using known emissions of pollutants, and information on the wind and turbulence in the atmosphere. They do not generally include statistical models for the observed air pollutants, and may not account for the natural constituents of the air, such as PM10 from wind-blown dust or sea-spray. These are not easily modelled as localized sources.
In theory, the deterministic pollution dispersion model could be combined with a time-series model such as ARIMA, using dynamic regression. This technique allows correlated error terms, so it can perform a regression of the observed PM10 concentration on the deterministic modelled PM10 and fit an ARIMA model to the errors. This does not explain the errors in the dispersion model, but may give guidance on the magnitude and behaviour of PM10 components that are not included in the dispersion model.
Dynamic regression techniques will be applied to pollution-dispersion model results in a future post.