This summary outlines a model of air-pollution concentration in central Sydney. Technical details of the model development are contained in previous posts on this subject.
1. Introduction
Health impacts of air pollution are well-documented; local and national governments have regulations in place on permissible ground-level concentrations (glcs) for a number of pollutants. Applications for consent to operate industrial facilities or build new roads, for instance, need to account for the cumulative effects of air discharges from these sources into an urban atmosphere that may already breach the regulatory limits. Potential glcs of air pollution from a new source need to be considered in combination with levels of pollution already present – known as the baseline.
The baseline data set studied here is a time series of glcs from an urban monitoring site, at which the regulatory glc limit may be exceeded several times per year (and the number of exceedences varies between years). A question which may need to be addressed in a new industry’s application is, how often will the cumulative glcs – which include a contribution from the industry – produce a higher number of exceedences than are currently observed? The answer may turn out to be ‘less frequently than the length of the currently available data sets’, a longer-term time series of baseline glcs would need to be generated, and the answer would be provided probabilistically.
Parts 1, 2, and 3 of the series entitled Simulations of Future Air Quality focus on daily glcs of PM10 over recent years at an air quality monitoring site in central Sydney. PM10 refers to airborne particles with diameter less than 10 microns. The aim of the work is to produce a simulation of baseline air quality over a potentially indefinite period, which incorporates as much of the observed behaviour of the site’s glcs as possible.
2. Model Simulation of Daily PM10 in Central Sydney
The time-series simulation of PM10 for a ten-year period is shown in Figure 1. In principle, this can be extended indefinitely. Although the modelled glcs have dates associated with them, the time series may be thought of as a set of possible realisations of the daily PM10 that may occur during any year.
In Figure 1, the observations run from 2011 to 2017 inclusive. Data from 2011 to 2015 were used to train the model, holding back data from 2016 and 2017 for testing. The simulated time series runs from 2016, with the first two years evaluated with respect to the testing data set. The behaviour of the simulated time series matches the behaviour of the observations in the following important respects:
- The daily concentration is stable; it does not drift over multi-year periods.
- Concentrations are greater than zero.
- There is a persistent, seasonal component, with a period of 365 days.
- The frequency distribution of concentrations has been preserved. In particular, the positive skewness and occasional elevated peaks of concentration is captured by the model.
- The seasonally-adjusted concentration behaves as an auto-regressive model.
- The irregular, random-noise components behave the same way in the simulation as observed – they follow a log-normal distribution.
The simulated time series was generated by first performing a decomposition of the data into trend, seasonal and remainder components, shown in Figure 2. The key element was that the decomposition was performed on the logarithm of the daily PM10, so the daily PM10 itself is the product of (the antilogarithms of) the components. The components are shown in the lower panels of Figure 2, and must be multiplied together to produce the time series in the top panel.
The multiplicative decomposition renders the skewed data more normally distributed, and more amenable to modelling. The model residuals are more normally distributed, and the simulation can then be based on a standard Gaussian white noise generator. Also, the simulated total concentration, as a product of exponential terms, must be positive.
An ARIMA(2, 0, 0) model was fitted to the logarithm of the remainder component, in which each day’s glc is correlated with glcs from the previous two days. This model was of a lower order than the ARIMA model fitted to the remainder from an additive decomposition of the PM10 data (namely, ARIMA(2, 0, 2)). Also, the reconstructed time series from an additive model missed the high peaks in concentration and produced negative values. The ‘additive’ ARIMA model is discussed in Part 3.
The simulation of total daily PM10, shown in Figure 1 was constructed as the product of the mean of the trend component, the seasonal component, repeated annually, and the ARIMA model (taking antilogarithms of the components before multiplying them together).
3. Cumulative PM10 Concentrations
The model for daily PM10 was arrived at during the course of several stages of investigation. In Part 1, the time series was modelled as a block bootstrap replicate of observed PM10 glcs. Whilst the simulation matched the observed distribution, including its seasonality, it did not produce any correlation between successive days’ glcs. In Part 2, an ARIMA model was fitted to the data, but the seasonal pattern did not persist in the simulation (these are both expected properties of the selected models). In Part 3, the seasonal pattern was extracted from the time series and an ARIMA model fitted to the remaining glcs. Then the total PM10 was simulated by combining the seasonal pattern with the simulated remainder. Thus the seasonality and the autocorrelation were retained in the simulation, and – aided by a suitable transformation of the data – the distribution of glcs in the simulation matched the observed distribution.
So far, a model has been generated for the baseline PM10 only. The next stage in the process is the combination of the baseline with the (modelled) time series of PM10 arising from a proposed industrial facility or highway upgrade, for example. The cumulative PM10 would be used to examine changes in the number of exceedences of the regulations per year. Note (i) the number of exceedences may currently be low, and (ii) contributions from the industry will be small. This is because (i) the industry would not be permitted to operate if the city were already ‘too’ polluted, and (ii) the industry would be required to ensure its emissions to the atmosphere were minimized. In this scenario, ‘extra’ exceedences may be expected several years apart, at intervals longer than the available data set. This is why a longer-term simulation is needed. The onus is on the industry to demonstrate that any negative impact on air quality will be negligible.
A solution for the long-term cumulative concentration has not been investigated here. Ideally, the baseline and the industrial contributions would be known for the same several-year period, and this situation may be relatively straightforward to address. Models could be developed to simulate both the baseline and the cumulative effects, and the change in the number of exceedences examined. This case may be returned to in a future series of posts.
4. Acknowledgements
The analysis was carried out in R, using tidyverse libraries dplyr and ggplot2. It also uses time-series package forecast, developed by Rob Hyndman and George Athanasopoulos, and their online text Forecasting: Principles and Practice. For some of the residual checking, routines from the astsa package, developed by Robert Shumway and David Stoffer have also been used.
Data were obtained from the New South Wales Government’s Office of Environment & Heritage, Australia, as hourly PM10 concentrations for the years 2011-2017 from an air quality monitoring site in central Sydney.