1. Background and Motivation
Health impacts of air pollution are well-documented; local and national governments have in place regulations on permissible concentration limits for a number of pollutants. Applications for consent to operate industrial facilities or build new roads, for instance, need to account for the cumulative effects of air discharges from these sources into a potentially already-polluted urban atmosphere. In other words, concentrations of pollutants from a new source need to be considered in combination with baseline levels of those pollutants.
The potential concentrations from proposed new sources are usually calculated using air-dispersion models, as the new facility has not been built yet. Ideally, they would be combined, hour-by-hour, with baseline concentrations from a nearby air quality monitoring site, and the total impact assessed. However, baseline data may be missing from the period modelled, or they may not be of adequate quality and need to be simulated in some way. Also, the assessment may need to provide evidence of compliance with the regulations in the long term, over any potential baseline conditions.
The object of this study is to simulate baseline air-pollution concentrations at a monitoring site in central Sydney, projecting them into the future, based on data from recent years. The aim is to produce a simulation of future years’ air quality, which incorporates as much of the known behaviour of the data as possible.
In this part (Part 1), the simulation takes current data on airborne material (known as PM10; particles with diameter less than 10 microns), and samples this to simulate future years of PM10. In Part 2, several time-series models are evaluated for the monthly-average PM10, incorporating autocorrelation and seasonality. In Part 3, models for the future daily-average PM10 are developed.
The analysis is carried out in R, using tidyverse libraries dplyr and ggplot2. For Part 2, the time-series package forecast, developed by Rob Hyndman and George Athanasopoulos, is used. Reference is made to their online text Forecasting: Principles and Practice.
2. Hourly Data and Calculated Daily Averages
Data were downloaded from the New South Wales Government’s Office of Environment & Heritage, Australia, obtaining hourly PM10 concentrations for the years 2011-2017 from an air quality monitoring site in central Sydney. The data set obtained is nearly complete and in a usable format, thus only a small amount of cleaning is needed to generate 24-hour averages from the data. Figure 1 shows the hourly PM10 over the seven-year data period – concentrations are generally below 100 microgrammes per cubic metre (μg/m3), but a few exceed 200 μg/m3.
The seven-year period encompasses 61368 hours; data are missing from 1132 of these. Daily averages (midnight to midnight) were calculated for days with six or fewer missing hours. This left no more than six missing days in any month, 57 missing days in total. For each date of the year, the mean concentration over the seven years was calculated. The relevant date-mean was imputed into each of the 57 missing days to complete the time series, which is shown in Figure 2. There are a few concentrations over 50 μg/m3, which would be considered breaches of the Australian National Environmental Protection Measure and the New Zealand National Environmental Standard for ambient daily PM10. There is also a clearly discernible seasonal component to the time series, with higher concentrations at the end of the year. However, the peaks do not always appear at these times.
3. Simulation of Future Daily Concentrations
A simple approach to simulating this time series into the future involves sampling from the current time series of 2557 data points, with replacement – so the future is a bootstrap replicate of the present. This reproduces the distribution of concentrations, more closely for longer the sampling periods. However, it does not capture the seasonality. This can be done by sampling data from only the relevant month, not from the whole year – known as a simple block bootstrap. In R, this has been done as follows:
- Create a data frame containing a column of sample dates. In the following example, a seven-year simulation is created, starting 1 January 2018.
- Group the column by month and count the number of rows to get the sample size for each month. For example, a seven-year sample will contain 7 x 31 = 217 rows for January.
- For each month:
- Filter the sample date, keeping only the rows for the month.
- Filter the PM10 data set, keeping only the rows for the month.
- Obtain a sample (with replacement) of the calculated size, from the filtered month’s PM10 data.
- These two filtered columns – dates and sampled PM10 – have the same length. Bind them together and append them to the previous months’ samples.
- At this point, rows are ordered with all January samples followed by all February samples, etc. Arrange by date.
This approach only needs 12 bootstrap replicates to be obtained, no matter how long the future time period is. The simulated time series is shown in Figure 3. This has some similarities to the observations shown in Figure 2, including the seasonal behaviour and presence of occasional outlying concentrations in spring and autumn.
The distribution of simulated concentrations is similar to observed – see their histograms in Figure 4. They would match more closely if the sample were much larger. However, in this case 2557 daily concentrations have been sampled from a population of 2557 – many will have been selected more than once, and many will not have been selected.
Monthly-mean concentrations are shown in Figure 5. Again, the sampled monthly mean would be closer to observed if the sample were larger, but the seasonal pattern of high concentrations at the end of the year and lower concentrations in the middle has been reproduced. The pattern appears almost sinusoidal, with May (and possibly October) above that trend.
4. Time-Series Autocorrelation
The simulated time series captures some of the behaviour of the observations, but other aspects are missing. A key difference is that the simulated time series changes more rapidly from day to day than the observations. The contrast is more marked in the concentration differences at one-day lag; the bootstrap replicate encompasses a wider range of day-to-day differences than the original time series. The random sampling within each month’s data generates uncorrelated sets of concentrations. They follow that month’s frequency distribution, but are effectively monthly blocks of white noise. The observed time series data appear to be correlated, with smaller changes from one day to the next.
The next parts of the analysis examine the properties of the observed PM10 concentration series in more depth, testing the fit of several time-series models, and using them to produce forecasts (of expected concentrations with prediction intervals) and simulations (individual realizations of time series) of PM10 in Sydney. Given the strong seasonal component, which has a period of 365 or 366 days, the analysis in Part 2 deals with the monthly average, which has a return period of 12 and is therefore more amenable computationally. Part 3 returns to the daily average, decomposing the time series into long-term and seasonal components, and simulating the components.