Cluster Analysis of Meteorological Features: The exploratory data analysis of Part 1 is continued, to include a cluster analysis and develop simple models which classify the fine-particle concentration according to the local meteorology.
1. Introduction
In Part 1, weather and air quality data from Christchurch, New Zealand, were explored, to quantify the relationships among the observations. This was with a view to using hourly weather features – wind, temperature, and humidity – as predictors of fine particle concentration category. The category was defined as positive for concentrations higher than 30 µg/m3, and negative for lower concentrations. Higher concentrations tended to be found in the following circumstances:
- During the winter season, due to emissions from solid-fuel home heating.
- During specific periods of the day; morning, evening, and overnight peak times.
- When the wind speed is low, temperature is low, and a temperature inversion is present.
- Under less-common wind directions, such as north-westerlies.
The exploratory data analysis also found the following:
- There is a strong correlation between the temperature at different levels close to the surface.
- There are some modest correlations between other parameters – low wind speeds, cooler temperatures, temperature inversions and high relative humidity tend to occur together.
These findings would be expected by meteorologists and air quality quality practitioners, particularly those from urban areas where solid-fuel burners are used to heat homes in winter. People living in Christchurch know there is a potential for smog during cold, calm winter nights.
The analysis in Part 1 confirms anecdotal evidence for conditions of poor air quality, shows that the predictive features are not independent of each other, and also that the conditions conducive to pollution events do not always produce them. All of this has consequences for fitting predictive models.
In this part of the series, a cluster analysis has been carried out on the meteorological predictors. This is a standard step in exploratory data analysis, to further clarify relationships between the predictors. It is an example of unsupervised learning, as it does not use information on the fine particle category. However, some preliminary models are suggested in this post, based on the distribution of positives and negatives among the meteorological clusters.
This work uses meteorological and air quality data from the St Albans air monitoring site from 2012, which is licensed under a Creative Commons Attribution 4.0 International licence by Environment Canterbury.
2. Cluster analysis
Data for 2012 from the St Albans monitoring site in Christchurch have been used. K-means clustering has been performed for the five numerical features. These are as follows:
- temperature from a mast 10 metres above ground level, T10 (in °C),
- temperature difference between the 1 metre and 10 metre levels, T1-10 (in °C),
- westerly wind component, U (in m/s, west to east),
- southerly wind component, V (in m/s, south to north),
- relative humidity, RH (in %).
All features were normalized with min-max feature scaling, before calling the R routine kmeans, where k is the number of clusters chosen. A range of k between 1 and 10 clusters was specified, and for each value, 100 initial centroid locations were tested. The total distance between each data point and its nearest centroid – the within-cluster distance – decreases as k increases, as shown in Figure 1.
The distance would decrease to zero if the number of clusters approached the number of observations. There is no ideal value of k apparent in Figure 1 separating regimes of steep and gradual decrease in the total distance. A value of k = 6 has been chosen for this exercise.
Dividing the data into six clusters, their locations each feature range can be seen in the following figures. The normalized variables have been converted back to their original values, and the wind components U and V have been converted back to wind speed WS (in m/s) and wind direction WD (in degrees, bearing clockwise from north).
Figure 2 shows the density of data points plotted against wind speed, with a separate curve for each cluster. Although the number of data points varies between clusters, each curve has been normalized to enclose a unit area. Most clusters occupy a similar range of wind speed, except Cluster 3, whose wind speeds are relatively low values, light winds of around 1 m/s. Cluster 3 is depicted as a solid line to distinguish it more easily from the others, as it contains the range of wind speeds under which high pollution levels are expected.
Figure 3 shows the data plotted against wind direction. Clusters 1 and 6 contain east-northeasterly winds, between 45 and 90 degrees; Clusters 2 and 5 contain southwesterly winds, around 225 degrees. These are the prevalent wind directions in Christchurch. Cluster 4, whose wind speeds are highest, contains significant west-northwesterly winds, between 270 and 315 degrees. Cluster 3’s wind directions are distributed more evenly through the range 0 to 360 degrees (solid line), as at low wind speeds the wind direction is more variable.
The clusters overlap in their ranges of wind speed and direction; Clusters 1 and 6 cover similar ranges, as do Clusters 2 and 5. Figure 4 shows the data plotted against temperature, T10. There is overlap between the same pairs of clusters just mentioned, but the figure shows Cluster 4 is warmer than the others, and Cluster 3 cooler.
Figure 5 shows the data plotted against temperature difference, T1-10. Generally, temperature decreases with height, and T1-10 is positive. Inversion conditions are defined by a temperature increase with height, so T10 is larger than T1 and the difference is negative. Cluster 3 has the highest proportion of inversions. Clusters 1 and 6 are now quite distinct, with the range of T1-10 for Cluster 6 closer to zero. The temperature differences are mainly positive in the other clusters.
Figure 6 shows the data plotted against relative humidity, RH. Each cluster covers a subset of the total range of RH, from dry conditions (20%) to nearly saturated (100%).
The number of data points in each cluster and the centroid value of each parameter are shown in Table 1. Cluster 3 stands out in most respects – low wind speed, low temperature, temperature inversion.
Table 1: Cluster analysis of Christchurch 2012 meteorological data. Number of points in each cluster (N), and centroid parameter values.
N | WS | WD | T10 | T1-10 | RH | |
---|---|---|---|---|---|---|
1 | 1695 | 3.1 | 59 | 15 | 1.0 | 62 |
2 | 974 | 3.0 | 213 | 12 | 1.5 | 57 |
3 | 1377 | 0.3 | 311 | 6 | -0.9 | 87 |
4 | 560 | 2.5 | 309 | 20 | 1.3 | 35 |
5 | 1251 | 2.5 | 230 | 10 | 0.9 | 86 |
6 | 2079 | 2.7 | 68 | 13 | 0.3 | 82 |
The clusters can be described verbally as follows (listed in Cluster order):
- Unstable; warm, moderate east-northeasterly, 21% of hours.
- Unstable; warm, moderate south-southwesterly, 12% of hours.
- Inversion; cold, moist, light northwesterly, 17% of hours.
- Unstable; very warm, dry, moderate northwesterly, 7% of hours.
- Unstable; cool, moist, moderate southwesterly, 16% of hours.
- Neutral; cool, moist, moderate east-northeasterly, 26% of hours.
These descriptions are subjective, and do not describe the range of parameters included in each cluster; some clusters comprise a wide range of wind directions, and all include at least a few inversion hours.
The designations unstable, neutral, and inversion refer to the atmospheric stability, defined simply by the vertical temperature difference. If T1-10 is positive, warm air underlies cold air, and the atmosphere is unstable; vertical circulations can arise as warm air rises and cold air sinks. As mentioned already, an inversion is defined by T1-10 being negative. Finally, neutral conditions occur when T1-10 is around zero.
The relative humidity is not a measure of the absolute moisture content, but depends strongly on temperature. Hence the lower ranges of temperature may be labelled as moist in the above list, because the cooler air is closer to saturation (RH of 100%). Conversely, warm air is less saturated, hence the designation dry for Cluster 4. Their moisture contents may be similar.
However, the six clusters do have distinct characters – shown by the differing qualitative description above.
3. Particulate pollution categories in each cluster
The fine-particle pollution at the St Albans, Christchurch, has been represented by a categorical variable, defined as positive if the hourly concentration is greater than 30 µg/m3, and negative otherwise. The rest of the analysis in this and the next part of this series is devoted to modelling the pollution category using the weather features as predictors.
Table 2 shows the range of fine-particle concentrations and the number of positives in each meteorological cluster.
Table 2: Distribution of pollution concentration among the meteorological clusters. Concentrations are in µg/m3.
Cluster | Median concentration | Maximum concentration | Fraction of positives |
---|---|---|---|
1 | 10 | 51 | 1 % |
2 | 10 | 72 | 1 % |
3 | 21 | 209 | 36 % |
4 | 9 | 25 | 0 % |
5 | 10 | 56 | 1 % |
6 | 11 | 64 | 1 % |
All | 11 | 209 | 7 % |
The median concentration of the data set is 11 µg/m3, and most cluster medians are close to this value. The fraction of positives for the whole data set is 7%. That is, the fine-particle concentration is above 30 µg/m3 for just 7% of the time.
The clustering separates out the positives, as most of them are in Cluster 3, and the fraction of positives in Cluster 3 (36%) is much higher than in other clusters. As the fraction of positives in the other clusters is no more than 1%, a model that predicts the negative pollution class would be correct 99% of the time. Within Cluster 3, the positive category is not the majority case.
Some simple decision-based predictive models are possible, depending on which cluster the meteorological data are in. These are described as follows:
- Model 1: All predicted cases are negative. This reflects the majority case in each cluster, and would be 93% accurate, because 93% of the cases are negative. However, there is no predictive power for positive cases.
- Model 2: All predicted cases are positive. This has little accuracy and is included for completeness.
- Model 3: If the weather conditions are in Cluster 3, predict positive cases, otherwise predict negative.
Model 3 should perform better than the others, as it attempts to distinguish between positive and negative cases. Although positive cases are not in the majority in Cluster 3, their importance is increased in Model 3.
4. Incorporating the season as a predictor
Up to now, the cluster analysis has been applied to the numerical features, with no grouping by season category. In this work, the cold, cool and warm seasons have been defined by the range of fine-particle concentrations occurring in each month of 2012. The cold season contains higher concentrations than the seasons labelled cool and warm. Most of the high concentrations, the positive cases, occur in Cluster 3, and this can be broken down by season.
Cluster 3 is characterized by cold, stable and calm weather conditions. These can happen at any time of the year, but mostly in the cold season, which contains nearly all of the positive cases. Positive cases are in the majority in the cold-season component of Cluster 3. This suggests a further predictive model:
- Model 4: If the weather conditions are in Cluster 3 and the season is designated cold, predict positive cases, otherwise negative.
Model 4 reduces the chance of a false positive, compared to Model 3. Having discovered that improvements may be made by grouping the clustered data by season, further models can be developed by grouping the full data set by season first, then performing a cluster analysis on the cold season only, as follows:
- Model 5: If the data are from the cold season, predict positive, otherwise negative (included for completeness).
- Model 6: If the data are from the cold season, perform a cluster analysis and predict positives for suitable clusters. Everything else is predicted negative.
For Model 6, a k-means cluster analysis has been carried out on the cold-season data, again with k = 6. In this analysis, there are two clusters of calm and stable conditions, which contain most of the positive data. In one of these cases, 68% of the data are positives; in the other 49% are positives. Model 6 therefore is set to predict positive cases for these two clusters only.
The models include a simple decision tool based on season (Model 5), and more discerning models based on the weather conditions (Models 3, 4 and 6). Their skill scores are evaluation in the next section.
5. Scoring and comparison between models
For each model, the confusion matrix and various scoring parameters have been calculated. The confusion matrix contains the numbers of true and false positive and negative predictions (labelled TP, FP, TN and FN) and the commonly-used scores are simple functions of these. The following scores focus on the prediction of positive cases:
- Precision, or Positive Predictive Value (PPV): the proportion of predicted positive cases that were observed. PPV = TP/(TP+FP)
- Recall, or True Positive Rate (TPR): the proportion of observed positive cases that were predicted. TPR = TP/(TP+FN)
- F1 score: the harmonic mean of the PPV and TPR.
The F1 score combines the precision and recall into a useful overall score for the prediction of positive cases. The scores for each model are shown in Figure 7.
Considering the F1 score, models without clustering (1, 2 and 5) perform badly, as they predict either a whole season, a whole year, or none of the time as positive. Models 2 and 5 have high recall, but this is expected from over-prediction of positives. These models have little skill in separating positive and negative cases.
The model performance is improved by clustering, and improved further by incorporating the season. Model 6 produces the best precision and F1 scores (61% and 67%, respectively), by performing a cluster analysis on the cold-season data only. By filtering the season first, the proportion of positive cases is increased in the meteorologically-relevant clusters, so that the model’s prediction of a positive pollution event in those clusters is less likely to be wrong.
Following good data-science practice, models should be developed using training data, and scored using unseen test data. This is to avoid over-fitting of the model to the training data. Here, however, the 2012 air quality data were all used in both the training and testing stages of the analysis. Also, the model is based on historical meteorological data – performance would be diminished if air quality predictions were based on weather forecasts.
Further, if a high fine-particulate concentration were predicted using Model 6, the precision of 61% means the chance of it actually occurring is 61%. Air quality specialists, regulators, and people vulnerable to the effects of poor air quality may not feel that the information is useful to them and worth acting upon, despite the precision being much higher chance of a positive case on average (namely 7%).
There is easily room for improvement in the predictive models; those presented here are based on a small number of clusters of data, and not on data from individual hours. Subsequent posts to this apply more refined statistical models to the data, employing standard machine-learning techniques. Part 3 investigates logistic regression and Part 4 looks at decision trees and random forests. These lead to improved predictions of fine particulate levels based on observed meteorological features.