Logistic Regression Model: The simple models developed in Part 2 to classify airborne fine-particle concentration in terms of local meteorology are improved on using logistic regression techniques.
1. Introduction
This series of posts explores relationships between weather and air quality parameters in Christchurch, New Zealand, and develops models to predict the hourly pollution category using weather parameters such as wind, temperature and humidity. In Part 1, an exploratory data analysis (EDA) was carried out, to examine relationships among the parameters (weather and air quality). In Part 2, a cluster analysis was performed on the weather parameters, and the distribution of pollution observations among the clusters was examined. Simple models were developed which predict the category as positive or negative (high or low pollution) from the meteorology, according to meteorological data cluster. The best of these achieved a precision of 61%, a recall of 76% and an F1 score of 67% (the skill scores are defined and discussed below, and also here).
In this part, a logistic regression of air pollution category on meteorological data was done, incorporating both numerical and categorical features (for instance, wind direction, time of day, and month of year are cyclic, so cannot be treated as numerical data). The analysis was done in Microsoft’s Azure Machine Learning environment (Azure).
This work uses 2012 data from the St Albans air monitoring site, which is licensed under a Creative Commons Attribution 4.0 International licence by Environment Canterbury.
2. Definitions
The target variable is the hourly-averaged air-pollution category. This is defined as positive if the fine particle (or PM2.5) concentration is above 30 µg/m3, and negative otherwise. The data are imbalanced; 7% positive and 93% negative. When a classification model is used, the confusion matrix is comprised of the numbers of true and false, positive and negative predictions (labelled TP, FP, TN, FN). Derived from these, skill scores used in this work are the following:
- Precision, or Positive Predictive Value (PPV): the proportion of predicted positive cases that were observed. PPV = TP/(TP+FP)
- Recall, or True Positive Rate (TPR): the proportion of observed positive cases that were predicted. TPR = TP/(TP+FN)
- F1 score: the harmonic mean of the precision and recall.
The precision is also that the probability that a predicted positive case will actually occur, and is useful in forecasting. However, its value can be increased by making only positive predictions that the model is most certain of. If this is done, the precision has little practical value, as true positive cases – real pollution events – will be missed. Similarly, modelling all cases as positive makes the recall 100%, but decreases the precision and the model has no predictive power. The F1 score is a useful overall model score, which combines the precision and the recall, and gives an optimal prediction of positive pollution events.
3. A idealized example of logistic regression
Logistic regression is used to predict the probability of a binary variable being positive. It uses a logistic function for the probability, which ranges between 0 and 1. The argument to the logistic function is known as the log odds, where the odds are the ratio of probabilities of positive to negative outcomes. The log odds are a linear combination of features such as wind speed, temperature and humidity, and the logistic regression finds the coefficients.
A simple example is shown, where the air-pollution category is assumed to depend on the temperature only. This is denoted T10, the temperature in °C 10 metres above ground level. Suppose the logistic regression has been done, and has optimized the two coefficients in the linear formula as follows:
\text{ln} \, (odds) = 2 - 0.3 T_{10}
(ln denotes the natural logarithm). A decrease in temperature increases the odds of a positive case. The probability of a case being positive uses the logistic function and is given by the following:
Pr \, (positive) = \cfrac{1}{(1 + e^{-(2 – 0.3T_{10})})}
Figure 1 shows the prediction for temperatures between -5°C and 25°C. The log odds are linear in temperature, as defined. The odds themselves range from 33:1 on a positive outcome at -5°C to about 1:240 on (or 240:1 against) at 25°C. Then the probability of a positive case is nearly 1 at -5°C and nearly zero at 25°C.
The probability of a positive case is 0.5 when the odds are 1:1 (evens), or the log odds are, of course, zero. This occurs when T10 is 6.7°C. The model would be used predict a positive air-pollution category if the probability were above a threshold of 0.5 (temperature less than 6.7°C).
The logistic regression technique should be a useful tool for modelling air pollution as a function of meteorological parameters. In reality, the pollution concentration tends to vary monotonically with weather features, increasing as temperature and wind speed decrease, and as inversion strength and humidity increase. The logistic regression model is also monotonic in each feature – the curves in Figure 1 all decrease from left to right.
The pollution concentration may change monotonically with the numerical features, but that change may not be linear. It may be worthwhile to perform feature engineering on parameters such as the wind speed or temperature, as mentioned in Part 1. This is investigated below.
The logistic regression technique deals with numerical features straightforwardly. It deals with categorical features by converting them to a set of indicator values for each category using one-hot encoding.
4. Logistic regression on meteorological data
In Part 2, a set of simple models made a decision on the air pollution category based on a cluster analysis of the following five numerical features:
- Temperature on a mast 10 metres above ground level, T10 (in °C).
- Temperature difference between the 1 metre and 10 metre levels, T1-10 (in °C).
- Westerly component of the wind vector, U (in m/s, west to east).
- Southerly component of the wind vector, V (in m/s, south to north).
- Relative humidity, RH (in %).
Here, the analysis has been extended, to allow use of categorical features. The features used in the modelling for this Part were as follows:
- Temperature from a mast 10 metres above ground level, T10 (in °C).
- Temperature difference between the 1 metre and 10 metre levels, T1-10 (in °C).
- Wind speed, WS (in m/s).
- Relative humidity, RH (in %).
- Wind direction, in 16 categories defining a 16-wind compass rose.
- Hour of day, 24 categories.
- Month of year, 12 categories.
In Azure, the above features were selected for analysis, and the label identified, which in this case is the pollution class. Categorical features were identified and converted into indicator values.
Numerical features were scaled using min-max normalization, before training the model. A two-class logistic regression model was specified, trained on 60% of the data (4761 hours). The model was tested using the remaining 40% of the data (3175 hours). This is an example of a holdout method. The data split was stratified to ensure the same proportion of positives and negatives appeared in each subset. The skill scores were an improvement over the best model from Part 2, with a precision of 80%, recall of 75% and F1 score of 78%. The change in F1 score is driven by an improvement in precision. The logistic regression model is more discerning than the cluster-based decision model; it produces fewer false positives, or false alarms.
The logistic regression calculates the probability of a positive case, and a threshold probability must be chosen. Modelled probabilities above the threshold are then predicted as positives. Figure 2 shows how the precision and recall vary as the threshold is changed. Increasing the threshold from 0 to 1, we travel along the curve from bottom-right to top-left. If the threshold is close to 1, fewer cases are predicted positive, and the precision is high; if the threshold is close to 0, most cases are predicted positive, and the recall tends to 1. An optimum value of the threshold can be chosen which maximizes the F1 score, which occurs when the precision and recall are about halfway along the curve. The scores quoted above (and most in the tests below) use a threshold of 0.5; a slight improvement can be obtained by choosing a threshold of 0.46.
Results for pollution category are shown against some of the features. Figure 3 compares the observed and modelled categories against date and time.
Positive pollution categories occur largely during the winter months (mid-year in the southern hemisphere), with events of several hours commencing early evening (hour 18, or 6 pm) and lasting overnight, plus further high pollution up to around hour 10 (left panel of Figure 3). These are the peak emission times for solid-fuel home heating in Christchurch. Occasionally, the positive cases persist through the day (in early June, for example). Dates before 1 April and after 30 September are not shown, as the cases are all negative, and there is a full month of missing pollution data from mid-August to mid-September.
The model results are shown after re-combining the training and test data, and applying the model to times when the pollution data were missing (right panel of Figure 3). The modelled probability is a continuous variable in the range [0,1]. With a threshold probability of 0.5, hours shaded brown and blue would be modelled positives, with green and yellow modelled negatives. [A more cautious model could set the threshold at 0.25, and the greens would also be predicted positives]. The model matches the peak times of observed high pollution, and reproduces the episode in early June that lasted all day.
Figure 4 compares the observed and modelled categories against temperature (T10) and wind speed (as in Figure 3, the model results include predictions for hours of missing pollution data).
In Figure 4, positive cases are confined to the colder, calmer end of the ranges, in both the observations and the model results. Warmer temperatures and more rapid wind speeds are not shown, as the pollution cases are all negative. The distributions of observed and modelled positive cases are similar, although there are a few negative cases at around -2 °C and 0.0 m/s which were predicted as positives.
5. Improvements to the logistic regression model
A number of procedures are available in Azure, which can potentially improve the model performance. These have been applied to the air quality data, and their effects on the results are discussed in this section. The resulting skill scores for all tests, and the base case described already, are shown at the end of this section.
Feature importance: If a feature is important to the model, then a random permutation of its data will reduce the model’s skill. Otherwise, it is not important, the permutation does not lower model skill, and the feature may be discarded. The permutation feature importance (PFI) module in Azure permutes every feature column in turn and reports the decrease in score. Only results for numerical parameters, not indicator values, should be examined using this method; PFI results were examined for T10, T1-10, WS and RH, but not month, hour, and wind direction.
The PFI module was run and changes in precision and recall examined. Results indicated that RH was not important to the model precision. The logistic regression model was re-run without RH data, but this led to a slight decrease in F1 score. Hence RH was been retained as a predictor of air-pollution category.
Feature engineering: Feature engineering refers to transformations of features that can increase the predictive power of the model. Examples are the conversion of numerical data to indicator values, or the removal of features that are not independent of other features. The exploratory data analysis in Part 1 indicated non-linear relationships between fine-particle concentration and weather features; higher correlations of concentration with wind speed to the power -1/3, and with temperature to the power -1/2. The model was run with WS and T10 raised to these respective powers in the log odds formula, and this led to a slight reduction in skill scores. This type of feature engineering may be more useful in linear, rather than logistic regression, where actual concentration values are modelled. Hence the log odds were left as linear in WS and T10.
Over-sampling of positive cases: The pollution data are imbalanced, with only 7% of air-pollution concentrations in the positive category. This means that a model which makes mostly negative predictions can be highly accurate, but it does not have any useful predictive power for positive cases. The positive cases are pollution events and are arguably more important from a public-health point of view. The penalty for missing them should arguably be higher than for a false alarm. Increasing the penalty for a missed positive – and therefore making the model more likely to make positive predictions – is equivalent to over-sampling the positive cases in the data set. For example, if we want to double the penalty for a missed positive, we can simply duplicate all of the positive data (in the training set only). This is an example of the synthetic minority oversampling technique (SMOTE), which is implemented in Azure.
The logistic regression model has been run using SMOTE, by adding a SMOTE oversampling percentage of 1200%. This means the total number of positive cases has been multiplied by 13. The negative:positive ratio of 93:7 has been changed to 93:91 and the data are nearly balanced. This means that predictions are much more likely to be positive. However, this has the effect of reducing the precision and the overall F1 score is. In this data set, the positive cases are among the negative cases, not separate from them in feature space; this was seen in the cluster analysis of Part 2. Including a SMOTE percentage of 1200% increases the likelihood of a positive prediction; it removes false negatives and increases the recall, but also adds false positives and decreases the predictive power.
[Increasing the threshold probability from 0.5 to 0.92 maximizes the F1 score by reducing the number of modelled positives. However, this counteracts the effect of the oversampling and the results are no better than the base case.]
Regularization: This technique is used to avoid over-fitting of the model to the training data, which may contain noise. It constrains the feature coefficients used in the log odds formula, by incorporating their L1- or L2-norms into the model’s cost function. This reduces the sensitivity of the predicted pollution category to the meteorological features in the (noisy) training data, potentially giving a better fit to (new, unseen) test data. The logistic regression model has been run with L1-norm regularization (aka lasso regression), L2-norm regularization (aka ridge regression), and a combination of both (known as elastic net regularization). A good introduction to these concepts may be found here.
Employing regularization gave a slight improvement in F1 score compared to the base case. A reduction occurs in the number of positives predicted, as a handful of FPs changes to TNs. The score is increased further by reducing the threshold probability from 0.5 to 0.44; in this case the number of predicted positives increases, but there are more new TPs than FPs.
The model performance scores are summarized in Figure 5 for the models described above.
The logistic regression base case performs better than the decision-based model based on the cluster analysis in Part 2 (identified as Cluster Model 6 in Figure 5). A slight increase in F1 score over the base case was obtained by employing regularization, then perturbing the threshold probability, with the score reaching close to 0.8 (using L1 & L2 regularization and a threshold of 0.44). A small drop in skill occurred when removing RH or using inverse powers of temperature and wind speed (Feat. Eng’ng case in Figure 5). Also, using a SMOTE percentage of 1200 resulted in a far larger number of positive predictions, such that the recall was almost 100%. However, as mentioned above, this reduced the precision and the F1 score.
Most of the logistic regression model variations produced a marked improvement on the models presented in Part 2. The logistic regression is more discerning about which cases are predicted positives (as opposed to assigning predictions to whole clusters of data), and hence has higher predictive value.
6. Cross validation
Cross validation: The skill of the machine learning models discussed above has been assessed using the holdout method, in which the complete data set was divided into two subsets – the training and test set. The original model developed using the holdout method can be evaluated further using k-fold cross validation. In this, the data set is divided into ten folds, and ten new models calculated using each fold as test data and the other nine as training data. This is a way of assessing both the performance of the modelling method and the robustness of the data set. These can be considered acceptable if the range of scores from the cross validation is small and the original model’s scores are within that range.
In this cross validation case, the folds were stratified, and the cross validation was combined with regularization. The cross validation gave a range of F1 scores with mean 76% and standard deviation 3.8%. The F1 score of the logistic regression model that uses both L1- and L2-norm regularization is 78% (L1 & L2 in Figure 5), which is within one standard deviation of the mean from the k-fold cross validation. Hence this version of the logistic regression model performs acceptably.
7. Summary
The logistic regression has been reasonably successful as a model for predicting the hourly air pollution category from local meteorological data and time-of-day and time-of-year parameters. The base case model was improved by including regularization, but other available techniques did not appear to help.
Shortcomings in the model may be due to several factors. A whole month of data is missing, which may bias the results, and there are unavoidable physical relationships between the meteorological features. The modelling may be made more robust if it were based on a multi-year data set.
Meteorological features such as wind and temperature difference influence the dispersion of air pollution and therefore the airborne concentrations of fine particles. Temperature itself may not influence dispersion but it may influence emissions, as domestic heaters are used more when it is colder. This affects concentrations. The time of day and time of year may be indirectly related to concentration, as heaters are used in winter and at night-time. In other words, time of day, time of year and temperature may be thought of as proxies for air-pollution emissions. They are not completely independent of each other, but are probably all needed in the predictive model.
At the start of Part 1, a significant difficulty in predicting air quality was described, where weather conditions conducive to poor air quality would be observed, but air quality would actually be good. Although models have been shown to produce reasonable results in this set of posts, that difficulty still remains. The exploratory data analysis of Part 1 and the cluster analysis of Part 2 clearly showed that under the conditions where positive cases occur, there may be just as many, or more, negative cases. No data cluster contained only positive cases, and only one cluster was found where the positives were the majority case.
There are several extension to this investigation which may enable further improvements to a predictive model for air quality in Christchurch, New Zealand. These are as follows:
- Incorporation of similar data from other years, and other monitoring sites. As the particulate pollution disperses across the city, its impact at the St Albans site may be influenced by conditions upwind at other sites.
- Use of more sophisticated machine-learning algorithms to uncover deeper relationships in the data, if they exist. Work on decision trees and random forests is in progress and will appear as Part 4 of this series.
- Simulation of the three-dimensional structure of air pollution plumes and layers using urban airshed dispersion models. I have done this for several cities in New Zealand, including Christchurch.
- Incorporation of PM2.5 emissions explicitly, instead of presuming temperature and time-of-day are proxies for this: this is done by urban airshed models.
- Accounting for randomness: there may be a component of the PM2.5 time series that is due to unaccounted-for effects, but which appears to be random. Time-series analysis can reveal structure such as trends, seasonality, and auto-correlations. This has been done for air quality data in another series of posts.
Finally, it is accepted among data scientists that an ensemble of models can do better than an individual predictive model, and a combination of several kinds of model can yield better results than a single one. A combination of the above types of model would account for the dependence of air-pollution concentration on meteorological observations at monitoring sites, its three-dimensional structure and dependence on emissions, and the statistical properties of any unexplained noise in the time series. Future series of posts will investigate these ideas in an air-quality context.