One of our hobbies in recent years has been trying to make something useful. Aleks Lyubenov is one of ROITI’s Data Scientists and together with the rest of the team, has been working on an internal project regarding forecasting the air quality in Sofia – Bulgaria’s capital. Below, Aleks reveals some of the processes and analyses behind it, as well as what the predictions are.
Disclaimer: Buckle up for some Maths. 🙂
In recent years, air pollution has become an incredibly important topic in the realm of sustainable growth and development. Aside from the obvious environmental impact, rising emissions of air pollutants inevitably lead to higher concentrations of particulate matter in the atmosphere. This can – in turn – have a substantial effect on both local and global economies through reduced labour productivity, poor agricultural yields and an increase in healthcare expenditure. Thus, it is essential that we are able to model and accurately forecast such pollution in order to mitigate the aforementioned negative consequences.
Weather and other atmospheric phenomena are often modelled using numerical weather prediction, a method which attempts to explain the temporal evolution of atmospheric processes via the assimilation of observed variables such as cloud formation, air pressure, temperature, precipitation, and dozens, if not hundreds of other meteorological indicators. However, many of the dynamic equations that govern these processes cannot be explicitly calculated due to various numerical instabilities and are therefore approximated using parametrization schemes and computationally expensive numerical methods. Due to the extensive computational requirements of three-dimensional spatial interpolation in these frameworks, we believe it is essential to develop more lightweight models, which leverage the power of modern machine learning to model atmospheric phenomena.
Thus, we propose a data-driven approach to air quality forecasting. With over 2000 monitoring stations all over Europe, the European Air Quality Index provides a short-term indication of continental atmospheric quality using time series data containing five key pollutants:
- PFC2.5 Atmospheric particulate matter (< 2.5𝜇m)
- PFC10 Atmospheric particulate matter (< 10𝜇m)
- O3 Ground-level ozone
- NO2 Nitrogen-Dioxide
- SO2 Sulphur-Dioxide
Of these sensors, 32 are located in Bulgaria and 5 are within Sofia:
- 42.708, 23.31 165, Kozloduy, Banishora, zh.k. Banishora, Sofia
- 42.726, 23.342 Rusalka, kv. Orlandovtsi, Serdika
- 42.646, 23.381 389, zh.k. Mladost 3, Sofia
- 42.727, 23.34 54, Zhelezopatna, kv. Orlandovtsi, Sofia
- 42.622, 23.328 Monah Spiridon, Simeonovo – Dragalevtsi, Vitosha
In addition to particulate matter, we have considered meteorological data for the greater Sofia area. This data included indicators such as pressure, temperature, humidity, precipitation, wind speed, cloud cover and solar radiation levels, to name a few. We also decided to include temporal features such as, but not limited to, the day of the year and whether a particular day is a weekday or not. This was done to give our model an opportunity to capture any existing seasonal trends.
During the data collection and cleaning process, one of the main challenges we encountered was varying measurement frequencies in the gathered sensor data. We observed delays ranging from just 30 seconds to a few hours in extreme cases. Consistent timesteps are obviously an essential requirement in the predictive modelling of time series data. As such, we utilized a windowed resampling technique to fill in missing values. Our approach was to use a weighted average of data points located around the interpolation target. The weight, naturally, depended on the time difference between the sampled point and the target.
Thereafter, we constructed a 24-hour sliding window in the time series and defined the desired output to be the first consecutive hour outside the window.
Although traditional time series forecasting has utilized various approaches ranging from Autoregressive Integrated Moving Average (ARIMA) models to Support Vector and Random Forest Regressors, such models fail to take long-term dependencies into account and their performance ultimately suffers due to the volatile nature of air quality data.
In order to address this issue, we turn to Recurrent Neural Networks (RNNs). This is a class of neural networks, in which nodes are connected via a temporal-directed graph. Such networks make use of loops and internal states to process sequences of variable length, allowing for information persistence and the modelling of temporally dynamic behaviours, such as those we observe in air pollution datasets. At time step t, the output ht of component A depends on the input x(t), as well as the computation at time step t−1. Such an internal loop allows information to pass from one processing step to the next, taking previously learned information into account when identifying new patterns. Unfortunately, vanilla RNNs can’t model long-term dependencies in data due to vanishing and exploding gradients during backpropagation. This, fortunately, is a problem that isn’t shared by a very special kind of RNN called a Long-Term Short-Term Memory Network (LSTM).
The arrow/vector at the top of the LSTM diagram represents the cell state and can be thought of as the main highway of information. The model can modify the cell state by adding or removing information as it trains. This process is regulated by gates, which are composed of a sigmoid activation and an elementwise vector product. In our case, these gates will allow our model to keep track of the pieces of information that have a true impact on the levels of particulate matter in the atmosphere while allowing it to ignore more inconsequential variables.
There are three gates which control the flow of information in the LSTM. The first is the forget gate layer, which regulates what information needs to be thrown away from the cell state. The next is called the input gate layer, which – as its name would suggest – is responsible for deciding what new information should be included in the cell state. After the model has decided what new information to include, a hyperbolic tangent layer is responsible for creating a vector of candidate values that should be added to the cell state.
Now, the old state Ct−1 will be updated in a new state Ct by multiplying the old state with the information to be forgotten (weighted by sigmoid activation values) and adding the new candidate values (scaled by the intensity of the update). Of course, after this update is complete, we need to output a final value. This output will be a filtered version of our current cell state. A sigmoid activation decides which part of the cell state to push forward, and thereafter, a hyperbolic tangent activation forces the cell state values between -1 and 1. This output is known as the hidden state ht, and is fed back into the network at the next time step as the previously hidden state ht−1, allowing the LSTM to extract new information with the help of this clever recall mechanism.
Based on a paper by Wang, et. al. in May 2022, we have modified the LSTM architecture into an ILSTM model, which consists solely of an input gate and a forget gate. The mainline forgetting mechanism remains the same. However, the new architecture incorporates the prior cell state ct-1, into the input gate computation. In addition, the ILSTM introduces a Conversion Information Module (CIM) to prevent gradient saturation in the sigmoid during training: CIM=tanh(It). The information, which is ultimately kept in the cell state is defined by the sum of the cell state (after the forget gate computation) and the CIM. These modifications reduce the weight matrices from 8 to 4 and the bias parameters from 4 to 2.
Recurrent Neural Networks are not the only networks that utilize many identical neurons to model complex relationships (in this case, temporal relationships) while keeping the number of parameters in the model minimal. This weight-linking principle is also used in Convolutional Neural Networks (CNNs) to achieve a similar result.
The power of CNNs is based on the notion of a mathematical convolution, which is a way of extracting features from a signal. Formally, a convolution is an operation defined by two functions, which produce a third function, expressing the effect of one function on the other.
The convolution can be thought of as the area under the curve f(𝜏) at each step t, weighted by the function g(t−𝜏). This formulation tells us that as the value of t changes, the weighting changes and is able to emphasize different “parts” of the input function f(𝜏). In machine learning terminology, the function g(t−𝜏) is called the kernel. The idea is to replace the per defaltam matrix multiplication of neural networks with the convolution operator defined above.
Similar to the way in which a neuron in the visual cortex responds to a specific stimulus, a convolutional layer convolves the input and passes this output onto the next layer of the network. After passing through a convolutional layer, the input turns into an abstraction called a feature map. In our case, a one-dimensional convolution over time series data will produce a one-dimensional feature map, having only a single channel. A beneficial property of convolutional layers is that they are composable, meaning that each subsequent layer is able to extract more abstract features from the given data.
We interlace convolutional layers with what are known as pooling layers, which essentially aggregate the output of convolutional layers, telling us whether a particular feature was detected in that layer or not. Because these pooling layers create aggregations (commonly, this is the maximum of each local neuron cluster in the extracted feature map), they allow later convolutional layers to have a greater receptive field over the original input data. Moreover, pooling layers provide the added benefit of small input transformation invariance.
Therefore, CNNs are very good at exploiting locality in the data. This robustness has allowed CNNs to be very successful in the realm of computer vision and time series data analysis. As we are interested in the levels of particulate matter within a particular time window, our task obviously falls into the second category. It is important to note that – unlike our recurrent layer – the convolutional layer is more concerned with modelling local spatial correlation than long-term dependencies.
Below, are two sets of predictions for a single week of particulate matter values, presented in hourly granularity. These preliminary results seem to indicate that combining these two deep learning models can be an effective way of capturing both long- and short-term relationships in air quality data, with minimal architectural complexity. Moreover, the adaptation of the LSTM recurrent layer into an ILSTM module seems to give our model a bit more of the flexibility required to predict some of the more volatile spikes in the dataset.
In the future, we aim to extend the predictive window of the model and experiment with attention mechanisms, which effectively accentuate certain parts of the input data while diminishing others, allowing the model to focus on smaller, but far more meaningful patterns. This, and further results, will be discussed in a subsequent article.
As a part of the product roadmap, and in line with our corporate social responsibility, we also aim to integrate our model with an existing air quality/weather forecasting platform, making our predictions available to the public via an open API.
Jingyang Wang, X. L. (2022). An air quality index prediction model based on CNNILSTM. Nature Scientific Reports.
Olah, C. (2014, July 8). Conv Nets: A Modular Perspective. Retrieved from Colah’s Blog