Analysis of Delhi Pollution

Posted on Wed 08 June 2016 in data

The datasets for pollution data in many countries are readily available and there has been a bunch of research already done on those. I wanted to see if I could analyse the pollution patterns of Indian cities. Unfortunately, I could not find an openly accessible historical pollution dataset for Indian cities. The folks over at aqicn.org apparently provide access to institutions but not to individuals. In any case, I was able to locate a fantastic initiative by the Delhi Pollution Control Committee. They provide raw pollution data from six sensor clusters inside the city. While the availability could be better, and all sensor clusters do not cover all the metrics, this kind of data is incredibly useful. Kudos to them for having made this available! One problem is that they do not provide historical data, so I had to aggregate the realtime data over time. What follows is some analysis of that data. Hopefully, as the dataset grows, we'd be able to derive more insights about the pollution situation in delhi.

EDIT: TERI has done some interesting analysis on the Odd-Even scheme based on data from DPCC and CPCB. This study is available here.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import itertools

import re

%matplotlib inline

Munging

The data I am dumping out into the csv file below is pretty raw. It look like:

In [2]:
rawdata = pd.read_csv('./netfile.csv', names=['location', 'metric', 'ts', 'reading', 'guidance'])
rawdata.sort_values(['ts', 'location'])
rawdata.head()
Out[2]:
location metric ts reading guidance
0 Punjabi Bagh Ammonia 1471960200 19.1 µg/m3 400 µg/m3
1 Punjabi Bagh Benzene 1471960200 0.2 µg/m3 05 µg/m3 *
2 Punjabi Bagh Carbon Monoxide 1471960200 1.2 mg/m3 04 mg/m3
3 Punjabi Bagh Nitrogen Dioxide 1471960200 49.6 µg/m3 80 µg/m3
4 Punjabi Bagh Nitrogen Oxide 1471960200 4.3 µg/m3 -
In [3]:
rawdata.location.value_counts()
Out[3]:
IGI Airport     15860
Punjabi Bagh    15067
RK Puram        15067
Anand Vihar     15067
Civil Lines      8723
Mandir Mark       532
Name: location, dtype: int64
In [4]:
rawdata.metric.value_counts()
Out[4]:
Ammonia                                               3993
p-Xylene                                              3993
Ozone                                                 3993
Sulphur Dioxide                                       3993
Carbon Monoxide                                       3993
Benzene                                               3993
Nitrogen Oxide                                        3993
Toluene                                               3993
Nitrogen Dioxide                                      3993
Wind Direction                                        3200
Solar Radiation                                       2407
Relative Humidity                                     2407
Vertical Wind Speed                                   2407
Oxides of Nitrogen                                    2407
Particulate Matter < 10 µg                            2407
Ambient Temperature                                   2407
Barometric Pressure                                   2407
Horizontal Wind Speed                                 2407
Particulate Matter < 2.5 µg                           2407
Formaldehyde                                          1586
Mercury                                               1586
Ambient Air Pressure                                   793
Wind Speed                                             793
Mass Concentration PM 2.5.(Previous Day's Average)     793
Ambient Air Temperature                                793
Mass Concentration PM 10 (Previous Day's Average)      793
Particulates PM 2.5 (Real Time)                        793
Particulates PM 10 (Real Time)                         793
Ambient Humidity                                       793
Name: metric, dtype: int64

Some of these metrics sound inferable: for example Nitrogen Dioxide and Nitrogen Oxide should give a good estimate for Oxides of Nitrogen where it does not exist independently. However, we'll look at that later. For now, let's munge this into a more useful dataframe

In [5]:
rawdata['ts'] = pd.to_datetime(rawdata.ts, unit='s')

def mungeReading(x):
    return "".join([t[0] for t in re.findall("[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?", x)][:1]) if not (x.strip() == '-') else "coerce"

rawdata['reading'] = pd.to_numeric(rawdata.reading.apply(mungeReading), errors='coerce')
rawdata['guidance'] = pd.to_numeric(rawdata.guidance.apply(mungeReading), errors='coerce')

rawdata.head()
Out[5]:
location metric ts reading guidance
0 Punjabi Bagh Ammonia 2016-08-23 13:50:00 19.1 400.0
1 Punjabi Bagh Benzene 2016-08-23 13:50:00 0.2 5.0
2 Punjabi Bagh Carbon Monoxide 2016-08-23 13:50:00 1.2 4.0
3 Punjabi Bagh Nitrogen Dioxide 2016-08-23 13:50:00 49.6 80.0
4 Punjabi Bagh Nitrogen Oxide 2016-08-23 13:50:00 4.3 NaN
In [6]:
rawdata.reading.isnull().sum(), len(rawdata.reading), rawdata.guidance.isnull().sum(), len(rawdata.guidance)
Out[6]:
(4196, 70316, 39580, 70316)

So that looks reasonable and only a few readings are NaN. I expect that the guidance is a simple function of the metric and should not change that often. Let us check:

In [7]:
rawdata[['metric', 'guidance']].groupby('metric').guidance.value_counts()
Out[7]:
metric                                             guidance
Ammonia                                            400.0       3460
                                                   3.0          533
Benzene                                            5.0         3982
                                                   3.0           11
Carbon Monoxide                                    4.0         1764
                                                   3.0          643
Mass Concentration PM 10 (Previous Day's Average)  100.0        793
Nitrogen Dioxide                                   80.0        3453
                                                   3.0          540
Nitrogen Oxide                                     3.0          541
Oxides of Nitrogen                                 3.0          540
Ozone                                              180.0       3932
                                                   3.0           61
Particulate Matter < 10 µg                         100.0       2292
                                                   3.0          111
                                                   0.0            4
Particulate Matter < 2.5 µg                        60.0        2155
                                                   3.0          244
                                                   0.0            8
Solar Radiation                                    2.0           12
Sulphur Dioxide                                    80.0        3246
                                                   3.0          747
Toluene                                            3.0           11
Vertical Wind Speed                                0.0         1442
                                                   0.1          135
                                                   0.2           48
                                                   0.3           16
                                                   0.4            1
p-Xylene                                           3.0           11
Name: guidance, dtype: int64

Woah, loooks like there are multiple guidances per metric. For 9 records, the guidance for Ammonia, which ought to be 400 was listed as 3.0. Let us check those records..

In [8]:
rawdata[(rawdata['metric'] == 'Ammonia') & (rawdata['guidance'] == 3.0)]
Out[8]:
location metric ts reading guidance
6814 RK Puram Ammonia 2016-08-24 09:15:00 3.0 3.0
15174 RK Puram Ammonia 2016-08-25 09:05:00 3.0 3.0
15262 RK Puram Ammonia 2016-08-25 09:25:00 3.0 3.0
15350 RK Puram Ammonia 2016-08-25 09:30:00 3.0 3.0
15507 Anand Vihar Ammonia 2016-08-25 10:15:00 3.0 3.0
15595 Anand Vihar Ammonia 2016-08-25 10:15:00 3.0 3.0
15683 Anand Vihar Ammonia 2016-08-25 10:15:00 3.0 3.0
23622 RK Puram Ammonia 2016-08-26 09:10:00 3.0 3.0
23710 RK Puram Ammonia 2016-08-26 09:30:00 3.0 3.0
25256 Punjabi Bagh Ammonia 2016-08-26 13:45:00 3.0 3.0
25344 Punjabi Bagh Ammonia 2016-08-26 13:45:00 3.0 3.0
25432 Punjabi Bagh Ammonia 2016-08-26 13:45:00 3.0 3.0
32315 Anand Vihar Ammonia 2016-08-27 10:10:00 3.0 3.0
32334 RK Puram Ammonia 2016-08-27 10:10:00 3.0 3.0
32384 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
32403 Anand Vihar Ammonia 2016-08-27 10:10:00 3.0 3.0
32422 RK Puram Ammonia 2016-08-27 10:10:00 3.0 3.0
32472 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
32491 Anand Vihar Ammonia 2016-08-27 10:10:00 3.0 3.0
32560 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
32579 Anand Vihar Ammonia 2016-08-27 10:40:00 3.0 3.0
32648 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
32736 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
32824 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
32912 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
33000 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
33019 Anand Vihar Ammonia 2016-08-27 12:10:00 3.0 3.0
33088 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
33107 Anand Vihar Ammonia 2016-08-27 12:10:00 3.0 3.0
33176 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
... ... ... ... ... ...
67144 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
67232 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
67320 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
67427 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
67534 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
67641 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
67748 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
67855 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
67962 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
68069 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
68176 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
68283 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
68390 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
68497 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
68604 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
68711 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
68818 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
68925 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
69032 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
69139 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
69246 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
69353 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
69460 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
69567 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
69674 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
69781 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
69888 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
69995 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
70102 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0
70209 Punjabi Bagh Ammonia 2016-08-27 10:05:00 3.0 3.0

533 rows × 5 columns

In [9]:
rawdata[(rawdata['metric'] == 'Particulate Matter < 10 µg') & (rawdata['guidance'] == 3.0)]['reading'].value_counts()
Out[9]:
3.0    111
Name: reading, dtype: int64
In [10]:
rawdata[(rawdata['metric'] == 'Sulphur Dioxide') & (rawdata['guidance'] == 3.0)]['reading'].value_counts()
Out[10]:
3.0    747
Name: reading, dtype: int64

Looks like these are buggy readings, given that the guidance and reading both settle at 3.0, which clearly is an incorrect guidance for these particular metrics. I think that it is a good guess that all records with both guidance and readings at precisely 3.0 are spurious and should be dropped.

In [11]:
print(rawdata.shape)
rawdata = rawdata[(rawdata['guidance'] != 3.0) & (rawdata['reading'] != 3.0)]
rawdata.shape
(70316, 5)
Out[11]:
(65964, 5)
In [12]:
rawdata[(rawdata['metric'] == 'Particulate Matter < 2.5 µg') & (rawdata['reading'] == 0.0)]
Out[12]:
location metric ts reading guidance
4765 Punjabi Bagh Particulate Matter < 2.5 µg 2016-08-24 03:15:00 0.0 0.0
4853 Punjabi Bagh Particulate Matter < 2.5 µg 2016-08-24 03:25:00 0.0 0.0
4941 Punjabi Bagh Particulate Matter < 2.5 µg 2016-08-24 03:45:00 0.0 0.0
5029 Punjabi Bagh Particulate Matter < 2.5 µg 2016-08-24 03:55:00 0.0 0.0
13779 RK Puram Particulate Matter < 2.5 µg 2016-08-25 05:00:00 0.0 0.0
13867 RK Puram Particulate Matter < 2.5 µg 2016-08-25 05:10:00 0.0 0.0
13955 RK Puram Particulate Matter < 2.5 µg 2016-08-25 05:30:00 0.0 0.0
14043 RK Puram Particulate Matter < 2.5 µg 2016-08-25 05:50:00 0.0 0.0

These zero guidance/readings also look incorrect. Lets drop these as well.

In [13]:
print(rawdata.shape)
rawdata = rawdata[(rawdata['guidance'] != 0.0) & (rawdata['reading'] != 0.0)]
rawdata.shape
(65964, 5)
Out[13]:
(64045, 5)
In [14]:
rawdata[['metric', 'guidance']].groupby('metric').guidance.value_counts()
Out[14]:
metric                                             guidance
Ammonia                                            400.0       3460
Benzene                                            5.0         3948
Carbon Monoxide                                    4.0         1732
Mass Concentration PM 10 (Previous Day's Average)  100.0        793
Nitrogen Dioxide                                   80.0        3453
Ozone                                              180.0       3929
Particulate Matter < 10 µg                         100.0       2292
Particulate Matter < 2.5 µg                        60.0        2151
Solar Radiation                                    2.0           12
Sulphur Dioxide                                    80.0        3240
Vertical Wind Speed                                0.1          135
                                                   0.2           48
                                                   0.3           16
                                                   0.4            1
Name: guidance, dtype: int64

I think the data is cleaner now, and as we suspected, there is a single guidance per metric save for Vertical Wind Speed, which feels fairly reasonable. For other metrics, the guidance is NaN

In [15]:
rawdata = rawdata[['metric', 'location', 'reading', 'ts']]

Some simple plots

Lets take our first look at the data for a few metrics.

In [16]:
fig = plt.figure(figsize=(10,8))

ax1 = fig.add_subplot(311)
ax1.plot(rawdata[rawdata['metric'] == 'Particulate Matter < 2.5 µg']['reading'])
ax2 = fig.add_subplot(312)
ax2.plot(rawdata[rawdata['metric'] == 'Particulate Matter < 10 µg']['reading'])
ax2 = fig.add_subplot(313)
ax2.plot(rawdata[rawdata['metric'] == 'Ozone']['reading'])
Out[16]:
[]

I'd not have expected the peaks to add up to a constant value. There are large flat plateaus seem a bit strange to me, particularly given that this is aggregated data. Of course, the data for different stations is not synchronized from a timestamp perspective. Hence, it could be that the same readings occur at different stations but are staggered. in any case, this needs to be dealt with.

In [17]:
fig = plt.figure(figsize=(10,8))

ax1 = fig.add_subplot(421)
ax1.plot(rawdata[(rawdata['metric'] == 'Particulate Matter < 2.5 µg') & (rawdata['location'] == 'RK Puram')]['reading'])
ax2 = fig.add_subplot(423)
ax2.plot(rawdata[(rawdata['metric'] == 'Particulate Matter < 10 µg') & (rawdata['location'] == 'RK Puram')]['reading'])
ax2 = fig.add_subplot(425)
ax2.plot(rawdata[(rawdata['metric'] == 'Ozone') & (rawdata['location'] == 'RK Puram')]['reading'])
ax2 = fig.add_subplot(427)
ax2.plot(rawdata[(rawdata['metric'] == 'Nitrogen Dioxide') & (rawdata['location'] == 'RK Puram')]['reading'])

ax1 = fig.add_subplot(422)
ax1.plot(rawdata[(rawdata['metric'] == 'Particulate Matter < 2.5 µg') & (rawdata['location'] == 'Punjabi Bagh')]['reading'])
ax2 = fig.add_subplot(424)
ax2.plot(rawdata[(rawdata['metric'] == 'Particulate Matter < 10 µg') & (rawdata['location'] == 'Punjabi Bagh')]['reading'])
ax2 = fig.add_subplot(426)
ax2.plot(rawdata[(rawdata['metric'] == 'Ozone') & (rawdata['location'] == 'Punjabi Bagh')]['reading'])
ax2 = fig.add_subplot(428)
ax2.plot(rawdata[(rawdata['metric'] == 'Nitrogen Dioxide') & (rawdata['location'] == 'Punjabi Bagh')]['reading'])
Out[17]:
[]

There seems to be some periodicity in the PM 10mcg plot for RK Puram, but that does not hold for Punjabi Bagh. I am surprised by the flat plateaus that are seen in some areas. Could be due to lack of readings in the middle. It is however strange that the Nitrogen Dioxide metric for Punjabi Bagh seems to be constant even after a significant time, particularly given how noisy it has been otherwise. Also, very interestingly, for Punjabi Bagh, the plateaus in the data at the very end stack up at the same time for all the metrics. Looks like some issue with the website as stale data might have been returned for a bunch of time.

In [18]:
def getReadings(metrics='all', location='all'):
    allReadings = []
    if location == 'all':
        rd = rawdata
    else:
        rd = rawdata[rawdata['location'] == location]
    if metrics == 'all':
        if location == 'all':
            metrics = list(set(list(rawdata['metric'])))
        else:
            metrics = list(set(list(rawdata[rawdata['location'] == location]['metric'])))
    for m in metrics:
        allReadings.append(np.array(rd[rd['metric'] == m]['reading']))
    #return pd.DataFrame(list(zip(allReadings))).transpose()
    t = pd.DataFrame(allReadings).transpose().dropna()
    t.columns = metrics
    return t
In [19]:
sn.pairplot(getReadings(location='Punjabi Bagh'), size=2, kind="reg", diag_kind="kde")
/home/rohit/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[19]: