The dataset contains 6 files:
File 1 contains the state, city and county day wise information of the Air Quality Index for Ozone gas, for the year 2021. It contains 29 variables including city name, state name, county name, date, AQI, name of gas,etc and 256,776 Observations.
File 2 contains the state, city and county day wise information of the Air Quality Index for Sulphur Dioxide(SO2) gas, for the year 2021.It contains 29 variables, including city name, state name, county name, date, AQI, name of gas,etc and 183,618 Observations.
File 3 contains the state, city and county day wise information of the Air Quality Index for Carbon Monoxide(CO) gas, for the year 2021.It contains 29 variables, including city name, state name, county name, date, AQI, name of gas,etc and 106,552 Observations.
File 4 contains the state, city and county day wise information of the Air Quality Index for Nitrogen Dioxide(NO2) gas, for the year 2021.It contains 29 variables, including city name, state name, county name, date, AQI, name of gas,etc and 94,452 Observations.
File 5 contains the state, city and county day wise information of the Temperature for the year 2021.It contains 10 variables, including city name, state name, county name, date, Temperature, name of gas,etc and 168,444 Observations.
File 6 contains the state, and county wise information of the AQI for the year 2021.It contains 10 variables, including state name, county name, date, AQI, etc and 218,196 Observations.
Although the effect of the Air Quality Index for gases on the global temperature is well known, I wish to study the relationship between the overall AQI and individual gas concentrations, and the temperature at county level.
How is the concentration for individual gases, namely Ozone, Sulphur Dioxide(\(SO_2\)), Carbon Monoxide(CO), Nitrogen Dioxide(\(NO_2\)), and the overall AQI related to the temperature at the county level?
Note: The relationship between gases and temperature is studied without taking into account the effect of combination of gases. For example, the combined effect of SO2 and CO may have different relation then these gases taken individually. For the purpose of this analysis, that combined effect is not taken into account.
This code imports all the required zip files directly from EPA website, unzips them and then read the csv files from the unzipped folders.
# Importing the temperature file
<- tempfile()
temp download.file("https://aqs.epa.gov/aqsweb/airdata/daily_TEMP_2021.zip",temp)
<- read_csv(unz(temp, "daily_TEMP_2021.csv"))
temperature_file unlink(temp)
# Importing the ozone gas file
<-tempfile()
tempdownload.file("https://aqs.epa.gov/aqsweb/airdata/daily_44201_2021.zip",temp)
<-read_csv((unz(temp,"daily_44201_2021.csv")))
ozone_fileunlink(temp)
# Importing the sulphur dioxide gas file
<-tempfile()
tempdownload.file("https://aqs.epa.gov/aqsweb/airdata/daily_42401_2021.zip",temp)
<-read_csv((unz(temp,"daily_42401_2021.csv")))
sulphurdioxide_fileunlink(temp)
# Importing the carbon monoxide gas file
<-tempfile()
tempdownload.file("https://aqs.epa.gov/aqsweb/airdata/daily_42101_2021.zip",temp)
<-read_csv((unz(temp,"daily_42101_2021.csv")))
carbonmonoxide_fileunlink(temp)
#Importing the Nitrogrendioxide gas file
<-tempfile()
tempdownload.file("https://aqs.epa.gov/aqsweb/airdata/daily_42602_2021.zip",temp)
<-read_csv((unz(temp,"daily_42602_2021.csv")))
nitrogendioxide_fileunlink(temp)
# Importing the daily AQI file
<-tempfile()
tempdownload.file("https://aqs.epa.gov/aqsweb/airdata/daily_aqi_by_county_2021.zip",temp)
<-read_csv((unz(temp,"daily_aqi_by_county_2021.csv")))
AQI_Countyunlink(temp)
# Removing the temporary file as it is no longer needed
rm(temp)
This part of the code filters the imported files, merge them into a single data frame, and then generates two output files. First one contains the county wise and date wise information of temperature and AQI, and second one contains county wise and date wise information of concentration of gases and Temperature. These files are used for further analysis.
# Filtered Temperature File
<- temperature_file %>% select(`State Code`,`County Code`,`Site Num`,`Date Local`,`Arithmetic Mean`,`State Name`,`County Name`,`City Name`,`CBSA Name`)
temperature_file_1$`State Name`<-as.factor(temperature_file_1$`State Name`)
temperature_file_1$`County Name`<-as.factor(temperature_file_1$`County Name`)
temperature_file_1$`City Name`<-as.factor(temperature_file_1$`City Name`)
temperature_file_1
# Filtered Ozone File
<-ozone_file %>% select(`State Code`,`County Code`,`Site Num`,`Date Local`,`Arithmetic Mean`,AQI,`State Name`,`County Name`,`City Name`,`CBSA Name`)
ozone_file_1$`State Name`<-as.factor(ozone_file_1$`State Name`)
ozone_file_1$`County Name`<-as.factor(ozone_file_1$`County Name`)
ozone_file_1$`City Name`<-as.factor(ozone_file_1$`City Name`)
ozone_file_1
# Filtered Sulphurdioxide File
<-sulphurdioxide_file %>% select(`State Code`,`County Code`,`Site Num`,`Date Local`,`Arithmetic Mean`,AQI,`State Name`,`County Name`,`City Name`,`CBSA Name`)
sulphurdioxide_1$`State Name`<-as.factor(sulphurdioxide_1$`State Name`)
sulphurdioxide_1$`County Name`<-as.factor(sulphurdioxide_1$`County Name`)
sulphurdioxide_1$`City Name`<-as.factor(sulphurdioxide_1$`City Name`)
sulphurdioxide_1
# Filtered Nitorgendioxide File
<-nitrogendioxide_file %>% select(`State Code`,`County Code`,`Site Num`,`Date Local`,`Arithmetic Mean`,AQI,`State Name`,`County Name`,`City Name`,`CBSA Name`)
nitrogendioxide_1$`State Name`<-as.factor(nitrogendioxide_1$`State Name`)
nitrogendioxide_1$`County Name`<-as.factor(nitrogendioxide_1$`County Name`)
nitrogendioxide_1$`City Name`<-as.factor(nitrogendioxide_1$`City Name`)
nitrogendioxide_1
# Filtered carbonmonoxide File
<-carbonmonoxide_file %>% select(`State Code`,`County Code`,`Site Num`,`Date Local`,`Arithmetic Mean`,AQI,`State Name`,`County Name`,`City Name`,`CBSA Name`)
carbonmonoxide_1$`State Name`<-as.factor(carbonmonoxide_1$`State Name`)
carbonmonoxide_1$`County Name`<-as.factor(carbonmonoxide_1$`County Name`)
carbonmonoxide_1$`City Name`<-as.factor(carbonmonoxide_1$`City Name`)
carbonmonoxide_1
# Filtered AQI File
<- AQI_County %>% select(`State Code`,`County Code`,`Date`,`AQI`,`State Name`,`county Name`)
AQI_file_1$`State Name`<-as.factor(AQI_file_1$`State Name`)
AQI_file_1$`County Name`<-as.factor(AQI_file_1$`county Name`)
AQI_file_1
# Merging the data frames.
<- inner_join(temperature_file_1,ozone_file_1, by = c("State Code","County Code","Date Local","State Name", "County Name","City Name","CBSA Name","Site Num"),na_matches=c("na"))
Mergedfile <- Mergedfile %>% select(`State Code`,`County Code`,`Date Local`,`State Name`, `County Name`,`City Name`,`CBSA Name`,`Site Num`,`Arithmetic Mean.x`, AQI,`Arithmetic Mean.y`) %>% rename(Temperature=`Arithmetic Mean.x`) %>% rename(Ozone_AQI=AQI) %>% rename(Ozone_Mean=`Arithmetic Mean.y`)
Mergedfile
<- merge(Mergedfile,carbonmonoxide_1, by.x = c("State Code","County Code","Date Local","State Name", "County Name","City Name","CBSA Name","Site Num") , by.y = c("State Code","County Code","Date Local","State Name", "County Name","City Name","CBSA Name","Site Num"), all.x = FALSE, all.y = FALSE )
Mergedfile1<- Mergedfile1 %>% select(`State Code`,`County Code`,`Date Local`,`State Name`, `County Name`,`City Name`,`CBSA Name`,`Site Num`, Temperature, Ozone_AQI,Ozone_Mean,AQI,`Arithmetic Mean`) %>% rename(carbonmonoxide_AQI=AQI) %>% rename(carbonmonoxide_Mean=`Arithmetic Mean`)
Mergedfile1
<- merge(Mergedfile1,nitrogendioxide_1, by.x = c("State Code","County Code","Date Local","State Name", "County Name","City Name","CBSA Name","Site Num") , by.y = c("State Code","County Code","Date Local","State Name", "County Name","City Name","CBSA Name","Site Num"), all.x = FALSE, all.y = FALSE )
Mergedfile2<- Mergedfile2 %>% select(`State Code`,`County Code`,`Date Local`,`State Name`, `County Name`,`City Name`,`CBSA Name`,`Site Num`, Temperature, Ozone_AQI,Ozone_Mean,carbonmonoxide_AQI,carbonmonoxide_Mean,`Arithmetic Mean`,AQI) %>% rename(nitrogendioxide_AQI=AQI) %>% rename(nitrogendioxide_Mean=`Arithmetic Mean`)
Mergedfile2
<- merge(Mergedfile2,sulphurdioxide_1, by.x = c("State Code","County Code","Date Local","State Name", "County Name","City Name","CBSA Name","Site Num") , by.y = c("State Code","County Code","Date Local","State Name", "County Name","City Name","CBSA Name","Site Num"), all.x = FALSE, all.y = FALSE )
Mergedfile3<- Mergedfile3 %>% select(`State Code`,`County Code`,`Date Local`,`State Name`, `County Name`,`City Name`,`CBSA Name`,`Site Num`, Temperature, Ozone_AQI,Ozone_Mean,carbonmonoxide_AQI,carbonmonoxide_Mean,nitrogendioxide_AQI,nitrogendioxide_Mean,`Arithmetic Mean`,AQI) %>% rename(sulphurdioxide_AQI=AQI) %>% rename(sulphurdioxide_Mean=`Arithmetic Mean`)
Mergedfile3
colnames(AQI_file_1)[3] <- "Date Local"
<- AQI_file_1 %>% select(-`county Name`)
AQI_file_1
<- inner_join(Mergedfile3,AQI_file_1, by = c("State Code","County Code","Date Local","State Name", "County Name"),na_matches=c("na"))
Mergedfile4 attach(Mergedfile4)
<- Mergedfile4 %>% select(-Ozone_AQI,-carbonmonoxide_AQI,-nitrogendioxide_AQI,-sulphurdioxide_AQI)
Mergedfile4
# Final Merged File
<- distinct(Mergedfile4) %>% group_by(`State Name`, `County Name`,`Date Local`) %>% summarise(Temperature=mean(Temperature,na.rm=TRUE), Ozone_Mean=mean(Ozone_Mean,na.rm=TRUE), carbonmonoxide_Mean=mean(carbonmonoxide_Mean,na.rm=TRUE), nitrogendioxide_Mean=mean(nitrogendioxide_Mean, na.rm=TRUE), sulphurdioxide_Mean=mean(sulphurdioxide_Mean,na.rm=TRUE), AQI=mean(AQI,na.rm=TRUE))
Mergedfile5
# This is used to describe the variables. This file doesnot have any otehr benefit.
<- distinct(Mergedfile4) %>% group_by(`State Name`, `County Name`,`Date Local`) %>% summarise(Temperature=mean(Temperature,na.rm=TRUE), Ozone_Mean=mean(Ozone_Mean,na.rm=TRUE), carbonmonoxide_Mean=mean(carbonmonoxide_Mean,na.rm=TRUE), nitrogendioxide_Mean=mean(nitrogendioxide_Mean, na.rm=TRUE), sulphurdioxide_Mean=mean(sulphurdioxide_Mean,na.rm=TRUE), AQI=mean(AQI,na.rm=TRUE))
Mergedfile6
# This part is to filter out the outliers observed in variable description.
<-Mergedfile5 %>% filter(AQI<100)
Mergedfile5<-Mergedfile5 %>% filter(Temperature>5)
Mergedfile5attach(Mergedfile5)
# Final file separated into two files for Analysis.
<- Mergedfile5 %>% select(`State Name`,`County Name`,`Date Local`,Temperature,AQI)
Mergedfile5_AQI<-Mergedfile5 %>% select(`State Name`,`County Name`,`Date Local`,Temperature,Ozone_Mean,carbonmonoxide_Mean,nitrogendioxide_Mean,sulphurdioxide_Mean)
Mergedfile5_Gasescolnames(Mergedfile5_Gases)<-c("State Name","County Name","Date Local","Temperature","Ozone","Carbonmonoxide","Nitrogendioxide","Sulphurdioxide")
<- Mergedfile5_Gases %>% pivot_longer(cols = c(Ozone,Carbonmonoxide,Nitrogendioxide,Sulphurdioxide), names_to="Gas",values_to = "Concentration")
Mergedfile5_Gases $Gas<-as.factor(Mergedfile5_Gases$Gas)
Mergedfile5_Gases<-Mergedfile5_Gases %>% filter(Concentration > 0)
Mergedfile5_Gases
# Remove the non useful files
rm(carbonmonoxide_file,temperature_file,nitrogendioxide_file,sulphurdioxide_file,ozone_file, AQI_County, carbonmonoxide_1,temperature_file_1,nitrogendioxide_1,sulphurdioxide_1,ozone_file_1, AQI_file_1)
This part of the code is get a brief look at the final data files used for the purpose of Analysis.
head(Mergedfile5_AQI,n=6L)
## # A tibble: 6 × 5
## # Groups: State Name, County Name [1]
## `State Name` `County Name` `Date Local` Temperature AQI
## <fct> <fct> <date> <dbl> <dbl>
## 1 Alabama Jefferson 2021-01-01 66.0 36
## 2 Alabama Jefferson 2021-01-02 49.2 47
## 3 Alabama Jefferson 2021-01-03 43.6 53
## 4 Alabama Jefferson 2021-01-04 47.1 53
## 5 Alabama Jefferson 2021-01-05 43.7 50
## 6 Alabama Jefferson 2021-01-06 43.2 54
head(Mergedfile5_Gases,n=8L)
## # A tibble: 8 × 6
## # Groups: State Name, County Name [1]
## `State Name` `County Name` `Date Local` Temperature Gas Concentration
## <fct> <fct> <date> <dbl> <fct> <dbl>
## 1 Alabama Jefferson 2021-01-01 66.0 Ozone 0.0254
## 2 Alabama Jefferson 2021-01-01 66.0 Carbonmonox… 0.134
## 3 Alabama Jefferson 2021-01-01 66.0 Nitrogendio… 4.61
## 4 Alabama Jefferson 2021-01-01 66.0 Sulphurdiox… 0.348
## 5 Alabama Jefferson 2021-01-02 49.2 Ozone 0.0232
## 6 Alabama Jefferson 2021-01-02 49.2 Carbonmonox… 0.169
## 7 Alabama Jefferson 2021-01-02 49.2 Nitrogendio… 4.87
## 8 Alabama Jefferson 2021-01-02 49.2 Sulphurdiox… 0.467
This variable contains the names of states in the United States for which the data have been collected. The information about this variable is:
Data Type: Factor
Range: The data set contains 37 states.
Graphical Summary: The graph tells the number of observations for each state in the file obtained after merging all the data sets. This shows that the number of observations for California are more than the observations for other states.
This variable contains the names of counties in the United States for which the data have been collected. The information about this variable is:
Data Type: Factor
Range: It has 56 unique entries which are the names of the counties.
Graphical Summary: The graph tells the number of observations for each county in the file obtained after merging all the data sets. This data also tells us that all counties does not have data available for all the dates.
This variable provides information about the day-wise AQI at the county level. The information about this variable is:
Data Type: Double (Numeric)
Range: The variable ranges from 7 to 315 for the given merged dataset.
Graphical Summary: The boxplot shows the outliers in the AQI column of the data, which needs to be considered and taken care of during analysis.
This variable contains the date of observation for the temperature and AQI. The information about this variable is:
Data Type: Date ( treated as Numeric in R)
Range: The variable ranges from 01 January, 2021 to 31 October, 2021.
Graphical Summary: The graph shows that not all the counties contain information about all the dates. The number of observations for the month of January is more than for the month of October. This information was also provided by the graphical summary of county name variable.
This variable provides information about the temperature for the respective counties. The information about this variable is:
Data Type: Double ( Numeric)
Range: The variable ranges from -18.1250 degrees Fahrenheit to 104.9583 degrees Fahrenheit for the given merged dataset.
Graphical Summary: The boxplot shows the number of outliers in the Temperature column of the data, which needs to be considered and taken care of during analysis.
This variable provides information about the Gases that are considered in the analysis.The information about this variable is:
Data Type: Factor
Range: The variable contains 4 gases namely Ozone, Carbon monoxide, Nitrogren Dioxide, and Sulphur Dioxide.
Graphical Summary: The graph below shows the average concentration (in parts per billion) of each gas as available from data. This shows that the concentration of Ozone in the atmosphere is much lower than that of Nitrogen Dioxide.
This variable provides information about the concentration of each Gas in a county corresponding to a given date. The unit used for concentration is parts per billion(ppb). The information about this variable is:
Data Type: Numeric
Range: The variable ranges from -1.942857 to 51.34348 for the given merged data set. It cannot be less than 0, so the incorrect observations needs to be removed. After the removal of incorrect observations, the variable ranges from 0.000118 to 51.34348.
Graphical Summary: The graph above, same as the one used for variable 6, shows the average concentration of each gas as available from data.
For the purpose of analysis, and establishing the relationship between the concentration of gases, and AQI at the county level, the average values of all the variables have been taken for each county over the period January - May. The data after May is not taken into consideration as the number of counties containing that information is low.
# AF1 stands for Analysis File 1
<- Mergedfile5_AQI %>% filter(`Date Local` < ymd(20210601)) %>% group_by(`State Name`,`County Name`) %>% summarise(Temperature=mean(Temperature,na.rm=TRUE),AQI=mean(AQI,na.rm=TRUE)) AF1
ggplot(data=AF1) +
geom_jitter(aes(x=AQI,y=Temperature), colour='red', size=0.7, height=2, width=2)+
theme(panel.background = element_rect(fill="white"),axis.title.x =element_text(margin = margin(t = 10, r = 0, b = 0, l = 0)),axis.title.y=element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)))
The graph given above indicates the presence of a relationship between AQI and temperature. It can be seen from the graph that for the lower values of AQI corresponds to lower temperatures and the higher values of AQI corresponds to higher temperatures.
To further investigate into this, Linear Model was applied to the given data set. The summary shows a low value (0.159) of \(R^{2}\). This mean that the linear relation between AQI and Temperature is not strong. The correlation coefficient for this is around 0.40 which is also low. This analysis demonstrates the absence of a strong linear relation between the variables. The p-value demonstrates that the AQI is a significant predictor of Temperature, but the residuals generated in this case may be large. Further investigation is done to visually verify these observations. The code used for this purpose is given here but the outputs have not been displayed for aesthetic purposes.
<-lm(Temperature~AQI, data=AF1)
regsummary(reg)
<-predict(reg)
Predicted_Values<-resid(reg)
Residuals<-data.frame(Predicted_Values,Residuals)
predict_residualcor(AF1$Temperature,AF1$AQI)
ggplot(data=predict_residual)+ geom_point(mapping = aes(x=Predicted_Values, y= Residuals),size=0.7) + geom_hline(yintercept = 0,color="red") + theme(panel.background = element_rect(fill="white"),axis.title.x =element_text(margin = margin(t = 10, r = 0, b = 0, l = 0)),axis.title.y=element_text(margin = margin(t = 0, r = 15, b = 0, l = 0))) + xlab("Predicted Values of Temperature")
The plot between the predicted values and residuals shows that the difference between the predicted values and the actual given values of Temperature is large ( even below -10 in certain cases). This implies that the linear model is not the best model to use here. This is in accordance with the observations made above.
ggplot(data=AF1) +
geom_line(aes(x=`County Name`,y=Temperature, colour='red'),group=1, size=0.5) +
geom_line(aes(x=`County Name`,y=AQI, colour='blue'),group=1, size=0.5) + theme_tufte() +
theme(axis.title.x =element_text(margin = margin(t = 10, r = 0, b = 0, l = 0)),axis.title.y=element_text(margin = margin(t = 0, r = 15, b = 0, l = 0))) + labs( y = "Temperature/AQI", x = "County") + scale_color_manual(labels = c("AQI", "Temperature"), values = c("blue", "red"))+
theme(legend.position = c(0.9,0.90)) +
scale_color_identity(guide="legend", name="Variable", breaks=c("red","blue"), labels=c("Temperature","AQI")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))
The results found previously could also be visually demonstrated using the line plot given above. It shows no significant relationship between the AQI plot and Temperature plot, when plotted for each county.
<- Mergedfile5_AQI %>% filter(`Date Local` <= ymd(20210228)) %>% filter(`County Name` == "Fresno" || `County Name` == "Dallas"|| `County Name` == "Suffolk")
AF2
ggplot(data=AF2) + geom_line(mapping = aes(x=`Date Local`, y=Temperature,color="red3")) + geom_line(mapping = aes(x=`Date Local`, y=AQI, color="turquoise3")) + facet_wrap(~`County Name`, nrow = 3) + theme_tufte() +
theme(legend.position = c(0.915,1.03),axis.title.x =element_text(margin = margin(t = 10, r = 0, b = 0, l = 0)),axis.title.y=element_text(margin = margin(t = 0, r = 20, b = 0, l = 0)))+
scale_color_identity(guide="legend", name="", breaks=c("red3","turquoise3"), labels=c("Temperature","AQI")) + ylab("Temperature/AQI") + xlab("Date")
This plot contains information for three counties namely “Fresno”, “Dallas” and “Suffolk” that are selected from different regions. Fresno lies on the west coast, Dallas lies somewhere in middle, and Suffolk lies on the East Coast. The analysis in this part is done on the basis of date. It helps in finding a relation between the AQI and Temperature for the months of January and February (selected randomly) for the same county.
It is evident from the graphs for all the three counties that no significant relationship exist between the AQI and Temperature for the given period of time.
<- Mergedfile5_Gases %>% filter(`Date Local` <= ymd(20210601)) %>% group_by(`State Name`,`County Name`,Gas) %>% summarise(Temperature=mean(Temperature,na.rm=TRUE), Concentration=mean(Concentration,na.rm=TRUE))
AF3
ggplot(data=AF3) + geom_point(mapping = aes(x=Concentration, y=Temperature), color="red3", size=0.5) + facet_wrap(~Gas, nrow = 4) + theme(panel.background = element_rect(fill="white"),axis.title.x =element_text(margin = margin(t = 10, r = 0, b = 0, l = 0)),axis.title.y=element_text(margin = margin(t = 0, r = 20, b = 0, l = 0)))
The graph given tries to identify a relation between the concentration of gases and the Temperature of the county. As the points are randomly distributed, it is evident that no significant relationship exist between the individual concentration of gases and Temperature.
The graph for Ozone is not clearly visible here so it is plotted further for better clarity.
<- AF3 %>% filter(Gas == "Ozone")
AF4
ggplot(AF4) + geom_point(mapping = aes(x=Concentration,y=Temperature), size=0.7, color="red3") + theme(panel.background = element_rect(fill="white"),axis.title.x =element_text(margin = margin(t = 10, r = 0, b = 0, l = 0)),axis.title.y=element_text(margin = margin(t = 0, r = 20, b = 0, l = 0))) + xlab("Ozone Concentration")
It is evident from the graph that no relation exists between the concentration of Ozone and temperature for the counties.
It can be concluded from the above analysis that the a relationship seem to exist between temperature and AQI of the county but is not significant when looked using linear models. This can be further analysed using the other models.
It can also be concluded from the above analysis, that no relationship exist between the concentration of gases and Temperature.
Citation : United States Environmental Protection Agency. (2021, 05 18). Daily Summary Data. United States Environmental Protection Agency. Retrieved 10 10, 2021, from https://aqs.epa.gov/aqsweb/airdata/download_files.html