Welcome to the second installation of Thesis Scribbles by Lillian, the series where I document my casual exploratory data analysis in pursuit of a testable hypothesis for my MPH capstone! My over-arching goal is to create a predictive model of respiratory disease incidence at a county-level of granularity. In the last installment I compared flu-symptom Google Trends data in Grand Rapids to Influenza-Like Illness (ILI) incidence in Kent County. In this installment, I will take a look at historical Air Quality Index (AQI) measurements in the county.
In light of the Canadian wildfires that impacted the state of Michigan this summer, I find myself wondering what - if any - effect will that have on ILI incidence this winter? There's literature suggesting that increased air pollutants during wildfire season is associated with an increase in respiratory illness 8-10 weeks later. If this is indeed happening in Michigan, then AQI may be a useful parameter in a predictive model of respiratory diseases.
Let's instantiate our Jupyter environment and import necessary python modules to get started.
The EPA has daily, county-level AQI data available here. These data are available via an API, and in addition, pregenerated csv files are available to download. I chose to download the files this time.
Each csv file has one year of daily AQI measurements for all US counties, totaling to over 250,000 rows in each file. I downloaded 23 years of AQI measurements, so I had a LOT of data to dig through. I wrote two functions to manage this task.
The preprocess() function intakes a dataframe made from one csv data file. It filters the dataframe to select, reformat, and return Kent County data.
The process() function intakes a list of dataframes (one from each year-csv file), runs each through the preprocess() function before calculating monthly mean AQIs and concatenating everything into a single output dataframe.
The result is one dataframe containing the average AQI for each month between 2000-2022 for Kent County, Michigan.
Let's take a peek at how these data look.
We can definitely see the rise and fall of air pollution alongside wildfire season.
Let's normalize these data, then plot them together with normalized ILI incidence data from the county
Interestingly, air pollution and ILI incidence seem to be in almost perfect antiphase with each other, with peaks in air pollution preceding flu peaks by about 4-5 months. Could increased air pollution in the summer contribute to higher ILI incidence in the winter? Weekly data on AQI and ILI incidence may show a clearer picture of this possible association