Chapter 2 Data sources
All the data we use come from websites with “.gov” domain name for the sake of accuracy and reliability. Data are either collected by the Centers for Disease Control and Prevention (CDC) or John Hopkins University (JHU).
There are several notable points about our datasets:
2.1 Belief 1 and 2
There are numerous reliable datasets recording COVID related information like infection rates, death rate etc, but we chose entirely different datasets (approach) by analyzing our first two problems from the perspective of all-cause death counts. We do not directly use COVID death counts as we believe that data may be deceiving: that data only give the number of people who die with COVID, but COVID may not be the leading cause of death (studies have shown that most people with COVID actually die from COVID complications). Therefore, we directly compare monthly death counts. This is kind of doing an AB test while holding everything else constant or in similar states except for the existence of the pandemic. As such, our data would help answer our questions.
For the analysis of death counts, we have two datasets, namely "Monthly_Counts_of_Deaths_by_Select_Causes__2014-2019.csv" (https://data.cdc.gov/NCHS/Monthly-Counts-of-Deaths-by-Select-Causes-2014-201/bxq8-mugm/data) and "Monthly_provisional_counts_of_deaths_by_age_group__sex__and_race_ethnicity_for_select_causes_of_death__2019-2020.csv" (https://catalog.data.gov/dataset/monthly-provisional-counts-of-deaths-by-age-group-sex-and-race-ethnicity-for-select-causes) from two websites. Dataset 1 directly gives death count per month while the 2nd one reports monthly death count by age, sex and race. It is easy to transform the second dataset into the form of the first one by taking summation on different groups of data. After such data transformation, we found that the two datasets give almost the same number of monthly death counts in 2019, which is the only overlapped data. We believe this is a strong evidence of consistency and reliability.
2.2 Belief 3
- The raw data we used for analyzing COVID-19 situation in terms of state is obtained from repository of the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). It contains data of state name, latitude, longitude, the number of daily death case and daily confirmed case across different state from 2020-04-12 to 2021-04-03.
- The raw data we used for analyzing COVID-19 situation in terms of race is obtained from ‘The COVID Tracking Project’ (https://covidtracking.com/race/about#download-the-data) and it contains the number of daily COVID-19 cases and deaths by race by US state from 2020-04-12 to 2021-03-07. We also get each state’s overall population in 2020 from United States Census official website (https://www.census.gov/programs-surveys/popest/technical-documentation/research/evaluation-estimates.html).