Chapter 7 Conclusion

Taking into account all the findings we have, we believe that it is safe to come to a conclusion that all three beliefs are somewhat not true:

Based on our plots, the pandemic started in late March or Early April 2020 which is consistent with official records. The first belief may come from the idea that the pandemic could never spread so quickly like that without any build-up period. However, the truth is that all-cause death counts did have an unexpected spike in the April 2020 and we have to believe what the data tell. To further question on this belief, more studies should be done in analyzing the data collection stage where bias might be induced.
The pandemic was eased in summer 2020 and death counts surged again after October. We cannot conclude that COVID became less killing after summer 2020. On the other hand, we cannot completely reject that belief either as we cannot determine why there was a second wave of pandemic. As far as we know, winter 2020 is the time where several more killing and more infectious new variants of COVID were found in the Europe. It might be the case that the virus in the US became less killing after summer and then new variants from other places caused the second wave. More works could also be done in this direction.
Race does impact death rates and infection rates. However, the difference is subtle. There might be other factors that are more statistically significant than race, such as poverty. Majorities of different races may not have the same socio-economic backgrounds which can cause huge bias. A better approach of investigating belief 3 is to group people into subgroups with similar income level, age, place of living and etc. Then calculate difference by race in each subgroup and do an aggregation. (We did not do so as such data are too specific to find it)

By doing this project, we learn two main points: The first is to always believe in what our data show us no matter how counter-intuitive it is. The second is that the whole process of exploratory analysis and visualization is highly restricted to a huge extent by what the original/raw data are given no matter how hard we try in data cleaning and processing