R code
library(tidyverse)
library(lubridate)The virus has been recently renamed based on phylogenetic analysis severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease caused by the virus is coronavirus disease (COVID-19). In this lab we will work with reporting data on COVID-19 cases, deaths and recoveries.
Researchers (Ensheng Dong, Hongru Du, Lauren Gardner) at John Hopkins University developed an interactive dashboard to visual data and track reported cases of coronavirus disease 2019 (SARS-CoV-2) in real time. The underlying data is collated from the following sources was updated several times a day until March 2023 (For more recent views of the data see the (CDC tracker and NY Times Tracker)
It is important to understand that this data is only as accurate as the reporting and many cases of the disease go unreported because of a lack of testing. This some countries may have have confirmed cases because of more comprehensive testing. Thus, the reporting data represent a minimum number of cases.
JHU researchers make data that goes into the dashboard available on Github repo for Novel Coronavirus (COVID-19) Cases. In this lab we will work with this data.
Let’s take a look at the files and the structure of data in the files.
Open up the file to look at the structure
The file contains the columns
Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude
It is important to note that for some countries there is only one row, while for others (e.g. China and US) there are multiple rows representing different provinces or states. Thus, we will need to sum these rows to get a total count for the US and China when we make graphs. From experience in making this tutorial I know the Column names with / will cause errors in ggplot ().
Let’s start by loading tidyverse and a package lubridate for working with dates.
We could load data directly into R each time we render, it would make sure we have the most current data, but it is a time consuming step. This code is for loading the data from the site just once to your data folder. After you do this add the line #| eval: false to you code chunk so that you do not download the file each time your render (this will take a long time).
Then we can load the data.
Check the table properties to make sure the data imported as we expected. Click on the file in the top right corner of R Studio under Environment.
Today we will go over Chapter 5 in R for Data Sciences - Data Tiyding and Pivot
As noted above this data is in wide format. To convert to long format we can usepivot_longer
Let’s also change the format of Date to something that is easier to work with in graphs. See Chapter 17 in R for Data Sciences - Data Tiyding and Pivot
Let’s look at the format of the data frame of time_series_confirmed_long by clicking on it in the top right corner of R Studio under Environment.
To make a times series graph of the confirmed cases we need to summarize the Country date to count up the individual state data for the US.
Now several countries on the same graph
time_series_confirmed_long |>
group_by(Country_Region, Date) |>
summarise(Confirmed = sum(Confirmed)) |>
filter (Country_Region %in% c("China","France","Italy",
"Korea, South", "US")) |>
ggplot(aes(x = Date, y = Confirmed, color = Country_Region)) +
geom_point() +
geom_line() +
ggtitle("COVID-19 Confirmed Cases")The above graphs using the cumulative counts. Let’s make a new table with the daily counts using the tidverse/dyplr lag function which subtracts a row from the previous row.
Now for a graph with the US data
A line graph version of the above
Now with a curve fit
By default, geom_smooth() adds a LOESS/LOWESS (Locally Weighted Scatterplot Smoothing) smoother to the data. That’s not what we’re after, though. Here is a fit using a generalized additive model (GAM)
Animated graphs when down right have a great visual impact. You can do this in R and have your animations embedded on your web page. Essentially gganimate creates a series of files that are encompassed in a gif file. In addition to having this gif as part of your report file, you can save the gif and use in a slide or other presentations. It just takes a few lines of code to covert and existing ggplot graph into an animation. See Tutorial for Getting Started with gganimate and gganimate: How to Create Plots with Beautiful Animation in R.
This are some important gganimate functions:
gifski is a package for creating a gif file from gganimate. Use Tools > Install Packages to install gifski and gganimate on Unity.
daily_counts <- time_series_confirmed_long_daily |>
filter (Country_Region == "US")
p <- ggplot(daily_counts, aes(x = Date, y = Daily, color = Country_Region)) +
geom_point() +
ggtitle("Confirmed COVID-19 Cases") +
# gganimate lines
geom_point(aes(group = seq_along(Date))) +
transition_reveal(Date)
# make the animation
animate(p, renderer = gifski_renderer(), end_pause = 15)You can change the output to a gif file that can be used in slide presentations or an instagram post. After you make the gif set eval=FALSE in your report so that it doesn’t recreate the gif (this takes a fair amount of time) each time you Knit.
# This download may take about 5 minutes. You only need to do this once so set `#| eval: false` in your qmd file
download.file(url="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv",
destfile = "data/time_series_covid19_deaths_global.csv")time_series_deaths_confirmed <- read_csv("data/time_series_covid19_deaths_global.csv")|>
rename(Province_State = "Province/State", Country_Region = "Country/Region")
time_series_deaths_long <- time_series_deaths_confirmed |>
pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = "Date", values_to = "Confirmed")
time_series_deaths_long$Date <- mdy(time_series_deaths_long$Date)p <- time_series_deaths_long |>
filter (Country_Region %in% c("US","Canada", "Mexico","Brazil","Egypt","Ecuador","India", "Netherlands", "Germany", "China" )) |>
ggplot(aes(x=Country_Region, y=Confirmed, color= Country_Region)) +
geom_point(aes(size=Confirmed)) +
transition_time(Date) +
labs(title = "Cumulative Deaths: {frame_time}") +
ylab("Deaths") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
# make the animation
animate(p, renderer = gifski_renderer(), end_pause = 15)Pay attention to how your graphs look in today’s final rendered lab report. You will be docked points if the graphs do not look nice (e.g. overlapping column names, truncated legends, ets.)
Go through Chapter 5 in R for Data Sciences - Data Tiyding and Pivot](https://r4ds.hadley.nz/data-tidy.html) putting the examples and exerices into your report as in Lab 2 and 3
Instead of making a graph of 5 countries on the same graph as in the above example, use facet_wrap with scales="free_y".
Using the daily count of confirmed cases, make a single graph with 5 countries of your choosing.
Plot the cumulative deaths in the US, Canada and Mexico (you will need to download time_series_covid19_deaths_global.csv)
Make a graph with the countries of your choice using the daily deaths data
Make an animation of your choosing (do not use a graph with geom_smooth)