R code
library(tidyverse)
library(lubridate)
I recognize, and fully understand, that this data maybe emotionally difficult to work. My intention is to make these lab relevant, allowing you to gather your own insights directly from new visualizations of the data. Please let me know if you would rather not work with the data.
The virus has been recently renamed based on phylogenetic analysis severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease caused by the virus is coronavirus disease (COVID-19). In this lab we will work with reporting data on COVID-19 cases, deaths and recoveries.
Researchers (Ensheng Dong, Hongru Du, Lauren Gardner) at John Hopkins University developed an interactive dashboard to visual data and track reported cases of coronavirus disease 2019 (SARS-CoV-2) in real time. The underlying data is collated from the following sources was updated several times a day until March 2023 (For more recent views of the data see the (CDC tracker and NY Times Tracker)
It is important to understand that this data is only as accurate as the reporting and many cases of the disease go unreported because of a lack of testing. This some countries may have have confirmed cases because of more comprehensive testing. Thus, the reporting data represent a minimum number of cases.
JHU researchers make data that goes into the dashboard available on Github repo for Novel Coronavirus (COVID-19) Cases. In this lab we will work with this data.
Let’s take a look at the files and the structure of data in the files.
Open up the file to look at the structure
The file contains the columns
Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude
It is important to note that for some countries there is only one row, while for others (e.g. China and US) there are multiple rows representing different provinces or states. Thus, we will need to sum these rows to get a total count for the US and China when we make graphs. From experience in making this tutorial I know the Column names with / will cause errors in ggplot ().
Let’s start by loading tidyverse
and a package lubridate
for working with dates.
library(tidyverse)
library(lubridate)
We could load data directly into R each time we knit
, it would make sure we have the most current data, but it is a time consuming step.
This code is for dynamically loading the data from the site each time you run the chunk or render.
<- read_csv(url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")) |>
time_series_confirmed rename(Province_State = "Province/State", Country_Region = "Country/Region")
This code is for loading the data from the site just once to your data folder.
download.file(url="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv",
destfile = "data/time_series_covid19_confirmed_global.csv")
In this case it is best to have the data on your computer and then load the file into R. I recommend do this and then setting eval=FALSE so that you do not download the file each time your render (this will take a long time).
Then we can load the data.
<- read_csv("data/time_series_covid19_confirmed_global.csv")|>
time_series_confirmed rename(Province_State = "Province/State", Country_Region = "Country/Region")
Check the table properties to make sure the data imported as we expected. Click on the file in the top right corner of R Studio under Environment
.
Today we will go over Chapter 5 in R for Data Sciences - Data Tiyding and Pivot
As noted above this data is in wide format. To convert to long format we can usepivot_longer
<- time_series_confirmed |>
time_series_confirmed_long pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = "Date", values_to = "Confirmed")
Let’s also change the format of Date to something that is easier to work with in graphs. See Chapter 17 in R for Data Sciences - Data Tiyding and Pivot
$Date <- mdy(time_series_confirmed_long$Date) time_series_confirmed_long
Let’s look at the format of the data frame of time_series_confirmed_long by clicking on it in the top right corner of R Studio under Environment.
To make a times series graph of the confirmed cases we need to summarize the Country date to count up the individual state data for the US.
|>
time_series_confirmed_longgroup_by(Country_Region, Date) |>
summarise(Confirmed = sum(Confirmed)) |>
filter (Country_Region == "US") |>
ggplot(aes(x = Date, y = Confirmed)) +
geom_point() +
geom_line() +
ggtitle("US COVID-19 Confirmed Cases")
Now several countries on the same graph
|>
time_series_confirmed_long group_by(Country_Region, Date) |>
summarise(Confirmed = sum(Confirmed)) |>
filter (Country_Region %in% c("China","France","Italy",
"Korea, South", "US")) |>
ggplot(aes(x = Date, y = Confirmed, color = Country_Region)) +
geom_point() +
geom_line() +
ggtitle("COVID-19 Confirmed Cases")
The above graphs using the cumulative counts. Let’s make a new table with the daily counts using the tidverse/dyplr lag function which subtracts a row from the previous row.
<-time_series_confirmed_long |>
time_series_confirmed_long_daily group_by(Country_Region, Date) |>
summarise(Confirmed = sum(Confirmed)) |>
mutate(Daily = Confirmed - lag(Confirmed, default = first(Confirmed )))
Now for a graph with the US data
|>
time_series_confirmed_long_daily filter (Country_Region == "US") |>
ggplot(aes(x = Date, y = Daily, color = Country_Region)) +
geom_point() +
ggtitle("COVID-19 Confirmed Cases")
A line graph version of the above
|>
time_series_confirmed_long_daily filter (Country_Region == "US") |>
ggplot(aes(x = Date, y = Daily, color = Country_Region)) +
geom_line() +
ggtitle("COVID-19 Confirmed Cases")
Now with a curve fit
|>
time_series_confirmed_long_daily filter (Country_Region == "US") |>
ggplot(aes(x = Date, y = Daily, color = Country_Region)) +
geom_smooth() +
ggtitle("COVID-19 Confirmed Cases")
By default, geom_smooth()
adds a LOESS/LOWESS (Locally Weighted Scatterplot Smoothing) smoother to the data. That’s not what we’re after, though. Here is a fit using a generalized additive model (GAM)
|>
time_series_confirmed_long_daily filter (Country_Region == "US") |>
ggplot(aes(x = Date, y = Daily, color = Country_Region)) +
geom_smooth(method = "gam", se = FALSE) +
ggtitle("COVID-19 Confirmed Cases")
Animated graphs when down right have a great visual impact. You can do this in R and have your animations embedded on your web page. Essentially gganimate creates a series of files that are encompassed in a gif file. In addition to having this gif as part of your report file, you can save the gif and use in a slide or other presentations. It just takes a few lines of code to covert and existing ggplot graph into an animation. See Tutorial for Getting Started with gganimate and gganimate: How to Create Plots with Beautiful Animation in R.
This are some important gganimate functions:
gifski
is a package for creating a gif file from gganimate
. This file can be embedded in a presentation or website. Yesterday, there were several hiccups in getting gganimate
to install without errors. Thankfully the folks at Posit came through with fix. Before installing gganimate
first install a fix for a package sf
that is required by gganimate
::install_github("r-spatial/sf") remotes
I am hoping that this works for people doing the lab on their own computers. This fix was not needed for installing gganimate
on Unity.
library(gganimate)
library(gifski)
theme_set(theme_bw())
<- time_series_confirmed_long_daily |>
daily_counts filter (Country_Region == "US")
<- ggplot(daily_counts, aes(x = Date, y = Daily, color = Country_Region)) +
p geom_point() +
ggtitle("Confirmed COVID-19 Cases") +
# gganimate lines
geom_point(aes(group = seq_along(Date))) +
transition_reveal(Date)
# make the animation
animate(p, renderer = gifski_renderer(), end_pause = 15)
You can change the output to a gif file that can be used in slide presentations or an instagram post. After you make the gif set eval=FALSE in your report so that it doesn’t recreate the gif (this takes a fair amount of time) each time you Knit.
anim_save("daily_counts_US.gif", p)
# This download may take about 5 minutes. You only need to do this once so set eval=false in your Rmd file
download.file(url="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv",
destfile = "data/time_series_covid19_deaths_global.csv")
<- read_csv("data/time_series_covid19_deaths_global.csv")|>
time_series_deaths_confirmed rename(Province_State = "Province/State", Country_Region = "Country/Region")
<- time_series_deaths_confirmed |>
time_series_deaths_long pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = "Date", values_to = "Confirmed")
$Date <- mdy(time_series_deaths_long$Date) time_series_deaths_long
<- time_series_deaths_long |>
p filter (Country_Region %in% c("US","Canada", "Mexico","Brazil","Egypt","Ecuador","India", "Netherlands", "Germany", "China" )) |>
ggplot(aes(x=Country_Region, y=Confirmed, color= Country_Region)) +
geom_point(aes(size=Confirmed)) +
transition_time(Date) +
labs(title = "Cumulative Deaths: {frame_time}") +
ylab("Deaths") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
# make the animation
animate(p, renderer = gifski_renderer(), end_pause = 15)
Pay attention to how your graphs look in today’s final knitted lab report. You will be docked points if the graphs do not look nice (e.g. overlapping column names, truncated legends, ets.)
Instead of making a graph of 5 countries on the same graph as in the above example, use facet_wrap
with scales="free_y"
as we did in lab 4.
Using the daily count of confirmed cases, make a single graph with 5 countries of your choosing.
Plot the cumulative deaths in the US, Canada and Mexico (you will need to download time_series_covid19_deaths_global.csv)
Make the same graph as above with the daily deaths. Use a generalized additive model (GAM) gam
for making the graph.
Make a graph with the countries of your choice using the daily deaths data
Make an animation of your choosing (do not use a graph with geom_smooth)