Lab 5 : Data Transformation and Visualization with COVID-19 reporting data

I recognize, and fully understand, that this data maybe emotionally difficult to work. My intention is to make these lab relevant, allowing you to gather your own insights directly from new visualizations of the data. Please let me know if you would rather not work with the data.

Learning Objectives

  • Understanding the sources of SARS-CoV-2 incidence reports
  • Accessing data remotely
  • Wide and long table formats
  • More data visualization with ggplot2
  • Animation

Visualizing COVID-19 cases, deaths and recoveries

The virus has been recently renamed based on phylogenetic analysis severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease caused by the virus is coronavirus disease (COVID-19). In this lab we will work with reporting data on COVID-19 cases, deaths and recoveries.

Introduction to JHU case tracking data

Researchers (Ensheng Dong, Hongru Du, Lauren Gardner) at John Hopkins University developed an interactive dashboard to visual data and track reported cases of coronavirus disease 2019 (SARS-CoV-2) in real time. The underlying data is collated from the following sources was updated several times a day until March 2023 (For more recent views of the data see the (CDC tracker and NY Times Tracker)

It is important to understand that this data is only as accurate as the reporting and many cases of the disease go unreported because of a lack of testing. This some countries may have have confirmed cases because of more comprehensive testing. Thus, the reporting data represent a minimum number of cases.

JHU researchers make data that goes into the dashboard available on Github repo for Novel Coronavirus (COVID-19) Cases. In this lab we will work with this data.

Let’s take a look at the files and the structure of data in the files.

  • csse_covid_19_data
    • csse_covid_19_daily_reports
      • 03-11-2020.csv

Open up the file to look at the structure

The file contains the columns

Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude

It is important to note that for some countries there is only one row, while for others (e.g. China and US) there are multiple rows representing different provinces or states. Thus, we will need to sum these rows to get a total count for the US and China when we make graphs. From experience in making this tutorial I know the Column names with / will cause errors in ggplot ().

On the Computer

Let’s start by loading tidyverse and a package lubridate for working with dates.

R code
library(tidyverse)
library(lubridate)

Loading data from a github repository

We could load data directly into R each time we knit, it would make sure we have the most current data, but it is a time consuming step.

This code is for dynamically loading the data from the site each time you run the chunk or render.

R code
time_series_confirmed <- read_csv(url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")) |>
  rename(Province_State = "Province/State", Country_Region = "Country/Region") 

This code is for loading the data from the site just once to your data folder.

R code
 download.file(url="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv", 
               destfile = "data/time_series_covid19_confirmed_global.csv")

In this case it is best to have the data on your computer and then load the file into R. I recommend do this and then setting eval=FALSE so that you do not download the file each time your render (this will take a long time).

Then we can load the data.

R code
time_series_confirmed <- read_csv("data/time_series_covid19_confirmed_global.csv")|>
  rename(Province_State = "Province/State", Country_Region = "Country/Region")

Check the table properties to make sure the data imported as we expected. Click on the file in the top right corner of R Studio under Environment.

Data Tidying - Pivoting

Today we will go over Chapter 5 in R for Data Sciences - Data Tiyding and Pivot

As noted above this data is in wide format. To convert to long format we can usepivot_longer

R code
time_series_confirmed_long <- time_series_confirmed |> 
               pivot_longer(-c(Province_State, Country_Region, Lat, Long),
                            names_to = "Date", values_to = "Confirmed") 

Dates and time

Let’s also change the format of Date to something that is easier to work with in graphs. See Chapter 17 in R for Data Sciences - Data Tiyding and Pivot

R code
time_series_confirmed_long$Date <- mdy(time_series_confirmed_long$Date)

Let’s look at the format of the data frame of time_series_confirmed_long by clicking on it in the top right corner of R Studio under Environment.

Making Graphs from the time series data

To make a times series graph of the confirmed cases we need to summarize the Country date to count up the individual state data for the US.

R code
time_series_confirmed_long|> 
  group_by(Country_Region, Date) |> 
  summarise(Confirmed = sum(Confirmed)) |> 
  filter (Country_Region == "US") |> 
  ggplot(aes(x = Date,  y = Confirmed)) + 
    geom_point() +
    geom_line() +
    ggtitle("US COVID-19 Confirmed Cases")

Now several countries on the same graph

R code
time_series_confirmed_long |> 
    group_by(Country_Region, Date) |> 
    summarise(Confirmed = sum(Confirmed)) |> 
    filter (Country_Region %in% c("China","France","Italy", 
                                "Korea, South", "US")) |> 
    ggplot(aes(x = Date,  y = Confirmed, color = Country_Region)) + 
      geom_point() +
      geom_line() +
      ggtitle("COVID-19 Confirmed Cases")

The above graphs using the cumulative counts. Let’s make a new table with the daily counts using the tidverse/dyplr lag function which subtracts a row from the previous row.

R code
time_series_confirmed_long_daily <-time_series_confirmed_long |> 
    group_by(Country_Region, Date) |> 
    summarise(Confirmed = sum(Confirmed)) |> 
    mutate(Daily = Confirmed - lag(Confirmed, default = first(Confirmed )))

Now for a graph with the US data

R code
time_series_confirmed_long_daily |> 
    filter (Country_Region == "US") |> 
    ggplot(aes(x = Date,  y = Daily, color = Country_Region)) + 
      geom_point() +
      ggtitle("COVID-19 Confirmed Cases")

A line graph version of the above

R code
time_series_confirmed_long_daily |> 
    filter (Country_Region == "US") |> 
    ggplot(aes(x = Date,  y = Daily, color = Country_Region)) + 
      geom_line() +
      ggtitle("COVID-19 Confirmed Cases")

Now with a curve fit

R code
time_series_confirmed_long_daily |> 
    filter (Country_Region == "US") |> 
    ggplot(aes(x = Date,  y = Daily, color = Country_Region)) + 
      geom_smooth() +
      ggtitle("COVID-19 Confirmed Cases")

By default, geom_smooth() adds a LOESS/LOWESS (Locally Weighted Scatterplot Smoothing) smoother to the data. That’s not what we’re after, though. Here is a fit using a generalized additive model (GAM)

R code
time_series_confirmed_long_daily |> 
    filter (Country_Region == "US") |> 
    ggplot(aes(x = Date,  y = Daily, color = Country_Region)) + 
      geom_smooth(method = "gam", se = FALSE) +
      ggtitle("COVID-19 Confirmed Cases")

Animated Graphs with gganimate

Animated graphs when down right have a great visual impact. You can do this in R and have your animations embedded on your web page. Essentially gganimate creates a series of files that are encompassed in a gif file. In addition to having this gif as part of your report file, you can save the gif and use in a slide or other presentations. It just takes a few lines of code to covert and existing ggplot graph into an animation. See Tutorial for Getting Started with gganimate and gganimate: How to Create Plots with Beautiful Animation in R.

This are some important gganimate functions:

  • transition_*() defines how the data should be spread out and how it relates to itself across time.
  • view_*() defines how the positional scales should change along the animation.
  • shadow_*() defines how data from other points in time should be presented in the given point in time.
  • enter_()/exit_() defines how new data should appear and how old data should disappear during the course of the animation.
  • ease_aes() defines how different aesthetics should be eased during transitions.

Installing gganimate and gifski

gifski is a package for creating a gif file from gganimate. This file can be embedded in a presentation or website. Yesterday, there were several hiccups in getting gganimate to install without errors. Thankfully the folks at Posit came through with fix. Before installing gganimate first install a fix for a package sf that is required by gganimate

R code
remotes::install_github("r-spatial/sf")

I am hoping that this works for people doing the lab on their own computers. This fix was not needed for installing gganimate on Unity.

R code
library(gganimate)
library(gifski)
theme_set(theme_bw())

An animation of the confirmed cases in select countries

R code
daily_counts <- time_series_confirmed_long_daily |> 
      filter (Country_Region == "US")

p <- ggplot(daily_counts, aes(x = Date,  y = Daily, color = Country_Region)) + 
        geom_point() +
        ggtitle("Confirmed COVID-19 Cases") +
# gganimate lines  
        geom_point(aes(group = seq_along(Date))) +
        transition_reveal(Date) 

# make the animation
 animate(p, renderer = gifski_renderer(), end_pause = 15)

You can change the output to a gif file that can be used in slide presentations or an instagram post. After you make the gif set eval=FALSE in your report so that it doesn’t recreate the gif (this takes a fair amount of time) each time you Knit.

R code
anim_save("daily_counts_US.gif", p)

Animation of confirmed deaths

Download the data

R code
# This download may take about 5 minutes. You only need to do this once so set eval=false in your Rmd file
download.file(url="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv", 
  destfile = "data/time_series_covid19_deaths_global.csv")

Data tidying, pivot and time

R code
time_series_deaths_confirmed <- read_csv("data/time_series_covid19_deaths_global.csv")|>
  rename(Province_State = "Province/State", Country_Region = "Country/Region")

time_series_deaths_long <- time_series_deaths_confirmed |> 
    pivot_longer(-c(Province_State, Country_Region, Lat, Long),
        names_to = "Date", values_to = "Confirmed") 

time_series_deaths_long$Date <- mdy(time_series_deaths_long$Date)

Making the animated graph

R code
p <- time_series_deaths_long |>
  filter (Country_Region %in% c("US","Canada", "Mexico","Brazil","Egypt","Ecuador","India", "Netherlands", "Germany", "China" )) |>
  ggplot(aes(x=Country_Region, y=Confirmed, color= Country_Region)) + 
    geom_point(aes(size=Confirmed)) + 
    transition_time(Date) + 
    labs(title = "Cumulative Deaths: {frame_time}") + 
    ylab("Deaths") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
# make the animation
animate(p, renderer = gifski_renderer(), end_pause = 15)

Exercises

Pay attention to how your graphs look in today’s final knitted lab report. You will be docked points if the graphs do not look nice (e.g. overlapping column names, truncated legends, ets.)

Exercise 1

Instead of making a graph of 5 countries on the same graph as in the above example, use facet_wrap with scales="free_y" as we did in lab 4.

Exercise 2

Using the daily count of confirmed cases, make a single graph with 5 countries of your choosing.

Exercise 3

Plot the cumulative deaths in the US, Canada and Mexico (you will need to download time_series_covid19_deaths_global.csv)

Exercise 4

Make the same graph as above with the daily deaths. Use a generalized additive model (GAM) gam for making the graph.

Exercise 5

Make a graph with the countries of your choice using the daily deaths data

Exercise 6

Make an animation of your choosing (do not use a graph with geom_smooth)