I recognize, and fully understand, that this data maybe emotionally difficult to work. My intention is to make these lab relevant, allowing you to gather your own insights directly from new visualizations of the data. Please let me know if you would rather not work with the data.

Learning Objectives

Visualizing COVID-19 cases, deaths and recoveries

The virus has been recently renamed based on phylogenetic analysis (more on this next week) severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease caused by the virus is coronavirus disease (COVID-19). In this lab we will work with reporting data on COVID-19 cases, deaths and recoveries.

Introduction to JHU case tracking data

Researchers (Ensheng Dong, Hongru Du, Lauren Gardner) at John Hopkins University developed an interactive dashboard to visual data and track reported cases of coronavirus disease 2019 (SARS-CoV-2) in real time. The underlying data is collated from the following sources and updated several times a day.

World Health Organization (WHO) | DXY.cn. Pneumonia. 2020 | BNO News | National Health Commission of the People’s Republic of China (NHC) | China CDC (CCDC) | Hong Kong Department of Health | Macau Government | Taiwan CDC | US CDC | Government of Canada | Australia Government Department of Health | European Centre for Disease Prevention and Control (ECDC) | Ministry of Health Singapore (MOH) | Italy Ministry of Health | 1Point3Arces | WorldoMeters

  • It is important to understand that this data is only as accurate as the reporting and many cases of the disease go unreported because of a lack of testing. This some countries may have have confirmed cases because of more comprehensive testing. Thus, the reporting data represent a minimum number of cases. See recent posts and an article by UMass statistian Prof. Nick Reich

JHU’s Github repo

JHU researchers make data that goes into the dashboard available on Github repo for Novel Coronavirus (COVID-19) Cases. In this lab we will work with this data.

You could start by creating a new Rproj and cloning the JHU repo as you did in Lab 6. However, we will not need all this data and it is easier to access the individual files directly as some are updated daily

Let’s take a look at the files and the structure of data in the files.

  • csse_covid_19_data
    • csse_covid_19_daily_reports
      • 03-11-2020.csv

Open up the file to look at the structure

The file contains the columns

Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude

It is important to note that for some countries there is only one row, while for others (e.g. China and US) there are multiple rows representing different provinces or states. Thus, we will need to sum these rows to get a total count for the US and China when we make graphs. From experience in making this tutorial I know the Column names with / will cause errors in ggplot ().

Next look at the time series data which is a complilation of the daily reports for each reporting type (Confirmed, Deaths, Recovered)

  • csse_covid_19_data
    • csse_covid_19_time_series
      • time_series_covid19_confirmed_global.csv

Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20…

This data is what is called wide format, where values of the variable being observed (Confirmed cases) are spread out across columns (Here: columns for each day). Another way of describing this is that there is more than one measurement per row. This wide format works well for data entry and sometimes works well for analysis but can be cumbersome when using R. The tidyr package allows us to quickly switch between wide format and what is called long format where there is only 1 column for the cases.

Making a graphs from the daily reports

We are going to work with the data using tidyverse functions and use a the library lubridate for reformatting the dates

Let’s start by plotting data from our last class on 3/11/2020 (the file looked at above). We can read data directly from the Github site.

  1. Go to the file you want to download.
  2. Click it to view the contents within the GitHub UI.
  3. In the top right, right click the Raw button.
  4. Save as… or copy the url

Below I have used the url so that it will get the most current version. This method also loads the file into directly in R (memory) and doesn’t save it on your computer.

time_series_confirmed_long2 <- read_csv(url(“https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv”))

Check the table properties to make sure the data imported as we expected

## # A tibble: 6 x 8
##   Province_State Country_Region `Last Update`       Confirmed Deaths Recovered
##   <chr>          <chr>          <dttm>                  <dbl>  <dbl>     <dbl>
## 1 Hubei          China          2020-03-11 10:53:02     67773   3046     49134
## 2 <NA>           Italy          2020-03-11 21:33:02     12462    827      1045
## 3 <NA>           Iran           2020-03-11 18:52:03      9000    354      2959
## 4 <NA>           Korea, South   2020-03-11 21:13:18      7755     60       288
## 5 France         France         2020-03-11 22:53:03      2281     48        12
## 6 <NA>           Spain          2020-03-11 20:53:02      2277     54       183
## # … with 2 more variables: Latitude <dbl>, Longitude <dbl>
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 216 obs. of  8 variables:
##  $ Province_State: chr  "Hubei" NA NA NA ...
##  $ Country_Region: chr  "China" "Italy" "Iran" "Korea, South" ...
##  $ Last Update   : POSIXct, format: "2020-03-11 10:53:02" "2020-03-11 21:33:02" ...
##  $ Confirmed     : num  67773 12462 9000 7755 2281 ...
##  $ Deaths        : num  3046 827 354 60 48 ...
##  $ Recovered     : num  49134 1045 2959 288 12 ...
##  $ Latitude      : num  31 43 32 36 46.2 ...
##  $ Longitude     : num  112.27 12 53 128 2.21 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Province/State` = col_character(),
##   ..   `Country/Region` = col_character(),
##   ..   `Last Update` = col_datetime(format = ""),
##   ..   Confirmed = col_double(),
##   ..   Deaths = col_double(),
##   ..   Recovered = col_double(),
##   ..   Latitude = col_double(),
##   ..   Longitude = col_double()
##   .. )
  • Note that data from older reports is in a different format than the new format just release yesterday.
## # A tibble: 6 x 12
##   FIPS  Admin2 Province_State Country_Region Last_Update           Lat  Long_
##   <chr> <chr>  <chr>          <chr>          <dttm>              <dbl>  <dbl>
## 1 45001 Abbev… South Carolina US             2020-03-23 23:19:34  34.2  -82.5
## 2 22001 Acadia Louisiana      US             2020-03-23 23:19:34  30.3  -92.4
## 3 51001 Accom… Virginia       US             2020-03-23 23:19:34  37.8  -75.6
## 4 16001 Ada    Idaho          US             2020-03-23 23:19:34  43.5 -116. 
## 5 19001 Adair  Iowa           US             2020-03-23 23:19:34  41.3  -94.5
## 6 21001 Adair  Kentucky       US             2020-03-23 23:19:34  37.1  -85.3
## # … with 5 more variables: Confirmed <dbl>, Deaths <dbl>, Recovered <dbl>,
## #   Active <dbl>, Combined_Key <chr>
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3415 obs. of  12 variables:
##  $ FIPS          : chr  "45001" "22001" "51001" "16001" ...
##  $ Admin2        : chr  "Abbeville" "Acadia" "Accomack" "Ada" ...
##  $ Province_State: chr  "South Carolina" "Louisiana" "Virginia" "Idaho" ...
##  $ Country_Region: chr  "US" "US" "US" "US" ...
##  $ Last_Update   : POSIXct, format: "2020-03-23 23:19:34" "2020-03-23 23:19:34" ...
##  $ Lat           : num  34.2 30.3 37.8 43.5 41.3 ...
##  $ Long_         : num  -82.5 -92.4 -75.6 -116.2 -94.5 ...
##  $ Confirmed     : num  1 1 1 13 1 0 2 0 18 0 ...
##  $ Deaths        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Active        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Combined_Key  : chr  "Abbeville, South Carolina, US" "Acadia, Louisiana, US" "Accomack, Virginia, US" "Ada, Idaho, US" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   FIPS = col_character(),
##   ..   Admin2 = col_character(),
##   ..   Province_State = col_character(),
##   ..   Country_Region = col_character(),
##   ..   Last_Update = col_datetime(format = ""),
##   ..   Lat = col_double(),
##   ..   Long_ = col_double(),
##   ..   Confirmed = col_double(),
##   ..   Deaths = col_double(),
##   ..   Recovered = col_double(),
##   ..   Active = col_double(),
##   ..   Combined_Key = col_character()
##   .. )
  • Once change is the column titles, so that we no longer required to rename . Note the addition of new columns to work with including county “Admin2” level information within a state. You will need to use summarize to get state or country counts.

Let’s start with a plot of confirmed cases in US states as of March 11th.

If we try to graph by Country_Region there are entries that have the same character (e.g. China and US) so we need to summarise the data we are interested in graphing

## # A tibble: 116 x 2
##    Country_Region Deaths
##    <chr>           <dbl>
##  1 China            3161
##  2 Italy             827
##  3 Iran              354
##  4 Korea, South       60
##  5 Spain              54
##  6 France             48
##  7 US                 36
##  8 Japan              15
##  9 United Kingdom      8
## 10 Cruise Ship         7
## # … with 106 more rows

There are 116 countries which are a lot to put on a single graph, so let’s just start with the countries with the most cases. In past labs we have only made bar graphs from histogram data, but values can be used where height of the bar will represent the value in a column of the data frame. This is done by using stat=“identity” instead of the default, stat=“bin”.

Working with the time series data

Data Wrangling

There is some data wrangling required for working with the time series data. As of March 26, 2020 it still used the old column headers so we need to rename them. I expect this will change soon.

Start by loading the most recent times series data for confirmed cases

Check the table properties to make sure the data imported as we expected

## # A tibble: 6 x 81
##   Province_State Country_Region   Lat   Long `1/22/20` `1/23/20` `1/24/20`
##   <chr>          <chr>          <dbl>  <dbl>     <dbl>     <dbl>     <dbl>
## 1 <NA>           Afghanistan     33    65            0         0         0
## 2 <NA>           Albania         41.2  20.2          0         0         0
## 3 <NA>           Algeria         28.0   1.66         0         0         0
## 4 <NA>           Andorra         42.5   1.52         0         0         0
## 5 <NA>           Angola         -11.2  17.9          0         0         0
## 6 <NA>           Antigua and B…  17.1 -61.8          0         0         0
## # … with 74 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
## #   `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>,
## #   `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>,
## #   `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>,
## #   `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>,
## #   `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>,
## #   `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>,
## #   `2/21/20` <dbl>, `2/22/20` <dbl>, `2/23/20` <dbl>, `2/24/20` <dbl>,
## #   `2/25/20` <dbl>, `2/26/20` <dbl>, `2/27/20` <dbl>, `2/28/20` <dbl>,
## #   `2/29/20` <dbl>, `3/1/20` <dbl>, `3/2/20` <dbl>, `3/3/20` <dbl>,
## #   `3/4/20` <dbl>, `3/5/20` <dbl>, `3/6/20` <dbl>, `3/7/20` <dbl>,
## #   `3/8/20` <dbl>, `3/9/20` <dbl>, `3/10/20` <dbl>, `3/11/20` <dbl>,
## #   `3/12/20` <dbl>, `3/13/20` <dbl>, `3/14/20` <dbl>, `3/15/20` <dbl>,
## #   `3/16/20` <dbl>, `3/17/20` <dbl>, `3/18/20` <dbl>, `3/19/20` <dbl>,
## #   `3/20/20` <dbl>, `3/21/20` <dbl>, `3/22/20` <dbl>, `3/23/20` <dbl>,
## #   `3/24/20` <dbl>, `3/25/20` <dbl>, `3/26/20` <dbl>, `3/27/20` <dbl>,
## #   `3/28/20` <dbl>, `3/29/20` <dbl>, `3/30/20` <dbl>, `3/31/20` <dbl>,
## #   `4/1/20` <dbl>, `4/2/20` <dbl>, `4/3/20` <dbl>, `4/4/20` <dbl>,
## #   `4/5/20` <dbl>, `4/6/20` <dbl>, `4/7/20` <dbl>

As noted above this data is in wide format. To convert to long format

Let’s look at the format of the data frame (tibble) now

## # A tibble: 6 x 6
##   Province_State Country_Region   Lat  Long Date    Confirmed
##   <chr>          <chr>          <dbl> <dbl> <chr>       <dbl>
## 1 <NA>           Afghanistan       33    65 1/22/20         0
## 2 <NA>           Afghanistan       33    65 1/23/20         0
## 3 <NA>           Afghanistan       33    65 1/24/20         0
## 4 <NA>           Afghanistan       33    65 1/25/20         0
## 5 <NA>           Afghanistan       33    65 1/26/20         0
## 6 <NA>           Afghanistan       33    65 1/27/20         0

It would be convenient to have the confirmed cases and deaths in the same table. We can create another table with the deaths and then join the two tables.

Let’s get the times series data for deaths

and convert it to long format

## # A tibble: 6 x 6
##   Province_State Country_Region   Lat  Long Date    Deaths
##   <chr>          <chr>          <dbl> <dbl> <chr>    <dbl>
## 1 <NA>           Afghanistan       33    65 1/22/20      0
## 2 <NA>           Afghanistan       33    65 1/23/20      0
## 3 <NA>           Afghanistan       33    65 1/24/20      0
## 4 <NA>           Afghanistan       33    65 1/25/20      0
## 5 <NA>           Afghanistan       33    65 1/26/20      0
## 6 <NA>           Afghanistan       33    65 1/27/20      0

To join 2 tables we need a common column in which each row is a unique name. This is often called the Key. None of the columns in this data set meet that criteria, but we can create a column to serve as our key by mergering the names for the Province_State, Country_Region and Date columns using unite

## # A tibble: 6 x 7
##   Key                 Province_State Country_Region   Lat  Long Date   Confirmed
##   <chr>               <chr>          <chr>          <dbl> <dbl> <chr>      <dbl>
## 1 NA.Afghanistan.1/2… <NA>           Afghanistan       33    65 1/22/…         0
## 2 NA.Afghanistan.1/2… <NA>           Afghanistan       33    65 1/23/…         0
## 3 NA.Afghanistan.1/2… <NA>           Afghanistan       33    65 1/24/…         0
## 4 NA.Afghanistan.1/2… <NA>           Afghanistan       33    65 1/25/…         0
## 5 NA.Afghanistan.1/2… <NA>           Afghanistan       33    65 1/26/…         0
## 6 NA.Afghanistan.1/2… <NA>           Afghanistan       33    65 1/27/…         0

For the second table we can do the same and eliminate the columns that are redundant so that we just have the Key and Deaths columns

Now to join the tables. A full_join which means that the keys that are common to the tables will be joined and there will be counts for both Confirmed and Deaths. Where there are not matching values, returns NA for the one missing. select can be used to remove the key after the join since we don’t have further need for it.

## # A tibble: 6 x 8
##   Key           Province_State Country_Region   Lat  Long Date  Confirmed Deaths
##   <chr>         <chr>          <chr>          <dbl> <dbl> <chr>     <dbl>  <dbl>
## 1 NA.Afghanist… <NA>           Afghanistan       33    65 1/22…         0      0
## 2 NA.Afghanist… <NA>           Afghanistan       33    65 1/23…         0      0
## 3 NA.Afghanist… <NA>           Afghanistan       33    65 1/24…         0      0
## 4 NA.Afghanist… <NA>           Afghanistan       33    65 1/25…         0      0
## 5 NA.Afghanist… <NA>           Afghanistan       33    65 1/26…         0      0
## 6 NA.Afghanist… <NA>           Afghanistan       33    65 1/27…         0      0

Check to make sure neither the Cofirmed or Death counts have NA as a value (both tables have the same number of rows, but we should check to make sure)

## integer(0)
## integer(0)

There are no NA values.

Following the same set of steps to add the Recovered Data

## # A tibble: 6 x 8
##   Province_State Country_Region   Lat  Long Date    Confirmed Deaths Recovered
##   <chr>          <chr>          <dbl> <dbl> <chr>       <dbl>  <dbl>     <dbl>
## 1 <NA>           Afghanistan       33    65 1/22/20         0      0         0
## 2 <NA>           Afghanistan       33    65 1/23/20         0      0         0
## 3 <NA>           Afghanistan       33    65 1/24/20         0      0         0
## 4 <NA>           Afghanistan       33    65 1/25/20         0      0         0
## 5 <NA>           Afghanistan       33    65 1/26/20         0      0         0
## 6 <NA>           Afghanistan       33    65 1/27/20         0      0         0

Now we are in good shape to proceed to using these table for graphing, but first reformat Date

Confirmed, Deaths and Recovered are all count data, and there may be times when we want to plot all three on the same graph. There are multiple ways to accomplish this and one is through creating a column will all 3 using _pivot_long

## # A tibble: 6 x 7
##   Province_State Country_Region   Lat  Long Date       Report_Type Counts
##   <chr>          <chr>          <dbl> <dbl> <date>     <chr>        <dbl>
## 1 <NA>           Afghanistan       33    65 2020-01-22 Confirmed        0
## 2 <NA>           Afghanistan       33    65 2020-01-22 Deaths           0
## 3 <NA>           Afghanistan       33    65 2020-01-22 Recovered        0
## 4 <NA>           Afghanistan       33    65 2020-01-23 Confirmed        0
## 5 <NA>           Afghanistan       33    65 2020-01-23 Deaths           0
## 6 <NA>           Afghanistan       33    65 2020-01-23 Recovered        0

Making Graphs from the time series data

To make a times series graph of the confirmed cases we need to summarize the Country date to count up the individual state data for the US.

Let’s look at the US data in the context of a few other countries

Now several countries on the same graph

We can use the alternative data frame with column Report_Type to show Confirmed, Deaths and Recovered. Because the counts for Deaths and Recovered are low relative to Confirmed a log scale works best here for seeing the pattern

Visualizing data on maps

Here are two examples provided by Prof. Chris Sutherland and Anisa Dhana. They both rely on using the maps package. For ideas on how to build on these examples here is a nice tutorial on Maps in R using maps by Eric Anderson

Adaptation of a post from Anisa Dhana on R-Bloggers

  • This code uses data is in wide format from the above times series.

Load the additional libraries (and install if necessary)

Get the world map

Cutoffs based on the number of cases

Plot of data from a month ago on Feb 25th

## Warning: Transformation introduced infinite values in discrete y-axis

## Warning: Transformation introduced infinite values in discrete y-axis
## Warning in sqrt(x): NaNs produced
## Warning: Removed 187 rows containing missing values (geom_point).

Adaptation of code from Prof. Chris Sutherland

Get data, pivot to longer format, mutate and summarize

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.

View the table

## # A tibble: 6 x 6
## # Groups:   Place [1]
##   Place           Date       cumulative_cases     Lat  Long Pandemic_day
##   <chr>           <date>                <dbl>   <dbl> <dbl>        <dbl>
## 1 -0.0236_37.9062 2020-01-21               NA -0.0236  37.9            0
## 2 -0.0236_37.9062 2020-01-22               NA -0.0236  37.9            1
## 3 -0.0236_37.9062 2020-01-23               NA -0.0236  37.9            2
## 4 -0.0236_37.9062 2020-01-24               NA -0.0236  37.9            3
## 5 -0.0236_37.9062 2020-01-25               NA -0.0236  37.9            4
## 6 -0.0236_37.9062 2020-01-26               NA -0.0236  37.9            5

Plot the data

## Warning: Removed 803 rows containing missing values (geom_point).

Exercises

The above graphs are minimal and meant to provide some starter code. I am sure you can make these richer and with greater impact. Create at least 2 new graphs (different from the ones above) each from the categories of daily reports, time series and geographical maps that best convey aspects of the pandemic relevant to you.

Submit this exercise as a link from your Github page (Submit your Github website URL in Moodle).