I recognize, and fully understand, that this data maybe emotionally difficult to work. My intention is to make these lab relevant, allowing you to gather your own insights directly from new visualizations of the data. Please let me know if you would rather not work with the data.
The virus has been recently renamed based on phylogenetic analysis (more on this next week) severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease caused by the virus is coronavirus disease (COVID-19). In this lab we will work with reporting data on COVID-19 cases, deaths and recoveries.
Researchers (Ensheng Dong, Hongru Du, Lauren Gardner) at John Hopkins University developed an interactive dashboard to visual data and track reported cases of coronavirus disease 2019 (SARS-CoV-2) in real time. The underlying data is collated from the following sources and updated several times a day.
World Health Organization (WHO) | DXY.cn. Pneumonia. 2020 | BNO News | National Health Commission of the People’s Republic of China (NHC) | China CDC (CCDC) | Hong Kong Department of Health | Macau Government | Taiwan CDC | US CDC | Government of Canada | Australia Government Department of Health | European Centre for Disease Prevention and Control (ECDC) | Ministry of Health Singapore (MOH) | Italy Ministry of Health | 1Point3Arces | WorldoMeters
JHU researchers make data that goes into the dashboard available on Github repo for Novel Coronavirus (COVID-19) Cases. In this lab we will work with this data.
You could start by creating a new Rproj and cloning the JHU repo as you did in Lab 6. However, we will not need all this data and it is easier to access the individual files directly as some are updated daily
Let’s take a look at the files and the structure of data in the files.
Open up the file to look at the structure
The file contains the columns
Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude
It is important to note that for some countries there is only one row, while for others (e.g. China and US) there are multiple rows representing different provinces or states. Thus, we will need to sum these rows to get a total count for the US and China when we make graphs. From experience in making this tutorial I know the Column names with / will cause errors in ggplot ().
Next look at the time series data which is a complilation of the daily reports for each reporting type (Confirmed, Deaths, Recovered)
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20…
This data is what is called wide format, where values of the variable being observed (Confirmed cases) are spread out across columns (Here: columns for each day). Another way of describing this is that there is more than one measurement per row. This wide format works well for data entry and sometimes works well for analysis but can be cumbersome when using R. The tidyr package allows us to quickly switch between wide format and what is called long format where there is only 1 column for the cases.
We are going to work with the data using tidyverse functions and use a the library lubridate for reformatting the dates
Let’s start by plotting data from our last class on 3/11/2020 (the file looked at above). We can read data directly from the Github site.
Below I have used the url so that it will get the most current version. This method also loads the file into directly in R (memory) and doesn’t save it on your computer.
time_series_confirmed_long2 <- read_csv(url(“https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv”))
report_03_11_2020 <- read_csv(url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-11-2020.csv")) %>%
rename(Country_Region = "Country/Region", Province_State = "Province/State")Check the table properties to make sure the data imported as we expected
## # A tibble: 6 x 8
## Province_State Country_Region `Last Update` Confirmed Deaths Recovered
## <chr> <chr> <dttm> <dbl> <dbl> <dbl>
## 1 Hubei China 2020-03-11 10:53:02 67773 3046 49134
## 2 <NA> Italy 2020-03-11 21:33:02 12462 827 1045
## 3 <NA> Iran 2020-03-11 18:52:03 9000 354 2959
## 4 <NA> Korea, South 2020-03-11 21:13:18 7755 60 288
## 5 France France 2020-03-11 22:53:03 2281 48 12
## 6 <NA> Spain 2020-03-11 20:53:02 2277 54 183
## # … with 2 more variables: Latitude <dbl>, Longitude <dbl>
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 216 obs. of 8 variables:
## $ Province_State: chr "Hubei" NA NA NA ...
## $ Country_Region: chr "China" "Italy" "Iran" "Korea, South" ...
## $ Last Update : POSIXct, format: "2020-03-11 10:53:02" "2020-03-11 21:33:02" ...
## $ Confirmed : num 67773 12462 9000 7755 2281 ...
## $ Deaths : num 3046 827 354 60 48 ...
## $ Recovered : num 49134 1045 2959 288 12 ...
## $ Latitude : num 31 43 32 36 46.2 ...
## $ Longitude : num 112.27 12 53 128 2.21 ...
## - attr(*, "spec")=
## .. cols(
## .. `Province/State` = col_character(),
## .. `Country/Region` = col_character(),
## .. `Last Update` = col_datetime(format = ""),
## .. Confirmed = col_double(),
## .. Deaths = col_double(),
## .. Recovered = col_double(),
## .. Latitude = col_double(),
## .. Longitude = col_double()
## .. )
## # A tibble: 6 x 12
## FIPS Admin2 Province_State Country_Region Last_Update Lat Long_
## <chr> <chr> <chr> <chr> <dttm> <dbl> <dbl>
## 1 45001 Abbev… South Carolina US 2020-03-23 23:19:34 34.2 -82.5
## 2 22001 Acadia Louisiana US 2020-03-23 23:19:34 30.3 -92.4
## 3 51001 Accom… Virginia US 2020-03-23 23:19:34 37.8 -75.6
## 4 16001 Ada Idaho US 2020-03-23 23:19:34 43.5 -116.
## 5 19001 Adair Iowa US 2020-03-23 23:19:34 41.3 -94.5
## 6 21001 Adair Kentucky US 2020-03-23 23:19:34 37.1 -85.3
## # … with 5 more variables: Confirmed <dbl>, Deaths <dbl>, Recovered <dbl>,
## # Active <dbl>, Combined_Key <chr>
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3415 obs. of 12 variables:
## $ FIPS : chr "45001" "22001" "51001" "16001" ...
## $ Admin2 : chr "Abbeville" "Acadia" "Accomack" "Ada" ...
## $ Province_State: chr "South Carolina" "Louisiana" "Virginia" "Idaho" ...
## $ Country_Region: chr "US" "US" "US" "US" ...
## $ Last_Update : POSIXct, format: "2020-03-23 23:19:34" "2020-03-23 23:19:34" ...
## $ Lat : num 34.2 30.3 37.8 43.5 41.3 ...
## $ Long_ : num -82.5 -92.4 -75.6 -116.2 -94.5 ...
## $ Confirmed : num 1 1 1 13 1 0 2 0 18 0 ...
## $ Deaths : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Recovered : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Active : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Combined_Key : chr "Abbeville, South Carolina, US" "Acadia, Louisiana, US" "Accomack, Virginia, US" "Ada, Idaho, US" ...
## - attr(*, "spec")=
## .. cols(
## .. FIPS = col_character(),
## .. Admin2 = col_character(),
## .. Province_State = col_character(),
## .. Country_Region = col_character(),
## .. Last_Update = col_datetime(format = ""),
## .. Lat = col_double(),
## .. Long_ = col_double(),
## .. Confirmed = col_double(),
## .. Deaths = col_double(),
## .. Recovered = col_double(),
## .. Active = col_double(),
## .. Combined_Key = col_character()
## .. )
Let’s start with a plot of confirmed cases in US states as of March 11th.
report_03_11_2020 %>%
filter (Country_Region == "US") %>%
ggplot(aes(x = Confirmed, y = reorder(Province_State, Confirmed))) +
geom_point() +
ggtitle("Confirmed cases for each US State") +
ylab("Country/Region") +
xlab("Confirmed Cases")If we try to graph by Country_Region there are entries that have the same character (e.g. China and US) so we need to summarise the data we are interested in graphing
report_03_11_2020 %>%
group_by(Country_Region) %>%
summarise(Deaths = sum(Deaths)) %>%
arrange(desc(Deaths))## # A tibble: 116 x 2
## Country_Region Deaths
## <chr> <dbl>
## 1 China 3161
## 2 Italy 827
## 3 Iran 354
## 4 Korea, South 60
## 5 Spain 54
## 6 France 48
## 7 US 36
## 8 Japan 15
## 9 United Kingdom 8
## 10 Cruise Ship 7
## # … with 106 more rows
There are 116 countries which are a lot to put on a single graph, so let’s just start with the countries with the most cases. In past labs we have only made bar graphs from histogram data, but values can be used where height of the bar will represent the value in a column of the data frame. This is done by using stat=“identity” instead of the default, stat=“bin”.
report_03_11_2020 %>%
group_by(Country_Region) %>%
summarise(Deaths = sum(Deaths)) %>%
arrange(desc(Deaths)) %>%
slice(1:20) %>%
ggplot(aes(x = Deaths, y = reorder(Country_Region, Deaths))) +
geom_bar(stat = 'identity') +
ggtitle("The 20 countries with the most reported COV19-related deaths") +
ylab("Country/Region") +
xlab("Deaths")There is some data wrangling required for working with the time series data. As of March 26, 2020 it still used the old column headers so we need to rename them. I expect this will change soon.
Start by loading the most recent times series data for confirmed cases
time_series_confirmed <- read_csv(url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")) %>%
rename(Province_State = "Province/State", Country_Region = "Country/Region")Check the table properties to make sure the data imported as we expected
## # A tibble: 6 x 81
## Province_State Country_Region Lat Long `1/22/20` `1/23/20` `1/24/20`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> Afghanistan 33 65 0 0 0
## 2 <NA> Albania 41.2 20.2 0 0 0
## 3 <NA> Algeria 28.0 1.66 0 0 0
## 4 <NA> Andorra 42.5 1.52 0 0 0
## 5 <NA> Angola -11.2 17.9 0 0 0
## 6 <NA> Antigua and B… 17.1 -61.8 0 0 0
## # … with 74 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
## # `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>,
## # `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>,
## # `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>,
## # `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>,
## # `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>,
## # `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>,
## # `2/21/20` <dbl>, `2/22/20` <dbl>, `2/23/20` <dbl>, `2/24/20` <dbl>,
## # `2/25/20` <dbl>, `2/26/20` <dbl>, `2/27/20` <dbl>, `2/28/20` <dbl>,
## # `2/29/20` <dbl>, `3/1/20` <dbl>, `3/2/20` <dbl>, `3/3/20` <dbl>,
## # `3/4/20` <dbl>, `3/5/20` <dbl>, `3/6/20` <dbl>, `3/7/20` <dbl>,
## # `3/8/20` <dbl>, `3/9/20` <dbl>, `3/10/20` <dbl>, `3/11/20` <dbl>,
## # `3/12/20` <dbl>, `3/13/20` <dbl>, `3/14/20` <dbl>, `3/15/20` <dbl>,
## # `3/16/20` <dbl>, `3/17/20` <dbl>, `3/18/20` <dbl>, `3/19/20` <dbl>,
## # `3/20/20` <dbl>, `3/21/20` <dbl>, `3/22/20` <dbl>, `3/23/20` <dbl>,
## # `3/24/20` <dbl>, `3/25/20` <dbl>, `3/26/20` <dbl>, `3/27/20` <dbl>,
## # `3/28/20` <dbl>, `3/29/20` <dbl>, `3/30/20` <dbl>, `3/31/20` <dbl>,
## # `4/1/20` <dbl>, `4/2/20` <dbl>, `4/3/20` <dbl>, `4/4/20` <dbl>,
## # `4/5/20` <dbl>, `4/6/20` <dbl>, `4/7/20` <dbl>
As noted above this data is in wide format. To convert to long format
time_series_confirmed_long <- time_series_confirmed %>%
pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = "Date", values_to = "Confirmed") Let’s look at the format of the data frame (tibble) now
## # A tibble: 6 x 6
## Province_State Country_Region Lat Long Date Confirmed
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 <NA> Afghanistan 33 65 1/22/20 0
## 2 <NA> Afghanistan 33 65 1/23/20 0
## 3 <NA> Afghanistan 33 65 1/24/20 0
## 4 <NA> Afghanistan 33 65 1/25/20 0
## 5 <NA> Afghanistan 33 65 1/26/20 0
## 6 <NA> Afghanistan 33 65 1/27/20 0
It would be convenient to have the confirmed cases and deaths in the same table. We can create another table with the deaths and then join the two tables.
Let’s get the times series data for deaths
time_series_deaths <- read_csv(url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")) %>%
rename(Province_State = "Province/State", Country_Region = "Country/Region")and convert it to long format
time_series_deaths_long <- time_series_deaths %>%
pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = "Date", values_to = "Deaths")
head(time_series_deaths_long)## # A tibble: 6 x 6
## Province_State Country_Region Lat Long Date Deaths
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 <NA> Afghanistan 33 65 1/22/20 0
## 2 <NA> Afghanistan 33 65 1/23/20 0
## 3 <NA> Afghanistan 33 65 1/24/20 0
## 4 <NA> Afghanistan 33 65 1/25/20 0
## 5 <NA> Afghanistan 33 65 1/26/20 0
## 6 <NA> Afghanistan 33 65 1/27/20 0
To join 2 tables we need a common column in which each row is a unique name. This is often called the Key. None of the columns in this data set meet that criteria, but we can create a column to serve as our key by mergering the names for the Province_State, Country_Region and Date columns using unite
time_series_confirmed_long <- time_series_confirmed_long %>%
unite(Key, Province_State, Country_Region, Date, sep = ".", remove = FALSE)
head(time_series_confirmed_long)## # A tibble: 6 x 7
## Key Province_State Country_Region Lat Long Date Confirmed
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 NA.Afghanistan.1/2… <NA> Afghanistan 33 65 1/22/… 0
## 2 NA.Afghanistan.1/2… <NA> Afghanistan 33 65 1/23/… 0
## 3 NA.Afghanistan.1/2… <NA> Afghanistan 33 65 1/24/… 0
## 4 NA.Afghanistan.1/2… <NA> Afghanistan 33 65 1/25/… 0
## 5 NA.Afghanistan.1/2… <NA> Afghanistan 33 65 1/26/… 0
## 6 NA.Afghanistan.1/2… <NA> Afghanistan 33 65 1/27/… 0
For the second table we can do the same and eliminate the columns that are redundant so that we just have the Key and Deaths columns
time_series_deaths_long <- time_series_deaths_long %>%
unite(Key, Province_State, Country_Region, Date, sep = ".") %>%
select(Key, Deaths)Now to join the tables. A full_join which means that the keys that are common to the tables will be joined and there will be counts for both Confirmed and Deaths. Where there are not matching values, returns NA for the one missing. select can be used to remove the key after the join since we don’t have further need for it.
time_series_long_joined <- full_join(time_series_confirmed_long,
time_series_deaths_long, by = c("Key"))
head(time_series_long_joined)## # A tibble: 6 x 8
## Key Province_State Country_Region Lat Long Date Confirmed Deaths
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 NA.Afghanist… <NA> Afghanistan 33 65 1/22… 0 0
## 2 NA.Afghanist… <NA> Afghanistan 33 65 1/23… 0 0
## 3 NA.Afghanist… <NA> Afghanistan 33 65 1/24… 0 0
## 4 NA.Afghanist… <NA> Afghanistan 33 65 1/25… 0 0
## 5 NA.Afghanist… <NA> Afghanistan 33 65 1/26… 0 0
## 6 NA.Afghanist… <NA> Afghanistan 33 65 1/27… 0 0
Check to make sure neither the Cofirmed or Death counts have NA as a value (both tables have the same number of rows, but we should check to make sure)
## integer(0)
## integer(0)
There are no NA values.
Following the same set of steps to add the Recovered Data
### download the file
time_series_recovered <- read_csv(url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv")) %>%
rename(Province_State = "Province/State", Country_Region = "Country/Region")
### convert to long format
time_series_recovered_long <- time_series_recovered %>%
pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = "Date", values_to = "Recovered")
### Create the Key
time_series_recovered_long <- time_series_recovered_long %>%
unite(Key, Province_State, Country_Region, Date, sep = ".") %>%
select(Key, Recovered)
### Merge with prior table (only this time will delete the Key column
### because we have no further use for it)
time_series_long_joined <- full_join(time_series_long_joined,
time_series_recovered_long, by = c("Key")) %>%
select(-Key)
head(time_series_long_joined)## # A tibble: 6 x 8
## Province_State Country_Region Lat Long Date Confirmed Deaths Recovered
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 <NA> Afghanistan 33 65 1/22/20 0 0 0
## 2 <NA> Afghanistan 33 65 1/23/20 0 0 0
## 3 <NA> Afghanistan 33 65 1/24/20 0 0 0
## 4 <NA> Afghanistan 33 65 1/25/20 0 0 0
## 5 <NA> Afghanistan 33 65 1/26/20 0 0 0
## 6 <NA> Afghanistan 33 65 1/27/20 0 0 0
Now we are in good shape to proceed to using these table for graphing, but first reformat Date
Confirmed, Deaths and Recovered are all count data, and there may be times when we want to plot all three on the same graph. There are multiple ways to accomplish this and one is through creating a column will all 3 using _pivot_long
time_series_long_joined_counts <- time_series_long_joined %>%
pivot_longer(-c(Province_State, Country_Region, Lat, Long, Date),
names_to = "Report_Type", values_to = "Counts")
head(time_series_long_joined_counts)## # A tibble: 6 x 7
## Province_State Country_Region Lat Long Date Report_Type Counts
## <chr> <chr> <dbl> <dbl> <date> <chr> <dbl>
## 1 <NA> Afghanistan 33 65 2020-01-22 Confirmed 0
## 2 <NA> Afghanistan 33 65 2020-01-22 Deaths 0
## 3 <NA> Afghanistan 33 65 2020-01-22 Recovered 0
## 4 <NA> Afghanistan 33 65 2020-01-23 Confirmed 0
## 5 <NA> Afghanistan 33 65 2020-01-23 Deaths 0
## 6 <NA> Afghanistan 33 65 2020-01-23 Recovered 0
To make a times series graph of the confirmed cases we need to summarize the Country date to count up the individual state data for the US.
time_series_long_joined %>%
group_by(Country_Region,Date) %>%
summarise_at(c("Confirmed", "Deaths", "Recovered"), sum) %>%
filter (Country_Region == "US") %>%
ggplot(aes(x = Date, y = Confirmed)) +
geom_point() +
geom_line() +
ggtitle("US Confirmed COVID-19 Cases")Let’s look at the US data in the context of a few other countries
time_series_long_joined %>%
group_by(Country_Region,Date) %>%
summarise_at(c("Confirmed", "Deaths", "Recovered"), sum) %>%
filter (Country_Region %in% c("China","Japan", "Korea, South",
"Italy","Spain", "US")) %>%
ggplot(aes(x = Date, y = Confirmed)) +
geom_point() +
geom_line() +
ggtitle("Confirmed COVID-19 Cases") +
facet_wrap(~Country_Region, ncol=2, scales="free_y")Now several countries on the same graph
time_series_long_joined %>%
group_by(Country_Region,Date) %>%
summarise_at(c("Confirmed", "Deaths", "Recovered"), sum) %>%
filter (Country_Region %in% c("China","France","Italy",
"Korea, South", "US")) %>%
ggplot(aes(x = Date, y = Confirmed, color = Country_Region)) +
geom_point() +
geom_line() +
ggtitle("Confirmed COVID-19 Cases")We can use the alternative data frame with column Report_Type to show Confirmed, Deaths and Recovered. Because the counts for Deaths and Recovered are low relative to Confirmed a log scale works best here for seeing the pattern
time_series_long_joined_counts %>%
group_by(Country_Region, Report_Type, Date) %>%
summarise(Counts = sum(Counts)) %>%
filter (Country_Region == "US") %>%
filter (Report_Type %in% c("Confirmed", "Deaths")) %>%
ggplot(aes(x = Date, y = log2(Counts), fill = Report_Type, color = Report_Type)) +
geom_point() +
geom_line() +
ggtitle("US COVID-19 Cases")Here are two examples provided by Prof. Chris Sutherland and Anisa Dhana. They both rely on using the maps package. For ideas on how to build on these examples here is a nice tutorial on Maps in R using maps by Eric Anderson
Load the additional libraries (and install if necessary)
Get the world map
Cutoffs based on the number of cases
Plot of data from a month ago on Feb 25th
ggplot() +
geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
geom_point(data=time_series_confirmed, aes(x=Long, y=Lat, size=`2/25/20`, color=`2/25/20`),stroke=F, alpha=0.7) +
scale_size_continuous(name="Cases", trans="log", range=c(1,7),breaks=mybreaks, labels = c("1-19", "20-99", "100-999", "1,000-49,999", "50,000+")) +
# scale_alpha_continuous(name="Cases", trans="log", range=c(0.1, 0.9),breaks=mybreaks) +
scale_color_viridis_c(option="inferno",name="Cases", trans="log",breaks=mybreaks, labels = c("1-19", "20-99", "100-999", "1,000-49,999", "50,000+")) +
theme_void() +
guides( colour = guide_legend()) +
labs(caption = "") +
theme(
legend.position = "bottom",
text = element_text(color = "#22211d"),
plot.background = element_rect(fill = "#ffffff", color = NA),
panel.background = element_rect(fill = "#ffffff", color = NA),
legend.background = element_rect(fill = "#ffffff", color = NA)
)## Warning: Transformation introduced infinite values in discrete y-axis
## Warning: Transformation introduced infinite values in discrete y-axis
## Warning in sqrt(x): NaNs produced
## Warning: Removed 187 rows containing missing values (geom_point).
Get data, pivot to longer format, mutate and summarize
time_series_confirmed_long2 <- read_csv(url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")) %>%
rename(Province_State = "Province/State", Country_Region = "Country/Region") %>%
pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = "Date", values_to = "cumulative_cases") %>%
mutate(Date = mdy(Date) - days(1),
Place = paste(Lat,Long,sep="_")) %>%
group_by(Place,Date) %>%
summarise(cumulative_cases = ifelse(sum(cumulative_cases)>0,
sum(cumulative_cases),NA_real_),
Lat = mean(Lat),
Long = mean(Long)) %>%
mutate(Pandemic_day = as.numeric(Date - min(Date)))## Parsed with column specification:
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.
View the table
## # A tibble: 6 x 6
## # Groups: Place [1]
## Place Date cumulative_cases Lat Long Pandemic_day
## <chr> <date> <dbl> <dbl> <dbl> <dbl>
## 1 -0.0236_37.9062 2020-01-21 NA -0.0236 37.9 0
## 2 -0.0236_37.9062 2020-01-22 NA -0.0236 37.9 1
## 3 -0.0236_37.9062 2020-01-23 NA -0.0236 37.9 2
## 4 -0.0236_37.9062 2020-01-24 NA -0.0236 37.9 3
## 5 -0.0236_37.9062 2020-01-25 NA -0.0236 37.9 4
## 6 -0.0236_37.9062 2020-01-26 NA -0.0236 37.9 5
Plot the data
static <- ggplot(subset(time_series_confirmed_long2,Date %in% seq(min(Date),max(Date),14)),
aes(x = Long, y = Lat, size = cumulative_cases/1000)) +
borders("world", colour = NA, fill = "grey90") +
theme_bw() +
geom_point(shape = 21, color='purple', fill='purple', alpha = 0.5) +
labs(title = 'COVID-19 spread',x = '', y = '',
size="Cases (x1000))") +
theme(legend.position = "right") +
coord_fixed(ratio=1)+
facet_wrap(.~Date,ncol=1)
static## Warning: Removed 803 rows containing missing values (geom_point).
The above graphs are minimal and meant to provide some starter code. I am sure you can make these richer and with greater impact. Create at least 2 new graphs (different from the ones above) each from the categories of daily reports, time series and geographical maps that best convey aspects of the pandemic relevant to you.
Submit this exercise as a link from your Github page (Submit your Github website URL in Moodle).