Lab 6 : ggplotting with NEON MAG data

Learning objectives

  • Be able to add title, axis labels, legends, colors to ggplot graphs
  • Resize graphs within R code chunks
  • Print graphics to a file (e.g. jpeg, pdf)
  • Loading images into Markdown
  • Making interactive graphs

Fine tuning ggplots

Today we will build upon the graphing approaches in the with all the Data Carpentry ggplot tutorial

The Cookbook for R by Winston Chang is also great for tidying up our graphs.

Here are a couple of cheat sheets that can be useful

First load the libraries

R code
library(tidyverse)
library(janitor)
library(plotly)
library(DT)
library(lubridate)

Basic graph labels

R code
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(aes(color=Species, shape=Species)) +
    labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width", color = "Plant Species", shape = "Plant Species") 

Themes

https://r-charts.com/ggplot2/themes/

R code
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(aes(color=Species, shape=Species)) +
    labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width", color = "Plant Species", shape = "Plant Species") +
  theme_classic()

Colors

R code
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(color = "red", aes(shape = Species))+
    labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") 

R code
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(aes(color = Species, shape = Species)) +
    scale_color_manual(values=c("blue", "purple", "red")) +
    labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") 

R code
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(aes(color = Species, shape = Species)) +
    scale_color_brewer(palette="Dark2") +
    labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") 

R code
library(viridisLite)
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(aes(fill = Species), color = "black", pch=21) +
    labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") 

viridisLite - Colorblind-Friendly Color Maps for R](https://sjmgarnier.github.io/viridisLite/)

R code
library(viridisLite)
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(aes(color = Species, shape = Species)) +
    scale_colour_viridis_d() +
    labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") 

Understanding directory paths

Here’s an overview of Unix directory paths, which are fundamental to navigating and managing files in Unix-like operating systems (such as on Unity). There directory paths that are listed in examples will often be different than the ones you need to use on Unity.

Absolute vs. Relative Paths

Absolute Path: Starts from the root directory (/) and specifies the full path to a file or directory.

Example: /home/jlb/images/star_virus.png

Relative Path: Starts from the current working directory.

Example: images/star_virus.png (if you’re already in /home/jlb)

To find your absolute path in RStudiio. Type in the console (bottow left window) getwd()

> getwd()
[1] "/home/jlb_umass_edu"

Often I will put the data sets we are working with in our course directory which has the path

/work/pi_bio678_umass_edu/data_NEON/

therefore to load the data into RStudio

NEON_MAGs <- read_tsv("/work/pi_bio678_umass_edu/data_NEON/exported_img_bins_Gs0166454_NEON.tsv")

I suggest you make a new folder for today’s lab in names images in your home directory. When you create or plot a graph you can use the relative path

"images/iris_example_plot1.pdf"

or the absolute path

"home/jlb/images/iris_example_plot1.pdf"

Controlling graph size in Markdown

The dimensions of an individual graph in the Markdown document be adjusted by specifying the graph dimensions in the header for the r code chunk.

#| fig-height: 20
#| fig-width: 8

This is similar to what we used in previous labs where we wanted to show but not run the code

#| eval: false

If you want to run the code but not show the code

#| echo: false

Graphic Output

You may have realized that you can export plots in R Studio by clicking on Export in the Plots window that appears after you make a graph. You can save as a pdf, svg, tiff, png, bmp, jpeg and eps. You can also write the output directly to a file. This is particularly useful for controling the final dimensions in a reproducible way and for manuscripts.

R code
# Plot graph to a pdf outputfile
pdf("../images/iris_example_plot1.pdf", width=6, height=3)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() +
  labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") 
dev.off()
png 
  2 
R code
# Plot graph to a png outputfile
ppi <- 300
png("../images/iris_example_plot2.png", width=6*ppi, height=4*ppi, res=ppi)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point()
dev.off()
png 
  2 

For more details on sizing output Cookbook for R - Output to a file - PDF, PNG, TIFF, SVG

Markdown loading images

Sometimes it is useful in controlling the image layout for a report to file with the graph and then subsequently load it into the .qmd file. This works with png files, but not pdfs. You can also upload images made with other bioinformatic tools into your report.

![Iris example plot](../images/iris_example_plot2.png) 

Iris example plot

Another way to present a graph without the code is adding echo = FALSE within the r{} chunk - {r echo = FALSE}. This prevents code, but not the results from appearing in the knitr file.

Interactive graphs

With plotly/ggplotly (https://plot.ly/ggplot2/) you can make interactive graphs in your lab report.

R code
library(plotly)
R code
# Version 1
ggplotly(
  ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
    geom_point()
 )
R code
# Version 2
p <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point()
ggplotly(p)

Examples with the NEON data

Let’s load the table into R. This week I have made a number of changes to the data table to get it into a format in which we can work with both the soil and freshwater MAGs

This first block of code is for both the freshwater and soil MAGs

R code
# This is the location used for Github
NEON_MAGs_prelim <- read_tsv("../data/NEON_metadata/exported_img_bins_Gs0166454_NEON.tsv") |> 
# This is the location used for the class data directory on Unity
# NEON_MAGs_prelim <- read_tsv("/work/pi_bio678_umass_edu/data_NEON/exported_img_bins_Gs0166454_NEON.tsv") |>
  
  clean_names() |> 
  
  # Add a new column community corresponding to different communities names in the genome_name
  mutate(community = case_when(
    str_detect(genome_name, "Freshwater sediment microbial communities") ~ "Freshwater sediment microbial communitie",
    str_detect(genome_name, "Freshwater biofilm microbial communities") ~ "Freshwater biofilm microbial communities",
    str_detect(genome_name, "Freshwater microbial communities") ~ "Freshwater microbial communities",
    str_detect(genome_name, "Soil microbial communities") ~ "Soil microbial communities",
    TRUE ~ NA_character_
  )) |> 
  
  # Create a column type that is either Freshwater or Soil
  mutate(type = case_when(
    str_detect(genome_name, "Freshwater sediment microbial communities") ~ "Freshwater",
    str_detect(genome_name, "Freshwater biofilm microbial communities") ~ "Freshwater",
    str_detect(genome_name, "Freshwater microbial communities") ~ "Freshwater",
    str_detect(genome_name, "Soil microbial communities") ~ "Soil",
    TRUE ~ NA_character_
  )) |> 
  
  # Get rid of the communities strings
  mutate_at("genome_name", str_replace, "Freshwater sediment microbial communities from ", "") |> 
  mutate_at("genome_name", str_replace, "Freshwater biofilm microbial communities from", "") |> 
  mutate_at("genome_name", str_replace, "Freshwater microbial communities from ", "") |> 
  mutate_at("genome_name", str_replace, "Soil microbial communities from ", "") |> 

  # separate site from sample name 
  separate(genome_name, c("site","sample_name"), " - ") |>   
  
  # Deal with these unknow fields in the sample name by creating a new column and removing them from the sample name
  mutate(sample_unknown = case_when(
    str_detect(sample_name, ".SS.") ~ "SS",
    str_detect(sample_name, ".C0.") ~ "C0",
    str_detect(sample_name, ".C1.") ~ "C1",
    str_detect(sample_name, ".C2.") ~ "C2",
    TRUE ~ NA_character_
  )) |> 

# These fields are all associated with "Freshwater microbial communities from...   
# SS - near stream sensor
# C0 - non-stratified lake/river surface near buoy
# C1 - stratified lake surface/epilimnion near buoy
# C2 - stratified lake hypolimnion near buoy
# EPIPSAMMON  - biofilm on sand/silt
# EPILITHON - biofilm on rocks/cobbles
# EPIPHYTON - biofilm that grows on the stems and leaves of aquatic plants
  
# These fields are all associated with "Freshwater biofilm microbial communities from
# EPILITHON - biofilm on rocks/cobbles
  
# These fields are all associated with "Freshwater sediment microbial communities from 
# EPIPSAMMON - biofilm on sand/silt
  
  mutate_at("sample_name", str_replace, ".SS", "") |> 
  mutate_at("sample_name", str_replace, ".C0", "") |> 
  mutate_at("sample_name", str_replace, ".C1", "") |> 
  mutate_at("sample_name", str_replace, ".C2", "") |> 
  

  # Get rid of the the common strings at the end of sample names
  mutate_at("sample_name", str_replace, "-GEN-DNA1", "") |>  
  mutate_at("sample_name", str_replace, "-COMP-DNA1", "") |>  
  mutate_at("sample_name", str_replace, "-COMP-DNA2", "") |>  
  mutate_at("sample_name", str_replace, ".DNA-DNA1", "") |>
  mutate_at("sample_name", str_replace, "_v2", "") |>
  mutate_at("sample_name", str_replace, " \\(version 2\\)", "") |>
  mutate_at("sample_name", str_replace, " \\(version 3\\)", "") |>
  
# Separate out the taxonomy groups
  separate(gtdb_taxonomy_lineage, c("domain", "phylum", "class", "order", "family", "genus"), "; ", remove = FALSE)

This block is for separating out the freshwater sample names

R code
NEON_MAGs_freshwater <- NEON_MAGs_prelim |>
  filter(type == "Freshwater") |>
  # separate the Sample Name into Site ID and plot info
  separate(sample_name, c("site_ID","date", "layer", "subplot"), "\\.", remove = FALSE) |>
  mutate(quadrant = NA_character_)

This block is for separating out the soil sample names

R code
NEON_MAGs_soil <- NEON_MAGs_prelim |>
  filter(type == "Soil") |>
  # separate the Sample Name into Site ID and plot info
  separate(sample_name, c("site_ID","subplot.layer.date"), "_", remove = FALSE,) |> 
  # some sample names have 3 fields while others have a fourth field for the quadrant. This code create a field for the quadrant when present and adds na for samples from combined cores.
  extract(
    subplot.layer.date,
    into = c("subplot", "layer", "quadrant", "date"),
    regex = "^([^-]+)-([^-]+)(?:-([^-]+))?-([^-]+)$",
    remove = FALSE
  ) |>
  mutate(quadrant = na_if(quadrant, "")) |>
  select(-subplot.layer.date)

This combines the soil and freshwater data frames

R code
NEON_MAGs <-bind_rows(NEON_MAGs_freshwater, NEON_MAGs_soil) |>
  # This changes the format of the data column
  mutate(date = ymd(date))

Finding rows with NA

In some cases you will have NA on your graphs. If you would like to see which rows in a column contain NA use

R code
filter(is.na(column_name))

These represent potential novel taxonomic groups. If you would like to replace NA with unknown or novel

R code
mutate(column_name = replace_na(column_name, "novel"))

NEON graphs

Bar plots

Counts produced by ggplot

Note that in this graph ggplot produces the count automatically

R code
NEON_MAGs |> 
mutate(phylum = replace_na(phylum, "***NOVEL***")) |>
ggplot(aes(x = phylum)) +
  geom_bar() +
  coord_flip()

Use the forcats package in tidyverse to put the counts in descending order

R code
NEON_MAGs |> 
ggplot(aes(x = fct_infreq(phylum))) +
  geom_bar() +
  coord_flip()

Counts passed to ggplot

This is different code that creates the same graph as above. Note in this case the counts were first calculated in dplyr then passed to ggplot. Both x and y values are needed. Within geom_bar stat is set to “identify”

R code
NEON_MAGs |> 
  count(phylum) |> 
ggplot(aes(x = phylum, y = n)) +
  geom_col(stat = "identity") +
  coord_flip()

To put in descending order

jgy"Sepal Width #| fig-height: 10 #| fig-width: 8 NEON_MAGs |> count(phylum) |> ggplot(aes(x = reorder(phylum, n), y = n)) + geom_col(stat = "identity") + coord_flip()

Stacked vs multiple bar plots

R code
NEON_MAGs |> 
  mutate(phylum = replace_na(phylum, "***UNKNOWN***")) |> 
ggplot(aes(x = fct_rev(fct_infreq(phylum)), fill = type)) +
  geom_bar() +
  coord_flip() +
  labs(title = "2023 and 2024 NEON MAG Taxonomic Distribution", x = "Count", y = "Phylum") 

R code
NEON_MAGs |> 
mutate(phylum = replace_na(phylum, "***NOVEL***")) |>
ggplot(aes(x = fct_rev(fct_infreq(phylum)), fill = type)) +
  geom_bar(position = "dodge") +
  coord_flip()

Notice that the bars are of different width. This can be adjusted by setting the width

R code
NEON_MAGs |> 
mutate(phylum = replace_na(phylum, "***NOVEL***")) |> 
ggplot(aes(x = fct_rev(fct_infreq(phylum)), fill = type)) +
  geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
  coord_flip()

Multiple panels (facet_wrap)

R code
NEON_MAGs |> 
ggplot(aes(x = phylum, fill = type)) +
  geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
  coord_flip() +
  facet_wrap(vars(site_ID), scales = "free", ncol = 2)

Histogram

R code
NEON_MAGs |> 
ggplot(aes(x = total_number_of_bases)) +
  geom_histogram(bins = 50) 

Box plot

R code
NEON_MAGs  |>    
ggplot(aes(x = fct_infreq(phylum), y = total_number_of_bases)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))

Showing each point in the above plot

R code
NEON_MAGs |>   
ggplot(aes(x = fct_infreq(phylum), y = total_number_of_bases)) +
  geom_point() +
  coord_flip()

Exercises

For all exercises make complete graphs that are report ready. Relabel the x-axis, y-axis and legend for clarity, add a title, add color and size appropriately. The NAs in the taxonomy indicate a novel species starting with the highest level. For example a NA in a class that has an assigned phylum Proteobacteria would be a novel class in the phylum Proteobacteria. To filter Class and Order based on NA.

R code
NEON_MAGs |> 
  filter(is.na(class) | is.na(order))

Exercise 1

What are the overall class MAG counts?

Exercise 2

What are the MAG counts for each subplot. Color by site ID.

Exercise 3

How many novel bacteria were discovered (Show that number of NAs for each site)?

Exercise 4

How many novel bacterial MAGs are high quality vs medium quality?

Exercise 5

What phyla have novel bacterial genera?

Exercise 6

Make a stacked bar plot of the total number of MAGs at each site using Phylum as the fill.

Exercise 7

Using facet_wrap make plots of the total number of MAGs at each site for each phylum (e.g. similar to the example above but using the site ID and separating each graph by phylum.)

Exercise 8

What is the relationship between MAGs genome size and the number of genes? Color by Phylum.

Exercise 9

What is the relationship between scaffold count and MAG completeness?

Exercise 10

Separate out bin_id (e.g 3300078752_s0) into 2 columns metagenome_id and bin_num.

Exercise 11

The site column has strings like

  • Rio Cupeyes NEON Field Site, San Germán, Puerto Rico
  • Sycamore Creek NEON Field Site, Rio Verde, Arizona, USA
  • San Joaquin Experimental Range NEON Field site, Yosemite Lakes, California, USA
  • San Joaquin Experimental Range NEON Field Site, California, USA
  • Oak Ridge NEON Field Station, Tennessee, USA

Separate out this string into 4 columns site name (e.g. Rio Cupeyes), ’NEON field site(e.g. NEON Field Site/Station),region(e.g. Yosemite Lakes) andstate`.

This is a tough one that may require back and forth with copilot.

Exercise 12

Make a graph showing the number of MAGs in each state.