Lab 5 : Working with NEON Metagenome Assembled Genomes (MAGs / bins)

Learning objectives

  • Introduction to the National Ecological Observatory Network
  • Introduction to the JGI and the Integrated Microbial Genomes and Microbiomes (IMG/MER)
  • Creating nicely displayed and interactive tables for reports
  • Wrangling the NEON data

Background

We will going over the following in class

Our Project Space

NEON has produced metagenomic data as data product since 2014. The Joint Genome Institute has recently annotated version all reads prior to 20222 Gs0144570. However, these metagenomes have a low sequencing depth and therefore are difficult to use for assembling bacterial genomes.

Last year JGI and NEON collaborated to produce metagenomes which are ~10-fold deeper which allow for better assemble of reads into gene length and greater fragments. We will work with the NEON 2023 and 2024 data Gs0166454 study sets.

On the Computer

R code
library(tidyverse)
library(knitr)
library(janitor)
library(DT)

kable

Don’t display tables with thousands of rows. If you do you may run out of memory and/or you will generate a very large html file.

By default, Quarto displays data frames and matrixes as they would be in the R terminal (in a monospaced font). You have seen already what they look like. Here is an example for the iris data set that is preloaded when you start R. We will make a subset of the table.

R code
iris_setosa <- iris |> 
filter(Species == "setosa") |> 
filter(Sepal.Length > 5)

To make a table more readable in a report the knitr::kable function works nice for simple customizable tables.

R code
library(knitr)
kable(iris_setosa)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
5.4 3.7 1.5 0.2 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
5.1 3.3 1.7 0.5 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
5.5 3.5 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.1 3.8 1.9 0.4 setosa
5.1 3.8 1.6 0.2 setosa
5.3 3.7 1.5 0.2 setosa

kable works for small tables, but even 22 rows is too much to display in a report. If you have larger tables and/or want to make them interactive, the DT works well.

DT

R code
library(DT)
datatable(iris_setosa)

There are two options for using datatable in your R code chunk. (1) Bound the code chunk you want to present by datatable

R code
datatable(
  iris |> 
    filter(Species == "setosa") |> 
    filter(Sepal.Length > 5)
)

Or create a new object

R code
iris_setosa <- iris |> 
  filter(Species == "setosa") |> 
  filter(Sepal.Length > 5)

datatable(iris_setosa)

Examples using the NEON data table

Our data that we will work with today can be found by searching metagenome bins associated with GOLD Study ID Gs0166454 IMG/MER

A description of the column headers for the file we will work with

  • Bin ID - Metagenome Assemble Genome (MAG) ID
  • Genome Name - The metagenome sample name
  • IMG Genome ID - The metagenome sample ID
  • Bin Quality - An estimate of the quality of the bin or MAG
  • Bin Lineage - Taxonomic lineage using the JGI system
  • GTDB-Tk Taxonomy Lineage - Taxonomic lineage using GTDB
  • Bin Methods - The methods for binning contigs and quality control
  • Created By - The process by which the bins were created
  • Date Added - Date sample/metagenome was added
  • Bin Completeness - An estimate of the completeness of the MAG
  • Bin Contamination - An estimate of the contamination of the MAG
  • Total Number of Bases - MAG size in bases
  • 5s rRNA - Count of 5s rRNAs in the MAG
  • 16s rRNA - Count of 16S rRNAs in the MAG
  • 23s rRNA - Count of 23s rRNAs in the MAG
  • tRNA Genes - Count of tRNA genes in the MAG
  • Gene Count - Count of number of genes in the MAG
  • Scaffold Count - Number of separate scaffold comprising the MAG. The ideal would be 1
  • GOLD Study ID - The ID in the JGI GOLD database

Let’s load the table into R

R code
# This is the location used for Github
NEON_MAGs <- read_tsv("../data/NEON_metadata/exported_img_bins_Gs0166454_NEON.tsv")
# This is the location used for the class data directory on Unity
# NEON_MAGs <- read_tsv("/work/pi_bio678_umass_edu/data_NEON/exported_img_bins_Gs0166454_NEON.tsv")

As always in the Environment window check to see if the table loaded as expect and what the object types are. Or you can do it in your R console

R code
head(NEON_MAGs)
# A tibble: 6 × 21
  bin_oid     `Bin ID` `Genome Name` `IMG Genome ID` `Bin Quality` `Bin Lineage`
  <chr>       <chr>    <chr>                   <dbl> <chr>         <chr>        
1 3300075492… 3300075… Soil microbi…      3300075492 MQ            Bacteria     
2 3300075492… 3300075… Soil microbi…      3300075492 MQ            Bacteria     
3 3300075492… 3300075… Soil microbi…      3300075492 MQ            <NA>         
4 3300075492… 3300075… Soil microbi…      3300075492 MQ            Bacteria; Ac…
5 3300075492… 3300075… Soil microbi…      3300075492 MQ            Bacteria; Ac…
6 3300075492… 3300075… Soil microbi…      3300075492 MQ            Bacteria; Ac…
# ℹ 15 more variables: `GTDB Taxonomy Lineage` <chr>, `Bin Methods` <chr>,
#   `Created By` <chr>, `Date Added` <date>, `Bin Completeness` <dbl>,
#   `Bin Contamination` <dbl>, `Average Coverage` <lgl>,
#   `Total Number of Bases` <dbl>, `5s rRNA` <dbl>, `16s rRNA` <dbl>,
#   `23s rRNA` <dbl>, `tRNA Genes` <dbl>, `Gene Count` <dbl>,
#   `Scaffold Count` <dbl>, `GOLD Study ID` <chr>
R code
str(NEON_MAGs)
spc_tbl_ [16,669 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ bin_oid              : chr [1:16669] "3300075492_s0" "3300075492_s1" "3300075492_s100" "3300075492_s106" ...
 $ Bin ID               : chr [1:16669] "3300075492_s0" "3300075492_s1" "3300075492_s100" "3300075492_s106" ...
 $ Genome Name          : chr [1:16669] "Soil microbial communities from University of Notre Dame Environmental Research Center NEON Field Site, Michiga"| __truncated__ "Soil microbial communities from University of Notre Dame Environmental Research Center NEON Field Site, Michiga"| __truncated__ "Soil microbial communities from University of Notre Dame Environmental Research Center NEON Field Site, Michiga"| __truncated__ "Soil microbial communities from University of Notre Dame Environmental Research Center NEON Field Site, Michiga"| __truncated__ ...
 $ IMG Genome ID        : num [1:16669] 3.3e+09 3.3e+09 3.3e+09 3.3e+09 3.3e+09 ...
 $ Bin Quality          : chr [1:16669] "MQ" "MQ" "MQ" "MQ" ...
 $ Bin Lineage          : chr [1:16669] "Bacteria" "Bacteria" NA "Bacteria; Actinomycetota; Thermoleophilia; Solirubrobacterales" ...
 $ GTDB Taxonomy Lineage: chr [1:16669] "Bacteria; Acidobacteriota; Terriglobia; Acidoferrales; UBA7541; Acidoferrum" "Bacteria; Desulfobacterota_B; Binatia; Binatales; Binataceae; Binatus; Binatus soli" "Archaea; Thermoplasmatota; Thermoplasmata; UBA184; UBA184; UBA184" "Bacteria; Actinomycetota; Thermoleophilia; Solirubrobacterales; Solirubrobacteraceae; Palsa-744" ...
 $ Bin Methods          : chr [1:16669] "SemiBin2:v2.1.0, CheckM2:v1.0.2, GTDB-Tk:v2.4.0, GTDB-Tk-database:release220" "SemiBin2:v2.1.0, CheckM2:v1.0.2, GTDB-Tk:v2.4.0, GTDB-Tk-database:release220" "SemiBin2:v2.1.0, CheckM2:v1.0.2, GTDB-Tk:v2.4.0, GTDB-Tk-database:release220" "SemiBin2:v2.1.0, CheckM2:v1.0.2, GTDB-Tk:v2.4.0, GTDB-Tk-database:release220" ...
 $ Created By           : chr [1:16669] "IMG_PIPELINE" "IMG_PIPELINE" "IMG_PIPELINE" "IMG_PIPELINE" ...
 $ Date Added           : Date[1:16669], format: "2025-01-23" "2025-01-23" ...
 $ Bin Completeness     : num [1:16669] 95.9 99.6 57.9 74 67.3 ...
 $ Bin Contamination    : num [1:16669] 6.31 0 2.39 8.19 3.95 6.54 5.24 0.63 0.61 4.56 ...
 $ Average Coverage     : logi [1:16669] NA NA NA NA NA NA ...
 $ Total Number of Bases: num [1:16669] 5600425 3706224 1233791 2084993 3809196 ...
 $ 5s rRNA              : num [1:16669] 1 1 1 1 0 0 0 1 1 0 ...
 $ 16s rRNA             : num [1:16669] 2 0 0 1 0 0 0 0 0 0 ...
 $ 23s rRNA             : num [1:16669] 2 1 1 1 0 0 0 1 0 0 ...
 $ tRNA Genes           : num [1:16669] 57 51 30 33 31 33 29 25 30 30 ...
 $ Gene Count           : num [1:16669] 4976 3617 1344 2284 3848 ...
 $ Scaffold Count       : num [1:16669] 26 45 210 322 398 85 527 435 243 409 ...
 $ GOLD Study ID        : chr [1:16669] "Gs0166454" "Gs0166454" "Gs0166454" "Gs0166454" ...
 - attr(*, "spec")=
  .. cols(
  ..   bin_oid = col_character(),
  ..   `Bin ID` = col_character(),
  ..   `Genome Name` = col_character(),
  ..   `IMG Genome ID` = col_double(),
  ..   `Bin Quality` = col_character(),
  ..   `Bin Lineage` = col_character(),
  ..   `GTDB Taxonomy Lineage` = col_character(),
  ..   `Bin Methods` = col_character(),
  ..   `Created By` = col_character(),
  ..   `Date Added` = col_date(format = ""),
  ..   `Bin Completeness` = col_double(),
  ..   `Bin Contamination` = col_double(),
  ..   `Average Coverage` = col_logical(),
  ..   `Total Number of Bases` = col_double(),
  ..   `5s rRNA` = col_double(),
  ..   `16s rRNA` = col_double(),
  ..   `23s rRNA` = col_double(),
  ..   `tRNA Genes` = col_double(),
  ..   `Gene Count` = col_double(),
  ..   `Scaffold Count` = col_double(),
  ..   `GOLD Study ID` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

You might also notice that many columns are surrounded by backticks. That’s because they contain spaces, breaking R’s usual rules for variable names; they’re non-syntactic names. To refer to these variables, you need to surround them with backticks, `. Last week in Chapter 7 we used janitor::clean_names() to use some heuristics to turn them all into snake case at once.

R code
NEON_MAGs <- NEON_MAGs |> janitor::clean_names()

Count the number of MQ and HQ genomes

R code
NEON_MAGs |> 
  count(bin_quality, sort = TRUE) 
# A tibble: 2 × 2
  bin_quality     n
  <chr>       <int>
1 MQ          15369
2 HQ           1300

Make a knitr::kable table of the bin quality counts

R code
kable(
  NEON_MAGs |> 
   count(bin_quality) 
)
bin_quality n
HQ 1300
MQ 15369

Filter so that Bin Quality = HQ and display in DT::datatable

R code
datatable(
  NEON_MAGs|> 
    filter(bin_quality == "HQ")
)

Select the GTDB taxonomy and the MAGs genome size then filter to all MAGs greater than 10,000,000 bases

R code
kable(
NEON_MAGs |> 
  select(c(gtdb_taxonomy_lineage, total_number_of_bases)) |> 
  filter(total_number_of_bases > 10000000)
)
gtdb_taxonomy_lineage total_number_of_bases
Bacteria; Pseudomonadota; Gammaproteobacteria; Steroidobacterales; Steroidobacteraceae; 13-2-20CM-66-19 10115899
Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae 11932805
Bacteria; Actinomycetota; Actinomycetes; Streptomycetales; Catenulisporaceae; Catenulispora 12420050
Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae; JAJPJC01 13151516
Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae 12046820
Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae; JAQGHR01 12217509
Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae 12942605
Bacteria; Actinomycetota; Actinomycetes; Mycobacteriales; Pseudonocardiaceae; Actinophytocola 11293845
Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae 14411455
Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae 10890912
Bacteria; Planctomycetota; Planctomycetia; Pirellulales; JAICIG01; JAICLL01 11379784
Bacteria; Acidobacteriota; Terriglobia; Bryobacterales; Bryobacteraceae; PALSA-243 10475466
NA 10248256
Bacteria; Pseudomonadota; Alphaproteobacteria; Rhizobiales; Xanthobacteraceae 10436042
Bacteria; Planctomycetota; Planctomycetia; Gemmatales; Gemmataceae 13087515
Bacteria; Actinomycetota; Actinomycetes; Mycobacteriales; Pseudonocardiaceae; Actinophytocola 10431101
Bacteria; Myxococcota; Polyangia; Polyangiales; JAFGIB01 10030262
Bacteria; Myxococcota; Polyangia; Polyangiales; Polyangiaceae; JANYGI01 12560711
Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae 11026377
Bacteria; Planctomycetota; Planctomycetia; Gemmatales; Gemmataceae 12479517
Bacteria; Acidobacteriota; Terriglobia; Bryobacterales; Bryobacteraceae; Solibacter 10617884
Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae; JAQGHR01 10431455
Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae 12864765
Bacteria; Actinomycetota; Actinomycetes; Mycobacteriales; Mycobacteriaceae; Mycobacterium 10583130

We can use the stringr package to filter based on a word or string of characters in a column

R code
datatable(
NEON_MAGs |> 
  filter(str_detect(gtdb_taxonomy_lineage, 'Bacteroidota'))
)

Filter to include only the samples from Yellowstone NP

R code
datatable(
NEON_MAGs |> 
  filter(str_detect(genome_name, 'Yellowstone NP'))
)

Since the the taxonomic categories in GTDB-Tk Taxonomy Lineage are separated by the ; we can use the separate function to create new columns for each of the taxonomic categories. The remove = FALSE keeps the original GTDB-Tk Taxonomy Lineage column

R code
NEON_MAGs_tax <- NEON_MAGs |> 
  separate(gtdb_taxonomy_lineage, c("domain", "phylum", "class", "order", "family", "genus"), "; ", remove = FALSE) 

Count the number of MAGs for each Phylum and display in DT::datatable

R code
datatable(
  NEON_MAGs_tax |> 
    count(phylum, sort = TRUE)
)

Note that there is one category with no name. This were MAGs that were not annotated by GTDB using the JGI pipline (they are Archaea)

There is a lot of information in genome_name. Let’s unpack it into separate columns. Note here where the double quotes and grave accents are used

R code
NEON_MAGs_tax_sample <- NEON_MAGs_tax |> 
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("genome_name", str_replace, "Terrestrial soil microbial communities from ", "") |> 
  # Use the first `-` to split the column in two
  separate(genome_name, c("site","sample_name"), " - ") |> 
  # Get rid of the the common string "S-comp-1"
  mutate_at("sample_name", str_replace, "-comp-1", "") |>
  # separate the Sample Name into Site ID and plot info
  separate(sample_name, c("site_ID","subplot.layer.date"), "_", remove = FALSE,) |> 
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("subplot", "layer", "date"), "-") 

Check this out in the Environment window to make sure with got it right.

Let’s see which Site has the most MAGs

R code
datatable(
  NEON_MAGs_tax_sample |> 
    count(site, sort = TRUE)
)

Exercises

Exercise 1

Use view(iris) to see the whole data table. Subset the table based on a different species than was used in the example. Display the table using DT::datatable

Exercise 2

Display using DT::datatable the NEON MAGs that have at least 1 16S rRNA

Exercise 3

Display a table of the MAGs from Lower Tombigbee with only the columns for the genome_name, gtdb_taxonomy_lineage, and estimated MAG genome size (total_number_of_bases).

Exercise 4

Display a table with the counts of each class names at Lyndon B. Johnson National Grasslands.

Exercise 5

Display a table with the counts for the Phylum Actinomycetota at each Site.