R code
library(tidyverse)
library(knitr)
library(janitor)
library(DT)We will going over the following in class
NEON has produced metagenomic data as data product since 2014. The Joint Genome Institute has recently annotated version all reads prior to 20222 Gs0144570. However, these metagenomes have a low sequencing depth and therefore are difficult to use for assembling bacterial genomes.
Last year JGI and NEON collaborated to produce metagenomes which are ~10-fold deeper which allow for better assemble of reads into gene length and greater fragments. We will work with the NEON 2023 and 2024 data Gs0166454 study sets.
Don’t display tables with thousands of rows. If you do you may run out of memory and/or you will generate a very large html file.
By default, Quarto displays data frames and matrixes as they would be in the R terminal (in a monospaced font). You have seen already what they look like. Here is an example for the iris data set that is preloaded when you start R. We will make a subset of the table.
To make a table more readable in a report the knitr::kable function works nice for simple customizable tables.
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 5.4 | 3.7 | 1.5 | 0.2 | setosa |
| 5.8 | 4.0 | 1.2 | 0.2 | setosa |
| 5.7 | 4.4 | 1.5 | 0.4 | setosa |
| 5.4 | 3.9 | 1.3 | 0.4 | setosa |
| 5.1 | 3.5 | 1.4 | 0.3 | setosa |
| 5.7 | 3.8 | 1.7 | 0.3 | setosa |
| 5.1 | 3.8 | 1.5 | 0.3 | setosa |
| 5.4 | 3.4 | 1.7 | 0.2 | setosa |
| 5.1 | 3.7 | 1.5 | 0.4 | setosa |
| 5.1 | 3.3 | 1.7 | 0.5 | setosa |
| 5.2 | 3.5 | 1.5 | 0.2 | setosa |
| 5.2 | 3.4 | 1.4 | 0.2 | setosa |
| 5.4 | 3.4 | 1.5 | 0.4 | setosa |
| 5.2 | 4.1 | 1.5 | 0.1 | setosa |
| 5.5 | 4.2 | 1.4 | 0.2 | setosa |
| 5.5 | 3.5 | 1.3 | 0.2 | setosa |
| 5.1 | 3.4 | 1.5 | 0.2 | setosa |
| 5.1 | 3.8 | 1.9 | 0.4 | setosa |
| 5.1 | 3.8 | 1.6 | 0.2 | setosa |
| 5.3 | 3.7 | 1.5 | 0.2 | setosa |
kable works for small tables, but even 22 rows is too much to display in a report. If you have larger tables and/or want to make them interactive, the DT works well.
There are two options for using datatable in your R code chunk. (1) Bound the code chunk you want to present by datatable
Or create a new object
Here are a few other popular table making packages
Our data that we will work with today can be found by searching metagenome bins associated with GOLD Study ID Gs0166454 IMG/MER
A description of the column headers for the file we will work with
Let’s load the table into R
As always in the Environment window check to see if the table loaded as expect and what the object types are. Or you can do it in your R console
# A tibble: 6 × 21
bin_oid `Bin ID` `Genome Name` `IMG Genome ID` `Bin Quality` `Bin Lineage`
<chr> <chr> <chr> <dbl> <chr> <chr>
1 3300075492… 3300075… Soil microbi… 3300075492 MQ Bacteria
2 3300075492… 3300075… Soil microbi… 3300075492 MQ Bacteria
3 3300075492… 3300075… Soil microbi… 3300075492 MQ <NA>
4 3300075492… 3300075… Soil microbi… 3300075492 MQ Bacteria; Ac…
5 3300075492… 3300075… Soil microbi… 3300075492 MQ Bacteria; Ac…
6 3300075492… 3300075… Soil microbi… 3300075492 MQ Bacteria; Ac…
# ℹ 15 more variables: `GTDB Taxonomy Lineage` <chr>, `Bin Methods` <chr>,
# `Created By` <chr>, `Date Added` <date>, `Bin Completeness` <dbl>,
# `Bin Contamination` <dbl>, `Average Coverage` <lgl>,
# `Total Number of Bases` <dbl>, `5s rRNA` <dbl>, `16s rRNA` <dbl>,
# `23s rRNA` <dbl>, `tRNA Genes` <dbl>, `Gene Count` <dbl>,
# `Scaffold Count` <dbl>, `GOLD Study ID` <chr>
spc_tbl_ [16,669 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ bin_oid : chr [1:16669] "3300075492_s0" "3300075492_s1" "3300075492_s100" "3300075492_s106" ...
$ Bin ID : chr [1:16669] "3300075492_s0" "3300075492_s1" "3300075492_s100" "3300075492_s106" ...
$ Genome Name : chr [1:16669] "Soil microbial communities from University of Notre Dame Environmental Research Center NEON Field Site, Michiga"| __truncated__ "Soil microbial communities from University of Notre Dame Environmental Research Center NEON Field Site, Michiga"| __truncated__ "Soil microbial communities from University of Notre Dame Environmental Research Center NEON Field Site, Michiga"| __truncated__ "Soil microbial communities from University of Notre Dame Environmental Research Center NEON Field Site, Michiga"| __truncated__ ...
$ IMG Genome ID : num [1:16669] 3.3e+09 3.3e+09 3.3e+09 3.3e+09 3.3e+09 ...
$ Bin Quality : chr [1:16669] "MQ" "MQ" "MQ" "MQ" ...
$ Bin Lineage : chr [1:16669] "Bacteria" "Bacteria" NA "Bacteria; Actinomycetota; Thermoleophilia; Solirubrobacterales" ...
$ GTDB Taxonomy Lineage: chr [1:16669] "Bacteria; Acidobacteriota; Terriglobia; Acidoferrales; UBA7541; Acidoferrum" "Bacteria; Desulfobacterota_B; Binatia; Binatales; Binataceae; Binatus; Binatus soli" "Archaea; Thermoplasmatota; Thermoplasmata; UBA184; UBA184; UBA184" "Bacteria; Actinomycetota; Thermoleophilia; Solirubrobacterales; Solirubrobacteraceae; Palsa-744" ...
$ Bin Methods : chr [1:16669] "SemiBin2:v2.1.0, CheckM2:v1.0.2, GTDB-Tk:v2.4.0, GTDB-Tk-database:release220" "SemiBin2:v2.1.0, CheckM2:v1.0.2, GTDB-Tk:v2.4.0, GTDB-Tk-database:release220" "SemiBin2:v2.1.0, CheckM2:v1.0.2, GTDB-Tk:v2.4.0, GTDB-Tk-database:release220" "SemiBin2:v2.1.0, CheckM2:v1.0.2, GTDB-Tk:v2.4.0, GTDB-Tk-database:release220" ...
$ Created By : chr [1:16669] "IMG_PIPELINE" "IMG_PIPELINE" "IMG_PIPELINE" "IMG_PIPELINE" ...
$ Date Added : Date[1:16669], format: "2025-01-23" "2025-01-23" ...
$ Bin Completeness : num [1:16669] 95.9 99.6 57.9 74 67.3 ...
$ Bin Contamination : num [1:16669] 6.31 0 2.39 8.19 3.95 6.54 5.24 0.63 0.61 4.56 ...
$ Average Coverage : logi [1:16669] NA NA NA NA NA NA ...
$ Total Number of Bases: num [1:16669] 5600425 3706224 1233791 2084993 3809196 ...
$ 5s rRNA : num [1:16669] 1 1 1 1 0 0 0 1 1 0 ...
$ 16s rRNA : num [1:16669] 2 0 0 1 0 0 0 0 0 0 ...
$ 23s rRNA : num [1:16669] 2 1 1 1 0 0 0 1 0 0 ...
$ tRNA Genes : num [1:16669] 57 51 30 33 31 33 29 25 30 30 ...
$ Gene Count : num [1:16669] 4976 3617 1344 2284 3848 ...
$ Scaffold Count : num [1:16669] 26 45 210 322 398 85 527 435 243 409 ...
$ GOLD Study ID : chr [1:16669] "Gs0166454" "Gs0166454" "Gs0166454" "Gs0166454" ...
- attr(*, "spec")=
.. cols(
.. bin_oid = col_character(),
.. `Bin ID` = col_character(),
.. `Genome Name` = col_character(),
.. `IMG Genome ID` = col_double(),
.. `Bin Quality` = col_character(),
.. `Bin Lineage` = col_character(),
.. `GTDB Taxonomy Lineage` = col_character(),
.. `Bin Methods` = col_character(),
.. `Created By` = col_character(),
.. `Date Added` = col_date(format = ""),
.. `Bin Completeness` = col_double(),
.. `Bin Contamination` = col_double(),
.. `Average Coverage` = col_logical(),
.. `Total Number of Bases` = col_double(),
.. `5s rRNA` = col_double(),
.. `16s rRNA` = col_double(),
.. `23s rRNA` = col_double(),
.. `tRNA Genes` = col_double(),
.. `Gene Count` = col_double(),
.. `Scaffold Count` = col_double(),
.. `GOLD Study ID` = col_character()
.. )
- attr(*, "problems")=<externalptr>
You might also notice that many columns are surrounded by backticks. That’s because they contain spaces, breaking R’s usual rules for variable names; they’re non-syntactic names. To refer to these variables, you need to surround them with backticks, `. Last week in Chapter 7 we used janitor::clean_names() to use some heuristics to turn them all into snake case at once.
Count the number of MQ and HQ genomes
# A tibble: 2 × 2
bin_quality n
<chr> <int>
1 MQ 15369
2 HQ 1300
Make a knitr::kable table of the bin quality counts
Filter so that Bin Quality = HQ and display in DT::datatable
Select the GTDB taxonomy and the MAGs genome size then filter to all MAGs greater than 10,000,000 bases
| gtdb_taxonomy_lineage | total_number_of_bases |
|---|---|
| Bacteria; Pseudomonadota; Gammaproteobacteria; Steroidobacterales; Steroidobacteraceae; 13-2-20CM-66-19 | 10115899 |
| Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae | 11932805 |
| Bacteria; Actinomycetota; Actinomycetes; Streptomycetales; Catenulisporaceae; Catenulispora | 12420050 |
| Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae; JAJPJC01 | 13151516 |
| Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae | 12046820 |
| Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae; JAQGHR01 | 12217509 |
| Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae | 12942605 |
| Bacteria; Actinomycetota; Actinomycetes; Mycobacteriales; Pseudonocardiaceae; Actinophytocola | 11293845 |
| Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae | 14411455 |
| Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae | 10890912 |
| Bacteria; Planctomycetota; Planctomycetia; Pirellulales; JAICIG01; JAICLL01 | 11379784 |
| Bacteria; Acidobacteriota; Terriglobia; Bryobacterales; Bryobacteraceae; PALSA-243 | 10475466 |
| NA | 10248256 |
| Bacteria; Pseudomonadota; Alphaproteobacteria; Rhizobiales; Xanthobacteraceae | 10436042 |
| Bacteria; Planctomycetota; Planctomycetia; Gemmatales; Gemmataceae | 13087515 |
| Bacteria; Actinomycetota; Actinomycetes; Mycobacteriales; Pseudonocardiaceae; Actinophytocola | 10431101 |
| Bacteria; Myxococcota; Polyangia; Polyangiales; JAFGIB01 | 10030262 |
| Bacteria; Myxococcota; Polyangia; Polyangiales; Polyangiaceae; JANYGI01 | 12560711 |
| Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae | 11026377 |
| Bacteria; Planctomycetota; Planctomycetia; Gemmatales; Gemmataceae | 12479517 |
| Bacteria; Acidobacteriota; Terriglobia; Bryobacterales; Bryobacteraceae; Solibacter | 10617884 |
| Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae; JAQGHR01 | 10431455 |
| Bacteria; Planctomycetota; Planctomycetia; Isosphaerales; Isosphaeraceae | 12864765 |
| Bacteria; Actinomycetota; Actinomycetes; Mycobacteriales; Mycobacteriaceae; Mycobacterium | 10583130 |
We can use the stringr package to filter based on a word or string of characters in a column
Filter to include only the samples from Yellowstone NP
Since the the taxonomic categories in GTDB-Tk Taxonomy Lineage are separated by the ; we can use the separate function to create new columns for each of the taxonomic categories. The remove = FALSE keeps the original GTDB-Tk Taxonomy Lineage column
Count the number of MAGs for each Phylum and display in DT::datatable
Note that there is one category with no name. This were MAGs that were not annotated by GTDB using the JGI pipline (they are Archaea)
There is a lot of information in genome_name. Let’s unpack it into separate columns. Note here where the double quotes and grave accents are used
NEON_MAGs_tax_sample <- NEON_MAGs_tax |>
# Get rid of the the common string "Soil microbial communities from "
mutate_at("genome_name", str_replace, "Terrestrial soil microbial communities from ", "") |>
# Use the first `-` to split the column in two
separate(genome_name, c("site","sample_name"), " - ") |>
# Get rid of the the common string "S-comp-1"
mutate_at("sample_name", str_replace, "-comp-1", "") |>
# separate the Sample Name into Site ID and plot info
separate(sample_name, c("site_ID","subplot.layer.date"), "_", remove = FALSE,) |>
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("subplot", "layer", "date"), "-") Check this out in the Environment window to make sure with got it right.
Let’s see which Site has the most MAGs
Use view(iris) to see the whole data table. Subset the table based on a different species than was used in the example. Display the table using DT::datatable
Display using DT::datatable the NEON MAGs that have at least 1 16S rRNA
Display a table of the MAGs from Lower Tombigbee with only the columns for the genome_name, gtdb_taxonomy_lineage, and estimated MAG genome size (total_number_of_bases).
Display a table with the counts of each class names at Lyndon B. Johnson National Grasslands.
Display a table with the counts for the Phylum Actinomycetota at each Site.