R code
library(tidyverse)
library(lubridate)
library(phyloNEON)
library(DT)
library(viridis)
This is from https://github.com/NEONScience/phyloNEON/blob/main/README.md and https://github.com/NEONScience/phyloNEON/blob/main/docs/metagenomic/README.md
A set of tools in R and Python to run phylogenetic and taxonomic analyses on NEON and related data
To install phyloNEON, you will need the devtools
package.
library(devtools)
install_github("NEONScience/phyloNEON/phyloNEON")
NEON offers several data products that include genetic data. This repository is being developed to include tools and guidelines to help users of NEON data to better utilize the genetic data.
DNA is extracted from NEON soil and aquatic samples and sequenced with a shotgun sequence library prep. Through collaborations with the Joint Genome Institute (JGI) and the National Microbime Collaborative Network (NMDC), most of the metagenomic sequencing data are available on the data portals of these organizations. Connections to these external data sources are being built into NEON data releases. The phyloNEON package also offers some tools and guidelines to help the user find and analyze NEON metagenomic data on the JGI/NMDC data portals.
This page on the repo (in docs/metagenomic/README.md
) will help you get started
We have provided some tools and guidelines to help users access NEON metagenomic data on the JGI and NMDC data portals.
A table (neon.metaDB
) has been added to the phyloNEON package that contains over 1,800 NEON metagenome samples that are on the JGI IMG data portal. This includes legacy data as well as all samples that are part of the JGI CSP award, which covers deep sequencing and analysis by JGI of all NEON metagenome samples collected in 2023 and 2024. Included in the table are several fields with JGI metadata and statistics for each sample, such as Sequencing Method
, GenomeSize
, GeneCount
, and number of bins (metaBATbinCount
). Also included are some NEON variables such as siteID
and collectDate
, as well as multiple environmental terms assigned to each sample according to ENVO specifications (e.g. Ecosystem Category
, Ecosystem Type
, Specific Ecosystem
). The table also has reference codes for the Genome Online Database (GOLD), including GOLD Analysis Project ID
and GOLD Study ID
; and the taxon OID (imgGenomeID
) that allows accessing the sample on the JGI IMG data portal.
This table is available when you load the package phyloNEON
.
Save file to have version used
To view the table neon.metaDB (Note set eval = FALSE or do not include in your R code or you will get an error when rendering)
To view the structure of the neon.metaDB
tibble [1,834 × 32] (S3: tbl_df/tbl/data.frame)
$ dnaSampleID : chr [1:1834] "ONAQ_044-M-20190619-COMP-DNA1" "PUUM_031-O-20210104-COMP-DNA1" "TECR.20230821.EPILITHON.8.DNA-DNA1" "KONZ_003-M-20170710-COMP-DNA1" ...
$ imgGenomeID : num [1:1834] 3.3e+09 3.3e+09 3.3e+09 3.3e+09 3.3e+09 ...
$ jgiProjectID : num [1:1834] NA NA 1506438 0 1500369 ...
$ ITS Proposal ID : num [1:1834] NA NA 509938 NA 509938 ...
$ Sequencing Status : chr [1:1834] "Permanent Draft" "Permanent Draft" "Permanent Draft" "Permanent Draft" ...
$ Study Name : chr [1:1834] "Terrestrial soil microbial communities from various locations - NEON" "Terrestrial soil microbial communities from various locations - NEON" "Soil and water microbial communities from various NEON Field Sites across the United States" "Terrestrial soil microbial communities from various locations - NEON" ...
$ GenomeName : chr [1:1834] "Terrestrial soil microbial communities from Onaqui, Utah, USA - ONAQ_044-M-20190619-COMP-DNA1" "Terrestrial soil microbial communities from Puu Makaala Natural Area Reserve, Hawaii, USA - PUUM_031-O-20210104-COMP-DNA1" "Freshwater microbial communities from Teakettle 2 Creek NEON Field Site, Sierra National Forest, CA, USA - TECR"| __truncated__ "Terrestrial soil microbial communities from Konza Prairie Biological Station, Prairie Peninsula, KS, USA - KONZ"| __truncated__ ...
$ Sequencing Center : chr [1:1834] "Battelle Memorial Institute" "Battelle Memorial Institute" "DOE Joint Genome Institute (JGI)" "Battelle Memorial Institute" ...
$ GOLD Analysis Project ID : chr [1:1834] "Ga0620072" "Ga0619546" "Ga0672972" "Ga0428256" ...
$ GOLD Analysis Project Type: chr [1:1834] "Metagenome Analysis" "Metagenome Analysis" "Metagenome Analysis" "Metagenome Analysis" ...
$ GOLD Sequencing Project ID: chr [1:1834] "Gp0766640" "Gp0766114" "Gp0812633" "Gp0476824" ...
$ GOLD Study ID : chr [1:1834] "Gs0144570" "Gs0144570" "Gs0166454" "Gs0144570" ...
$ Funding Program : chr [1:1834] NA NA "CSP" NA ...
$ Sequencing Method : chr [1:1834] "Illumina NextSeq 550" "Illumina NextSeq 550" "Illumina NovaSeq X 10B" "Illumina NextSeq 550" ...
$ Sequencing Quality : chr [1:1834] "Level 1: Standard Draft" "Level 1: Standard Draft" "Level 1: Standard Draft" "Level 1: Standard Draft" ...
$ siteID : chr [1:1834] "ONAQ" "PUUM" "TECR" "KONZ" ...
$ collectDate : chr [1:1834] "20190619" "20210104" "20230821" "20170710" ...
$ Ecosystem : chr [1:1834] "Environmental" "Environmental" "Environmental" "Environmental" ...
$ Ecosystem Category : chr [1:1834] "Terrestrial" "Terrestrial" "Aquatic" "Terrestrial" ...
$ Ecosystem Subtype : chr [1:1834] "Unclassified" "Forest" "Creek" "Unclassified" ...
$ Ecosystem Type : chr [1:1834] "Soil" "Soil" "Freshwater" "Soil" ...
$ Specific Ecosystem : chr [1:1834] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
$ GenomeSize : num [1:1834] 3.57e+06 8.86e+04 8.33e+08 2.23e+07 7.62e+08 ...
$ GeneCount : num [1:1834] 11955 313 1296700 63840 1113040 ...
$ ScaffoldCount : num [1:1834] 10633 277 671686 56485 522571 ...
$ metaBATbinCount : num [1:1834] 0 0 11 0 23 5 0 7 0 0 ...
$ eukCCbinCount : num [1:1834] 0 0 2 0 2 0 0 0 0 0 ...
$ estNumberGenomes : num [1:1834] 0 0 147 0 151 215 0 132 0 0 ...
$ avgGenomeSize : num [1:1834] 0 0 5663975 0 5046004 ...
$ numberFilteredReads : num [1:1834] 0.00 0.00 1.87e+08 0.00 3.15e+08 ...
$ numberMappedReads : num [1:1834] 0.00 0.00 1.03e+08 0.00 2.78e+08 ...
$ pctAssembledReads : num [1:1834] 0 0 55.4 0 88.2 ...
Convert the collectDate from character to date format
# A tibble: 9 × 2
year mean_GenomeSize
<dbl> <dbl>
1 2013 8635723.
2 2014 7670580.
3 2016 16242281.
4 2017 17172060.
5 2018 14100181.
6 2019 16735397.
7 2020 37402350.
8 2021 1334660078.
9 2023 1310246892.
# A tibble: 6 × 2
year mean_GenomeSize
<dbl> <dbl>
1 2013 4721159.
2 2016 17096867.
3 2017 29942423.
4 2019 13990976.
5 2020 30567005.
6 2023 2236533461.
To reformat dnasampleID column for terrestrial samples (This does not work for the aquatic samples)
neon.metaDB.my.soil <- neon.metaDB.my |>
filter(`Ecosystem Category` == "Terrestrial") |>
filter(`GOLD Analysis Project Type` != "Combined Assembly") |>
separate(`dnaSampleID`, c("dnaSampleID.site","dnaSampleID.sub"), "_", remove=FALSE) |>
mutate_at("dnaSampleID.sub", str_replace, "-COMP", "_COMP") |>
mutate_at("dnaSampleID.sub", str_replace, "-GEN", "_GEN") |>
separate(`dnaSampleID.sub`, c("dnaSampleID.sub","dnaSampleID.type"), "_") |>
mutate_at("dnaSampleID.sub", str_replace, "-M", "_M") |>
mutate_at("dnaSampleID.sub", str_replace, "-O", "_O") |>
separate(`dnaSampleID.sub`, c("dnaSampleID.plot","dnaSampleID.sub"), "_") |>
mutate_at("dnaSampleID.sub", str_replace, "M-", "M_") |>
mutate_at("dnaSampleID.sub", str_replace, "O-", "O_") |>
separate(`dnaSampleID.sub`, c("dnaSampleID.layer","dnaSampleID.sub"), "_") |>
mutate_at("dnaSampleID.sub", str_replace, "-201", "201") |>
mutate_at("dnaSampleID.sub", str_replace, "-202", "202") |>
mutate_at("dnaSampleID.sub", str_replace, "201", "_201") |>
mutate_at("dnaSampleID.sub", str_replace, "202", "_202") |>
separate(`dnaSampleID.sub`, c("dnaSampleID.subplot","dnaSampleID.date"), "_") |>
unite(plotID, c(dnaSampleID.site, dnaSampleID.plot), sep='_', remove=FALSE)
neon.metaDB.my.soil$dnaSampleID.data <- as.numeric(neon.metaDB.my.soil$dnaSampleID.date)
neon.metaDB.my.soil$dnaSampleID.date <- ymd(neon.metaDB.my.soil$dnaSampleID.date)
To reformat dnasampleID column for aquatic samples
neon.metaDB.my.aquatic <- neon.metaDB.my |>
filter(`Ecosystem Category` == "Aquatic") |>
filter(`GOLD Analysis Project Type` != "Combined Assembly") |>
mutate(dnaSampleID.sub = dnaSampleID) |>
mutate_at("dnaSampleID.sub", str_replace, ".202", "_202") |>
separate(`dnaSampleID.sub`, c("dnaSampleID.site","dnaSampleID.sub"), "_") |>
separate(`dnaSampleID.site`, c("dnaSampleID.site","dnaSampleID.code"), "\\.") |>
mutate_at("dnaSampleID.sub", str_replace, ".DNA", "_DNA") |>
separate(`dnaSampleID.sub`, c("dnaSampleID.sub","dnaSampleID.type"), "_") |>
separate(`dnaSampleID.sub`, c("dnaSampleID.data","dnaSampleID.niche", "dnaSampleID.num"), "\\.") |>
unite(dnaSampleID.niche, c(dnaSampleID.code, dnaSampleID.niche)) |>
mutate_at("dnaSampleID.niche", str_replace, "NA_", "") |>
mutate_at("dnaSampleID.niche", str_replace, "_NA", "")
neon.metaDB.my.soil |>
filter(siteID == "HARV") |>
group_by(Year = lubridate::year(collectDate), dnaSampleID.plot) |>
count() |>
pivot_wider(names_from = dnaSampleID.plot, values_from = n) |>
mutate_all(funs(replace_na(.,0))) |>
pivot_longer(!Year, names_to = "plot", values_to = "metagenomes") |>
ggplot(aes(x=Year, y = plot)) +
geom_tile(aes(fill = metagenomes)) +
scale_fill_viridis(discrete=FALSE, direction = -1) +
scale_x_continuous(breaks = seq(2013, 2023, by = 1))
All sites are missing data from 2021 and 2022. That should be in IMG soon. What about 2018?
neon.metaDB.my.soil |>
group_by(siteID, Year = lubridate::year(collectDate), dnaSampleID.plot) |>
count() |>
pivot_wider(names_from = Year, values_from = n) |>
mutate_all(funs(replace_na(.,0))) |>
pivot_longer(!c(siteID, dnaSampleID.plot), names_to = "Year", values_to = "metagenomes") |>
ggplot(aes(x=Year, y = dnaSampleID.plot)) +
geom_tile(aes(fill = metagenomes)) +
scale_fill_viridis(discrete=FALSE, direction = -1) +
# scale_x_continuous(breaks = seq(2013, 2023, by = 1)) +
facet_wrap(~siteID, scales ="free_y", ncol = 3) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))