phyloNEON

This is from https://github.com/NEONScience/phyloNEON/blob/main/README.md and https://github.com/NEONScience/phyloNEON/blob/main/docs/metagenomic/README.md

A set of tools in R and Python to run phylogenetic and taxonomic analyses on NEON and related data

Installation

To install phyloNEON, you will need the devtools package.

library(devtools)

install_github("NEONScience/phyloNEON/phyloNEON")

Accessing and using NEON genetic data

NEON offers several data products that include genetic data. This repository is being developed to include tools and guidelines to help users of NEON data to better utilize the genetic data.

NEON metagenomic data

DNA is extracted from NEON soil and aquatic samples and sequenced with a shotgun sequence library prep. Through collaborations with the Joint Genome Institute (JGI) and the National Microbime Collaborative Network (NMDC), most of the metagenomic sequencing data are available on the data portals of these organizations. Connections to these external data sources are being built into NEON data releases. The phyloNEON package also offers some tools and guidelines to help the user find and analyze NEON metagenomic data on the JGI/NMDC data portals.

This page on the repo (in docs/metagenomic/README.md) will help you get started

Getting started with NEON metagenomic data

We have provided some tools and guidelines to help users access NEON metagenomic data on the JGI and NMDC data portals.

Accessing NEON samples on the JGI IMG data portal

NEON metagenome database

A table (neon.metaDB) has been added to the phyloNEON package that contains over 1,800 NEON metagenome samples that are on the JGI IMG data portal. This includes legacy data as well as all samples that are part of the JGI CSP award, which covers deep sequencing and analysis by JGI of all NEON metagenome samples collected in 2023 and 2024. Included in the table are several fields with JGI metadata and statistics for each sample, such as Sequencing Method, GenomeSize, GeneCount, and number of bins (metaBATbinCount). Also included are some NEON variables such as siteID and collectDate, as well as multiple environmental terms assigned to each sample according to ENVO specifications (e.g. Ecosystem Category, Ecosystem Type, Specific Ecosystem). The table also has reference codes for the Genome Online Database (GOLD), including GOLD Analysis Project ID and GOLD Study ID; and the taxon OID (imgGenomeID) that allows accessing the sample on the JGI IMG data portal.

This table is available when you load the package phyloNEON.

R code

library(tidyverse)
library(lubridate)
library(phyloNEON)
library(DT)
library(viridis)

Save file to have version used

R code

write_csv(neon.metaDB, "data/NEON_metadata/neon.metaDB_20250701.csv")

To view the table neon.metaDB (Note set eval = FALSE or do not include in your R code or you will get an error when rendering)

R code

View(neon.metaDB)

To view the structure of the neon.metaDB

R code

str(neon.metaDB)

tibble [1,834 × 32] (S3: tbl_df/tbl/data.frame)
 $ dnaSampleID               : chr [1:1834] "ONAQ_044-M-20190619-COMP-DNA1" "PUUM_031-O-20210104-COMP-DNA1" "TECR.20230821.EPILITHON.8.DNA-DNA1" "KONZ_003-M-20170710-COMP-DNA1" ...
 $ imgGenomeID               : num [1:1834] 3.3e+09 3.3e+09 3.3e+09 3.3e+09 3.3e+09 ...
 $ jgiProjectID              : num [1:1834] NA NA 1506438 0 1500369 ...
 $ ITS Proposal ID           : num [1:1834] NA NA 509938 NA 509938 ...
 $ Sequencing Status         : chr [1:1834] "Permanent Draft" "Permanent Draft" "Permanent Draft" "Permanent Draft" ...
 $ Study Name                : chr [1:1834] "Terrestrial soil microbial communities from various locations - NEON" "Terrestrial soil microbial communities from various locations - NEON" "Soil and water microbial communities from various NEON Field Sites across the United States" "Terrestrial soil microbial communities from various locations - NEON" ...
 $ GenomeName                : chr [1:1834] "Terrestrial soil microbial communities from Onaqui, Utah, USA - ONAQ_044-M-20190619-COMP-DNA1" "Terrestrial soil microbial communities from Puu Makaala Natural Area Reserve, Hawaii, USA - PUUM_031-O-20210104-COMP-DNA1" "Freshwater microbial communities from Teakettle 2 Creek NEON Field Site, Sierra National Forest, CA, USA - TECR"| __truncated__ "Terrestrial soil microbial communities from Konza Prairie Biological Station, Prairie Peninsula, KS, USA - KONZ"| __truncated__ ...
 $ Sequencing Center         : chr [1:1834] "Battelle Memorial Institute" "Battelle Memorial Institute" "DOE Joint Genome Institute  (JGI)" "Battelle Memorial Institute" ...
 $ GOLD Analysis Project ID  : chr [1:1834] "Ga0620072" "Ga0619546" "Ga0672972" "Ga0428256" ...
 $ GOLD Analysis Project Type: chr [1:1834] "Metagenome Analysis" "Metagenome Analysis" "Metagenome Analysis" "Metagenome Analysis" ...
 $ GOLD Sequencing Project ID: chr [1:1834] "Gp0766640" "Gp0766114" "Gp0812633" "Gp0476824" ...
 $ GOLD Study ID             : chr [1:1834] "Gs0144570" "Gs0144570" "Gs0166454" "Gs0144570" ...
 $ Funding Program           : chr [1:1834] NA NA "CSP" NA ...
 $ Sequencing Method         : chr [1:1834] "Illumina NextSeq 550" "Illumina NextSeq 550" "Illumina NovaSeq X 10B" "Illumina NextSeq 550" ...
 $ Sequencing Quality        : chr [1:1834] "Level 1: Standard Draft" "Level 1: Standard Draft" "Level 1: Standard Draft" "Level 1: Standard Draft" ...
 $ siteID                    : chr [1:1834] "ONAQ" "PUUM" "TECR" "KONZ" ...
 $ collectDate               : chr [1:1834] "20190619" "20210104" "20230821" "20170710" ...
 $ Ecosystem                 : chr [1:1834] "Environmental" "Environmental" "Environmental" "Environmental" ...
 $ Ecosystem Category        : chr [1:1834] "Terrestrial" "Terrestrial" "Aquatic" "Terrestrial" ...
 $ Ecosystem Subtype         : chr [1:1834] "Unclassified" "Forest" "Creek" "Unclassified" ...
 $ Ecosystem Type            : chr [1:1834] "Soil" "Soil" "Freshwater" "Soil" ...
 $ Specific Ecosystem        : chr [1:1834] "Unclassified" "Unclassified" "Unclassified" "Unclassified" ...
 $ GenomeSize                : num [1:1834] 3.57e+06 8.86e+04 8.33e+08 2.23e+07 7.62e+08 ...
 $ GeneCount                 : num [1:1834] 11955 313 1296700 63840 1113040 ...
 $ ScaffoldCount             : num [1:1834] 10633 277 671686 56485 522571 ...
 $ metaBATbinCount           : num [1:1834] 0 0 11 0 23 5 0 7 0 0 ...
 $ eukCCbinCount             : num [1:1834] 0 0 2 0 2 0 0 0 0 0 ...
 $ estNumberGenomes          : num [1:1834] 0 0 147 0 151 215 0 132 0 0 ...
 $ avgGenomeSize             : num [1:1834] 0 0 5663975 0 5046004 ...
 $ numberFilteredReads       : num [1:1834] 0.00 0.00 1.87e+08 0.00 3.15e+08 ...
 $ numberMappedReads         : num [1:1834] 0.00 0.00 1.03e+08 0.00 2.78e+08 ...
 $ pctAssembledReads         : num [1:1834] 0 0 55.4 0 88.2 ...

Convert the collectDate from character to date format

R code

neon.metaDB.my <- neon.metaDB
neon.metaDB.my$collectDate <- as.numeric(neon.metaDB.my$collectDate)
neon.metaDB.my$collectDate <- ymd(neon.metaDB.my$collectDate)
str(neon.metaDB.my$collectDate)

 Date[1:1834], format: "2019-06-19" "2021-01-04" "2023-08-21" "2017-07-10" "2023-07-12" ...

Table of mean genome size per year

R code

neon.metaDB.my |> 
  filter(`GOLD Analysis Project Type` != "Combined Assembly") |> 
  group_by(year = lubridate::year(collectDate)) |> 
  summarize(mean_GenomeSize = mean(GenomeSize))

# A tibble: 9 × 2
   year mean_GenomeSize
  <dbl>           <dbl>
1  2013        8635723.
2  2014        7670580.
3  2016       16242281.
4  2017       17172060.
5  2018       14100181.
6  2019       16735397.
7  2020       37402350.
8  2021     1334660078.
9  2023     1310246892.

Table HARV mean genome size per year

R code

neon.metaDB.my |> 
  filter(siteID == "HARV") |> 
  group_by(year = lubridate::year(collectDate)) |> 
  summarize(mean_GenomeSize = mean(GenomeSize))

# A tibble: 6 × 2
   year mean_GenomeSize
  <dbl>           <dbl>
1  2013        4721159.
2  2016       17096867.
3  2017       29942423.
4  2019       13990976.
5  2020       30567005.
6  2023     2236533461.

Plot of genome size per year

R code

neon.metaDB.my |> 
  ggplot(aes(x=collectDate, y = GenomeSize)) +
  geom_col(colour = "maroon", fill = "maroon") +
  coord_flip()

To reformat dnasampleID column for terrestrial samples (This does not work for the aquatic samples)

R code

neon.metaDB.my.soil <- neon.metaDB.my |> 
  filter(`Ecosystem Category` == "Terrestrial") |> 
  filter(`GOLD Analysis Project Type` != "Combined Assembly") |> 
  
  separate(`dnaSampleID`, c("dnaSampleID.site","dnaSampleID.sub"), "_", remove=FALSE) |> 
  
  mutate_at("dnaSampleID.sub", str_replace, "-COMP", "_COMP") |>
  mutate_at("dnaSampleID.sub", str_replace, "-GEN", "_GEN") |>
  separate(`dnaSampleID.sub`, c("dnaSampleID.sub","dnaSampleID.type"), "_") |> 
  
  mutate_at("dnaSampleID.sub", str_replace, "-M", "_M") |>
  mutate_at("dnaSampleID.sub", str_replace, "-O", "_O") |>
  separate(`dnaSampleID.sub`, c("dnaSampleID.plot","dnaSampleID.sub"), "_") |> 
  
  mutate_at("dnaSampleID.sub", str_replace, "M-", "M_") |>
  mutate_at("dnaSampleID.sub", str_replace, "O-", "O_") |>
  separate(`dnaSampleID.sub`, c("dnaSampleID.layer","dnaSampleID.sub"), "_") |> 

  mutate_at("dnaSampleID.sub", str_replace, "-201", "201") |>
  mutate_at("dnaSampleID.sub", str_replace, "-202", "202") |>
  mutate_at("dnaSampleID.sub", str_replace, "201", "_201") |>
  mutate_at("dnaSampleID.sub", str_replace, "202", "_202") |>
  separate(`dnaSampleID.sub`, c("dnaSampleID.subplot","dnaSampleID.date"), "_") |> 

  unite(plotID, c(dnaSampleID.site, dnaSampleID.plot), sep='_', remove=FALSE)

neon.metaDB.my.soil$dnaSampleID.data <- as.numeric(neon.metaDB.my.soil$dnaSampleID.date)
neon.metaDB.my.soil$dnaSampleID.date <- ymd(neon.metaDB.my.soil$dnaSampleID.date)

To reformat dnasampleID column for aquatic samples

R code

neon.metaDB.my.aquatic <- neon.metaDB.my |> 
  filter(`Ecosystem Category` == "Aquatic") |> 
  filter(`GOLD Analysis Project Type` != "Combined Assembly") |> 
  
  mutate(dnaSampleID.sub = dnaSampleID) |> 
  mutate_at("dnaSampleID.sub", str_replace, ".202", "_202") |>
  separate(`dnaSampleID.sub`, c("dnaSampleID.site","dnaSampleID.sub"), "_") |> 
  separate(`dnaSampleID.site`, c("dnaSampleID.site","dnaSampleID.code"), "\\.") |> 
  
  mutate_at("dnaSampleID.sub", str_replace, ".DNA", "_DNA") |>
  separate(`dnaSampleID.sub`, c("dnaSampleID.sub","dnaSampleID.type"), "_") |> 
  
  separate(`dnaSampleID.sub`, c("dnaSampleID.data","dnaSampleID.niche", "dnaSampleID.num"), "\\.") |> 

  unite(dnaSampleID.niche, c(dnaSampleID.code, dnaSampleID.niche)) |> 
  mutate_at("dnaSampleID.niche", str_replace, "NA_", "") |>
  mutate_at("dnaSampleID.niche", str_replace, "_NA", "")

HARV metagenomes by year and plot

R code

datatable(
neon.metaDB.my.soil |> 
  filter(siteID == "HARV") |> 
  group_by(Year = lubridate::year(collectDate), dnaSampleID.plot) |> 
  count() |> 
  pivot_wider(names_from = dnaSampleID.plot, values_from = n) |> 
  mutate_all(funs(replace_na(.,0)))
)

WREF metagenomes by year and plot

R code

datatable(
neon.metaDB.my.soil |> 
  filter(siteID == "WREF") |> 
  group_by(Year = lubridate::year(collectDate), dnaSampleID.plot) |> 
  count() |> 
  pivot_wider(names_from = dnaSampleID.plot, values_from = n) |> 
  mutate_all(funs(replace_na(.,0)))
)

Plot of HARV samples per plot per year

R code

neon.metaDB.my.soil |> 
  filter(siteID == "HARV") |> 
  group_by(Year = lubridate::year(collectDate), dnaSampleID.plot) |> 
  count() |> 
  pivot_wider(names_from = dnaSampleID.plot, values_from = n) |> 
  mutate_all(funs(replace_na(.,0))) |> 
  pivot_longer(!Year, names_to = "plot", values_to = "metagenomes") |> 
  ggplot(aes(x=Year, y = plot)) +
  geom_tile(aes(fill = metagenomes)) +
  scale_fill_viridis(discrete=FALSE, direction = -1) +
  scale_x_continuous(breaks = seq(2013, 2023, by = 1))

Missing years at HARV

All sites are missing data from 2021 and 2022. That should be in IMG soon. What about 2018?

CollectDate

R code

neon.metaDB.my.soil |> 
  group_by(Year = lubridate::year(collectDate)) |> 
  count()

# A tibble: 9 × 2
# Groups:   Year [9]
   Year     n
  <dbl> <int>
1  2013    63
2  2014   106
3  2016   229
4  2017   326
5  2018    45
6  2019   231
7  2020   185
8  2021   117
9  2023   303

dnaSampleID.date

R code

neon.metaDB.my.soil |> 
  group_by(Year = lubridate::year(dnaSampleID.date)) |> 
  count()

# A tibble: 9 × 2
# Groups:   Year [9]
   Year     n
  <dbl> <int>
1  2013    63
2  2014   106
3  2016   229
4  2017   326
5  2018    45
6  2019   231
7  2020   185
8  2021   117
9  2023   303

Plot of samples per plot per year at all sites

R code

neon.metaDB.my.soil |> 
  group_by(siteID, Year = lubridate::year(collectDate), dnaSampleID.plot) |> 
  count() |> 
  pivot_wider(names_from = Year, values_from = n) |> 
  mutate_all(funs(replace_na(.,0))) |> 
  pivot_longer(!c(siteID, dnaSampleID.plot), names_to = "Year", values_to = "metagenomes") |> 
  ggplot(aes(x=Year, y = dnaSampleID.plot)) +
  geom_tile(aes(fill = metagenomes)) +
  scale_fill_viridis(discrete=FALSE, direction = -1) +
 # scale_x_continuous(breaks = seq(2013, 2023, by = 1)) +
  facet_wrap(~siteID, scales ="free_y", ncol = 3) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))