Environmental Correlations in Microbial Communities (NEON MAGs)

Author

Tutorial

Published

December 3, 2025

Introduction

This tutorial explores environmental correlations in microbial communities using NEON MAGs data: - Correlate soil chemistry (pH, moisture, temperature, elevation) with microbial diversity and taxa. - Perform gradient analysis to identify taxa enriched along environmental gradients. - Use Canonical Correspondence Analysis (CCA) to link community composition to environmental variables.

Step 1: Load and Prepare Data

R code

df <- read_csv("NEON_soilMAGs_soilChem.csv")

# Select relevant columns
df <- df %>%
  filter(!is.na(genomicsSampleID)) %>%
  mutate(siteID = site_ID,
         sampleID = genomicsSampleID) %>%
  select(sampleID, siteID, phylum, soilMoisture, soilTemp, soilInWaterpH, soilInCaClpH, elevation)

Step 2: Aggregate Data by Sample

R code

tax_profile <- df %>%
  group_by(sampleID, phylum) %>%
  summarise(count = n(), .groups = "drop") %>%
  pivot_wider(names_from = phylum, values_from = count, values_fill = 0)

# Merge with environmental data
env_data <- df %>% select(sampleID, soilMoisture, soilTemp, soilInWaterpH, soilInCaClpH, elevation) %>% distinct()
merged_data <- left_join(tax_profile, env_data, by = "sampleID")

Step 3: Correlation Analysis

Explanation: Correlate soil variables with diversity metrics and selected taxa.

R code

# Compute diversity metrics
abundance_matrix <- merged_data %>% select(-sampleID, -soilMoisture, -soilTemp, -soilInWaterpH, -soilInCaClpH, -elevation)
shannon <- diversity(abundance_matrix, index = "shannon")
simpson <- diversity(abundance_matrix, index = "simpson")
richness <- specnumber(abundance_matrix)

# Combine with environmental data
corr_df <- merged_data %>% select(sampleID, soilMoisture, soilTemp, soilInWaterpH, soilInCaClpH, elevation) %>%
  mutate(Shannon = shannon, Simpson = simpson, Richness = richness)

# Correlation plot
ggplot(corr_df, aes(x = soilMoisture, y = Shannon)) +
  geom_point() + geom_smooth(method = "lm") +
  labs(title = "Correlation: Soil Moisture vs Shannon Diversity")

Step 4: Gradient Analysis

Explanation: Identify taxa enriched along pH or moisture gradients.

R code

# Example: Correlation of Acidobacteriota abundance with pH
target_taxon <- "Acidobacteriota"
merged_data %>%
  ggplot(aes(x = soilInWaterpH, y = !!sym(target_taxon))) +
  geom_point() + geom_smooth(method = "lm") +
  labs(title = paste("Gradient Analysis: ", target_taxon, " vs pH"))

Step 5: Canonical Correspondence Analysis (CCA)

Explanation: Link community composition to environmental variables.

R code

# Prepare matrices
abundance_matrix <- merged_data %>% select(-sampleID, -soilMoisture, -soilTemp, -soilInWaterpH, -soilInCaClpH, -elevation)
env_matrix <- merged_data %>% select(soilMoisture, soilTemp, soilInWaterpH, soilInCaClpH, elevation)

# Perform CCA
cca_model <- cca(abundance_matrix ~ ., data = env_matrix)
summary(cca_model)


Call:
cca(formula = abundance_matrix ~ soilMoisture + soilTemp + soilInWaterpH +      soilInCaClpH + elevation, data = env_matrix) 

Partitioning of scaled Chi-square:
              Inertia Proportion
Total          2.6009     1.0000
Constrained    0.3828     0.1472
Unconstrained  2.2182     0.8528

Eigenvalues, and their contribution to the scaled Chi-square 

Importance of components:
                         CCA1    CCA2    CCA3     CCA4     CCA5    CA1     CA2
Eigenvalue            0.19931 0.12342 0.04250 0.010087 0.007432 0.6531 0.15254
Proportion Explained  0.07663 0.04745 0.01634 0.003878 0.002857 0.2511 0.05865
Cumulative Proportion 0.07663 0.12409 0.14043 0.144304 0.147161 0.3983 0.45690
                          CA3     CA4     CA5     CA6     CA7     CA8     CA9
Eigenvalue            0.13116 0.10967 0.10706 0.09812 0.08784 0.07540 0.06901
Proportion Explained  0.05043 0.04217 0.04116 0.03773 0.03377 0.02899 0.02653
Cumulative Proportion 0.50733 0.54949 0.59065 0.62838 0.66215 0.69114 0.71767
                         CA10    CA11    CA12    CA13    CA14    CA15    CA16
Eigenvalue            0.06175 0.05868 0.05130 0.04824 0.04689 0.04333 0.04170
Proportion Explained  0.02374 0.02256 0.01973 0.01855 0.01803 0.01666 0.01603
Cumulative Proportion 0.74141 0.76398 0.78370 0.80225 0.82028 0.83694 0.85297
                         CA17    CA18    CA19    CA20    CA21    CA22    CA23
Eigenvalue            0.03881 0.03710 0.03476 0.03293 0.03150 0.02997 0.02699
Proportion Explained  0.01492 0.01426 0.01336 0.01266 0.01211 0.01152 0.01038
Cumulative Proportion 0.86789 0.88216 0.89552 0.90818 0.92029 0.93182 0.94219
                         CA24     CA25     CA26     CA27     CA28     CA29
Eigenvalue            0.02606 0.024659 0.022182 0.016794 0.015542 0.013283
Proportion Explained  0.01002 0.009481 0.008529 0.006457 0.005976 0.005107
Cumulative Proportion 0.95221 0.961693 0.970222 0.976679 0.982654 0.987762
                          CA30     CA31     CA32     CA33      CA34
Eigenvalue            0.011622 0.009967 0.005175 0.003759 0.0013079
Proportion Explained  0.004469 0.003832 0.001990 0.001445 0.0005029
Cumulative Proportion 0.992230 0.996062 0.998052 0.999497 1.0000000

Accumulated constrained eigenvalues
Importance of components:
                        CCA1   CCA2   CCA3    CCA4     CCA5
Eigenvalue            0.1993 0.1234 0.0425 0.01009 0.007432
Proportion Explained  0.5207 0.3225 0.1110 0.02635 0.019416
Cumulative Proportion 0.5207 0.8432 0.9542 0.98058 1.000000

R code

# Interactive CCA plot
cca_sites <- scores(cca_model, display = "sites")
cca_env <- scores(cca_model, display = "bp")
cca_df <- as.data.frame(cca_sites)
cca_df$sampleID <- rownames(cca_df)
cca_env_df <- as.data.frame(cca_env)
cca_env_df$variable <- rownames(cca_env_df)

p_cca <- ggplot() +
  geom_point(data = cca_df, aes(x = CCA1, y = CCA2, text = sampleID), color = "blue") +
  geom_segment(data = cca_env_df, aes(x = 0, y = 0, xend = CCA1, yend = CCA2, text = variable),
               arrow = arrow(length = unit(0.3, "cm")), color = "red") +
  geom_text(data = cca_env_df, aes(x = CCA1, y = CCA2, label = variable), vjust = -0.5) +
  labs(title = "CCA: Community vs Environment") + theme_minimal()

ggplotly(p_cca, tooltip = c("text"))

Interpretation

Correlation plots show relationships between soil variables and diversity.
Gradient analysis highlights taxa responding to environmental gradients.
CCA links overall community composition to environmental drivers.