Introduction
This tutorial explores environmental correlations in microbial communities using NEON MAGs data: - Correlate soil chemistry (pH, moisture, temperature, elevation) with microbial diversity and taxa. - Perform gradient analysis to identify taxa enriched along environmental gradients. - Use Canonical Correspondence Analysis (CCA) to link community composition to environmental variables.
Step 1: Load and Prepare Data
R code
df <- read_csv ("NEON_soilMAGs_soilChem.csv" )
# Select relevant columns
df <- df %>%
filter (! is.na (genomicsSampleID)) %>%
mutate (siteID = site_ID,
sampleID = genomicsSampleID) %>%
select (sampleID, siteID, phylum, soilMoisture, soilTemp, soilInWaterpH, soilInCaClpH, elevation)
Step 2: Aggregate Data by Sample
R code
tax_profile <- df %>%
group_by (sampleID, phylum) %>%
summarise (count = n (), .groups = "drop" ) %>%
pivot_wider (names_from = phylum, values_from = count, values_fill = 0 )
# Merge with environmental data
env_data <- df %>% select (sampleID, soilMoisture, soilTemp, soilInWaterpH, soilInCaClpH, elevation) %>% distinct ()
merged_data <- left_join (tax_profile, env_data, by = "sampleID" )
Step 3: Correlation Analysis
Explanation : Correlate soil variables with diversity metrics and selected taxa.
R code
# Compute diversity metrics
abundance_matrix <- merged_data %>% select (- sampleID, - soilMoisture, - soilTemp, - soilInWaterpH, - soilInCaClpH, - elevation)
shannon <- diversity (abundance_matrix, index = "shannon" )
simpson <- diversity (abundance_matrix, index = "simpson" )
richness <- specnumber (abundance_matrix)
# Combine with environmental data
corr_df <- merged_data %>% select (sampleID, soilMoisture, soilTemp, soilInWaterpH, soilInCaClpH, elevation) %>%
mutate (Shannon = shannon, Simpson = simpson, Richness = richness)
# Correlation plot
ggplot (corr_df, aes (x = soilMoisture, y = Shannon)) +
geom_point () + geom_smooth (method = "lm" ) +
labs (title = "Correlation: Soil Moisture vs Shannon Diversity" )
Step 4: Gradient Analysis
Explanation : Identify taxa enriched along pH or moisture gradients.
R code
# Example: Correlation of Acidobacteriota abundance with pH
target_taxon <- "Acidobacteriota"
merged_data %>%
ggplot (aes (x = soilInWaterpH, y = !! sym (target_taxon))) +
geom_point () + geom_smooth (method = "lm" ) +
labs (title = paste ("Gradient Analysis: " , target_taxon, " vs pH" ))
Step 5: Canonical Correspondence Analysis (CCA)
Explanation : Link community composition to environmental variables.
R code
# Prepare matrices
abundance_matrix <- merged_data %>% select (- sampleID, - soilMoisture, - soilTemp, - soilInWaterpH, - soilInCaClpH, - elevation)
env_matrix <- merged_data %>% select (soilMoisture, soilTemp, soilInWaterpH, soilInCaClpH, elevation)
# Perform CCA
cca_model <- cca (abundance_matrix ~ ., data = env_matrix)
summary (cca_model)
Call:
cca(formula = abundance_matrix ~ soilMoisture + soilTemp + soilInWaterpH + soilInCaClpH + elevation, data = env_matrix)
Partitioning of scaled Chi-square:
Inertia Proportion
Total 2.6009 1.0000
Constrained 0.3828 0.1472
Unconstrained 2.2182 0.8528
Eigenvalues, and their contribution to the scaled Chi-square
Importance of components:
CCA1 CCA2 CCA3 CCA4 CCA5 CA1 CA2
Eigenvalue 0.19931 0.12342 0.04250 0.010087 0.007432 0.6531 0.15254
Proportion Explained 0.07663 0.04745 0.01634 0.003878 0.002857 0.2511 0.05865
Cumulative Proportion 0.07663 0.12409 0.14043 0.144304 0.147161 0.3983 0.45690
CA3 CA4 CA5 CA6 CA7 CA8 CA9
Eigenvalue 0.13116 0.10967 0.10706 0.09812 0.08784 0.07540 0.06901
Proportion Explained 0.05043 0.04217 0.04116 0.03773 0.03377 0.02899 0.02653
Cumulative Proportion 0.50733 0.54949 0.59065 0.62838 0.66215 0.69114 0.71767
CA10 CA11 CA12 CA13 CA14 CA15 CA16
Eigenvalue 0.06175 0.05868 0.05130 0.04824 0.04689 0.04333 0.04170
Proportion Explained 0.02374 0.02256 0.01973 0.01855 0.01803 0.01666 0.01603
Cumulative Proportion 0.74141 0.76398 0.78370 0.80225 0.82028 0.83694 0.85297
CA17 CA18 CA19 CA20 CA21 CA22 CA23
Eigenvalue 0.03881 0.03710 0.03476 0.03293 0.03150 0.02997 0.02699
Proportion Explained 0.01492 0.01426 0.01336 0.01266 0.01211 0.01152 0.01038
Cumulative Proportion 0.86789 0.88216 0.89552 0.90818 0.92029 0.93182 0.94219
CA24 CA25 CA26 CA27 CA28 CA29
Eigenvalue 0.02606 0.024659 0.022182 0.016794 0.015542 0.013283
Proportion Explained 0.01002 0.009481 0.008529 0.006457 0.005976 0.005107
Cumulative Proportion 0.95221 0.961693 0.970222 0.976679 0.982654 0.987762
CA30 CA31 CA32 CA33 CA34
Eigenvalue 0.011622 0.009967 0.005175 0.003759 0.0013079
Proportion Explained 0.004469 0.003832 0.001990 0.001445 0.0005029
Cumulative Proportion 0.992230 0.996062 0.998052 0.999497 1.0000000
Accumulated constrained eigenvalues
Importance of components:
CCA1 CCA2 CCA3 CCA4 CCA5
Eigenvalue 0.1993 0.1234 0.0425 0.01009 0.007432
Proportion Explained 0.5207 0.3225 0.1110 0.02635 0.019416
Cumulative Proportion 0.5207 0.8432 0.9542 0.98058 1.000000
R code
# Interactive CCA plot
cca_sites <- scores (cca_model, display = "sites" )
cca_env <- scores (cca_model, display = "bp" )
cca_df <- as.data.frame (cca_sites)
cca_df$ sampleID <- rownames (cca_df)
cca_env_df <- as.data.frame (cca_env)
cca_env_df$ variable <- rownames (cca_env_df)
p_cca <- ggplot () +
geom_point (data = cca_df, aes (x = CCA1, y = CCA2, text = sampleID), color = "blue" ) +
geom_segment (data = cca_env_df, aes (x = 0 , y = 0 , xend = CCA1, yend = CCA2, text = variable),
arrow = arrow (length = unit (0.3 , "cm" )), color = "red" ) +
geom_text (data = cca_env_df, aes (x = CCA1, y = CCA2, label = variable), vjust = - 0.5 ) +
labs (title = "CCA: Community vs Environment" ) + theme_minimal ()
ggplotly (p_cca, tooltip = c ("text" ))
Interpretation
Correlation plots show relationships between soil variables and diversity.
Gradient analysis highlights taxa responding to environmental gradients.
CCA links overall community composition to environmental drivers.