Human Genome Analysis Lab 9 : Linking DTC SNPs to the NHGRI-EBI Catalog of published genome-wide association studies

IMPORTANT POINTS FOR TODAY’S LAB FOR UNITY USERS

  1. bioconductor and gwascat take a long time to install (20-30 min).
  2. The GWAS catalog is getting very large (375 MB) and difficult to download with intermittent or not fast internet connections. FOR UNITY USERS it is available on Canvas (Github doesn’t allow files greater than 100 MB). Put it in your data folder.

Learning objectives

  • Regular expressions
  • Mapping DTC raw data to phenotypes and the literature using gwascat

Introduction

Regular expressions

Several columns in the gwas table, in particular the study name, disease and traits, are entered by the authors and do not conform a particular ontology. Thus, it will be better to search for a term in the column rather than the complete string (name) in the column as we have done in previous labs. To do this we will need to use regular expressions. Understanding how to construct regular expressions is critical for data tidying and transformation. At the start of the lab we will go over Chapter * regular expressions in R for data science. In this lab we will make use of the str_detect function.

Bioconductor

Bioconductor is an open source, open development software project for genomic data analysis. There are thousands of software packages available and an active user community. R/Bioconductor is one of the primary mechanisms for publishing new genomic data analysis tools. We will use Bioconductor in the this and the following weeks labs.

Installing Bioconductor and GWASCAT (only for students working on Unity or with R download to their computer)

Bioconductor is already installed on RStudio Cloud. Install Bioconductor locally on your computer by typing in the R console

R code
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.19")

Individual packages that are not part of the BioConductor core can be installed using

R code
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
# BiocManager::install("name_of_package")
# for example
BiocManager::install("gwascat")

On the computer

Libraries for today’s lab.

Note if you are working on your own computer (and not RStudio Cloud) you will need to install Bioconductor before installing gwascat.

R code
# Load Libraries
library(tidyverse)
library(DT)
library(gwascat)
# If you are downloading gwascat to your computer increase the default download time
options(timeout=600)

Loading our genetic data

23andMe

R code
# Load SNP file and convert genotype and chromosome to factors
mySNPs <- read_tsv("data/23andMe_raw.tsv", comment = '#', col_names = FALSE) |> 
  setNames(c("rsid", "chromosome", "position", "genotype")) |> 
  mutate(chromosome = as.factor(chromosome)) |> 
  mutate(genotype = as.factor(genotype)) 

AncestryDNA

R code
mySNPs <- read_tsv("data/AncestryDNA_raw.tsv", comment = '#') %>%  
  unite(genotype, c("allele1", "allele2"), sep='') %>% 
  mutate(chromosome = as.factor(chromosome)) %>% 
  mutate(genotype = as.factor(genotype))

MyHeritageDNA

R code
mySNPs <- read_csv("data/MyHeritage_raw.csv", comment = '#') |>
  # rename column labels to be consistent with 23andME and AncestryDNA
  rename(rsid = RSID) |> 
  rename(chromosome = CHROMOSOME) |> 
  rename(position = POSITION) |> 
  rename(genotype = RESULT) |> 
  # change to factors
  mutate(chromosome = as.factor(chromosome)) |> 
  mutate(genotype = as.factor(genotype))

FamilyTreeDNA

R code
mySNPs <- read_csv("data/FamilyTreeDNA_raw.csv", comment = '#') |>
  # rename column labels to be consistent with 23andME and AncestryDNA
  rename(rsid = RSID) |> 
  rename(chromosome = CHROMOSOME) |> 
  rename(position = POSITION) |> 
  rename(genotype = RESULT) |> 
  # change to factors
  mutate(chromosome = as.factor(chromosome)) |> 
  mutate(genotype = as.factor(genotype)) 

The GWAS catalog

Downloading the GWAS catalog

gwascat is an R packackage for working with the NHGRI-EBI Catalog of published genome-wide association studies.

The GWAS catalog is getting very large (~375 MB) and difficult to download with intermittent or slow internet connections. I have downloaded the table and put it in the data folder. FOR UNITY USERS it is also available in Canvas (Github doesn’t allow files greater than 100 MB).

R code
gwascat <- as.data.frame(makeCurrentGwascat())

Save table in case of internet issues

R code
write_tsv(gwascat, "data/updated_gwascat2.tsv") 

To get the catalog into a R data frame

R code
gwascat <- read_tsv("data/updated_gwas_data.tsv")

Exploring the traits associated with your genome using gwascat

The below examples and text are from a blog post by Adam Metcalf. I have stripped down the text to make it easier to follow the steps on your computer, but the post is an enjoyable read and provides additional background.

We can also use our rsid to link to NHGRI-EBI Catalog of published genome-wide association studies

To find the date of the most recent study

R code
max(gwascat$DATE.ADDED.TO.CATALOG)
[1] "2024-10-18"

To get the most recent studies

R code
gwascat |> 
  filter(DATE.ADDED.TO.CATALOG == max(gwascat$DATE.ADDED.TO.CATALOG)) |> 
  select(STUDY) |> 
  distinct(STUDY)
# A tibble: 3 × 1
  STUDY                                                                         
  <chr>                                                                         
1 Multivariate genomic analysis of 5***million people elucidates the genetic ar…
2 Imaging genetics of language network functional connectivity reveals links wi…
3 GWAS and polygenic risk score of severe COVID-19 in Eastern Europe.           

To find the link on Pubmed to those studies

R code
gwascat |> 
  filter(DATE.ADDED.TO.CATALOG == max(gwascat$DATE.ADDED.TO.CATALOG)) |>     
  select(LINK) |> 
  distinct(LINK)
# A tibble: 3 × 1
  LINK                                
  <chr>                               
1 www.ncbi.nlm.nih.gov/pubmed/39349817
2 www.ncbi.nlm.nih.gov/pubmed/39342056
3 www.ncbi.nlm.nih.gov/pubmed/39364016

To get studies associated with a term

R code
gwascat |> 
  filter(str_detect(tolower(DISEASE.TRAIT), "diabetes")) |> 
  distinct(DISEASE.TRAIT)
# A tibble: 1,150 × 1
   DISEASE.TRAIT                                            
   <chr>                                                    
 1 Type 2 diabetes                                          
 2 Adult onset asthma or type 2 diabetes                    
 3 Nonatopic asthma or type 2 diabetes                      
 4 Diabetic neuropathy in type 2 diabetes                   
 5 Macrovascular complications in type 2 diabetes           
 6 Ophthalmic complications in type 2 diabetes              
 7 Chronic kidney disease x type 2 diabetes interaction     
 8 Peripheral arterial disease x type 2 diabetes interaction
 9 Coronary heart disease x type 2 diabetes interaction     
10 Acute ischemic stroke x type 2 diabetes interaction      
# ℹ 1,140 more rows

That is a lot of different DISEASE.TRAIT categories associated with diabetes. Now let’s look at Studies with diabetes in the title.

R code
gwascat |> 
  filter(str_detect(tolower(STUDY), "diabetes")) |> 
  distinct(STUDY)
# A tibble: 219 × 1
   STUDY                                                                        
   <chr>                                                                        
 1 Associations of Observational and Genetically Determined Caffeine Intake Wit…
 2 Identification of type 2 diabetes loci in 433,540 East Asian individuals.    
 3 Novel susceptibility loci identified in a genome-wide association study of t…
 4 Discovery of 318 new risk loci for type 2 diabetes and related vascular outc…
 5 Phenotypic and Genetic Characterization of Lower LDL-C and Increased Type-2 …
 6 Interactions of Habitual Coffee Consumption by Genetic Polymorphisms with th…
 7 Genetic Predisposition to Coronary Artery Disease in Type 2 Diabetes Mellitu…
 8 A variant within the FTO confers susceptibility to diabetic nephropathy in J…
 9 Identification of 28 new susceptibility loci for type 2 diabetes in the Japa…
10 A Genome-Wide Association Study of Diabetic Kidney Disease in Subjects With …
# ℹ 209 more rows

To find studies that include both diabetes and asthma

R code
gwascat |> 
  filter(str_detect(tolower(DISEASE.TRAIT), "diabetes") & str_detect(tolower(DISEASE.TRAIT), "asthma")) |>
  distinct(STUDY)
# A tibble: 1 × 1
  STUDY                                                                         
  <chr>                                                                         
1 Shared Genetic and Experimental Links between Obesity-Related Traits and Asth…
R code
gwascat |> 
  filter(str_detect(tolower(STUDY), "diabetes") & str_detect(tolower(STUDY), "asthma")) |>
  distinct(STUDY)
# A tibble: 1 × 1
  STUDY                                                                         
  <chr>                                                                         
1 Variants in <i>JAZF1</i> are associated with asthma, type 2 diabetes, and hei…

Note that we get different results depending on whether we search the DISEASE.TRAIT or STUDY.

Let’s look at how many alleles (SNPS) that are associated with diabetes

R code
gwascat |> 
  filter(str_detect(tolower(DISEASE.TRAIT), "diabetes")) |> 
  distinct(SNPS)
# A tibble: 5,325 × 1
   SNPS       
   <chr>      
 1 rs10852123 
 2 rs8026714  
 3 rs61021634 
 4 rs79826452 
 5 rs2240885  
 6 rs117267808
 7 rs1421085  
 8 rs12600132 
 9 rs6416749  
10 rs2925979  
# ℹ 5,315 more rows

Joining mySNPs with the updated gwas data

The GWAS data lists each SNP (e.g. “rs9302874”) in a field called SNPS. Our imported 23andme data has the same info in the rsid field. Hence we can do a simple join, here using the dplyr library.

R code
mySNPs_gwas_table <- inner_join(mySNPs, gwascat, by = c("rsid" = "SNPS"))

Now that we have joined the genetic test with the gwas cat let’s visit how many SNPS/rsids we have associated with our genetic data

R code
mySNPs_gwas_table |> 
  filter(str_detect(tolower(DISEASE.TRAIT), "diabetes")) |> 
  distinct(rsid)
# A tibble: 1,745 × 1
   rsid      
   <chr>     
 1 rs11583755
 2 rs10399665
 3 rs880315  
 4 rs4845987 
 5 rs4661310 
 6 rs4920461 
 7 rs2236835 
 8 rs3738126 
 9 rs10916780
10 rs6672420 
# ℹ 1,735 more rows

We have lost over 4,000 alleles that were associate with diabetes-related traits.

Note the consequences of the inner join here. 23andme analyses SNPs that don’t appear in ths GWAS database, and the GWAS database may contain SNPs that 23andme doesn’t provide for you. In either case, these will be removed in the above file result. There’ll just be rows for SNPs that 23andme does provide you, and that do have an entry in the GWAS database.

Also, the GWAS database may have several rows for a single SNP. It could be that several studies examined a single SNP, or that one study found many traits potentially associated with a SNP. This means your final “output_data” table above will have many rows per for some SNPs.

How shall we narrow down this data-mass to find something potentially interesting?

There are many fields in the GWAS database you might care about – the definitions being listed here. DISEASE.TRAIT and STRONGEST.SNP.RISK.ALLELE might be of most interest.

DISEASE.TRAIT gives you a genericish name for the trait that a study investigated whether there was an association with a given SNP (e.g. “Plasma omega-3 polyunsaturated fatty acid levels”, or “Systolic blood pressure”). Note that the values are not actually all “diseases” by the common-sense meaning – unless you consider traits like being tall a type of illness anyway.

STRONGEST.SNP.RISK.ALLELE gives you the specific allele of the SNP that was “most strongly” associated with that trait in the study (or potentially a ? if unknown, but let’s ignore those for now). The format here is to show the name of the SNP first, then append a dash and the allele of interest afterwards e.g. “rs10787517-A” or “rs7977462-C”.

Back to an exploration of your genome – the two most obvious approaches that come to mind are either: 1) check whether your 23andMe results suggest an association with a specific trait you’re interested in, or 2) check which traits your results may be associated with.

In either case, it’ll be useful to create a field that highlights whether your 23andme results indicate that you have the “strongest risk allele” for each study. This is one way to help narrow down towards the interesting traits you may have inherited.

The 23andme part of of your dataframe contains your personal allele results in the genotype field. There you’ll see entries like “AC” or “TT”. What we really want to do here is, for every combination of SNP and study, check to see if either of the letters in your genotype match up with the letter part of the strongest risk allele.

One method would be to separate out your “genotype” data field into two individual allele fields (so “AC” becomes “A” and “C”). Next, clean up the strongest risk allele so you only have the individual allele (so “rs10787517-A” becomes “A”). Finally check whether either or both of your personal alleles match the strongest risk allele. If they do, there might be something of interest here.

The below code will cover this.

R code
mySNPs_gwas_table_risk <- mySNPs_gwas_table

mySNPs_gwas_table_risk$risk_allele_clean <- str_sub(mySNPs_gwas_table$`STRONGEST.SNP.RISK.ALLELE`, -1)

mySNPs_gwas_table_risk$my_allele_1 <- str_sub(mySNPs_gwas_table$genotype, 1, 1)

mySNPs_gwas_table_risk$my_allele_2 <- str_sub(mySNPs_gwas_table$genotype, 2, 2)

mySNPs_gwas_table_risk$have_risk_allele_count <- if_else(mySNPs_gwas_table_risk$my_allele_1 == mySNPs_gwas_table_risk$risk_allele_clean, 1, 0) + if_else(mySNPs_gwas_table_risk$my_allele_2 == mySNPs_gwas_table_risk$risk_allele_clean, 1, 0)

Now you have your two individual alleles stored in my_allele_1 and my_allele_2, and the allele for the “highest risk” stored in risk_allele_clean. Risk_allele_clean is the letter part of the GWAS STRONGEST.SNP.RISK.ALLELE field. And finally, the have_risk_allele_count is either 0, 1 or 2 depending on whether your 23andme genotype result at that SNP contains 0, 1 or 2 of the risk alleles.

The previously mentioned DISEASE.TRAIT field contains a summary of the trait involved. So by filtering your dataset to only look for studies about a trait you care about, you can see a summary of the risk allele and whether or not you have it, and the relevant studies that elicited that connection.

R code
mySNPs_gwas_table_risk |> 
  filter(have_risk_allele_count >= 1) |>
  filter(str_detect(tolower(DISEASE.TRAIT), "diabetes")) |> 
  distinct(rsid)
# A tibble: 754 × 1
   rsid      
   <chr>     
 1 rs11583755
 2 rs10399665
 3 rs6672420 
 4 rs11249215
 5 rs3753693 
 6 rs12116935
 7 rs12742756
 8 rs17106184
 9 rs12140153
10 rs2269247 
# ℹ 744 more rows
R code
mySNPs_gwas_table_risk |> 
  filter(have_risk_allele_count >= 1) |>
  filter(str_detect(tolower(DISEASE.TRAIT), "diabetes")) |> 
  select(c(RISK.ALLELE.FREQUENCY, P.VALUE)) |> 
  arrange(P.VALUE)
# A tibble: 1,720 × 2
   RISK.ALLELE.FREQUENCY  P.VALUE
                   <dbl>    <dbl>
 1                 0.292 1  e-307
 2                 0.780 2  e-307
 3                 0.708 3  e-307
 4                 0.295 6  e-307
 5                NA     9  e-307
 6                 0.56  2  e-296
 7                 0.56  2  e-287
 8                 0.56  1  e-271
 9                 0.56  2  e-265
10                 0.632 2  e-245
# ℹ 1,710 more rows

That is a lot of risk alleles (rsids) that have a which have an association with diabetes

For example, let’s assume that by now we also inevitably developed a strong interest in omegas and fatty acids. Which SNPs may relate to that topic, and do we personally have the risk allele for any of them?

We can use the str_detect function of the stringr library in order to search for any entries that contain the substring “omega” or “fatty acid”. Then get the associated Studies.

R code
mySNPs_gwas_table_risk |> 
  select(rsid, DISEASE.TRAIT, STUDY, risk_allele = risk_allele_clean, your_geneotype = genotype) |> 
  filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) |>
  distinct(STUDY)
# A tibble: 36 × 1
   STUDY                                                                        
   <chr>                                                                        
 1 Characterising metabolomic signatures of lipid-modifying therapies through d…
 2 Role of circulating polyunsaturated fatty acids on cardiovascular diseases r…
 3 Genome-wide association study of plasma N6 polyunsaturated fatty acids withi…
 4 Genome-Wide Association Study for Serum Omega-3 and Omega-6 Polyunsaturated …
 5 Genome-wide characterization of circulating metabolic biomarkers.            
 6 Association between Human Genetic Variants and the Vaginal Bacteriome of Pre…
 7 A genome-wide association study of n-3 and n-6 plasma fatty acids in a Singa…
 8 Genetic loci associated with plasma phospholipid n-3 fatty acids: a meta-ana…
 9 Genetic Studies of Metabolomics Change After a Liquid Meal Illuminate Novel …
10 Metabolomic Investigation of Major Depressive Disorder Identifies a Potentia…
# ℹ 26 more rows

To find out if a particular allele (rsid) is associated with other traits

R code
mySNPs_gwas_table_risk |> 
  select(rsid, DISEASE.TRAIT, STUDY) |> 
  filter(rsid == "rs174537") |> 
  distinct(DISEASE.TRAIT)
# A tibble: 132 × 1
   DISEASE.TRAIT                                                                
   <chr>                                                                        
 1 Electrocardiographic traits (multivariate)                                   
 2 Sex hormone-binding globulin levels adjusted for BMI                         
 3 Sex hormone-binding globulin levels                                          
 4 Triglycerides                                                                
 5 HDL cholesterol                                                              
 6 Colorectal cancer                                                            
 7 Serum metabolite ratios in chronic kidney disease                            
 8 Serum metabolite concentrations in chronic kidney disease                    
 9 Phosphatidate(43:4)_[M-H]1- levels                                           
10 Phosphatidylinositol-O(36:0)_[M-H]1-/Phosphatidylglycerol(37:0)_[M+OAc]1- le…
# ℹ 122 more rows

Now onto the second approach – this time, you don’t have a specific trait in mind. You’re more interested in discovering which traits have risk alleles that match the respective part of your genome.

We already have our have_risk_allele_count field. If it’s 1 or 2 then you have some sort of match. So, the full list of your matches and the associated studies could be retrieved in a manner like this.

R code
mySNPs_gwas_table_risk |> 
  filter(have_risk_allele_count >= 1) |>
  select(rsid, your_genotype = genotype, strongest_risk_allele = risk_allele_clean, DISEASE.TRAIT, STUDY )
# A tibble: 83,679 × 5
   rsid       your_genotype strongest_risk_allele DISEASE.TRAIT            STUDY
   <chr>      <fct>         <chr>                 <chr>                    <chr>
 1 rs2272756  AG            A                     Skin reflectance (Melan… A ge…
 2 rs1891910  GG            G                     Parental longevity (mot… A Pr…
 3 rs4072537  CT            C                     Serum phosphate levels   Gene…
 4 rs9442380  CC            C                     Urate levels             Gene…
 5 rs9442385  GG            G                     Clear cell renal cell c… Mult…
 6 rs1891905  TT            T                     Serum urate levels       Larg…
 7 rs3766186  CC            C                     Blood protein levels     Co-r…
 8 rs78446752 CC            C                     Schizophrenia            Mapp…
 9 rs4648727  AC            A                     Serum levels of protein… A ge…
10 rs4648727  AC            A                     Hemoglobin               A cr…
# ℹ 83,669 more rows

There are various other fields in the GWAS dataset you might consider using to filter down further. For example, you might be most interested in findings from studies that match your ethnicity, or occasions where you have risk alleles that are rare within the population. After all, we all like to think we’re special snowflakes, so if 95% of the general population have the risk allele for a trait, then that may be less interesting than one where you are in the lucky or unlucky 1%.

For the former, you might try searching within the INITIAL.SAMPLE.SIZE or REPLICATION.SAMPLE.SIZE fields, which has entries like: “272 Han Chinese ancestry individuals” or “1,180 European ancestry individuals from ~475 families”.

Similar to the caveats on searching the trait fields, one does need to be careful here if you’re looking for a comprehensive set of results. Some entries in the database have blanks in one of these fields, and others don’t specify ethnicities, having entries like “Up to 984 individuals”.

For the proportion of the studied population who had the risk allele, it’s the RISK.ALLELE.FREQUENCY field. Again, this can sometimes be blank or zero. But in theory, where it has a valid value, then, depending on the study design, you might find that lower frequencies are rarer traits.

We can use dplyr‘s arrange and filter functions to sort do the above sort of narrowing-down. For example: what are the top 10 trait / study / SNP combinations you have the risk allele for that were explicitly studied within European folk, ordered by virtue of them having the lowest population frequencies reported in the study?

R code
mySNPs_gwas_table_risk |> 
  filter(have_risk_allele_count > 0 & (str_detect(tolower(INITIAL.SAMPLE.SIZE), "european") |   str_detect(tolower(REPLICATION.SAMPLE.SIZE), "european")) & (RISK.ALLELE.FREQUENCY > 0 & !is.na(RISK.ALLELE.FREQUENCY))) |>
  arrange(RISK.ALLELE.FREQUENCY) |>
  select(rsid, your_genotype = genotype, DISEASE.TRAIT, INITIAL.SAMPLE.SIZE, RISK.ALLELE.FREQUENCY)
# A tibble: 50,791 × 5
   rsid    your_genotype DISEASE.TRAIT INITIAL.SAMPLE.SIZE RISK.ALLELE.FREQUENCY
   <chr>   <fct>         <chr>         <chr>                               <dbl>
 1 rs1224… TT            Airway respo… Up to 4,108 Europe…               0.0002 
 2 rs1690… TT            Airway respo… Up to 4,108 Europe…               0.0004 
 3 rs1711… AG            Diisocyanate… 74 European ancest…               0.003  
 4 rs1447… AA            LDL choleste… 215,551 European a…               0.0035 
 5 rs1143… TT            Dementia      2,263 African ance…               0.00379
 6 rs7661… TT            Dementia      2,263 African ance…               0.00519
 7 rs1433… AG            Genetically … 265,000 European a…               0.00598
 8 rs1270… AG            Genetically … 265,000 European a…               0.00667
 9 rs1888… CC            Ticagrelor l… 1,794 European anc…               0.007  
10 rs4710… CT            IgG glycosyl… 2,247 European anc…               0.00804
# ℹ 50,781 more rows

Or perhaps you’d prefer to prioritize the order of the traits you have the risk allele for, for example, based on the number of entries in the GWAS database for that trait where the highest risk allele is one you have. You might argue that these could be some of the most reliably associated traits, in the sense that they would bias towards those that have so far been studied the most, at least within this database.

R code
trait_entry_count <- mySNPs_gwas_table_risk |> 
  group_by(DISEASE.TRAIT) |>
  filter(have_risk_allele_count >= 1) |>
  summarise(count_of_entries = n()) 

trait_entry_count |> 
  filter(count_of_entries > 500) |> 
ggplot(aes(x = reorder(DISEASE.TRAIT, count_of_entries, sum), y = count_of_entries)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = "Example Figure 1 - Traits where the risk alleles have over 500 entries in the GWAS database?", y = "Count of entries", x = "Trait")

Now, as we’re counting combinations of studies and SNPs per trait here, this is obviously going to be somewhat self-fulfilling as some traits have been featured in way more studies than others. Likewise some traits may have been associated with many more SNPs than others. Also, recalling that many interesting traits seem to be related to a complex mix of SNPs, each of which may only have a tiny effect size, it might be that whilst you do have 10 of the risk alleles for condition X, you also don’t have the other 20 risk alleles that we discovered so far have an association (let alone the 100 weren’t even publish on yet and hence aren’t in this data!).

Maybe then we can sort our output in a different way. How about we count the number of distinct SNPs where you have the risk allele, and then express those as a proportion of the count of all the distinct SNPs for the given trait in the database, whether not you have the risk allele? This would let us say things such as, based (solely) on what in this database, you have 60% of the known risk alleles associated with this trait.

One thing noted in the data, both the 23andme genome data and the gwascat highest risk allele have unusual values in the allele fields – things like ?, -, R, D, I and some numbers based on the fact the “uncleaned” STRONGEST.SNP.RISK.ALLELE didn’t have a -A, -C, -G or -T at the end of the SNP it named. Some of these entries may be meaningful – for example the D and I in the 23andme data refer to deletes and insertions, but won’t match up with anything in the gwascat data. Others may be more messy or missing data, for example 23andme reports “–” if no genotype result was provided for a specific SNP call.

In order to avoid these inflating the proportion’s denominator we’ll just filter down so that we only consider entries where our gwascat-derived “risk_allele_clean” and 23andme-derived “my_allele_1” and “my_allele_2″ are all one of the standard A, C, G or T bases.

Let’s also color code the results by the rarity of the SNP variant within the studied population. That might provide some insight to exactly what sort of special exception we are as an individual – although some of the GWAS data is missing that field and basic averaging won’t necessarily give the correct weighting, so this part is extra…”directional”.

You are no doubt getting bored with the sheer repetition of caveats here – but it is so important. Whilst these are refinements of sorts, they are simplistic and flawed and you should not even consider concluding something significant about your personal health without seeking professional advice here. This is fun only. Well, for for those of us who could ever possibly classify data as fun anyway.

Here we go, building it up one piece at a time for clarity of some kind:

R code
# Summarise proportion of SNPs for a given trait where you have a risk allele

trait_snp_proportion <- mySNPs_gwas_table_risk |> 
  filter(risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) |>
  mutate(you_have_risk_allele = if_else(have_risk_allele_count >= 1, 1, 0)) |>
  group_by(DISEASE.TRAIT, you_have_risk_allele) |>
  summarise(count_of_snps = n_distinct(rsid)) |>
  mutate(total_snps_for_trait = sum(count_of_snps), proportion_of_snps_for_trait = count_of_snps / sum(count_of_snps) * 100) |>
  filter(you_have_risk_allele == 1) |>
  arrange(desc(proportion_of_snps_for_trait)) |>
  ungroup()

Count the studies per trait in the database

R code
trait_study_count <- mySNPs_gwas_table_risk |> 
  filter(risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) |>
  group_by(DISEASE.TRAIT) |>
  summarise(count_of_studies = n_distinct(PUBMEDID), mean_risk_allele_freq = mean(RISK.ALLELE.FREQUENCY))
R code
# Merge the above together

trait_snp_proportion_study_count  <- inner_join(trait_snp_proportion, trait_study_count, by = "DISEASE.TRAIT")
R code
# Plot the traits where there were more than 2 studies and you have risk alleles for more than 70% of the SNPs studied

trait_snp_proportion_study_count |> 
  filter(count_of_studies > 5 & proportion_of_snps_for_trait > 70) |> 
ggplot(aes(x = reorder(DISEASE.TRAIT, proportion_of_snps_for_trait, sum), y = proportion_of_snps_for_trait, fill = mean_risk_allele_freq)) +
  geom_col() +
  coord_flip() +
  theme_bw() + 
  labs(title = "Example Figure 2. Traits I have more than half of the risk\nalleles studied where > 5 studies involved", 
  y = "% of SNPs with risk allele", x = "Trait", fill = "Mean risk allele frequency") 

Exercises

Only do these exercises after you have complete the above examples. The questions assume you already have the data objects loaded into R.

Exercise 1

How many unique alleles (rsids) are in your genetic test?

Exercise 2

How many unique alleles (SNPS) are in the gwas catalog?

Exercise 3

How many rsids are in the data set after joining your genetic test with the gwas catalog?

Exercise 4

What is the difference between DISEASE.TRAIT and MAPPED_TRAIT when you search for a term (e.g. diabetes)?

Exercise 5

Modify the below code to select for Type 2 Diabetes risk alleles (put the trait of interest in lower case). View mySNPs_gwas_table_risk for proper capitalization and spacing. How many distinct

R code
mySNPs_gwas_table_risk |> 
select(rsid, DISEASE.TRAIT, risk_allele = risk_allele_clean, your_geneotype = genotype) |> 
 filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) |>
 distinct()

Exercise 6

Repeat Ex5 using a trait of your choice