R code
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(version = "3.19") BiocManager
bioconductor
and gwascat
take a long time to install (20-30 min).data
folder.Several columns in the gwas table, in particular the study name, disease and traits, are entered by the authors and do not conform a particular ontology. Thus, it will be better to search for a term in the column rather than the complete string (name) in the column as we have done in previous labs. To do this we will need to use regular expressions. Understanding how to construct regular expressions is critical for data tidying and transformation. At the start of the lab we will go over Chapter * regular expressions in R for data science. In this lab we will make use of the str_detect
function.
Bioconductor is an open source, open development software project for genomic data analysis. There are thousands of software packages available and an active user community. R/Bioconductor is one of the primary mechanisms for publishing new genomic data analysis tools. We will use Bioconductor in the this and the following weeks labs.
Bioconductor is already installed on RStudio Cloud. Install Bioconductor locally on your computer by typing in the R console
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(version = "3.19") BiocManager
Individual packages that are not part of the BioConductor core can be installed using
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# BiocManager::install("name_of_package")
# for example
::install("gwascat") BiocManager
Note if you are working on your own computer (and not RStudio Cloud) you will need to install Bioconductor before installing gwascat
.
# Load Libraries
library(tidyverse)
library(DT)
library(gwascat)
# If you are downloading gwascat to your computer increase the default download time
options(timeout=600)
# Load SNP file and convert genotype and chromosome to factors
<- read_tsv("data/23andMe_raw.tsv", comment = '#', col_names = FALSE) |>
mySNPs setNames(c("rsid", "chromosome", "position", "genotype")) |>
mutate(chromosome = as.factor(chromosome)) |>
mutate(genotype = as.factor(genotype))
<- read_tsv("data/AncestryDNA_raw.tsv", comment = '#') %>%
mySNPs unite(genotype, c("allele1", "allele2"), sep='') %>%
mutate(chromosome = as.factor(chromosome)) %>%
mutate(genotype = as.factor(genotype))
<- read_csv("data/MyHeritage_raw.csv", comment = '#') |>
mySNPs # rename column labels to be consistent with 23andME and AncestryDNA
rename(rsid = RSID) |>
rename(chromosome = CHROMOSOME) |>
rename(position = POSITION) |>
rename(genotype = RESULT) |>
# change to factors
mutate(chromosome = as.factor(chromosome)) |>
mutate(genotype = as.factor(genotype))
<- read_csv("data/FamilyTreeDNA_raw.csv", comment = '#') |>
mySNPs # rename column labels to be consistent with 23andME and AncestryDNA
rename(rsid = RSID) |>
rename(chromosome = CHROMOSOME) |>
rename(position = POSITION) |>
rename(genotype = RESULT) |>
# change to factors
mutate(chromosome = as.factor(chromosome)) |>
mutate(genotype = as.factor(genotype))
gwascat
is an R packackage for working with the NHGRI-EBI Catalog of published genome-wide association studies.
The GWAS catalog is getting very large (~375 MB) and difficult to download with intermittent or slow internet connections. I have downloaded the table and put it in the data
folder. FOR UNITY USERS it is also available in Canvas (Github doesn’t allow files greater than 100 MB).
<- as.data.frame(makeCurrentGwascat()) gwascat
Save table in case of internet issues
write_tsv(gwascat, "data/updated_gwascat2.tsv")
To get the catalog into a R data frame
<- read_tsv("data/updated_gwas_data.tsv") gwascat
gwascat
The below examples and text are from a blog post by Adam Metcalf. I have stripped down the text to make it easier to follow the steps on your computer, but the post is an enjoyable read and provides additional background.
We can also use our rsid to link to NHGRI-EBI Catalog of published genome-wide association studies
To find the date of the most recent study
max(gwascat$DATE.ADDED.TO.CATALOG)
[1] "2024-10-18"
To get the most recent studies
|>
gwascat filter(DATE.ADDED.TO.CATALOG == max(gwascat$DATE.ADDED.TO.CATALOG)) |>
select(STUDY) |>
distinct(STUDY)
# A tibble: 3 × 1
STUDY
<chr>
1 Multivariate genomic analysis of 5***million people elucidates the genetic ar…
2 Imaging genetics of language network functional connectivity reveals links wi…
3 GWAS and polygenic risk score of severe COVID-19 in Eastern Europe.
To find the link on Pubmed to those studies
|>
gwascat filter(DATE.ADDED.TO.CATALOG == max(gwascat$DATE.ADDED.TO.CATALOG)) |>
select(LINK) |>
distinct(LINK)
# A tibble: 3 × 1
LINK
<chr>
1 www.ncbi.nlm.nih.gov/pubmed/39349817
2 www.ncbi.nlm.nih.gov/pubmed/39342056
3 www.ncbi.nlm.nih.gov/pubmed/39364016
To get studies associated with a term
|>
gwascat filter(str_detect(tolower(DISEASE.TRAIT), "diabetes")) |>
distinct(DISEASE.TRAIT)
# A tibble: 1,150 × 1
DISEASE.TRAIT
<chr>
1 Type 2 diabetes
2 Adult onset asthma or type 2 diabetes
3 Nonatopic asthma or type 2 diabetes
4 Diabetic neuropathy in type 2 diabetes
5 Macrovascular complications in type 2 diabetes
6 Ophthalmic complications in type 2 diabetes
7 Chronic kidney disease x type 2 diabetes interaction
8 Peripheral arterial disease x type 2 diabetes interaction
9 Coronary heart disease x type 2 diabetes interaction
10 Acute ischemic stroke x type 2 diabetes interaction
# ℹ 1,140 more rows
That is a lot of different DISEASE.TRAIT categories associated with diabetes. Now let’s look at Studies with diabetes in the title.
|>
gwascat filter(str_detect(tolower(STUDY), "diabetes")) |>
distinct(STUDY)
# A tibble: 219 × 1
STUDY
<chr>
1 Associations of Observational and Genetically Determined Caffeine Intake Wit…
2 Identification of type 2 diabetes loci in 433,540 East Asian individuals.
3 Novel susceptibility loci identified in a genome-wide association study of t…
4 Discovery of 318 new risk loci for type 2 diabetes and related vascular outc…
5 Phenotypic and Genetic Characterization of Lower LDL-C and Increased Type-2 …
6 Interactions of Habitual Coffee Consumption by Genetic Polymorphisms with th…
7 Genetic Predisposition to Coronary Artery Disease in Type 2 Diabetes Mellitu…
8 A variant within the FTO confers susceptibility to diabetic nephropathy in J…
9 Identification of 28 new susceptibility loci for type 2 diabetes in the Japa…
10 A Genome-Wide Association Study of Diabetic Kidney Disease in Subjects With …
# ℹ 209 more rows
To find studies that include both diabetes and asthma
|>
gwascat filter(str_detect(tolower(DISEASE.TRAIT), "diabetes") & str_detect(tolower(DISEASE.TRAIT), "asthma")) |>
distinct(STUDY)
# A tibble: 1 × 1
STUDY
<chr>
1 Shared Genetic and Experimental Links between Obesity-Related Traits and Asth…
|>
gwascat filter(str_detect(tolower(STUDY), "diabetes") & str_detect(tolower(STUDY), "asthma")) |>
distinct(STUDY)
# A tibble: 1 × 1
STUDY
<chr>
1 Variants in <i>JAZF1</i> are associated with asthma, type 2 diabetes, and hei…
Note that we get different results depending on whether we search the DISEASE.TRAIT
or STUDY
.
Let’s look at how many alleles (SNPS) that are associated with diabetes
|>
gwascat filter(str_detect(tolower(DISEASE.TRAIT), "diabetes")) |>
distinct(SNPS)
# A tibble: 5,325 × 1
SNPS
<chr>
1 rs10852123
2 rs8026714
3 rs61021634
4 rs79826452
5 rs2240885
6 rs117267808
7 rs1421085
8 rs12600132
9 rs6416749
10 rs2925979
# ℹ 5,315 more rows
The GWAS data lists each SNP (e.g. “rs9302874”) in a field called SNPS. Our imported 23andme data has the same info in the rsid field. Hence we can do a simple join, here using the dplyr library.
<- inner_join(mySNPs, gwascat, by = c("rsid" = "SNPS")) mySNPs_gwas_table
Now that we have joined the genetic test with the gwas cat let’s visit how many SNPS/rsids we have associated with our genetic data
|>
mySNPs_gwas_table filter(str_detect(tolower(DISEASE.TRAIT), "diabetes")) |>
distinct(rsid)
# A tibble: 1,745 × 1
rsid
<chr>
1 rs11583755
2 rs10399665
3 rs880315
4 rs4845987
5 rs4661310
6 rs4920461
7 rs2236835
8 rs3738126
9 rs10916780
10 rs6672420
# ℹ 1,735 more rows
We have lost over 4,000 alleles that were associate with diabetes-related traits.
Note the consequences of the inner join here. 23andme analyses SNPs that don’t appear in ths GWAS database, and the GWAS database may contain SNPs that 23andme doesn’t provide for you. In either case, these will be removed in the above file result. There’ll just be rows for SNPs that 23andme does provide you, and that do have an entry in the GWAS database.
Also, the GWAS database may have several rows for a single SNP. It could be that several studies examined a single SNP, or that one study found many traits potentially associated with a SNP. This means your final “output_data” table above will have many rows per for some SNPs.
How shall we narrow down this data-mass to find something potentially interesting?
There are many fields in the GWAS database you might care about – the definitions being listed here. DISEASE.TRAIT and STRONGEST.SNP.RISK.ALLELE might be of most interest.
DISEASE.TRAIT gives you a genericish name for the trait that a study investigated whether there was an association with a given SNP (e.g. “Plasma omega-3 polyunsaturated fatty acid levels”, or “Systolic blood pressure”). Note that the values are not actually all “diseases” by the common-sense meaning – unless you consider traits like being tall a type of illness anyway.
STRONGEST.SNP.RISK.ALLELE gives you the specific allele of the SNP that was “most strongly” associated with that trait in the study (or potentially a ? if unknown, but let’s ignore those for now). The format here is to show the name of the SNP first, then append a dash and the allele of interest afterwards e.g. “rs10787517-A” or “rs7977462-C”.
Back to an exploration of your genome – the two most obvious approaches that come to mind are either: 1) check whether your 23andMe results suggest an association with a specific trait you’re interested in, or 2) check which traits your results may be associated with.
In either case, it’ll be useful to create a field that highlights whether your 23andme results indicate that you have the “strongest risk allele” for each study. This is one way to help narrow down towards the interesting traits you may have inherited.
The 23andme part of of your dataframe contains your personal allele results in the genotype field. There you’ll see entries like “AC” or “TT”. What we really want to do here is, for every combination of SNP and study, check to see if either of the letters in your genotype match up with the letter part of the strongest risk allele.
One method would be to separate out your “genotype” data field into two individual allele fields (so “AC” becomes “A” and “C”). Next, clean up the strongest risk allele so you only have the individual allele (so “rs10787517-A” becomes “A”). Finally check whether either or both of your personal alleles match the strongest risk allele. If they do, there might be something of interest here.
The below code will cover this.
<- mySNPs_gwas_table
mySNPs_gwas_table_risk
$risk_allele_clean <- str_sub(mySNPs_gwas_table$`STRONGEST.SNP.RISK.ALLELE`, -1)
mySNPs_gwas_table_risk
$my_allele_1 <- str_sub(mySNPs_gwas_table$genotype, 1, 1)
mySNPs_gwas_table_risk
$my_allele_2 <- str_sub(mySNPs_gwas_table$genotype, 2, 2)
mySNPs_gwas_table_risk
$have_risk_allele_count <- if_else(mySNPs_gwas_table_risk$my_allele_1 == mySNPs_gwas_table_risk$risk_allele_clean, 1, 0) + if_else(mySNPs_gwas_table_risk$my_allele_2 == mySNPs_gwas_table_risk$risk_allele_clean, 1, 0) mySNPs_gwas_table_risk
Now you have your two individual alleles stored in my_allele_1 and my_allele_2, and the allele for the “highest risk” stored in risk_allele_clean. Risk_allele_clean is the letter part of the GWAS STRONGEST.SNP.RISK.ALLELE field. And finally, the have_risk_allele_count is either 0, 1 or 2 depending on whether your 23andme genotype result at that SNP contains 0, 1 or 2 of the risk alleles.
The previously mentioned DISEASE.TRAIT field contains a summary of the trait involved. So by filtering your dataset to only look for studies about a trait you care about, you can see a summary of the risk allele and whether or not you have it, and the relevant studies that elicited that connection.
|>
mySNPs_gwas_table_risk filter(have_risk_allele_count >= 1) |>
filter(str_detect(tolower(DISEASE.TRAIT), "diabetes")) |>
distinct(rsid)
# A tibble: 754 × 1
rsid
<chr>
1 rs11583755
2 rs10399665
3 rs6672420
4 rs11249215
5 rs3753693
6 rs12116935
7 rs12742756
8 rs17106184
9 rs12140153
10 rs2269247
# ℹ 744 more rows
|>
mySNPs_gwas_table_risk filter(have_risk_allele_count >= 1) |>
filter(str_detect(tolower(DISEASE.TRAIT), "diabetes")) |>
select(c(RISK.ALLELE.FREQUENCY, P.VALUE)) |>
arrange(P.VALUE)
# A tibble: 1,720 × 2
RISK.ALLELE.FREQUENCY P.VALUE
<dbl> <dbl>
1 0.292 1 e-307
2 0.780 2 e-307
3 0.708 3 e-307
4 0.295 6 e-307
5 NA 9 e-307
6 0.56 2 e-296
7 0.56 2 e-287
8 0.56 1 e-271
9 0.56 2 e-265
10 0.632 2 e-245
# ℹ 1,710 more rows
That is a lot of risk alleles (rsids) that have a which have an association with diabetes
For example, let’s assume that by now we also inevitably developed a strong interest in omegas and fatty acids. Which SNPs may relate to that topic, and do we personally have the risk allele for any of them?
We can use the str_detect
function of the stringr
library in order to search for any entries that contain the substring “omega” or “fatty acid”. Then get the associated Studies.
|>
mySNPs_gwas_table_risk select(rsid, DISEASE.TRAIT, STUDY, risk_allele = risk_allele_clean, your_geneotype = genotype) |>
filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) |>
distinct(STUDY)
# A tibble: 36 × 1
STUDY
<chr>
1 Characterising metabolomic signatures of lipid-modifying therapies through d…
2 Role of circulating polyunsaturated fatty acids on cardiovascular diseases r…
3 Genome-wide association study of plasma N6 polyunsaturated fatty acids withi…
4 Genome-Wide Association Study for Serum Omega-3 and Omega-6 Polyunsaturated …
5 Genome-wide characterization of circulating metabolic biomarkers.
6 Association between Human Genetic Variants and the Vaginal Bacteriome of Pre…
7 A genome-wide association study of n-3 and n-6 plasma fatty acids in a Singa…
8 Genetic loci associated with plasma phospholipid n-3 fatty acids: a meta-ana…
9 Genetic Studies of Metabolomics Change After a Liquid Meal Illuminate Novel …
10 Metabolomic Investigation of Major Depressive Disorder Identifies a Potentia…
# ℹ 26 more rows
To find out if a particular allele (rsid) is associated with other traits
|>
mySNPs_gwas_table_risk select(rsid, DISEASE.TRAIT, STUDY) |>
filter(rsid == "rs174537") |>
distinct(DISEASE.TRAIT)
# A tibble: 132 × 1
DISEASE.TRAIT
<chr>
1 Electrocardiographic traits (multivariate)
2 Sex hormone-binding globulin levels adjusted for BMI
3 Sex hormone-binding globulin levels
4 Triglycerides
5 HDL cholesterol
6 Colorectal cancer
7 Serum metabolite ratios in chronic kidney disease
8 Serum metabolite concentrations in chronic kidney disease
9 Phosphatidate(43:4)_[M-H]1- levels
10 Phosphatidylinositol-O(36:0)_[M-H]1-/Phosphatidylglycerol(37:0)_[M+OAc]1- le…
# ℹ 122 more rows
Now onto the second approach – this time, you don’t have a specific trait in mind. You’re more interested in discovering which traits have risk alleles that match the respective part of your genome.
We already have our have_risk_allele_count field. If it’s 1 or 2 then you have some sort of match. So, the full list of your matches and the associated studies could be retrieved in a manner like this.
|>
mySNPs_gwas_table_risk filter(have_risk_allele_count >= 1) |>
select(rsid, your_genotype = genotype, strongest_risk_allele = risk_allele_clean, DISEASE.TRAIT, STUDY )
# A tibble: 83,679 × 5
rsid your_genotype strongest_risk_allele DISEASE.TRAIT STUDY
<chr> <fct> <chr> <chr> <chr>
1 rs2272756 AG A Skin reflectance (Melan… A ge…
2 rs1891910 GG G Parental longevity (mot… A Pr…
3 rs4072537 CT C Serum phosphate levels Gene…
4 rs9442380 CC C Urate levels Gene…
5 rs9442385 GG G Clear cell renal cell c… Mult…
6 rs1891905 TT T Serum urate levels Larg…
7 rs3766186 CC C Blood protein levels Co-r…
8 rs78446752 CC C Schizophrenia Mapp…
9 rs4648727 AC A Serum levels of protein… A ge…
10 rs4648727 AC A Hemoglobin A cr…
# ℹ 83,669 more rows
There are various other fields in the GWAS dataset you might consider using to filter down further. For example, you might be most interested in findings from studies that match your ethnicity, or occasions where you have risk alleles that are rare within the population. After all, we all like to think we’re special snowflakes, so if 95% of the general population have the risk allele for a trait, then that may be less interesting than one where you are in the lucky or unlucky 1%.
For the former, you might try searching within the INITIAL.SAMPLE.SIZE or REPLICATION.SAMPLE.SIZE fields, which has entries like: “272 Han Chinese ancestry individuals” or “1,180 European ancestry individuals from ~475 families”.
Similar to the caveats on searching the trait fields, one does need to be careful here if you’re looking for a comprehensive set of results. Some entries in the database have blanks in one of these fields, and others don’t specify ethnicities, having entries like “Up to 984 individuals”.
For the proportion of the studied population who had the risk allele, it’s the RISK.ALLELE.FREQUENCY field. Again, this can sometimes be blank or zero. But in theory, where it has a valid value, then, depending on the study design, you might find that lower frequencies are rarer traits.
We can use dplyr‘s arrange and filter functions to sort do the above sort of narrowing-down. For example: what are the top 10 trait / study / SNP combinations you have the risk allele for that were explicitly studied within European folk, ordered by virtue of them having the lowest population frequencies reported in the study?
|>
mySNPs_gwas_table_risk filter(have_risk_allele_count > 0 & (str_detect(tolower(INITIAL.SAMPLE.SIZE), "european") | str_detect(tolower(REPLICATION.SAMPLE.SIZE), "european")) & (RISK.ALLELE.FREQUENCY > 0 & !is.na(RISK.ALLELE.FREQUENCY))) |>
arrange(RISK.ALLELE.FREQUENCY) |>
select(rsid, your_genotype = genotype, DISEASE.TRAIT, INITIAL.SAMPLE.SIZE, RISK.ALLELE.FREQUENCY)
# A tibble: 50,791 × 5
rsid your_genotype DISEASE.TRAIT INITIAL.SAMPLE.SIZE RISK.ALLELE.FREQUENCY
<chr> <fct> <chr> <chr> <dbl>
1 rs1224… TT Airway respo… Up to 4,108 Europe… 0.0002
2 rs1690… TT Airway respo… Up to 4,108 Europe… 0.0004
3 rs1711… AG Diisocyanate… 74 European ancest… 0.003
4 rs1447… AA LDL choleste… 215,551 European a… 0.0035
5 rs1143… TT Dementia 2,263 African ance… 0.00379
6 rs7661… TT Dementia 2,263 African ance… 0.00519
7 rs1433… AG Genetically … 265,000 European a… 0.00598
8 rs1270… AG Genetically … 265,000 European a… 0.00667
9 rs1888… CC Ticagrelor l… 1,794 European anc… 0.007
10 rs4710… CT IgG glycosyl… 2,247 European anc… 0.00804
# ℹ 50,781 more rows
Or perhaps you’d prefer to prioritize the order of the traits you have the risk allele for, for example, based on the number of entries in the GWAS database for that trait where the highest risk allele is one you have. You might argue that these could be some of the most reliably associated traits, in the sense that they would bias towards those that have so far been studied the most, at least within this database.
<- mySNPs_gwas_table_risk |>
trait_entry_count group_by(DISEASE.TRAIT) |>
filter(have_risk_allele_count >= 1) |>
summarise(count_of_entries = n())
|>
trait_entry_count filter(count_of_entries > 500) |>
ggplot(aes(x = reorder(DISEASE.TRAIT, count_of_entries, sum), y = count_of_entries)) +
geom_col() +
coord_flip() +
theme_bw() +
labs(title = "Example Figure 1 - Traits where the risk alleles have over 500 entries in the GWAS database?", y = "Count of entries", x = "Trait")
Now, as we’re counting combinations of studies and SNPs per trait here, this is obviously going to be somewhat self-fulfilling as some traits have been featured in way more studies than others. Likewise some traits may have been associated with many more SNPs than others. Also, recalling that many interesting traits seem to be related to a complex mix of SNPs, each of which may only have a tiny effect size, it might be that whilst you do have 10 of the risk alleles for condition X, you also don’t have the other 20 risk alleles that we discovered so far have an association (let alone the 100 weren’t even publish on yet and hence aren’t in this data!).
Maybe then we can sort our output in a different way. How about we count the number of distinct SNPs where you have the risk allele, and then express those as a proportion of the count of all the distinct SNPs for the given trait in the database, whether not you have the risk allele? This would let us say things such as, based (solely) on what in this database, you have 60% of the known risk alleles associated with this trait.
One thing noted in the data, both the 23andme genome data and the gwascat highest risk allele have unusual values in the allele fields – things like ?, -, R, D, I and some numbers based on the fact the “uncleaned” STRONGEST.SNP.RISK.ALLELE didn’t have a -A, -C, -G or -T at the end of the SNP it named. Some of these entries may be meaningful – for example the D and I in the 23andme data refer to deletes and insertions, but won’t match up with anything in the gwascat data. Others may be more messy or missing data, for example 23andme reports “–” if no genotype result was provided for a specific SNP call.
In order to avoid these inflating the proportion’s denominator we’ll just filter down so that we only consider entries where our gwascat-derived “risk_allele_clean” and 23andme-derived “my_allele_1” and “my_allele_2″ are all one of the standard A, C, G or T bases.
Let’s also color code the results by the rarity of the SNP variant within the studied population. That might provide some insight to exactly what sort of special exception we are as an individual – although some of the GWAS data is missing that field and basic averaging won’t necessarily give the correct weighting, so this part is extra…”directional”.
You are no doubt getting bored with the sheer repetition of caveats here – but it is so important. Whilst these are refinements of sorts, they are simplistic and flawed and you should not even consider concluding something significant about your personal health without seeking professional advice here. This is fun only. Well, for for those of us who could ever possibly classify data as fun anyway.
Here we go, building it up one piece at a time for clarity of some kind:
# Summarise proportion of SNPs for a given trait where you have a risk allele
<- mySNPs_gwas_table_risk |>
trait_snp_proportion filter(risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) |>
mutate(you_have_risk_allele = if_else(have_risk_allele_count >= 1, 1, 0)) |>
group_by(DISEASE.TRAIT, you_have_risk_allele) |>
summarise(count_of_snps = n_distinct(rsid)) |>
mutate(total_snps_for_trait = sum(count_of_snps), proportion_of_snps_for_trait = count_of_snps / sum(count_of_snps) * 100) |>
filter(you_have_risk_allele == 1) |>
arrange(desc(proportion_of_snps_for_trait)) |>
ungroup()
<- mySNPs_gwas_table_risk |>
trait_study_count filter(risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) |>
group_by(DISEASE.TRAIT) |>
summarise(count_of_studies = n_distinct(PUBMEDID), mean_risk_allele_freq = mean(RISK.ALLELE.FREQUENCY))
# Merge the above together
<- inner_join(trait_snp_proportion, trait_study_count, by = "DISEASE.TRAIT") trait_snp_proportion_study_count
# Plot the traits where there were more than 2 studies and you have risk alleles for more than 70% of the SNPs studied
|>
trait_snp_proportion_study_count filter(count_of_studies > 5 & proportion_of_snps_for_trait > 70) |>
ggplot(aes(x = reorder(DISEASE.TRAIT, proportion_of_snps_for_trait, sum), y = proportion_of_snps_for_trait, fill = mean_risk_allele_freq)) +
geom_col() +
coord_flip() +
theme_bw() +
labs(title = "Example Figure 2. Traits I have more than half of the risk\nalleles studied where > 5 studies involved",
y = "% of SNPs with risk allele", x = "Trait", fill = "Mean risk allele frequency")
Only do these exercises after you have complete the above examples. The questions assume you already have the data objects loaded into R.
How many unique alleles (rsids) are in your genetic test?
How many unique alleles (SNPS) are in the gwas catalog?
How many rsids are in the data set after joining your genetic test with the gwas catalog?
What is the difference between DISEASE.TRAIT and MAPPED_TRAIT when you search for a term (e.g. diabetes)?
Modify the below code to select for Type 2 Diabetes risk alleles (put the trait of interest in lower case). View mySNPs_gwas_table_risk
for proper capitalization and spacing. How many distinct
|>
mySNPs_gwas_table_risk select(rsid, DISEASE.TRAIT, risk_allele = risk_allele_clean, your_geneotype = genotype) |>
filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) |>
distinct()
Repeat Ex5 using a trait of your choice