Importance of Quality in Association Tests
SVS is a research application platform provided by Golden Helix that enables an array of computational analyses including genome-wide association studies (GWAS). GWAS is an observational study that can provide insight into the association of genetic variants with traits and complex disorders. The foundation of GWAS utilizes large cohorts sequenced with single nucleotide polymorphisms (SNP) arrays where the phenotype or quantitative trait are known. If implemented correctly, findings from GWAS can lead to novel insights into marker-phenotype associations and provide candidate gene targets for clinical and research application fields.
In SVS the ideal setup for GWAS analysis is a spreadsheet that contains sample IDs, phenotypes and SNPs. As an example, the spreadsheet in figure 1 is a human study containing samples on a per row basis, a case/control column (set as the dependent variable), followed by demographics and SNPs. In total, this project contains 478 samples and over 500,000 SNPs as shown in the top right corner of the image. Prior to running GWAS, the next step would be a thorough quality control process to confirm the accuracy of the genotyped data. Quality control steps for GWAS will be the focus of this blog series, but for now we will focus on the result of not accounting for data quality when running an association test.
The Genotype Association Tests tool can be accessed using the Genotype menu option above the spreadsheet, which becomes accessible once you click on the dependent variable (pink highlighted phenotype). In this test window, shown in Figure 2, alleles can be classified based on allele frequency or by the reference or alternate allele status as specified by a marker map field. You can also select which tests to perform based upon the genotype model and select the output results for a variety of statistical tests. Furthermore, notice the options to correct for population stratification with PCA. Correcting for population stratification is important for data quality management. To illustrate the importance of data quality management, we will leave this value unselected.
Not correcting for population stratification can influence one of the more desired outputs in SVS, which is the Manhattan plot. Manhattan plots are a common result of any GWAS and SVS houses fantastic plotting capabilities which you can see in Figure 3. The top image represents data without data quality management while in the bottom plot data quality is accounted for. Notice that there is significantly increased signal of SNPs that show a strong association with the selected phenotype when data quality is left unchecked. That said, we can easily perform some quality assurance to account for any inflation of significance.
One simple method to assess the quality of the association test and data is with Q-Q plots. The Q-Q plot is an easy way to compare the expected vs. observed values to monitor deviations in p-values, as shown in Figure 4. Ideally, we would like to see a Q-Q plot as seen in the right image where most values land on the slope. However, the Q-Q plot on the left is from the association test without testing for stratification as described earlier in this blog. From this we can conclude that our p-values are inflated and that we need to manage the quality of our data to remove this bias to achieve accurate results.
Hopefully this provided a helpful overview of GWAS quality considerations in SVS. This blog series will cover managing sample and marker quality, sample relatedness and population structure with Principle Component Analysis to achieve high-quality association tests. As always, if you have any questions regarding our software or need clarification, please reach out to us at [email protected].
Thanks for reading and Happy New Year!