Genotype Imputation

Venn Diagram

Imputation is the process of inferring the genotype of one or more markers based on the correlation pattern of the surrounding markers for which genotypes are known. Imputation is often necessary when harmonizing disparate datasets that come from different platforms, or for imputing a smaller genotype array to a larger, more comprehensive SNP set using a reference panel.

There is rarely a GWAS study performed today that doesn't employ imputation.

Golden Helix offers a robust, cost-effective imputation service built on best practice workflows that have been optimized over the years.

We can work from any commercial GWAS chip as the starting point, and can impute to the reference panel of your choice, whether 1000 Genomes, HapMap, or otherwise. We have imputed from Affymetrix, Illumina and even Perlegen arrays.

Which Method to Use?

Choosing the most appropriate imputation method depends on the qualities most important to our clients. We have experience running imputation using a variety of methods and will make recommendations based on each individual situation. Read the following blog post to learn more about recent testing results of a few common methods: Comparing BEAGLE, IMPUTE2, and Minimac Imputation Methods for Accuracy, Computation Time, and Memory Usage »

Process

The following summarizes a standard imputation process with options for post-imputation analysis. The actual process may change from project to project.

Example Workflow for Trio Exome Analysis

Example Imputation Process

  1. Data Receipt, Preparation
    1. Confirm delivery of all data files, verify "complete kit"
    2. Confirm matching between clinical/phenotype and intensity/genotype data files
  2. Preliminary Quality Control Testing
    1. Sample quality assurance
      1. Assess call rates per individual
      2. Verify sample gender using X heterozygosity
      3. Principal Components Analysis (PCA) of SNP genotype data
        1. Check principal components for agreement with reported ethnicity
        2. Analyze principal components for associations with confounding variables
        3. Remove subjects that appear as PCA outliers if necessary
      4. Perform relatedness tests to confirm expected familial relationships
    2. Marker quality Assurance
      1. Check principal components for agreement with reported ethnicity
  3. Imputation Preparation
    1. Collaborate with client to identify reference panel to target for imputation (Use all 1kG or Caucasian only? Impute all SNPs with MAF>1%, or other threshold?)
    2. Select appropriate set of SNPs from reference data to use as basis for imputation (based on MAF, call rate, presence in 1kG reference panel, etc.)
    3. Validate strand match with reference files
    4. Prepare input files–divide into smaller pieces for computational efficiency
  4. Imputation
    1. Pre-phase genotype data (optional–will greatly increase overall turnaround time and reduce cost over alternative)
    2. Run Beagle imputation
    3. Collate output data chunks into whole chromosomes/whole genome for analysis.
  5. Analysis of Imputed Data
    1. Quality review: Assess genotype probability scores and accuracy metrics for imputed genotypes. Compare to MAF and other relevant parameters.
    2. Genome-wide association testing. Base tests on imputed genotype dosage. Use linear regression framework to associate genotype dosage with specified outcomes and adjust for specified covariates.
    3. Review results. Compare to original GWAS, check imputation quality metrics for any significant results.
  6. Report Results
    1. Prepare written summary of methods and results.

Infrastructure

Golden Helix has been providing imputation services to academic and commercial organizations for many years and on numerous projects. Our infrastructures has evolved over time to include mature, efficient processes that ensure fast turnaround and cost advantages. Further, to ensure rapid completion of even large cohorts, Golden Helix utilizes a custom, cloud-based system for employing a multitude of servers on a single imputation project–yet incurring expense only for actual CPU time used. This brings additional cost and time savings.