New Annotation Sources Available

In recent months we have been updating our public annotation library to include the most recent versions of existing sources as well as include new sources. Each of these annotation sources are compatible with our three major products (SVS, GenomeBrowse and VarSeq) and can be used for visualization, annotation and filtering.

NHLBI ESP6500SI-V2-SSA137 Exomes Variant Frequencies 0.0.30, GHI

Annotations are available for both GRCh_37 and GRCh_38 human builds. The current EVS data release (ESP6500SI-V2) is taken from 6503 samples drawn from multiple ESP cohorts and represents all of the ESP exome variant data including both SNPs and Indels.

1kG Phase 3 – Variant Frequencies 5, GHI and 1kG Phase 3 – CNVs and Large Variants 5, GHI

The variant frequency annotation source provides the catalog of single nucleotide variants (SNVs) “sites” called by the 1000 Genomes project for 2504 individuals from the 2013-05-02 sequence and alignment release that is mapped to GRCh_37. The CNVs and Large Variants source is a subset of only those with length greater than 200 base pairs.

ExAC Variant Frequencies 0.3, BROAD and ExAC VEP Annotations 0.3, BROAD

The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. The data set provided spans 61,486 unrelated individuals sequenced as part of various disease-specific and population genetic studies.

The GRCh_37 data from ExAC was split into two sources to better facilitate filtering and annotation. The Variant Frequency source contains all of the population frequency and count information for each variant while the VEP Annotations contains the consequence type as predicted by Variant Effect Predictor (VEP) version 77.

ClinVar 2015-05-04, NCBI and ClinVar CNVs and Large Variants 2015-05-04, NCBI

The ClinVar database from NCBI contains information about the phenotypes and supporting evidence for variants in the dbSNP database. For those variants with a length greater than 200 base pairs we have created a separate interval source that can be used for annotation and filtering. Both sources are available with GRCh_37 and GRCh_38 coordinates.

dbNSFP Functional Predictions and Scores 2.9,GHI and dbNSFP Functional Predictions 2.9, GHI

The dbNSFP is an integrated database of functional annotations from multiple sources for the comprehensive collection of human non-synonymous SNPs (NSs). Its current version includes a total of 87,361,054 NSs. To facilitate basic filtering on functional predictions we have provided a subset predictions only source from SIFT, Polyphen2 (HVAR), MutationTaster, Mutation Assessor and FATHMM. Both sources are available with NCBI_36, GRCh_37 and GRCh_38 coordinates.

dbNSFP Gene Annotation with Entrez Gene Coordinates and MedGen 2.9, GHI

The dbNSFP gene database focuses on gene annotations. Gene positions were obtained from Entrez Gene. The dbNSFP gene table was merged with the MEDGen table containing OMIM and HPO information from NCBI. Data is provided for both GRCh_37 and GRCh_38 coordinates.

dbSNP 142v2, NCBI

Annotations are available for both GRCh_37 and GRCh_38 human builds. These sources displays single nucleotide polymorphisms (SNPs) from dbSNP build 142 release 2015-04-20.

RefSeq Genes 105v2, NCBI and RefSeq Genes 107, UCSC

For v2 of the RefSeq 105 release (GRCh_37) Locus Reference Genome (LRG) identifiers were added for those applicable transcripts. Additionally these identifiers were also added to the new 107 release (GRCh_38).

Ensembl Genes 75v2, Ensembl and Ensembl Genes 79, Ensembl

For the human versions 75v2 (GRCh_37) and 79 (GRCh_38) Ensembl release LRG identifiers were added for those applicable transcripts. Also added version 79 for the following non-human genome builds:

  • Drosophila melanogaster, BDGP_6.0
  • Canis familiaris, BROADD_3.1
  • Equus caballus, EquCab_2
  • Felis catus, Felis_catus-6.2
  • Mus musculus, GRCm_38
  • Gallus gallus, ICGSC_4
  • Macaca mulatta, MMUL_1
  • Ovis aries, OAR_3.1
  • Rattus norvegicus, RGSC_5.0
  • Sus scrofa, Sscrofa_10.2
  • Bos tarus, UMD_3.1
  • Caenorhabditis elegans, WBcel235
  • Danio rerio, Zv9


This human GRCh_38 source contains the results of the GENCODE project which is a gene set derived from manual curation, different computational analysis and targeted experimental approaches.

DGV Variants 2014-10-16, DGV

These sources display variants from the Database of Genomic Variants including results from studies that use CNV coordinates based on the human builds: NCBI_36 (hg18), GRCh_37 (hg19) and GRCh_38 (hg38) .

Pfam Domain Genes 2013-07-01, UCSC

This annotation source shows the high-quality, manually-curated Pfam-A domains found in transcripts located in the UCSC Genes source. The sequences from the knownGenePep table (see UCSC Genes description page) are submitted to the set of Pfam-A HMMs which annotate regions within the predicted peptide that are recognizable as Pfam protein domains. These regions are then mapped to the transcripts themselves using the pslMap utility. Data for both GRCh_37 and GRCh_38 human builds was obtained from the UCSC Table Browser.

CpG Islands 2009-03-08, UCSC

CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. Data for NCBI_36, GRCh_37 and GRCh_38 human genome builds were obtained from the UCSC Table Browser.

Reference Sequence BDGP_6.0, NCBI and Cytobands 2014-12-03, UCSC

The new genome assembly for Drosophila melanogaster (BDGP_6.0/dm6) was added which includes the reference sequence as provided by NCBI and the cytoband source from the UCSC Table Browser.

To obtain all of these new and updated sources through SVS and VarSeq go to Tools > Manage Data Sources and select them for download through the Public Annotations repository. For GenomeBrowse the Public Annotations repository can be found by going to File > Add to open the Data Source Library. Once downloaded the annotation sources will be available in your local annotations folder which is the default location (…/Golden Helix/Common Data/Annotations) for all three Golden Helix products.

We are continuing to add new annotation sources, support new species, and update genome builds. If you would like to request a particular database be converted into an annotation source or would like to see a particular species or build be available in SVS and GenomeBrowse please email us and let us know!




Feedburner  Subscribe in a reader or via email

« Back to Support Bulletin List