Scalable Genetic Data Warehouse for VarSeq
As Precision Medicine is taking off, the number of samples in a testing lab and the associated data volume is increasing exponentially. In order to organize the data and build a knowledge base of cases that can be used for future analysis as well as ongoing research, labs need to leverage state of the art warehousing technology. Building on the algorithms and high-performance storage technology powering the VarSeq® software, VSWarehouse is a scalable, multi-project warehouse for NGS variant call sets, clinical reports and catalogs of variant assessments.
Ask VarSeq Warehouse
- Have I ever seen this variant in my previous test samples?
- At what frequency? (counts as well)
- Does this gene contain other rare variants in my cohort? In what samples?
- How many rare functional variants have I seen in this recently published gene?
- Did I provide a pathogenicity assessment for this variant? Has that changed?
- Has ClinVar changed since that assessment was initially made?
- Have I put this variant into a clinical report for any previous samples?
Organize Samples into Projects
Projects as Variant Frequency Annotations
Centralized VSReports Hosting
Variant Assessment Catalogs
Unlike traditional, relational databases, genomic variant call sets do have "updates" or "deletes" at the row level. In fact, the providence of what a project contained that was used as an annotation or analysis source is important to preserve and ideally, to access or archive.
Rather than having a costly and mutable single large relational model, VSWarehouse builds on the highly-performant storage technology developed by VarSeq to allow your samples to be organized in as many fully-versioned projects as needed in a fraction of the space.
As new samples get uploaded from VarSeq's integrated VSWarehouse uploader, a background job is queued and run to create a new version of the project. The imported job will generate a fully merged matrix of unique variants across the previous and new samples as well as run the algorithms and annotations that are configured with the project. This new version of the project can be scheduled to come online as the new default or can be brought online by hand. You can explore the new version before publishing it, and if it is not desired for what ever reason, it can be deleted without any side-effects. This makes it easy to "undo" a erroneous sample upload or project change.
Projects hosted on VSWarehouse can be used as annotation sources in VarSeq to be integrated into your custom variant annotation and interpretation workflow. This allows any new variant to be annotated and potentially filtered with the frequency of that variant in your warehouse projects.
The annotations are versioned with the projects, meaning just like our public annotations hosted on the cloud, you can always reproduce your analysis by using the exact same version perpetually or choose when to update to the latest version, which may have more samples.
As an annotation source, stored warehouse variants can be plotted in GenomeBrowse, allowing for the density and other attributes to be shown visually in the genomic context.
VSReports allows for customizable report templates to be completed on a sample-by-sample basis in VarSeq. These sample level decisions and the rendered report are saved at the project level (and exportable as HTML or PDF).
VSWarehouse allows for the same user experience within VarSeq, however, the reports are hosted, saved and indexed on the VSWarehouse server. All reports are then able to be queried at the variant or sample level, with the rendered reports hosted on the server and are ready for download or integration with other internal systems.
VarSeq strives to provide all the high and low-level details needed for a variant scientist or medical professional to classify or QC variants for a specific sample or presenting phenotype. Our Assessment Catalog feature allows for a flexible way to capture lab-specific flags or classifications of variants outside of the single-project context, so it can be used as an annotation source for future projects.
VSWarehouse acts as the hosting server of these assessment catalogs, providing a web-interface in which to query and manage them.
At a technology level, it is easy for the table sizes related to storing hundreds and thousands of NGS samples to slow down and ultimate break traditional databases.
When storing every sample-level field, such as the genotype, zygosity, read depth and quality of each sample-variant pair, tables will quickly reach the billions and tens of billions.
Similar to the enterprise-scale Redshift product that Amazon acquired and built out for the general-purpose data warehouse market, VSWarehouse is built on the Postgres database technology stack with a completely customized and optimized storage and query-execution layer.
Taking advantage of the matrix structure of genomic data, a very space-efficient columnar and compressed storage engine allows projects computed with VarSeq's mature NGS data wrangling and annotation algorithms to be stored at a fraction of the size of traditional databases while still allowing for the full power and utility of a mature SQL front-end.
|Technology||Filter on Gene Effect +
Sample Read Depth Time
|Storage Size of Tables|
|PostgreSQL 9.4||6300 ms||2.5 GB|
|VSWarehouse||560 ms||150 MB|
VSWarehouse's Query and Space Efficiency on 17 Exomes with 500K Variants
Without losing a single piece of information in the VCFs, VSWarehouse creates a single annotated matrix of all unique variants for all uploaded samples that is accessible through multiple interfaces:
- Web-based interface with easy filter cascades and exports to Excel, text and VCF
- Annotation interface from VarSeq to access a project's aggregate variant counts and frequencies
- VSReports integration with hosted reports that can also be used as an annotation source
- VarSeq assessment catalog that reads and writes from the catalogs hosted on warehouse
- High level REST API to access the models used to construct the web interface
- Low level SQL access for data science tools and advanced integrations
Case Study - Chaim Jalas, Bonei Olam
“The ability to combine projects and query the data against so many annotation sources, from our own servers, was very impressive”
Webcast - Getting Started with VSWarehouse
Blog Post - Genomic Data is Big Data
We have big data in the field of genomics, yet all that crunching is not the hard part.
Genetic Data Warehousing eBook
Genetic Data Warehousing
by Dr. Andreas Scherer