Exploring Publicly Available NGS Data Sources

         October 31, 2024

Publicly available datasets play a crucial role in research and offer resources for validation and benchmarking of workflows. In this blog, I would like to point out and briefly discuss several notable, publicly available sources of NGS sequencing data. Each of these sources provides validated datasets that are invaluable for laboratories and institutions processing NGS samples.

  • EPI2ME from Oxford Nanopore
  • PacBio’s highly accurate long-read sequencing
  • 1000 Genomes Project Phase 3 data
  • data from the study associated with PMID: 28630945.

EPI2ME is an advanced data analysis platform developed by Oxford Nanopore Technologies. It facilitates real-time analysis of sequencing data generated by Nanopore sequencers. Users can access a wealth of publicly available datasets, including the Genome in a Bottle samples, a T2T assembly dataset, and a tumor normal pair dataset among others. EPI2ME has a cloud-based infrastructure that allows for rapid processing and comparison, making it an excellent tool for researchers looking to validate their NGS workflows. The platform supports a variety of analysis types, enabling researchers to test hypotheses and validate their findings against extensive datasets.

Pacific Biosciences (PacBio) offers highly accurate long-read sequencing technologies that produce high-fidelity (HiFi) reads, which are crucial for resolving complex genomic regions. The data generated from these platforms are publicly available and can be found in repositories like the PacBio website and the NCBI SRA database. PacBio datasets are instrumental in validating assembly methods, particularly for structural variant detection and comprehensive genome annotations. These datasets can be used to benchmark algorithms against high-quality reference sequences.

The 1000 Genomes Project represents a landmark effort to catalog human genetic variation. Phase 3 of this project includes extensive sequencing data from diverse populations, making it a vital resource for studying human genetics. The dataset is available through the International Genome Sample Resource and the European Bioinformatics Institute. This extensive database offers millions of variants and is an excellent reference for validating NGS workflows, particularly in studies of population genetics and disease association. Its comprehensive nature allows researchers to compare their findings against a broad spectrum of genetic variants.

The research associated with the publication, “The ICR96 exon CNV validation series: a resource for orthogonal assessment of exon CNV calling in NGS” describes a validation series that includes high quality sequencing data from 96 samples with 66 of those samples containing at least one validated exon CNV and 30 validated negative. This dataset is one of the few publicly available datasets that confirms accurate detection of duplications and deletions especially for single exon events and deletions. The dataset from this publication are accessible through the European Genome-Phenome Archive, enabling researchers to explore and validate their own findings against published results.

Publicly available NGS sequencing datasets are invaluable for researchers and laboratories seeking to validate their methods and results. Utilizing these datasets not only enhances the reliability of genomic research but also fosters collaboration and innovation in the ever-evolving field of genomics. As NGS technologies continue to advance, tapping into these resources will remain essential for researchers aiming to push the boundaries of our understanding of genetics. To learn more about our NGS analysis suite, please email us at [email protected].

Leave a Reply

Your email address will not be published. Required fields are marked *