First-place Abstract Competition Winner, Michael Iacocca, shared his research with the Golden Helix Community during our February webcast ‘Using NGS to detect CNVs in familial hypercholesterolemia‘. In this webcast, he gave a great explanation on how our CNV caller aided his team in their research. If you were unable to join us for the event, you can find a recording on our site here.
Iacocca’s webcast generated a lot of questions. His answers help further explain his studies and the capabilities of VS-CNV.
Are you able to control the acceptable percent difference for your references when calling CNVs?
Yes, indeed! As I mentioned, we provide the VS-CNV Caller tool with >100 matched reference controls. The tool itself then selects the 30 (however you can change this to be 40, 50, etc.) controls with the lowest percent difference in coverage compared to the sample of interest. So, chances are through this system you will get highly matched references to begin with. However, VS-CNV Caller flags the sample of interest IF the average % difference in your reference set is greater than 20%. In the case of the latter, you can also change this “acceptable value” of 20% to another value of your choice. So, yes you do have personalized control over this important quality control step.
Is it possible to use VarSeq to identify CNVs in plants?
CNV calling is based on the normalized coverage data of your reference samples compared to any single sample of interest. VarSeq contains many genome assemblies that include plant-specific assemblies. It is possible to detect CNVs for many plant species in VarSeq and also SNP and Variation Suite (SVS).
Have you tried finding CNVs in a trio exome seq data?
Many users have, and it has generated great results! The more reference samples you have for coverage normalization tends to be better to get a clearer picture of CNV events, but some users have simply used Mother and Father as the reference samples when compared against the Proband.
What .BED file do you use?
We use a .BED file that is unique to our LipidSeq panel. A .BED file defines the chromosomal start/stop coordinates of all the genetic material you are sequencing with your specific NGS panel. Thus, of course, everyone’s .BED file will be unique to their exact NGS panel used.
The .bed file (interval track) used for your CNV project can be defined by a specific panel, or you can utilize the many .bed files that are available with VarSeq. Minimally, the .bed file needs chromosome, start, and stop positions. These the define the regions that you will normalize the coverage data.
What about tandem duplication of 1 MB chromosomal segment?
To address the “tandem” part of your question first: VarSeq CNV analysis essentially just tells us if there is less (deletion) or more (duplication) genomic substrate at a given locus compared to averaged reference controls coverage. In the case of deletions, you can infer that the deleted genomic material has been lost right from the specific genomic coordinates you are visualizing. But with duplications, especially in the case of whole duplicated genes, you cannot be certain that the duplicated genomic material is in tandem – this duplicated material may have been inserted anywhere else in the genome during its genesis, but through your sequencing step you will still have your probes binding to that sequence and then extra sequence data will, of course, be generated from it. VarSeq will detect this extra sequencing coverage and indicate a duplication – however, you do not know if it is in tandem or located somewhere else, you just know there is duplicated material present.
To address the 1 Mb part of your question: With targeted NGS panels, you can likely not detect affected regions on the scale of Mb’s unless you sequence any specific genes with that size range, or sequence many genes which are located all next to each other in the genome. With whole-exome or whole-genome sequencing and subsequent VarSeq-CNV analysis then yes surely you could detect duplications of 1 Mb and greater if many sequential genes are indeed affected.
Have you used these tools with whole genome data?
“We haven’t used these tools on whole genome data, but we have used them with whole exome data; we’ve sequenced a number of whole exomes in our lab and have applied the VarSeq CNV caller to this data. We were able to confirm detected CNVs that we detected using our LipidSeq targeted NGS panel, so, VS-CNV seems to be working great with whole exome data as well.”
VarSeq has an alternative CNV calling algorithm tailored for shallow-whole genome data utilizing segmentation with bins defined as small as 10 Kb. The normalization concept is also used for this segmentation approach, but across multiple bins spanning the genome vs. targeted regions defined in the .bed file.
What was the average sequencing depth of the matched reference controls?
“Our average was around 300-fold per base. All of these matched reference controls were sequenced on our LipidSeq panel which has an average depth of coverage of 300. So, it doesn’t matter if it’s our reference controls or the samples of interest – they all have that same average depth of coverage.”
Can MLPA detect more than 1 MB duplications?
“MLPA detection is based on exons so it will detect per exon if it’s duplicated or deleted. If these exons are in sequential order and all duplicates then MLPA can detect and ‘put together’ multiple affected exonic regions. So, for example, if the LDLR gene was duplicated then MLPA could detect a duplication of that whole gene – however, the LDLR gene is 18 Kb in total which is far less than 1 Mb. So with MLPA, the maximum limit of detection is whatever the size of the gene is –in this case, 18 kb (the entire LDLR gene). MLPA makes kits on a per gene basis and not for many genes in the human genome – so detecting a duplication on the scale of Mb’s is not really possible with MLPA.”
Are you able to plot your coverage data the BAM file as well?
“Yes, you can plot your coverage data as well. There are tons of options which is great cause you can play around with the program and personalize it as much as you want. In my presentation, I show what works for us, what we like to visualize … simple, clean, shows us what we want/need to see. But, there are lots of different options including BAM coverage.”
If you have any other questions not covered in this blog post, please enter them into the comments below and we’d be happy to answer them for you!