I’ve had the pleasure of having a bioinformatics student from UCI, Ryan Bertwell, working with me this summer. He, along with my colleague Alessio, recently reviewed the paper on the CoNIFER algorithm. I thought it would be good to share their insights with our readers – below is their review of the algorithm.
With the increase in the use of NGS technology such as exome sequencing for SNV detection, researchers are looking to see what else they can gain from the data already in hand. Many are looking to obtain copy numbers from such sequence reads. Here we review the approach by Niklas Krumm et. al. in their paper,Copy number variation detection and genotyping from exome sequence data, which explores the techniques and accuracy of their copy number calculation algorithm for exome sequencing data. They’ve dubbed their algorithmCoNIFER (Copy Number Inference from Exome Reads).
The authors used data from HapMap individuals, ASD (Autism Spectrum Disorders) trios, and the NHLBI Sequencing Project (ESP). Sequencing was performed on the Illumina HiSeq 2000 or Illumina GAII platforms. Results were validated against data from arrayCGH, qPCR, whole-genome shotgun sequencing, and targeted clone sequencing.
Their main technique for determining copy number variation involves singular value decomposition (SVD) normalization using exome data. This technique finds and removes large sources of bias from the exome sets and includes X chromosome normalization so that samples can be assayed independent of sex. For discovery of rare CNVS, it was found that a large portion of systemic bias and variance stemmed from the first 10-15 singular values from the singular value decomposition method, and that removing these strongest components of bias helped in the normalization process. These components could be removed because the expected variation is small (due to CNVs and CNPs), while the variation observed due to the first 10-15 components was orders of magnitude higher. For discovery of CNP regions, they removed fewer components (only five) to preserve the real signals from highly CNP loci.
The authors propose that CoNIFER can be used to discover CNVs that might be missed by standard practices. Using a large sample base (366 exomes) it was demonstrated that the CoNIFER algorithm could produce copy number variation predictions that were very accurate (94% overall) and that it is strongly correlated with whole genome data (average r2=0.91). The assessments of the data come from experimentally produced CNV data from processes known to be accurate including quantitative PCR sequencing and whole-genome sequencing.
Considerations when evaluating the CoNIFER algorithm for your own use:
- Samples that have undergone whole-genome amplification perform very poorly so that is not recommended.
- It is recommended to sequence exomes to a minimum coverage of 50 million on-target 36mers.
- It is recommended to use an aligner that has multiple mappings enabled.
- A primary feature of CoNIFER is the ability to mix capture reactions, experiments, and sequencing runs, but care should be taken when combining data across significantly different platforms—in these cases, only the common set of probes between platforms should be used in order to avoid false negatives.
- The method depends on concurrent analysis of multiple samples to pull out batch effect biases and subsequently remove them so it only works with large sample sets.
- There is a fine line between removing too many components of variation which could remove real biological changes vs. removing too few which will retain batch effects and compromise the algorithm. There was no clear suggestion of how to determine what to throw out and this is left up to the user to decide. Deciding how many components to remove also depends on whether one is looking at rare CNVs or common CNPs. The authors found that removing up to 30 components for rare CNVs did not affect the results adversely.
- CoNIFER is not recommended for cancer samples. The authors state that the algorithm is unsuitable for detection of chromosomal aneuploidy as the algorithm assesses each chromosome separately and large events are likely to be normalized as part of the first few components. This would be an issue as most tumors are aneuploid. The algorithm was not run on cancer samples.
If you already have exome sequencing data and a large cohort, using CoNIFER may be a good way to take advantage of the data and capture copy number variations without much additional expense. If genome-wide detection of CNVs is the main goal, use of high density aCGH or SNP arrays is a much more preferable route at this time and if you can also obtain exome data, CNVs derived from exome sequencing can complement the array data.
CoNIFER software: http://conifer.sourceforge.net/index.html