Abstract
Recent advances in high-throughout sequencing technologies have made it possible to accurately assign copy number (CN) at
CN
variable loci. However, current analytic methods often perform poorly in regions in which complex
CN
variation is observed. Here we report the development of a read depth-based approach, CNVrd2, for investigation of
CN
variation using high-throughput sequencing data. This methodology was developed using data from the 1000 Genomes Project from the
CCL3L1
locus, and tested using data from the
DEFB103A
locus. In both cases, samples were selected for which paralog ratio test data were also available for comparison. The CNVrd2 method first uses observed read-count ratios to refine segmentation results in one population. Then a linear regression model is applied to adjust the results across multiple populations, in combination with a Bayesian normal mixture model to cluster segmentation scores into groups for individual
CN
counts. The performance of CNVrd2 was compared to that of two other read depth-based methods (CNVnator, cn.mops) at the
CCL3L1
and
DEFB103A
loci. The highest concordance with the paralog ratio test method was observed for CNVrd2 (77.8/90.4% for CNVrd2, 36.7/4.8% for cn.mops and 7.2/1% for CNVnator at
CCL3L1
and
DEF103A
). CNVrd2 is available as an R package as part of the Bioconductor project:
http://www.bioconductor.org/packages/release/bioc/html/CNVrd2.html
.