When I got my results I started looking at the CNV VCF file. I found a duplication of the region chr17:44163320-44357656 which encompasses the first 3 exons of the KANSL1 gene. This caught my attention as KANSL1 is in Spark's Autism Gene List and I was diagnosed with Autism a few years ago.

Upgrading to GenomeZ I got a lot more duplications in my file and this particular duplication tripled in size to chr17:44159016-44792000 and now encompasses four exons of KANSL1. There were also a few other duplication close by to this one, all in a region referred to as 17q21.3. 

Has anyone else found any similarly interesting CNVs and how accurate do you think this is? I am upgrading to GenomeH so hopefully that will confirm the duplication and clarify how large it actually is.
Quote 1 0
The region is flanked by segmental duplications, so I'd check the raw reads (BAM file in IGV software) to confirm. 
Quote 0 0
I haven't downloaded my BAM file yet, the size is a bit problematic. How does looking at it in IGV help, what specifically would I be looking for?
Quote 0 0
Do you have the “DNA explorer” option when you log into the Dante Genome Portal? That links directly to IGV (awesome addition to the platform!).
Quote 0 0
Yes, I have access to that and it is useful for looking at variants in the SNP and INDEL VCF files. However it doesn't show you anything from the SV and CNV VCF files.

I have found some sites that one can look up a CNV region in to see if it or any overlapping CNVs are recorded anywhere.

It looks like the duplication in my current data is fairly common in the general population. I definitely think I need the GenomeH results to clarify the size of the duplication. 

Quote 1 0
Here's how you can use the DNA Explorer to look at the CNV raw reads. Your CNV is a large region, so you'll have to look at left and right ends rather than the full region. The reads are 150 bases long with insert size ~300, so we'll want to include ~1,000 bases up- and down-stream of the breakpoint. Note that with CNVs and SVs, often the predicted breakpoint isn't exact. The actual breakpoint may be slightly further up/downstream.

Region: chr17:44,163,320-44,357,656
Left breakpoint (+/- 500 bases): chr17:44,162,320-44,164,320
Right breakpoint (+/- 500 bases): chr17:44,356,656-44,358,656
  1. Click on "DNA Explorer" in the left menu bar (photo1.png)
  2. In the top bar, to the left of the magnifying glass (photo2.png), paste "chr17:44,162,820-44,162,820"
  3. In the right bar, click on the gear icon. In the pop-up menu, click: "pair orientation", "View as pairs", and "Show soft clips" (photo3.png)
  4. Interpretation (see below)
    • Are ~1/2 of the reads colored (in the same way, in the same general area) up/downstream of the breakpoint? (see Broad's guide)
    • Then, color by "insert size (TLEN)" (in the right bar, click on the gear icon). are the reads colored then? (see Broad's guide)
    • Are there "rainbow" colors at the end of reads, at the breakpoint?
    • Is the coverage higher in the CNV region than the flanking region (downstream for Left breakpoint; upstream for Right breakpoint)? (photo4)
  5. Repeat steps 2-4 for the right breakpoint (chr17:44,356,656-44,358,656)
You can save the IGV images by clicking on "Save SVG" in the top bar. I'm happy to help you interpret directly if you want -- I've been studying SVs in WGS for years (phd, postdoc, currently).

There are five types of "evidence" for determining whether a CNV (or any other structural variant; SV) is "real":
  1. Discordant pairs, due to either mis-orientation or more/less insert size (# bases between the 1st and 2nd reads in a read pair) than the expected ~300 bases
  2. Split reads, due to mismatched bases in reads at the breakpoints (these reads "span" the CNV ends)
    • Visualized by the "rainbow" colors at the end of the reads, since mismatched bases are colored (A=green;T=red;G=yellow;C=blue)
  3. Coverage (# reads) across the entire region
    • For Duplications ("Insertions") - the region should have increased # reads in the CNV region compared to flanking regions (upstream of the Left breakpoint and downstream of the Right breakpoint). For example, if your data is 30X coverage and it's a duplication with an extra copy on one allele (so you have 3 copies total), then the flanking regions should be ~30X and the CNV region should be ~60X. Zoom out to visualize this.
    • For Deletions - the region should have less reads. If a deletion is on one chromosome (heterozygous), then you'll have ~1/2 the # of reads in the CNV region. If the region is deleted on both chromosomes (homozygous), then there will be no reads in the CNV region.
  4. Unmapped mates for reads up/downstream from the breakpoints, due to those reads having so many mismatches that they don't align well to the reference genome
  5. Not due to mapping errors due to "noisy" sequence at the region
    • Is the coverage SUPER high (eg 100s or 1000s read coverage?) at/near the breakpoints? Likely due to repetitive sequences. Can check by looking at those positions in UCSC Genome Browser
      1. Go to
      2. Click on "human" icon on left
      3. Select hg19 (not hg38!) in "Human Assembly" pulldown menu on the right (photo5 red box)
      4. Paste the region in the box (photo5 blue box)
      5. Once the next page loads, scroll down to the bottom of the page and under the "Repeats" header, select "full" in the "RepeatMasker" and "Segmental Dups" pulldown menus. Then click a "refresh" button.
    • Are there strings of "N" bases at/near the breakpoints? This means there is a "gap" in the genome sequence (eg we don't know what the sequence is). Gaps can also be viewed using UCSC Genome Browser (under the "Mapping and Sequencing" header, select "full" in the "Gap" pulldown menu. Then click a "refresh" button.
Quote 0 0
Thanks, that's really interesting. I shall have to spend some more time looking into this.
Quote 1 0