Reports - quantitative assessment of SVs

svviz2 generates a number of useful statistics for each event and sample it analyzes. The resulting reports can easily be automatically parsed and analyzed by downstream applications. A separate report is generated for each event, with filename <event_id>.<coordinates>.report.tsv. The report contains four columns:

  • sample - which sample being analyzed, corresponding to an input bam file
  • allele - the allele being reported; this is either alt, ref or amb; or empty in the case of a few overall descriptors such as genotype
  • key - the name of the statistic
  • value - the value of the statistic

A partial description of the different statistics generated follows:

count
The simple count of reads supporting the given allele.
weighted_count
The weighted count, where each read contributes mapq/40.0 to the sum (see weighted mapq). For example, a read supporting the alt allele with mapq 30 would contribute 30/40=0.75 to the weighted count.
GL_count
The simple genotype likelihood.
GQ_count
The simple genotype quality.
GT_count
The genotype inferred from GQ_count
GL_mapq
The genotype likelihood treating the mapq as a simple probability. For example, a read supporting the alt allele with mapq 30 would be presumed to have derived from the alt allele with probability 99.9%
GQ_mapq
The genotype quality treating mapq as a simple probability
GT_count
The genotype inferred from GQ_mapq
GL_weighted
The genotype likelihood using the proportional weighted mapq/40 value treated as a probability (see here for more info)
GQ_weighted
The corresponding genotype quality using the proportional weighted mapq
GT_weighted
The genotype inferred from GQ_weighted
<locus>_n_mismatch_low
The number of putative SNPs in the region of the breakpoint; high mismatch values suggest the variant is in a segmental duplication. This is inferred by looking at each position upstream or downstream 100 bp of the breakpoints, as well as the regions between breakpoints, then identifying positions where at least 20% of reads show a specific mismatch to the reference.
<locus>_n_mismatch_high
The number of putative SNPs in the region of the breakpoint; high mismatch values suggest the variant is in a segmental duplication. This is similar to *_n_mismatch_low, except 80% of reads at a given position must show a specific mismatch to the reference to count as a putative SNP.
<locus>_n_indel_low
The number of putative insertion/deletions near the breakpoints. 20% of reads must support an indel at a given position to be counted as an indel.
<locus>_n_indel_high
The number of putative insertion/deletions near the breakpoints. 80% of reads must support an indel at a given position to be counted as an indel.
<locus>_n_total
The number of positions assessed for SNPs and indels
overlap_<allele>:<coordinate>
The mean number of bases that reads extend across the breakpoint at <coordinate> supporting <allele>. For example, a 50bp read with 46bp to the left of the breakpoint and 4bp to the right of the breakpoint would be considered to extend 4bp (the lesser of 46 and 4). In general, a low number here indicates most reads don’t extend particularly far into the SV, suggesting these reads are either mismapping or the sequence or coordinates of the SV are incorrect.
extension_<allele>:<coord>
The mean number of bases that reads extend to the right of the breakpoint at <coord>. This is similar to the overlap_* statistic above, except we’re no longer taking the lesser of 46 and 4; we just take the second value.
count_<allele>.<coord>_seq
The number of reads whose sequence spans the breakpoint at <coord>.
count_<allele>.<coord>_pair
The number of reads whose mate pairs span the breakpoint at <coord>. Typically, true events should have both *_seq and *_pair counts in paired-end datasets, although complex sequence patterns surrounding the breakpoints may mean this isn’t always true.