Miscellanious Definitions

Within the instance-wide output table produced by ScaleHD, there are many flags or data entities which require explanation. Throughout the development of ScaleHD, we ended up determining a range of characteristics that indicate the believability of data produced from amplicon sequencing when attempting genotyping via our multiple reference library-based method. These characteristics provide us with a heuristic method to determine if an attempt at automated genotyping was successful, or not. Here, we define these characteristics and what they mean in literal terms. They range in importance, but all are useful in creating a representation of data quality. Some are self-explanatory, but are explained anyway.

A further note on SNP calling: Either Freebayes or GATK may detect variants in allele contigs which are not relevant to the literal alleles of any given sample. I.E., a sample with alleles 17_1_1_7_2 and 23_1_1_10_2 may have variants reported in a highly irrelevant contig, 40_1_1_5_2. SNPs are only reported within InstanceResults.csv if they are found within the appropriate contigs for that sample – other ‘irrelevant’ variant reports are written to IrrelevantVariants.txt in the sample’s specific output folder. An individual SNP will be reported in the format “{originalbase }->{mutated base}: @{base pair position in read}”. E.G. “C->T: @36”.

Update: as of SHD 0.322, only freebayes is used.

The significance levels are as follows:

  • N/A – This means the flag contains discrete information and does not need to be interpreted in regards to genotyping quality.
  • Dependent – This flag may be significant, depending on other flags. For example, a high level of somatic mosaicism may be an indicator of poor genotyping quality when the CAG repeat tract size is within the non HD-causing allele size range.
  • Minor – This entity is of minor significance and in the vast majority of samples will not be a deterministic factor for genotyping quality.
  • Moderate – This entity is of moderate significance. It is unlikely to render a sample’s genotype invalid on its own but may contribute to inaccurate genotyping.
  • Major – This entity is of major significance and is strongly associated with genotyping quality. If any major informative flags are raised, it is recommended to manually inspect the alignment/mapping outputs for that sample.

For maximum genotyping accuracy we recommend manual inspection for all samples for which any major flag was raised and for alleles with >47 CAGs.

*n denotes the number of reads for the modal allele; **very low reads is defined as an n value containing <=200 reads

Confidence Calculation

For each allele, ScaleHD calculates the confidence level in the provided genotyping result. This information is taken from a variety of sources, and attempts to paint an evidence-based picture of the data quality, and resultant genotype confidence. Each allele starts with 100% confidence, and penalties are applied when certain data characteristics were discovered throughout the genotyping process. Follows is a list of evidence used to best determine each allele’s confidence level:

  • If the First Order Differential peak confirmation stage required to re-run itself, with a lower threshold. More re-calls results in a higher penalty.
  • Rare characteristics, such as homozygous haplotypes, or neighbouring/diminished peaks, incur a penalty.
  • Atypical alleles are treated with more caution, and scores are weighted slightly more severely than typical alleles.
  • Simple data aspects such as total read count within a sample/distribution/peak are used.
  • Mapping percentages are taken into account, albeit as a minor factor within this algorithm.
  • “Fatal” errors, such as Differential Confusion, incur a significant penalty.

Any confidence score is capped at 100%. If the quality of data in a particular sample is high enough for alleles to be awarded a confidence score higher than 100%, they are reported as 100%, regardless. Generally, a ‘good’ score is anything over 80%, and we have found that samples returning a score of over 60% are considered believable. Anything less than this may justify manual inspection.