traces1  

Indelligent v.1.2

Home page  |  Back to program   

traces2  

© 2008 Dmitry Dmitriev & Roman Rakitov

 

Help

Direct sequencing of a diploid DNA template containing a heterozygous insertion or deletion results in a difficult-to-interpret mixed trace formed by two allelic traces superimposed onto each other with a phase shift. Indelligent uses a dynamic optimization algorithm to output the pair of maximally similar allelic strings which can be superimposed to produce the observed pattern of peaks. When multiple optimal solutions are possible, some of the mixed sites remain unresolved in the output. The method yields accurate reconstructions when (1) the analyzed trace has been formed by highly similar allelic sequences, (2) indel is small relatively to the length of the analyzed fragment, and (3) multiple indels, if present, are well spaced. Failed analyses result in a large number of mismatching and ambiguous sites in the output (more...). The user can adjust parameters of the analysis iteratively until a satisfactory reconstruction is obtained. The method is currently in the process of development and should be used for research only, not in diagnostic procedures.

 

Disclaimer

Indeligent is free program. Not for use in diagnostic procedures.

 

How to cite the program

Please cite the program as

Dmitriev, D.A. & Rakitov, R.A. 2008. Decoding of superimposed traces produced by direct sequencing of heterozygous indels. PLoS Comput. Biol. 4(7): e1000113. doi:10.1371/journal.pcbi.1000113 PDF.

Dmitriev, D.A. & Rakitov, R.A. 2008 onwards. Indelligent v.1.2. Http://imperialis.inhs.illinois.edu/dmitriev/indel.asp

 

Contact the authors

Questions and comments on the program are welcome. Please email Dmitry Dmitriev (dmitrievinhs.uiuc.edu).

 

Source code

The source code is available here free for non-commercial users.

 

Input

The sequence to be analyzed is input as a text string, which can be typed directly in the input window or copied-and-pasted from another application. The programs providing the option to call primary and secondary peaks on sequencing chromatograms include Sequencher and PHRED. The peaks are called using the standard IUPAC symbols for mixed bases. Indelligent can analyze sequences containing symbols for double and triple peaks, as well as unknown sites (N). Example: “TKGKKSCMWN”.  Alternatively, the sequence can be entered as pairs of superimposed base calls separated by spaces. The order of symbols within a pair is unimportant; for unambiguous peaks two identical symbols have to be entered. Example: “AC TG GG AC”. Because the run time for scripts on the server is limited the sequence should not exceed 1,000 bp. It is recommended that the chromatograms be inspected for basecalling errors prior to analysis.

 

Parameters

Maximum phase shift. Indels cause shifts in positional homologies between bases of two allelic strings, referred to as phase shifts. Consider a pair of allelic strings being sequenced simultaneously. A 3 bp insertion in one string will cause a 3 bp phase shift downstream in the trace. An additional 1 bp insertion in the same string downstream of the first one will produce a 4 bp shift further downstream. Alternatively, a 1 bp insertion in the opposite allelic string will produce a 2 bp shift. The parameter is the maximum magnitude of potential phase shifts considered during the analysis. Therefore, a mixed trace resulted from a 12 bp insertion or two 6 bp insertions in the same string cannot be reconstructed when the parameter is set to 10 bp. At the same time, setting the parameter too high can prevent the program from finding correct solutions with small phase shifts. Because the majority of indels are small, it is advisable to run the first analysis with a small value of Max. phase shift, definitely no larger than 1/10 of the fragment length. The value can be progressively increased in subsequent runs until a satisfactory reconstruction is obtained. However, the parameter cannot be set larger than 1/2 of the fragment length. If the current value in the parameter window exceeds 1/2 of the input fragment length, it will be scaled down automatically when the “Submit” button is pressed. The default value is 15 bp.

Shift change penalty. The program detects indels as changes of phase shifts along the sequence, including transitions from or to a clean single-peaked trace, which has the phase shift magnitude of zero. The shift change penalty is the cost of transition from one phase shift to another, analogous to the gap opening cost in alignment algorithms. The default value is 2. In cases when the allelic strings are reconstructed with multiple indels separated by short distances, the user can try larger values.

Fix shifts. This advanced option restricts analysis to phase shifts of selected magnitudes. One or several values separated by a space or a comma can be entered in the parameter window. Example: “1, 8”. In some cases when the mixed trace is formed by two indels, the parts upstream and downstream of the second indel are easy to reconstruct when analyzed separately, but analyzing the combined sequence to reconstruct the indel in between is problematic, unless the analysis is restricted to the appropriate phase shifts.

 

Output view options

Align alleles. The output of the program is a pair of reconstructed allelic sequences. When the box is checked, the sequences are output in the aligned form.

Floating indel alignment. When an insertion begins or ends with a base identical to the base following or preceding the insertion, respectively, multiple alignments are possible. The user can choose between aligning such floating indels in the extreme right or the extreme left positions:

Right aligned

ATCAT....TGCC

ATCATCGATTGCC

Left aligned

ATC....ATTGCC

ATCATCGATTGCC

Display “long indels”. A transition between two phase shifts (excepts transitions from or to the zero phase shift) can be explained alternatively by a long or a short insertions in the opposite strings:

Mixed fragment

TYWSRKKWYWMYMMYMTMYAACKWYGYWKYAYWRYRGTSRWSAW

Solution with “short” insertion

..TCAGGTTACTACCATCTA.CAACGTTGCATTACAGTGGTCAAGAT

TTTCAGGTTACTACCATCTAACTACGTTGCATTACAGTGGTCAA...

Solution with “long” insertion

..TCAGGTTACTACCATCTACAACTACGTTGCATTACAGTGGTCAA...

TTTCAGGTTACTACCATCTA.....ACGTTGCATTACAGTGGTCAAGAT

Depending on the parameters, the program can select the solution with a short insertion as optimal even if it contains mismatches, as in the example above. If the reconstructed allelic strings contain mismatches in the vicinity of an indel (which should always be considered suspicious) repeat the analysis with the “Display long indels” option checked. Analysis of the reverse sequence of the same mixed fragment can be used to verify that the indel has been reconstructed correctly.

Resolve ambiguities. This parameter is currently displayed as an option only for the purpose of testing the program. Use the default setting (the box checked) in all analyses.

 

Simulating indels

This advanced tool allows to simulate mixed fragments resulting from one or two indel events. Pairs of identical strings composed of the letters A, C, G, and T, selected randomly with equal probability, are generated and shifted with respect to each other by inserting additional bases into one or both strings. To simulate single nucleotide polymorphisms (SNPs), a specified number of point differences between the strings are introduced at randomly chosen sites.  When the button "Generate" is pressed, the strings are generated and their consensus, except the overhanging parts in the beginning and the end, is analyzed. The output summarizes differences between the generated and the reconstructed strings.

 

Interpreting results: mismatches and ambiguities

The mixed sequence below has been decoded as a pair of allelic strings with one mismatching site. That site represents either a single nucleotide polymorphism (SNP), or an incorrectly called peak on the sequencing chromatogram.

Mixed fragment

CCYWMYKSCMARRAYKGRWTKKWRS

Resolved fragment

.CCTACTGCCAAGAATGGATTGTAGC

CCCTACTGCCAAGACTGGATTGTAG.

In the example below, each reconstructed allelic strings contains two mixed bases (W):

Mixed fragment

WSWMYMMSWSWCTYTYYYYYKMSAY

Resolved fragment

..ACTACCAGWCWCTTTCCTTCGACAT

TGACTACCWGWCTCTTTCCTTCGAC..

This is so because three different pairs of strings provide equally optimal reconstructions of the analyzed sequence, each containing one mismatching site:

                               1.

..ACTACCAGTCACTTTCCTTCGACAT

TGACTACCAGTCTCTTTCCTTCGAC..

                               2.

..ACTACCAGTCTCTTTCCTTCGACAT

TGACTACCAGACTCTTTCCTTCGAC..

                               3.

..ACTACCAGACTCTTTCCTTCGACAT

TGACTACCTGACTCTTTCCTTCGAC..

When such multiple optimal reconstructions are possible, the upper and the lower strings in the output of Indelligent represent strict consensuses of, respectively, the upper and the lower allelic strings of the individual reconstructions:

 

Upper strings

Lower strings

1.

..ACTACCAGTCACTTTCCTTCGACAT

TGACTACCAGTCTCTTTCCTTCGAC..

2.

..ACTACCAGTCTCTTTCCTTCGACAT

TGACTACCAGACTCTTTCCTTCGAC..

3.

..ACTACCAGACTCTTTCCTTCGACAT

TGACTACCTGACTCTTTCCTTCGAC..

Consensus:

..ACTACCAGWCWCTTTCCTTCGACAT

TGACTACCWGWCTCTTTCCTTCGAC..

 

Significance of reconstructions

Currently the program lacks a test to estimate the statistical significance of reconstructions. The program will output a two-string reconstruction for any input sequence, even a randomly generated sequence of IUPAC symbols. However, a random input generally will result in a smaller proportion of ambiguous sites decoded. The reported maximum proportions of resolved ambiguous sites expected by chance are estimated from the maximum values observed in experiments in which fragments of variable length, formed by superimposition of randomly generated strings containing equal proportions of A, G, C, and T, were artificially generated (1,000 replicates for each length tested) and processed by Indelligent. If the value reported for a particular reconstruction exceeds the one expected by chance, the comparison supports the hypothesis that the analyzed mixed fragment has indeed resulted from superimposition of allelic sequences containing a heterozygous indel. The opposite may mean that the analyzed sequence is too short to distinguish between random vs. nonrandom nature of the fragment, inadequate parameters of the analysis have been chosen, or that the original trace contains no heterozygous indels. Because the comparison lacks statistical power its results should be interpreted cautiously. Note that the reported percent of resolved ambiguities is calculated based on the number of ambiguous sites remaining in the output "combined" sequence.

 

Troubleshooting

Problem

Solutions to try

Large number of ambiguous and mismatching sites in the output.

1. Increase the value of Maximum Phase Shift.

2. Inspect the chromatogram for basecalling errors.

3. Analyze a larger fragment.

Scattered mismatches or ambiguities. Clusters of mismatches or ambiguities.

1. Inspect the chromatogram for basecalling errors.

A part of the fragment has been reconstructed nicely, but the rest produced numerous mismatches and ambiguities.

1. Analyze a larger fragment.

2. Reduce Maximum Phase Shift.

3. Increase Maximum Phase Shift.

4. Analyze the two parts separately, record phase shifts detected in each case, and reanalyze the entire fragment with the corresponding phase shifts fixed as parameters.

 Mismatches or ambiguities near an indel.

1. Reanalyze with the Display “Long” Indels option selected.

2. Analyze the two parts of the sequence separately.

3. Try somewhat increasing or decreasing Shift Change Penalty.

Reconstructed strings contain multiple gaps separated by short distances.

1. Increase Phase Shift Penalty.