Interval Bio | Allegro(R) Bioinformatics

Bioinformatics for Allegro^®

Introduction

This page describes how to perform analysis on data generated from the Tecan Allegro Targeted Sequencing platform. This information is based on Interval Bio's extensive experience processing Allegro data, and is intended to provide some practical insight to supplement the official documentation.

As with all bioinformatics, there are many tradeoffs to be made in designing a pipeline for your specific circumstances. Interval Bio has partnered with Tecan to help many Allegro customers with data analysis, and we would be happy to help you as well. If you have further questions, please don't hesitate to contact us. We have a broad range of experience with Allegro data, ranging from the more straightforward types of genotyping tasks described here to more sophisticated applications.

Finally, we would note that beyond a certain point, the bioinformatics problems will have been conquered, and the difficulties will be in the areas of scaling and data organization. Managing these types of problems is our primary expertise at Interval Bio, and we would be happy to work with you to address them.

1. Basic Analysis Steps

Allegro is based on next-generation sequencing technologies, and the analysis steps are very similar to those used in whole-genome sequencing, exome sequencing, and so forth. However, there are some distinct differences to be aware of — we will discuss the most important in this writeup. Note that this page describes bioinformatics for Allegro, not its biochemical basis. For more details about how Allegro works in the lab, see the Allegro User Guide.

In the remainder of this note, we will assume that you are starting with FASTQ files from a sequencing provider and wish to call genotypes and/or identify variants in your data. A very high level outline of a processing pipeline for a single sample is shown in the figure below. On the left, you will find a list of the file formats that are in play at each stage, and on the right you will see a representative selection of tools that are employed. The stage marked [filter] is an optional step that depends on your strategy for variant calling. See Section 4 on this page.

A typical high-level pipeline for Allegro

The merge stage concatenates multiple files related to a sample into a single FASTQ. In the trimming stage, we trim the sequencing adapters, possibly a linking adapter in the case of paired-end sequencing, and possibly the probe sequences on each read. For information on this last item, see this section. The tools typically used here include trimmomatic, trimgalore, and cutadapt.

Alignment to the reference is performed using a standard aligner such as bwa or bowtie2. In practice, we have found very little difference in effectiveness with these aligners. Allegro can usually be run against a standard reference, but in some use cases there will be some additional work necessary to prepare the reference for alignment. In some cases, the "reference" may be nothing more than a set of small sequences. In other cases, it may be necessary to create a masked reference to create results that are comparable to genotyping arrays or to achieve good coverage in the presences of excessive homology. The optional [filter] step may be performed depending on the calling strategy, as described in Section 4.

The genotyping/variant calling stage is likely the most variable part of the pipeline. Some callers are better than others for particular purposes. For example, freebayes handles polyploidy, where bcftools does not. Some users prefer to stick to GATK. Tuning the parameters for maximum performance in this stage can be time-consuming. All of the standard callers produce genotypes or variants in the VCF file format, which may or may not be appropriate for every application.

In the final export stage, you can convert the VCF format to a variety of comma- or tab-delimited text files, Excel spreadsheets, or other formats. This largely depends on how the data are going to be analyzed downstream, whether they need to be compared to genotyping array data in a concordance, etc.

One typical final step that is not shown above is to merge sample-wise information into a single file for easy analysis. Tools such as bcftools merge are useful here.

The foregoing is a fairly general description that applies to most next-generation sequencing pipelines. The following sections describe some specifics related to Allegro that may change the workflow or require different parameters from the more standard pipelines.

2. The Structure of Allegro Reads

The figure below shows the layout of several aligned single-ended reads, typical of what you will encounter in processing Allegro data. The data would be the output of the align step in the next-generation sequencing pipeline described above.

Allegro reads targeting position (A). A clean flanking variant finding is located at position (B). The variants at (C) and (D) require some additional bioinformatics to call accurately.

This simplified diagram shows a series of single-ended reads that are aligned to a reference genome in the neighborhood of a specific target, labeled "A". On either side of the target, there is a yellow probe region. It is typical for each Allegro target to be bracketed by two probes, as shown here. The probe on the left of the target is designed to match a sequence on the positive strand, while the probe on the right matches the reverse strand.

It is important to note that every read shown in the figure above contains probe sequence as the first 40 bases on the 5' end. These probe sequences are distinct from sequencing adapters; they are unique for each probe and are designed to match the reference sequence (or its reverse compliment) at positions near a target. The arrangement of the reads may resemble that of paired-end sequencing, but these are single-ended reads — every read contains a probe sequence. We will discuss the case of paired-end sequencing below.

3. De-duplication

It is common in next-generation sequencing pipelines to perform a de-duplication step to remove redundant reads that arise in PCR. Tecan does not advise performing this step, and their rationale is clear from the figure above — the targeted reads align almost perfectly with their probe sequences, which can lead to many many identical reads in the same place. In other words, these reads may look like duplicates, but they are not. Standard de-duplication tools will remove most of your data!

Note that some variant callers, such as platypus will assume by default that reads in the same position are duplicates. This feature can be disabled in some cases; please consult the documentation for your specific variant caller.

4. Calling Genotypes

Referring to the illustration above, we can describe a couple of issues that you must consider carefully when calling genotypes with Allegro. First, note that there is a variant at position (B) in the diagram. This variant appears to be homozygous, and because it is in a position that does not overlap any probes, it is a "clean" flanking variant — basically, free additional information. Note however, that based on its position, the number of reads that contain this variant is about half the number of reads that contain the target (A).

The variants at positions (C) and (D) are potentially more problematic. As noted above, the probe sequences are designed to match 40-mers on the reference. As such, bases in the yellow regions in the figure are all reference bases (setting aside minor sequencing, hybridization, and alignment errors). In the example shown in the figure, if you attempt to call a variant at position (C), you will find 3 variant alleles and 11 reference alleles in your pileup. This may mean that the genotype — which appears to be heterozygous — is called as homozygous ref, i.e. no variant at all. Similarly, at position (D), a variant that appears to be homozygous alt may be called as heterozygous.

Whether or not calling errors like these occur depends on many factors, including read depths, settings in your caller, etc. — but they do happen with real data. The effect, if you are not careful, is a bias toward the reference allele.

The good news is that this problem is relatively simple to address, at least with single-ended reads. There are three basic approaches: (a) hard-trim the probe sequences (b) soft-clip the probe sequences so that the caller will disregard them, or (c) ignore all information in all reads that fall within the probe ranges. The first approach can be accomplished with a variety of off-the-shelf tools, including trimmomatic, trimgalore, cutadapt, and others — you simply hard-clip the first 40 bases of each read before alignment. As a result, the probe regions shown in yellow in the figure are simply not there to interfere with the work of the caller.

It is not a good idea to try to enumerate all the probe sequences and feed those as inputs to a trimmer for removal — there are many thousands of distinct sequences, they may change as you iterate your design, and this would make trimming very computationally complex. It's much faster and easier to simply remove the first 40 bases. One additional word of caution. Some variant callers (such as platypus) will ignore variants near the ends of reads by default, because the bases at the ends of reads typically have a higher error rate than bases in the middle. This filtering may interfere with the detection of incidental findings. Please see the manual for your specific variant caller to determine whether this is a problem. The soft-clipping approach described below may help in this case.

A second approach is to suppress the probe regions by soft clipping. In the soft-clipping approach, the probes are not eliminated from the reads, but they are marked so that downstream variant callers will ignore them. There is some evidence that alignment accuracy can be improved on shorter read lengths by not trimming the probe sequence from the read. This approach also has an advantage in dealing with paired-end reads, as described below. If you're interested in trying this approach, Interval Bio makes a probefilter tool that operates on the post-alignment BAM file.

Finally, the third approach (which we do not recommend) is to ignore all information in the probe regions, regardless of which read they are on. This approach can work in some cases — for example, in cases where you only require genotypes on the targets and the targets are sparse enough that they never overlap another probe region — but it most likely prevents the use of nearby flanking variants. If you find yourself considering this option, just know that it is almost certainly easier just to trim the 40mer on the 5' end of each read.

5. Paired-end Sequencing

In general, paired-end sequencing is an unambiguous win because it (a) typically costs less than single-ended sequencing, (b) improves mapping accuracy, and (c) gives you potentially even more flanking variants.

However, there are some potential issues to be aware of with paired-end sequencing as well. First, you need to remove a 15-base probe linker from the 3' end of your reverse reads. More details about this step are given in section III.G Data Analysis Guidance in Tecan's User Guide for Allegro. Second, the probe sequences can still interfere with variant calling, and the geometry of the problem is even more complex than in the single-ended case discussed above.

A single read pair from paired-end sequencing. The probe appears on the 5' end of R1. There is no probe sequence on R2, but in this case, R2 extends into the probe sequence on R1. The target is at (A); position (B) is a flanking variant.

The illustration above shows a single read pair, with the probe sequence in yellow at the 5' end of R1. In this example, the insert size is well below its expected value, such that the reverse read (R2) actually encroaches into the probe region on the forward read (R1). As a result, the bases in the region marked "trim" in the figure are probe sequence, i.e. reference bases. They do not represent real biological information about the individual you're sequencing. Overlaps like this are not necessarily a common occurrence in paired-end data, but when they happen, they compromise your ability to use the data in the overlapping area for genotyping or variant calling.

As in the single-ended case, probe sequence must be cleanly removed. Unlike the single-ended case, unfortunately, off-the-shelf trimmers will not help. The specific number of bases that should be trimmed from the 3' end of R2 depends on how that specific read aligns relative to the probe region on R1. Standard tools are not designed to handle this specialized circumstance. We at Interval Bio have created a probefilter tool that does perform these calculations, operating on the aligned BAM file. As noted above, it uses the "soft clipping" approach to ensure that downstream callers do not consider bases in the probe regions.

The impact of such overlaps may be minor in your case. They occur only on the extreme "small" end of the insert size distribution, so they may be below your noise threshold. If you care only about targets and not flanking variants, then this will not be a problem at all. However, the potential of errors like this have led some users to consider only single-ended reads, to restrict attention to targets only, or to use a soft-clipping approach as discussed above.

7. Interval Bio

Interval Bio incorporates all of the best practices described above into our own processes. If you would prefer to focus more on downstream analysis of your data and less on the mechanics of generating those data, Interval Bio is ready, willing, and able to help. Please contact us.