Overview

This page describes the technical steps involved in iRweb’s bioinformatic pipeline. If you are unfamiliar with data analysis for next-generation sequencing, we suggest you visit our NGS overview page. For a general introduction to iRweb, see our Learning Center article about data analysis. For more details about iRweb than what can be found on this page, download the data analysis guide

Read filtering

Following the demultiplexing of reads, dependent upon our primer systems (arm-PCR or dam-PCR), paired reads (R1 and R2) are trimmed for quality based on average Qscores. Once these reads are trimmed, they are overlapped and stitched. If trimming is excessive (i.e., if reads are trimmed enough that they cannot overlap), these reads are thrown out. Following overlap, reads are then mapped to the IMGT database and only those reads that map to reference sequences and contain canonical CDR3 motifs move forward to our SMART filters. 

We use five software filters to further eliminate PCR and sequencing error, as well as other noise among the mapped reads. Our SMART sequence filtering system includes a Sequencing Error Filter, a Mosaic Sequence Filter, an Amplification Filter, a Reference Filter, and a Frequency Threshold Filter. 

1) Sequencing Error Filter 

The Sequencing Error Filter uses identical matches at overlapping regions to remove sequencing artifacts. When paired end sequencing is used, if R1 and R2 have stitched in a way that they are not 100% identical in both directions, the reads are thrown out. 

2) Mosaic Sequence Filter 

The Mosaic Sequence Filter detects and removes chimeric sequences generated during PCR.  

3) Amplification Filter

The Amplification Filter removes insertion, deletion, and substitution errors introduced by PCR. The Amplification filter works by finding the major and minor sequence distributions in the variable N region (within CDR3). If a minor sequence occurs less than 5% as frequently as the most frequent clone, the associated reads are removed. The Amplification filter will remove any indel errors.  

4) Reference Filter 

Reads passing the S filter are then compared to the reference sequences they align to. If the read is not identical to the VDJ reference in GenBank, it is filtered out here.  

5) Threshold Filter (Frequency) 

As a final step, reads are filtered based on their frequency. Reads are collapsed based on identity to generate a frequency for each unique CDR3 and gene combination. After multiple rounds of filtering and collapse, if the sequence occurs at a frequency of 1, it is removed because it is considered noise. 

SMART filter application

Our SMART filter processes are mainly applied to TCR sequences, due to the fact that TCRs are supposedly without somatic hypermutation, which allows us to apply a reference filter (remove reads with a mismatch to the reference sequence in the CDR3 region) and also allows us to collapse sequence reads into one consensus. Without collapsing the sequence reads (which is required to gain the frequency of particular CDR3s), we cannot apply the PCR error filter and Mosaic error filter. Therefore, for B cells, we apply only the sequence error filter (through the overlapping region). 

Sequence assignment details 

We use the publicly available sequences from the International Immunogenetics Information System (IMGT) to make assignments using our in-house bioinformatics tools. The best alignment of V, D, J and C segments to a sequencing read are assigned to the sequence read. 

We use a modified Smith-Waterman algorithm for local sequence alignment between sequencing reads and germline reference (human consensus from IGMT). The parameters for the alignment are: match = 1, mismatch = 3, gap_open = 5, gap_extension = 2. The cutoff score for V match is 50 and the cutoff score for J match is 20. In addition, the alignment further checks for proper conserved motif sequences around the CDR3 region. 

It can be difficult to detect D genes by sequence alignment. D genes are utilized in the recombination that leads to the heavy chain. Quoting Roitt, Brostoff and Male, “The D segment is highly variable both in the number of codons and in the sequence of base pairs…More than one D segment may join to form an enlarged D region.” There are mechanistic constraints that preclude the use of D genes that are 3′ of the selected J gene. There are also possible insertions and deletions and other noise in the region where V and J join. When the available nucleotides in the sample are insufficient for our pipeline to be able to distinguish amongst D-genes or even make a call for a given D-gene, an asterisk is used and the gene isn’t called. 

Please note that the isotype information can be called accurately, but the subisotype may have more than one reasonable call. This is because the software uses a best-case-last alignment, so if the aligned genes are very similar, it assigns to the last called. 

Analyzing your data 

While we cannot make the primer sequences available, you will have access to the full nucleotide sequence which should give you the information that you need in lieu of the primer sequence.  

It is important to note that our primer sequences begin within the first framework region, and therefore, the first 20-30 nucleotides are missing. We can usually infer this sequence using the IMGT reference. Most people design a primer to pick up the target from the plate based on the sequence reported in the iPair Analyzer and add on this missing portion to complete the FR1 portion. 

Our mouse long-read primers for BCRs cover from within FR2 through the beginning of the C-region, so CDR2 and CDR3 will have full coverage, but there won’t be any data for FR1 or the CDR1. When our bioinformatics department designed across all mouse strains, there were a lot of truncated sequences in the database. To ensure maximum coverage, the decision was made to design from a position which would cover the majority.