B cell diversification is an integral part of the normal immune response and includes processes of somatic hypermutation (SHM) and isotype class switching within the germinal centres of the lymph nodes . Understanding the origin and clonal evolution of B cells is critical to the overall understanding of the immune response in disease. A number of studies [3, 4, 5] have demonstrated the usefulness of analysing clonally-related immunoglobulin gene sequence sets in this context.
We designed a tool for identifying clonally related sets as a component in a pipeline for processing high-throughput rearranged immunoglobulin gene sequence data. This tool requires as input a set of sequences partitioned into their likely component germline gene using iHMMune-align, and labels these sequences with information about their likely clonally-related sequence cluster membership. iHMMune-align is used to obtain the identity of the IGHV and IGHJ genes and was selected as it is a component of our analysis pipeline and the only currently available utility suitable for scriptable high-throughput analysis. However there is no reason why the clustering program cannot be modified to make use of other partitioning programs such as IMGT/V-QUEST or SoDA if these are available in high-throughput versions.
Standard phylogeny inference methods are not suitable for exploring clonal relationships within an immunoglobulin gene sequence dataset as antibodies diversify through processes that differ substantially from those of long time scale evolutionary events. However, string comparison methods that are widely used in sequence analysis are still applicable: sequence insertions, deletions and substitutions can be thought of as editing operations that transform one string into another. Pairwise distances between sequences can be calculated from string edit distances. Clonally related sequences are similar in length and have many common mutations, and this is reflected in their pairwise edit distances. A dendrogram representing the hierarchical structure of the relationship between sequences can then be constructed.
The major differences between different hierarchical clustering algorithms are the measure of similarity between each pair of clusters and the underlying modelling of the clusters. Distance measures which can accurately represent relationships between sequences result in improved accuracy in the identification of clonally related sets. The Levenshtein distance is the most commonly used metric for measuring the dissimilarity between strings. However, it is not very suitable for strings of different lengths since it lacks normalization for appropriately weighting edit operations relative to the length of the strings being compared. It should be clear that one difference between two short strings is more critical than one difference between two longer strings.
There are two well-known approaches for normalizing the Levenshtein distance: one based on the editing path lengths (normalized edit distance, NED), and the other on the string lengths (post normalized edit distance, PNED). Post normalized edit distance (PNED) normalizes the editing transformation by the length of one of the strings, while NED normalizes the editing transformation by the alignment path length. Marzal and Vidal  demonstrated the improved accuracy of NED relative to PNED when comparing string patterns. Our results also demonstrated that NED resulted in improved accuracy when clustering clonally-related immunoglobulin gene sequences relative to PNED and to the un-normalized Levenshtein distance.
The clustering accuracy was further improved by incorporating information about common usage of V and J genes between sequences, in addition to the CDR3 similarity. However CDR3 similarity remains the main criterion, as it is less sensitive to potential errors in germline gene identification. The resulting NED_VJ distance produced a clustering of clonally-related sequences in the benchmark set that, when examined by domain experts, was found to improve on their initial classification by visual inspection of the iHMMune-align outputs. This demonstrates the value of our approach not only in automating the painstaking task of identifying clonally-related sequence sets by visual inspection but also in improving the accuracy of that identification.
Various combinations of values for the gap, V mismatch and J mismatch penalty parameters were tested using the PNG dataset in order to select the best performing values. The PNG dataset covers a wide range of germline gene usage and mutation level and is very likely to be representative of most heavy chain gene sets. The resulting values are therefore likely to be suitable for all immunoglobulin heavy chain gene sets. However the user has the option of modifying these values if deemed necessary.
One issue with clonally-related sequence identification is the presence of sets of sequences which appear clonally-related but are only similar as a result of chance matching. This is particularly an issue with sequences containing very short IGHD genes where CDR3 similarity can more readily occur by chance and is less likely to reflect a common clonal origin. This issue was addressed by selecting a conservative distance threshold below which a cluster is considered to be a clonally-related sequence set, although this may decrease the sensitivity of our method. Some improvements may be obtained by taking into account IGHD gene identity and number of shared mutations, when these can be determined with confidence .
Another source of error in the classification of the benchmark dataset was the presence of chimeric sequences in the dataset. These chimeric sequences are formed from recombination of two sequences and may have identical CDR3 sequences, but different V genes. Our method relies primarily on CDR3 differences for identifying clonally related sets, and therefore cannot avoid erroneous clustering of chimeric sequences. This problem was minimised by inclusion of IGHV identity in distance calculations.