We provide here the most precise estimates of gene segment usage frequency currently available. The quantity of data that we assembled and analyzed has enabled us to estimate V, D, and J segment usage frequencies with tight confidence intervals. These data potentially give insight into the structural basis for differential segment usage in terms of either raw expression or somatic selection, though such elucidations are left for further research.
In addition, comparison of our results with previously published usage frequencies provides validation of our data collection methods and confidence that our P sequence dataset is representative of natural diversity as intended. In particular, an extensive study of Ig CDR3 diversity based on de novo sampling of Ig using a primer for a single VH gene shows D segment usage results remarkably similar to our own, based on a Spearman rank correlation score of 0.93 [18]. This in spite of the fact that D segments are notoriously challenging to identify within Ig genes due to recombination site choice, flanking 5' and 3' n-nucleotide addition [10], and somatic mutation [19, 20, 21]. With J segments, furthermore, our data are consistent with published findings that indicate that segment J4 is used most frequently, followed in descending order by segments J6, J5, J3, J1, and J2 [11, 22, 23] (Fig. 5).
For V segments, our data again provide statistical evidence in support of published findings. With individual segments, our data support previous results showing that segment V3-23 is the most frequently used [11] in productive rearrangements, and that gene V4-34, which we found to be used second most frequently, has high usage within adult peripheral lymphocytes [23]. Like the J segments, individual segment usage can vary, but in spite of that, segment usage at the family level approximates expected usage based on literature. Our data support findings that show that segments in family V3 are used most frequently, followed in descending order by V4, V1, V5, V2, V6, and lastly V7 [11, 12]. We have also shown consistency with findings that, with some variation, the distribution of V gene usage by family shows similarity to germline complexity of the known segments [11] (Fig. 7). Our data showed this to be especially true for families V1, V3, and V4.
The NP sequences showed an enhancement of segment usage from family V4 at the expense of segments from family V1, due primarily to a 67% increase in usage of segment V4-34 from what was expected. Segment V4-34 has been reported to be over-represented in the adult human repertoire [24], and has also been implicated in generating autoreactive B-cells in SLE patients and against cold agglutinins [25, 26, 27]. Since the NP sequences are not subject to selection, those sequences coding for autoreactive receptors would not be deleted from the repertoire. Also, V4-34 has been shown previously to be limited by selection in the expressed human Ig repertoire due to lowered usage of this segment between IgM and IgG populations [28]. Thus, V4-34 is likely not enhanced in autoimmune disorders, but instead is selectively limited in the P sequences.
Having validated our data collection methods, we focused on analyzing the genetic mechanisms involved in V(D)J recombination. One such mechanism is n-nucleotide addition by TdT. The zero-inflated negative binomial model fits these data well enough for us to seek an interpretation of its three parameters. We develop this interpretation in terms of two states: TdT attached to one of the unjoined DNA ends, or unattached. The probability that TdT never attaches is the first parameter, the zero-inflation factor. When attached, TdT either adds another nucleotide or becomes detached, with probabilities p and 1 - p, respectively. In this context, the final parameter, r, has a natural interpretation as the number of times TdT detaches before the joint is closed.
We found that for the P sequences, r < 2 for the D-J junction but r > 2 for the V-D junction (Table 2). This pattern is consistent with a greater TdT concentration during the V-D joining process relative to that during the D-J process.
Studies of TdT expression during B-cell ontogeny show high levels TdT mRNA during the pro-B and late pro-B stages of development – the stages in which the D-to-J and the V-to-DJ rearrangements occur, respectively [29, 30]. Specifically, it has been shown that TdT expression is upregulated as the B-cell moves from the pre-pro-B stage, undergoing D-to-J recombination, and that expression peaks as the V-to-DJ rearrangement occurs in the late pro-B stage [30]. TdT expression then quickly declines as the cell progresses into the pre-B stage. This observation is consistent with our result, that there are more detachments (and hence more attachments) before end-joining in the V-D junction relative to the D-J junction (Table 2).
We also investigated the mechanisms involved in gene segment recombination. Our findings regarding D-J segment correlations raise an interesting hypothesis that multiple successive D-J rearrangements may occur prior to recombination with a V segment. Previously, Reth et.al. tested the possibility of this hypothesis in murine 300-19 cells cultured in vitro by assaying for the presence of a designated D-J insert and found that such multiple successive recombinations can and do occur [16]. Other studies analyzing nonproductive human Ig rearrangements have hypothesized, based on their observations, that multiple successive D-J rearrangements at the human heavy locus are likely [14, 17]. We here provide evidence for this hypothesis for human Ig. This rearrangement mechanism differs from that observed in receptor editing in the heavy chain via V
H
replacement [31] or at the light chain loci by secondary de novo rearrangements [32, 33]. Our analysis suggests that multiple D-J rearrangements may occur up to 15% of the time prior to the V-to-DJ rearrangement, with each successive D-J recombination replacing the previous one via excision.
The processes involved in D-J recombination are complex and likely require more parameters to better model the system. Still, the results of our modeling, with such extreme differences in p-value and chi-square values, are sufficient to support our hypothesis for the observed patterns in our P sequences. These data provide the first statistically supported observations of multiple successive recombinations in productive human Ig sequences. Considering V-D pairings, we did not perform a similar contingency table analysis since the greater number of possible pairs dramaticallyreduces the statistical power. For the NP sequences, the relatively low number of sequences in this set did not allow for this analysis.
These analyses prompted us to speculate about the observed J segment frequencies. Our multiple recombination model can help explain the lower usage frequency of segments J1 and J2, but prompts one to question why V5 is not used as frequently as V6, yet instead has a similar frequency to V3. Of the remaining four segments, J4 and J6 are used most frequently, followed by J5 and J3. It is possible that there are structural reasons for these observations concerning DNA access and histone acetylation. We propose, however, that the observed trends may instead be due to selection for tyrosine residues. Analyses of the 5' portion of the functional J segments, up to the invariant tryptophan residue, show that both J3 and J5 lack tyrosine residues, while J4 has two and J6 has five. Tyrosine has biochemical and structural properties that make it beneficial in protein binding interfaces, such as CDR [34]. Also, studies of amino acid profiles in human Ig have shown that tyrosine is one of the most abundant residues found in CDR, and specifically within CDR3, it locates most often at the C-terminus end of the CDR3 loop [34, 35]. Any residues contributed to CDR3 by J segments would be found at the C-terminus end of CDR3. The desirability of tyrosine residues and their frequent location at the 3' end of CDR3 suggests biased selection toward proteins comprised of J segments that contribute such residues, namely J4 and J6.
With regard to CDR3 length, we found that the P sequences had a statistically shorted mean compared with the NP sequences. The higher mean CDR3 observed in the NP sequences may be due to a lack of selection. It has been previously shown that negative selection occurs in the bone marrow against B-cells presenting Ig with long CDR3 [36]. This may be because Ig with long CDR3 have been correlated with polyreactive specificity, including specificity for self peptide [37]. Since the NP sequences are not subject to selection in the bone marrow, these data provide evidence that negative selection restricts CDR3 length in the human Ig repertoire.