Immune responses are regulated and initiated by major histocompatibility complex (MHC) molecules, which bind to short peptides from antigens and display them on the cell surface for the recognition by T cell receptors. The specificity of this binding can be predicted from the amino acid sequence of a peptide. Such predictions can be used to select epitopes for use in rational vaccine design and to increase the understanding of roles of the immune system in infectious diseases, autoimmune diseases, and cancers.
There are two types of MHC molecules, class I and class II, and both are highly polymorphic. The core binding subsequence of both MHC I and II is approximately 9 amino acids long. However, the MHC I molecules rarely bind peptides much longer than 9 amino acids, while MHC II molecules can accommodate longer peptides of 10–30 residues [1, 2, 3]. The presence of the binding core with a uniform length for MHC I molecules makes the prediction of peptide-MHC binding relatively easier. Many different methods have been developed for the prediction of peptide-MHC binding, including simple binding motifs, quantitative matrices, hidden Markov models, and artificial neural networks [4, 5, 6, 7, 8]. These methods can be readily applied to MHC I molecules, since the binding motif is well characterized and most of the natural peptides that bind MHC I molecules are of close to equal length.
The prediction of MHC class II binding peptides is a difficult classification problem. MHC class II molecules bind peptides that are 10–30 amino acids long with a core region of 13 amino acids containing a primary and secondary anchor residues [2, 9, 6, 10, 11]. Analysis of binding motifs has suggested that a core of only 9 amino acids within a peptide is essential for peptide-MHC binding. Reported binding peptides usually have variable lengths and an undetermined core region for each peptide. Therefore, a search for the binding core region can circumvent the problem of variable lengths.
Efforts have been focused on how to align the peptides such that a block of the peptides can be identified as the binding cores. The alignment of peptides is searched based on evolutionary algorithms [12], the Gibbs sampling method [13], and a recent method motivated by the ant colony search strategy [14]. The former looks for a position scoring matrix with the highest fitness score (predictive power) through the genetic operator of mutation. The latter two methods attempt to find an optimal local alignment by means of Monte Carlo Metropolis sampling in the alignment space or by the collective search strategy of ant colony systems, respectively. The binding cores with same length are identified from the alignment, and a scoring matrix used for prediction is established from these binding cores. In the work of Brusic et al. [12], the alignment of peptides is treated as a pre-processing procedure. Upon the determination of the binding cores, a binary classifier is then learned with artificial neural networks using amino acid sequences presented in the binding core as a positive training set and other non-binding peptides as a negative training set. In Nielsen et al. [13] and Karpenko et al. [14], a position scoring matrix is obtained from the best alignment and used for scoring peptides. Most of these alignment-based predictors have achieved reasonably good performances. However, a common complication involved in these methods is the correct choice of associated parameters. The tuning of the parameters could be complicate. A similar work is by Bhasin et al. who used a pre-processing procedure called MOTs to filter the putative binding core for binding peptide sequences and subsequently trained the classifier based on the support vector machine (SVM) [15] with those binding core sequences and random sequences [16]. Another method using an iterative approach has been developed based on a stepwise discriminant analysis model [17, 18]. More recently a model based on Bayesian neural networks has been developed [19].
This work is motivated by a machine learning model designed for a training task with only positive and unlabeled examples in text mining. This type of training set is in evidence in various applications in which the identification of a positive training example is labor intensive and time consuming. The basic idea developed for this learning task is the use of a binary classifier to filter out positive examples from the unlabeled set and include them into the positive set through an iterative procedure [20, 21]. A classifier is trained at each iteration by simply assigning positive examples the label 1 and unlabeled examples the label -1 to form normal binary training sets. A classifier can be learned by the use of different binary classification methods such as the Naïve Bayesian or support vector machines.
The unlabeled and labeled examples in the prediction of peptide-MHC binding can be introduced naturally through the encoding mechanism. A sliding window scheme with a window length of 9 is applied to binding peptides. This procedure breaks a peptide into a set of nonamers of equal length. The binding core, which is unknown, is one of the nonamers. The nonamers from all the binding peptides serve as unlabeled examples in which the positive examples, i.e., nonamers of binding cores, are included. Similarly, all nonamers obtained from the non-binding peptides serve as negative examples. It is noted that the situation in this application is opposite to that of text mining. Here a negative set and an unlabeled set containing potential positive examples are presented. However, the same strategy described previously for text mining can be applied. The approach here is to filter out non-binding nonamers in the unlabeled set iteratively. This iterative learning model enables the use of the non-binder information for the identification of the binding cores and to generate the predictor simultaneously. This is different from the three alignment based methods mentioned earlier in which the identification of binding cores relied only on binding peptides.
The linear programming (LP) model proposed by Bennett and Mangasarian [22] is used as the learning model for binary classification at each iteration. This model has several advantages over other learning methods such as support vector machines, Naïve Bayesian, and artificial neural networks. First, there are only a few parameters and they are very easy to tune. Second, a linear program can be solved very fast and it embodies favorable properties which allow sensitivity analysis. Therefore, if the subsequent linear program is only different for a small number of constraints, then the corresponding optimal solution can be found through a re-optimization procedure that uses the information of the current optimal solution. This is particularly important for the iterative learning procedure as only a small number of nonamers is removed from the positive training set at each iteration.
This model was evaluated with benchmark datasets from MHCBench against other major existing methods. The computational study demonstrates overall that this method can achieve comparable or superior performance in comparison with the competing predictors, such as the Gibbs sampler [13] and TEPITOPE [10]. The average areas under the ROC (Receiver Operating Characteristic) curve [23] obtained from one variant of our model are 0.753 and 0.715 for the original and homology reduced benchmark sets, respectively. The corresponding values are 0.744 and 0.673 for the Gibbs sampler and 0.702 and 0.667 for TEPITOPE.