### Determination of prediction tools

We have identified a total of 16 different prediction tools from 12 different research groups. Where there are two tools from the same group, they differ either in the method used to predict binding affinity or in the data used to train the model. The tools tested are as follows: Bimas [9], Rankpep [10], SYFPEITHI [11], NetMHC 2.0 ANN and NetMHC 2.0 Matrix [5, 12, 13], SVMHC SYFPEITHI and SVMHC MHCPEP [14], HLA Ligand [15], Predep [16], SMM [2], MHCPred (position only) and MHCPred (interactions) [17, 18], Multipred HMM and Multipred ANN [19, 20, 21], ARB Matrix [7], and a locally implemented logistic regression-based tool [22].

### Creating a collection of peptides for evaluating the predictive performance of each tool

Prediction of peptide binding was evaluated for three different alleles: HLA-A*0201, HLA-B*3501, and H-2Kd. These alleles differ substantially in the number of available tools that make predictions for them: all of the aforementioned tools predict for HLA-A*0201, eleven make predictions for HLA-B*3501, and just four predict for H-2Kd. Thus, these alleles were chosen so that the performance of our combined tool (HBM) and linear discriminate analysis (LDA) could be evaluated when different numbers of individual tools are employed.

Two sources of data were used for comparative analysis of prediction tools in this study. The first was the community binding resource [6], a large, recently published database containing experimentally determined affinity values for the binding of peptides to many different MHC-I alleles. This dataset of testing peptides could potentially be expanded further by incorporating peptides from such online databases as SYFPEITHI [11], MHCPEP [23], HLA Ligand [15], and EPIMHC [24]. However, the use of the latter online databases presents a problem for the current study. As the models underlying many existing prediction tools were trained using data from these latter databases, the subsequent testing of the individual tools with these same peptides may result in an inaccurate estimation of each tool's predictive performance. For instance, tool A may be judged better than tool B merely because tool A was trained using the same peptides with which it was tested, while tool B was not. As combining the scores of the individual tools relies on an accurate appraisal of the performance of each tool, it is necessary to avoid the use of peptides with which the individual tools have been trained. Thus, we used only the community binding resource as our source of binding-affinity data. Only peptides of length 9 were considered, because all tools make predictions for peptides of this length. Peptides with *IC*
_{50} < 500 nM were classified as binders, while those having *IC*
_{50} > 500 nM were classified as nonbinders. In total, there were 1184 binders and 1905 non-binders to HLA-A*0201, 211 binders and 525 nonbinders to HLA-A*3501, and 60 binders and 116 nonbinders to H-2Kd.

For comparison purposes, the tools were also tested using an independent dataset consisting of peptides gathered only from published literature [25, 26, 27, 28, 29, 30, 31, 32, 33]. Again, only nonamers were chosen. Classifying a given peptide as a binder or a nonbinder was performed as follows: if *IC*
_{50} values were reported (as in the community binding database and most literature sources), then the standard binding threshold of 500 nM was used; where some other type of assay was done to determine binding affinity, the classification given by the authors was used. In the latter case, if no classification was given by the authors, the peptides were not used. Finally, to avoid bias in the data, peptides were filtered such that where two peptides differed at fewer than two residues, one peptide was randomly removed. The resultant dataset consisted of 108 binders and 108 nonbinders to HLA-A*0201, and are given in Additional File 1. Due to scarcity of published data, it was not possible to construct similar datasets for HLA-B*3501 or H-2Kd.

### Performance measures

Binding prediction programs give a numeric score to each considered peptide. Each score can be converted to a binary prediction by comparing against a tool-specific threshold – if the score is greater or equal, then the peptide is a predicted binder; otherwise, it is a predicted nonbinder.

Sensitivity is the proportion of experimentally determined binders that are predicted as binders and is defined as *true positives*/(*true positives + false negatives*). Specificity is the proportion of experimentally determined nonbinders that are predicted as nonbinders, and is defined as *true negatives*/(*true negatives + false positives*). The traditional way to measure the performance of a classifier is to use a receiver operating characteristic (ROC) curve. However, ROC curves do not always give a good measure of practical utility. For a researcher scanning a large proteome for potential epitopes, specificity may be much more important than sensitivity. Imagine scanning a proteome consisting of 10,000 overlapping nonamers, 50 of which (unbeknownst to the experimenter) are good binders to the MHC-I allele of interest. Consider further that prediction tool A has 0.70 sensitivity at 0.80 specificity and 0.05 sensitivity at 0.99 specificity.

Tool B has 0.50 sensitivity at 0.80 specificity and 0.20 sensitivity at 0.99 specificity. While tools A and B might have the same area under the ROC curve (*A*
_{
ROC
}), tool A is superior at 0.80 specificity and tool B is superior at 0.99 specificity. If tool A is used at a threshold corresponding to 0.80 specificity, then approximately 2000 peptides must be tested in order to find 35 of the high-affinity binders. In contrast, if tool B is used at a threshold corresponding to 0.99 specificity, only about 100 peptides would have to be tested in order to find 10 of the high-affinity binders. Due to the high cost of experimental testing, and because knowledge of all the binders in a given proteome is usually not needed, the latter scenario would be preferable. We therefore conclude that good sensitivity at very high specificity is a more practical measure of a tool's usefulness than the *A*
_{
ROC
} value, and have thus used sensitivity at high values of specificity as the primary assessor of the practical utility of each tool. For completeness, however, we also include each tool's *A*
_{
ROC
} value.

### Combining the scores of the individual tools

We propose a heuristic-based method (HBM) for combining scores from individual prediction tools to make a better prediction. This method takes advantage of the observation that most of the individual tools make very few false positive predictions when the classification threshold is set sufficiently high, but correspondingly make few predictions of positives. If the tools identify different actual binders, combining such predictions may result in a greater number of rrue positives. The method also tries to take advantage of the "collective wisdom" of a group of predictive tools. The individual tools are based on a variety of techniques. Instead of trying to find the "best" technique, we try to combine the best that each technique has to offer. This is an extension of the idea used by prediction tools such as MULTIPRED [19] which combine predictions made by a few methods.

Our proposed combined prediction tool ("HBM") takes a protein sequence as input, queries all of the individual prediction tools getting from each the predicted binding affinity for all nonamers in the protein, computes a combined score for each nonamer, and finally predicts binders based on the combined scores for all nonamers. The tool is implemented as a Perl script.

The first step in our HBM is to select a specificity for the individual tools. Each tool is then weighted according to its sensitivity at that specificity. Next, the score given to each peptide by a given prediction tool is compared to the tool-specific threshold value for that specificity. If the score is better than or equal to the threshold score, then that tool predicts the peptide as a binder, and the weight (sensitivity at the chosen specificity) for that tool is added to the total score for the peptide. Otherwise, the peptide's total score remains unchanged. For peptide *x* and each prediction tool *t*, we have

where *B*
_{
t
}(*x*) is 1 if peptide *x* is predicted to bind by tool *t* and 0 otherwise, and *W*
_{
t
} is the weight of tool *t*. CombinedScore(*x*) is then compared to a threshold in order to classify *x* as either a predicted binder or a predicted nonbinder.

The performance of the HBM was determined using 10-fold cross-validation: in each fold, 90% of the peptides (the "training peptides") were used to determine the performances of the individual tools, and these performance data were used by the HBM as described above to make predictions for the remaining 10% (the "testing peptides"). Each peptide was used as a testing peptide exactly once. The scores given to each testing peptide were then used to calculate specificity and sensitivity values for the HBM in the same manner as was described for the individual tools. To minimize experimental error due to the random partitioning of the peptides into training and testing sets, the entire process described above was repeated ten times, and the HBM sensitivity at each specificity was taken to be the average of its sensitivity in the ten trials. While *A*
_{
ROC
} values are shown for the individual tools and for the LDA, no such values could be computed for the HBM. The reason for this is that, at high individual tool specificity parameters, most nonbinding peptides get an HBM score of zero, and therefore the ROC curve contains no points for specificities between 0 and approximately 0.85–0.90.

### Comparison technique

A standard method for combining variables to distinguish two categories is linear discriminant analysis (LDA) [34]. If *y* is the vector of scores from all the tools for a particular peptide, it is classified according to the value of the linear discriminant

(*μ*
_{1} - *μ*
_{0})'∑^{-1}
*y*,

where *μ*
_{0} and *μ*
_{1} are the vectors of means for non-epitopes and epitopes, respectively, and ∑ is the average covariance matrix of the scores within the two groups. This method is optimal (in the sense of minimizing the probability of misclassification) if the scores have a multivariate normal distribution with the same covariance matrix for epitopes and non-epitopes. More sophisticated methods have been developed without the normality assumption, but doubts have been expressed about their advantage [35]. The separation between the groups can then be quantified by

*δ*^{2} = (*μ*_{1} - *μ*_{0})'∑^{-1}(*μ*_{
1
} - *μ*_{0}).

Under the normality assumption, if the specificity is fixed at 1 - *α*, then the sensitivity will be

Φ (*δ* + Φ^{-1}(*α*)),

where Φ is the cumulative distribution function (cdf) of the standard normal distribution. *A*
_{
ROC
} can be calculated as Φ (*δ*/
). The threshold for classification is determined by the prior probability *p*
_{1} that a peptide is an epitope, which is related to the specificity by

*p*_{1} = [1 + exp (-*δ*^{2}/2 - *δ*Φ^{-1}(*α*))]^{-1}.

A number of the tools displayed notably non-normal distributions. Most of these were highly skewed, but became close to normal when transformed to logarithms. The scores of three tools (NetMHC 2.0 ANN, Multipred ANN, and the logistic regression-based tool) had sigmoidal distributions. These became approximately normal when converted to scaled logits. A "logit" is a transformation of a probability *p* (between 0 and 1) to log(*p*/(1 - *p*)). For a variable *y* which is restricted between *a* and *b*, a "scaled logit" can be calculated via log((*y* - *a* + *ε*)/(*b* - *y* - *δ*)), where *ε* and *δ* are small adjustments to avoid zeros. *ε* = (*y*
_{-} - *a*)/2 and *δ* = (*b* - *y*
_{+})/2, *y*
_{-} and *y*
_{+} being the smallest and largest observed values greater or less than *a* or *b*, respectively. The actual performance of the linear discriminant on the transformed scores was estimated using ten-fold cross-validation. Computations were done using S-PLUS version 7.0.0. Figures were created with MATLAB 7.

Except for the H-2Kd dataset, the cross-validated specificities fell short of the nominal ones. To realize specificities of 0.99 and 0.90, the threshold was adjusted to a nominal specificity such that the cross-validated values were as close as possible to the target values. Figure 1 shows the distributions of the LDA scores for the community HLA-A*0201 data set. The diagonal lines indicate where the points are expected to fall for perfectly normal data. A specificity of 0.99 corresponds to a horizontal line such that 99% of the non-epitopes fall below this line. Because of the slight upward curvature of the non-epitope distribution, a nominal specificity of 0.99 falls short of this goal, but the larger nominal value of 0.9975 gives the correct threshold. About 32% of the epitopes give LDA scores above this value. Distributions of LDA scores for the the other datasets are given in Additional Files 2, 3 and 4.