Generation of a B-cell epitope (antigenicity) classifier
The set of 505 unique amino acid propensity scales taken from the AAINDEX database  forms a 20 × 505 descriptor matrix of rank 19. Calculation of sequence properties from propensity scales can be expressed as a matrix multiplication of the amino acid sequence in matrix representation with the descriptor matrix. Hence, the resulting property matrix is of rank 19 as well, corresponding to 19 linearly independent vectors. From this only 19 coefficients plus the intercept can be estimated in regression models. Therefore Principal Component Analysis (PCA) was employed to transform the propensity scales from the 505-dimensional space to a 19-dimenstional subspace spanned by all 19 principal components with non-zero Eigenvalues, comparable to how parameter reduction has been done before [34, 35]. The 20 × 505 descriptor matrix was centered and scaled to unity variance before the application of PCA. The full information contained in the original 20 × 505 descriptor matrix is retained, because no component is omitted. From this PCA-transformation, a new 20 × 19 descriptor matrix was built. These 19 PCA-derived propensity scales were used in turn to calculate sequence characteristics for the data set. The values were averaged over a sliding window of nine amino acids. Logistic regression employing the logit function was used to build models for the classification of single amino acid residues as epitopic or non epitopic.
In order to estimate the generalization error, bootstrap validation was employed. Error estimates were acquired from out-of-bag validation – i.e. using those residues that are not part of the bootstrap sample as validation data. 50 replicates were calculated for the validation of each investigated model.
The Blythe and Flower dataset
For validation of antigenicity classifiers a published antigenicity validation set by Blythe et.al (2005) was used. The list of 48 proteins we used from the dataset (because the mapping was clear) can be found in the supplementary material [see Additional file 2]. The data itself can be requested from Blythe et.al.
Capturing protein variability
To numerically represent variability of a protein at a certain amino acid position an information-entropy measure has been applied. For each protein where variability should be determined the sequence was BLASTed against the non-redundant protein database (nr) and hits were selected manually. The aim was to choose a diverse but not too diverse set of sequences to optimally represent the degree of evolutionary freedom of each amino acid position. Those proteins were downloaded and multiply aligned using clustalw. For each alignment column a variability value was calculated as follows:
Randomly draw 100 samples of size 30 (i.e. 100 times 30 amino acids which is a combination with repetition) from each column, independent of how many sequences have been aligned. That way the evaluation should be less dependent on the number of homologues as for some proteins only five elements can be found whereas for others hundreds are available. Determine the most abundantly found amino-acid in the column. Then calculate the Shannon-entropy weighted by the EMBOSS [36
] EBLOSUM variant of the BLOSUM62 [37
] substitution scores between each amino acid and the most abundant one in the column. After averaging over all 100 samples to obtain the mean variability the final variability score for each alignment column computes as
where f(j) is the score at alignment column j, f
is the frequency of the most abundant letter x in column j, f
is the frequency of letter i in column j and W
is the substitution weight between letters i and x. For each alignment variability scores are independently rescaled between 0 and 1. The variability score ultimately used for protectivity scoring is 1 – rescaled f(j).
Note that each individual gap is regarded as a new character which occurs only once (thus extending the 20 letter alphabet to a potentially high number leading to the perception of strongly gapped positions as highly variable).
As an independent strategy an evolutionary constraint index (ECI) was calculated for each alignment column. This constraint is essentially the difference between the standard Shannon entropy and the entropy after reducing the amino acid alphabet according to the same substitution matrix as above. Briefly, amino acid identities were grouped as follows: E < = E, D, Q, K, R ; I < = I, L, M, V; Y < = Y, W, F, H; S < = S, T, A. Other amino-acids remained ungrouped. The ECI is primarily discussed when parameter correlations are analyzed.
Posttranslational protein modifications
To predict posttranslational protein modifications PROSITE  patterns were used. In particular we considered patterns PS00001, PS00002, PS00003, PS00004, PS00005, PS00006, PS00007, PS00008, PS00009, PS00010, PS00012, PS00013, PS00294, PS00409. All sequences in the previously created multiple alignments were searched for the occurrence of these patterns. For each hit the amino acid putatively carrying the modification was marked and for each column in the alignment the ratio m of modified amino-acids was calculated. As for sequence variability the value used for protectivity scoring is 1-m. The advantage of using motif predictions on aligned sequences is the possibility to derive the degree of conservation of the motif. Combined with the assumption that conserved motifs of post-translational modification are more reliable and do otherwise carry a high false positive rate this increases the weight of the prediction.
Data for training antigenicity classifiers
A reference data set was generated from the antibody binding site repository BCIPEP . This database holds a collection of experimentally determined B-cell epitopes. The BCIPEP data set is highly redundant with plenty of entries showing relation to more than one source protein.
To realize a non-redundant data set, homologue proteins of this collection were grouped. Members of two different groups differed by at least 30% in sequence identity. After this partitioning, the number of epitopes with length between 6 to 30 amino acids that could be localized on each protein was determined. Finally, the proteins bearing the largest number of epitopes were selected as the representative for their group. This procedure leads to a diverse set of protein sequences with experimentally determined immunogenic regions for each protein.
This data set holds in total 197 proteins with an average length of 449 amino acids. The prevalence of amino acids being part of an epitope is 7.6%. The mean length of an epitope is 16 residues. Out of the 197 proteins 60 originate from bacteria, 55 from viruses, 14 from fungi, 11 from human, 3 from allergens, and 54 from other sources (e.g. eukaryotic parasites). The data set holds in total 401 continuous epitopes, i.e. about 2 epitopes per protein.
Since no systematic epitope mapping of all the proteins in the data set has been performed, categorizing sequence regions as non-antigenic is problematic, as these could be categorized false negative. Epitopes classified as non-epitopes do exist, yet it is problematic to discern whether those are the results of organism or individual (i.e. responses varying from individual to individual) or yet other biases. We chose to declare the non-defined regions of our reference dataset as non-epitopes. Regions not part of an epitope were randomized while maintaining the average amino-acid frequency of BciPep. The resulting was used for training B-cell epitope classifiers.
Each amino-acid functioned as the central amino acid of a 9-mer (peptide), inheriting its class (antigenic or non-antigenic) to the peptide which was then used for parameterization and training/validation. Nine-mers were used due to the desire to use an intermediate between common assumptions about sizes of continuous epitopes (usually between 7 and 15 amino-acids).
Antigenicity scores for alignments
Each sequence in the generated multiple alignments was independently scored for its antigenicity using the PCA19 classifier. PCA19 classification resulted in the assignment of an antigenicity value for each amino acid in the multiple alignment with the exception of the flanking 5 amino acids due to a window effect. For each alignment column overall antigenicity was calculated as the average antigenicity over the corresponding amino acids of individual sequences.
Data for validating predictions of protectivity
Proteins with known B-cell determinants were downloaded from the "The Immune Epitope Database and Analysis Resource" (IEDB) [40, 41]. The IEDB allows filtering by various criteria. We applied the following step-wise exclusion filters to obtain a protectivity-related dataset:
1. 'Assay Group' = 'Ab binding leading to biological activity' (767 proteins remained)
2. 'Epitope Structure Chemical Type' = 'Peptide/Protein' (735 proteins remained)
3. 'Epitope Source Species' must be defined (679 proteins remained)
4. Exclude linear fragments shorter than 3 and linear fragments longer than 50 aa and structural epitopes (381 proteins remained)
5. 'Qualitative Measurement' = 'Positive' (235 proteins remained)
6. Remove 'Epitope Source Species' = 'Homo Sapiens' (227 proteins remained)
7. Remove identical rows and obvious redundancies or identical sequences in different strains of the same organism. (184 proteins remained).
8. Remove entries from non-pathogens (including pathogenic plants).
SRC6129 and SRC6623 were removed because no Uniprot or GenBank Ids were specified. A neutral protease (gi 30260755) of Bacillus anthracis str. Ames was manually added from a literature source . All proteins were then clustered and identified groups multiply aligned using the standard tools blastclust and clustalw, respectively. Epitopes of all sequences present in the alignment were manually mapped to the homologous sequence where the fewest remapping steps were necessary, or where all epitopes could be represented as can be the case for large deletions or proteins with precursor variants. The process is thus similar to the one applied at the Los Alamos National Laboratory HIV database where all reported epitopes are remapped to the reference strain HXB2 .
After removal of all redundancies and impractically short proteins 57 entries (31785 amino acids), with an average peptide coverage (and thus protectivity prevalence) of 7.25% remained, which are from now on referred to as "protectivity dataset". It has to be cautioned that the functional effect of antibodies directed against these determinants is classified only as "leading to biological activity", not necessarily protectivity. For our purposes we consider this close enough an approximation. This dataset can be found in the supplementary materials [see Additional file 1].
Calculation of validation characteristics and Machine-Learning
Validation results were analyzed using the ROCR package) where specifically AROC (area under the curve of true-positive rate versus false-positive rate plots) calculation has been most relevant.
The WEKA package  was used where machine-learning functions were needed, in particular a C4.5 and a Random Forest implementation.
From a practical point of view predictions of continuous epitopes should be measured by the number of synthesized peptides required to cover known epitopes. The Synthesis Score is defined as the number of peptides required to cover at least five epitopic amino-acids in the protectivity validation dataset. Five has been selected as a minimum requirement for an epitope.
Machine Learning datasets
To analyse the relevance of the used parameters simple machine learning techniques were applied as implemented in the WEKA package. For these analyses a dataset was generated based on the entire protectivity dataset (i.e. not the antigenicity dataset) after exclusion of likely inaccessible regions. Essentially all residues which were likely immune-accessible according to the rules mentioned earlier were represented by the sum score (individually rescaled between 0 and 1 for each protein), antigenicity, PTM pattern conservation and variability. This dataset of dimension 21293 with 1485 antigenic (protective) and 19808 non-antigenic residues (baseline prediction 6.97%) will be termed complete machine learning set (CML set). In a second step for each antigenic (protective) residue a non-antigenic residue was randomly sampled to obtain an equilibrated set. This set was then randomly re-sampled into two stratified sets representing 80% and 20% of CML for training and validation, respectively. The training set (2376 instances) and validation set (594 instances) are termed MTS and MVS, respectively.