One of the most obvious ways to identify new potential antigens in newly sequenced microbial genomes is through similarity searching. Assuming that we know the sequence of one or more extant antigens, we can make use sequence searching programs of various complexity and sophistication, such as BLAST or FASTA, to identify similar sequences in the target genome. This set of selected candidate antigens can then be prioritised for further theoretical and ultimately experimental validation. Obviously, all the same caveats that exist for any sequence search hold here also: which thresholds are appropriate? Are apparently high-scoring matches real or an artefact of search methodology? This process also presupposes that enough known antigens are available so that such searches can be comprehensive and thus effective. Such compilation is the role played by the database.
In the last decade, factory-scale experimentation allied to extensive literature mining has generated many functional immunology databases. Databases, such as SYFPEITHI [65, 66], which focus on properties of cellular immunology, and look primarily at data relevant to MHC processing, presentation, and T-cell recognition have existed since the mid 1990s. Arguably, the best such database is the HIV Molecular Immunology Database , although clearly the depth of the database is at the expense of breadth and generality. It archives CD4+ and CD8+ T-cell and B-cell epitopes derived from the virus. It also includes detailed biological information regarding the response to the epitope, including its impact on long term survival, common escape mutations, whether an epitope is recognized in early infection, and curated alignments summarizing the epitope’s global variability.
Other recent databases include MHCBN [68, 69], which contains 18,790 MHC-binding peptides, 3,227 MHC nonbinding peptides, 1,053 TAP binders and nonbinders, and 6,548 T-cell epitopes. EPIMHC  is a relational database of naturally occurring MHC-binding peptides and T-cell epitopes. Presently, the database includes 4867 distinct peptide sequences from various sources, including 84 tumor antigens. Two databases in particular, warrant special attention, albeit for different reasons. They are AntiJen , formerly known as Jenpep [72, 73], and IEDB .
AntiJen seeks to integrate a wider range data than is archived by other databases. Implemented as a relational postgreSQL database, AntiJen is sourced from the primary literature and contains over 24,000 entries; it includes quantitative kinetic, thermodynamic, functional, and cellular data within the context of immunology and vaccinology. As well as T-cell and MHC binding data, AntiJen holds over 3,500 entries for linear and discontinuous B-cell epitopes, and includes measurements of peptide interactions with TAP transporter and peptide-MHC complex interactions with T-cell receptors (TCR), as well as immunological protein-protein interactions.
IEDB is a database lavishly funded by the NIH, which addresses issues of biodefence. As we have said, it is on a much larger scale than any other similar database, and benefits significantly from the input of 13 dedicated epitope sequencing projects. These exist, in part at least, to populate the database. IEDB has largely eclipsed other efforts in functional immune databases. However, it does not, as a priority, address antigenicity at the whole protein level.
At this point it is worth discussing the distinction between functional annotation and the objective discussed here. Generally speaking, the function that a protein performs within the context of its organism of origin is irrelevant to its status as an antigen. Here the ubiquity and multiple meanings of the word antigen are of little if any help. A protective antigen is a protein which is recognised and recalled by the host. This characteristic does not seem to be linked to the fact that a protein is an enzyme or a DNA binding protein, nor does logic require such a link. Thus identifying function is not a necessary condition for a protein to be an antigen, though the unequivocal identification of certain functions, such as being a virulence factor for example, greatly increases the probability that it will be such.
Concerning antigens, however, these databases, although replete with information concerning individual B cell epitopes, T cell epitopes, and Major Histocompatibility Complex (MHC) binding peptides, remain otherwise partial and incomplete. Their focus is on the epitope, not the antigen. There are many antigens for which specific epitope or MHC binding information is not currently available, yet many such antigens are known experimentally to induce either or both innate or adaptive immune responses. Fortunately for the future of vaccine design and discovery specific antigen-orientated – rather than epitope-orientated - databases are now becoming available.
Arguably, the clearest and most unequivocal example of an antigenic protein is the so-called virulence factor (VF). These proteins are able to undertake the colonization of a host organism and/or induce disease. They are the front-line weapons in the pathogenic arsenal. Analysis of known pathogenic species, such as Vibrio cholerae or Streptococcus pyogenes, has enabled the recognition of recurrent “systems” of VFs and toxins that may total 40 or more distinct proteins. These may exist as discrete pathogenicity islands or be spread more widely in the genome.
Traditionally, classification of VFs has categorised them as belonging to several thematic groups: adherence/colonization factors, invasions, exotoxins, transporters, iron-binding siderophores, and miscellaneous cell surface factors. A broader definition groups VFs into three classes: (1) “true” virulence factors; (2) VFs associated with the expression and regulation of class 1 VF genes; and (3) VFs required for the colonization of the host .
A number of databases that archive VFs have been reported. The Virulence Factor Database (VFDB; URL: http://www.mgc.ac.cn/VFs/) contains 16 characterized bacterial genomes with an emphasis on functional and structural biology and can be searched using text, BLAST, or functional queries [75, 76]. TVFac (Los Alamos National Laboratory Toxin and Virulence Factor database; URL: http://www.tvfac.lanl.gov/) contains genetic information on over 250 organisms and separate records for thousands of virulence genes and associated factors. The Fish Pathogen Database (URL: http://www.fishpathogens.eu/vhsv/index.php), set up by the Bacteriology and Fish Diseases Laboratory, has identified over 500 virulence genes using fish as a model system. Pathogens studied include Aeromonas hydrophila, Edwardsiella tarda, and many Vibrio species. Candida albicans virulence factor (CandiVF) is a small species-specific database that contains VFs, which may be searched using BLAST or a HLA-DR hotspot prediction server . PHI-base is a noteworthy development as it integrates VFs into a cohesive whole a variety of pathogens of plants and animals .
Obviously, antigens need not be VFs, they need only be accessible to immune surveillance and need not be directly or indirectly involved in infectivity. Because of this, other types of database are required, able to capture and contain a wider tranche of relevant data. In the recent past, another database has been developed: AntigenDB  contains a compilation of over 500 antigens drawn from the literature and other immunological resources. It marks a new beginning in immunoinformatics, signalling a switch away from the peptide epitope and toward the whole protein antigen. These antigens come from 44 important pathogenic species. In AntigenDB, a database entry contains information regarding the sequence, structure, origin, etc. of an antigen with additional information such as B and T-cell epitopes, MHC binding, function, gene-expression and post translational modifications, where available. AntigenDB also provides links to major internal and external databases. AntigenDB will be updated on a rolling basis, with the regular addition of antigens from other organisms. This database will form the core of future attempts to predict antigens both by sequence similarity and using more recondite analysis.
At this point it is worth mentioning the issue of thresholds. Clearly, when one runs a sequence search, using BLAST for example, one might generate huge lists of near-identical proteins or get no hits at all; and, for that matter, we could also obtain almost any intermediate result. The issue is to judge which result is useful and which is not. This typically equates to setting a threshold, above which we anticipate usefulness and below which we might expect a lack of utility. Setting such thresholds is however no easy task. They are dependent on the nature of the family in question. For the lipocalin family [80, 81, 82], for example, hits are still valid in terms of structure and function at levels that would simply be noise for most other protein families. Thus thresholds are family dependent, as well as problem dependent. Empirically-selected cut-offs are thus the order of the day, but much thought and experimentation is needed in order to select appropriate values.
As well as seeking similarity to known antigens, there is another, quite prevalent, idea that is deserving of comment: that the immunogenicity of protein is determined solely by its lack of similarity to the host. What we search for is some quantitatively-meaningful measure of the “foreignness” of a protein which correlates highly with its immanent immunogenicity. In this context, what we mean by the word “foreignness” is the evolutionary distance between the host – a man or a cow or mouse - and the candidate antigen, or the organism that produced it, or both. Some consider this to be the prime factor determining the potential immunogenicity of a protein [83, 84]. Clearly, since we are dealing with proteins and carbohydrates and the like, this evolutionary distance must be specified in terms of their molecular structures, or more likely their sequences, rather than in terms of morphological features.
The potential importance of such a concept is supported by the observation that immune systems are actively educated to lack reactivity when presented with self-proteins , a process – often called immune tolerance – which is generated via epitope-specific mechanisms including clonal anergy, receptor editing, and clonal deletion . But how can we progress beyond this rather inexactly-specified philosophical standpoint to something which is practically useful when we select vaccine candidates?
A potentially more useful way to express this conjecture is that the likelihood that a protein is immunogenic is solely a function of that protein’s dissimilarity to the whole host proteome at the sequence level. Or, to be more precise, how close in terms of sequence similarity is the candidate to the closest or most similar member of the host proteome. Most sequence software is well suited to solving this problem, since it is precisely this problem that they were written to address. More difficult is assessing average dissimilarity to the whole proteome, a problem compounded when we use the similarity of overlapping peptide fragments rather than looking at the similarity at the level of global sequence alignment. In terms of choosing the length of such fragments, the epitope would seem to be the most logical choice, since this immunological quantum is the moiety most likely to be recognised during the immune response.
Yet even at the level of the epitope, a peptide of say 10 amino acids, even one mismatch in an otherwise perfect match can be significant, since such sequence differences, comprising a single amino acid, may exacerbate or abrogate neutralizing antibodies binding to a particular antigenic protein. Moreover, the cross-reactivity of a single high-affinity monoclonal antibody is rather different in nature to the cross-reactivity of large set of less affine polyclonal antibodies, and so too their ability to tolerate amino acid mismatches. It will also vary between individuals, since the immunization history of each organism will dictate to a large extent the recognition of epitopes.
Our understanding of epitopes can inform our understanding of mismatch tolerance, since the affinity of T-cell epitopes is more dependent on the possession of suitable anchor residues than it is on the possession of non-anchor residues. Having said, the dogma of anchor-dependent affinity was long ago debunked, since all residues make some kind of contribute to affinity, entropic or enthalpic, although generally it is right to say that so-called anchors do make more significant contributions. Our understanding of the structural-basis that determines the affinity of antibody-mediated epitopes is much less assured and complete, and the underlying thermodynamic causes of affinity, if strict causes they are, typically only become clear when high resolution structural data combines with measured thermodynamic metrics.
Likewise, when one looks not at a representative individual, but at the whole population, then the deletion of a single protein, within one host versus another, can render candidates previously valid and immunogenic as suddenly neither. Again, these are difficult issues; as yet, they remain unresolved.
Given the hypothesis that immunogenicity is in some sense mediated by the level of similarity between a pathogenic protein and the host proteome, we have, in as yet-unpublished work, sought to bench-mark sequence similarity analysis as a means of quantifying the differences between populations of antigens and non-antigen. To that end, we identified sets of 100 known antigenic and 100 non-antigenic protein sequences derived from a variety of sources: allergens, bacteria, fungi, parasites, tumours and viruses. These were compared to the human and mouse genomes using standard sequence similarity searching protocols. Whole pathogen proteomes were also aligned to these host proteomes. Most antigenic and non-antigenic sequences were observed to be non-redundant; this implies a lack of clear homologues between pathogens and the human or mouse proteomes, although a number of parasite antigens were found to have a much higher level of similarity. These proteins comprised heat shock proteins, catalases, and enzymes involved in hydrolysis. These protein families are structurally conserved, though they might display significant functionality diversity.
We also used statistical approaches such as the non-parametric Mann-Whitney test to assess if comparisons between the two populations were significant. The statistical null hypothesis was accepted in most cases, signifying that the effect presumably resulted solely from chance. The Mann-Whitney test supported the observations from sequence similarity analysis. We were unable to determine a threshold or cut-off based on the hypothesis of non-redundancy to the host’s proteome. These results suggest that this is not in itself a solution to the identification of antigens. A variant based on fragments may be more successful, and this is clearly an issue crying out for further, deeper research.
There are, of course, many other ways to approaches identifying antigenic proteins. One notable way, is looking for the horizontal transfer of so-called pathogenicity islands, clusters of pathogenic proteins acquired by transfer between micro-organisms. Detection of such islands, which are typically large gene clusters with an atypical yet characteristic G + C content, can in turn lead to the identification of antigenic proteins [88, 89, 90, 91]. Analysis at the nucleic acid level rather than at the protein level can facilitate the discovery of virulence proteins, perhaps using similar approaches to that used to identify CpG islands .
However, rather than look at nucleic acid sequences, or at protein sequences directly, a new approach, based upon alignment-free techniques, has been developed which shows significant potential; we examine this next.