Collecting human immune system related genes and proteins and their orthologs
We collected from articles, textbooks and electronic sources altogether 847 human genes that are involved in immunology related processes, or which are essential for the life of immunological cells and organs [4]. The variable chains of the immunoglobulins (Igs), B and T cell receptors (BCRs and TCRs) and major histocompatibility complexes (MHCs) were not included since these proteins are not coded by conventionally structured genes but by gene fragments. These gene fargments and their products are already exclusively collected and listed in IMGT, the international ImMunoGeneTics information system at National Computer Centre of Higher Education [6] and European Bioinformatics Institute [7]. ImmTree contains the genes and proteins that are required for processing these gene fragments. In the ImmTree database Entrez Gene [8] identifiers were used to refer to genes. Protein sequences were downloaded from NCBI GenBank [9]. Ortholog sequences are from the Eukaryotic Gene Orthologs (EGO) [10], HomoloGene [11] and OrthoMCL [12] databases. HomoloGene contains groups of homologs for completely sequenced eukaryotic genomes, while EGO has (tentative) ortholog groups of the eukaryotic sequences in the TIGR sequence database. OrthoMCL contains sequences exclusively from 55 complete genomes and therefore the number of sequences from the different branches is limited. The releases used were EGO version 9.0, released 15 February 2005; HomoloGene build 50.1, released 25 July 2006; Ortho MCL version 1.0, released 19 October 2005.
The nucleotide sequences of ortholog groups were taken from EGO and the protein sequences from HomoloGene and OrthoMCL. The sequences were aligned using ClustalW [13] with the default parameters. Phylogenetic trees were reconstructed for all three type of ortholog groups using the PAUP* program package [14] when the group contained at least three sequences. We thus created three trees for most of the ortholog groups for the data from the three independent databases. A simple neighbour-joining method was applied if the ortholog group contained only three taxa, otherwise bootstrap analysis was applied with parsimony method, heuristic tree search, and 1000 replications. The number of bootstrap replicates was reduced to 100 in the case of OrthoMCL ortholog groups where more than 50 sequences were in the group. Similarly the number of replicates was reduced even further, to 50, where the number of sequences exceeded 100. This was necessary due to computational time requirements, since some OrthoMCL groups contain numerous paralogs. In these cases, the tree constructing becomes very CPU intensive without any further phylogenetic advantage.
For a general overview of the ortholog groups, we generated a fourth tree. This tree represents protein sequences from all the species in any of the three datasets. Moreover, each species is represented by just one sequence, preventing the accumulation of identical sequences from multiple data sources. This way the large paralog groups from the OrthoMCL database are represented by just a single sequence.
The nucleotide sequences from the EGO database were translated to amino acids to align the representative protein sequences from the three databases. The translation was done in all six frames, and all six transcripts were aligned with the human protein sequence using bl2seq from the BLAST package [15]. Only the transcript with the longest identical stretch with the human ortholog was retained for further analysis. The protein sequences collected this way were aligned and phylogenetic trees were constructed as described above.
Comparison of the human-mouse ortholog pairs
In 603 cases orthologs were present both in the mouse and human genome in the HomoloGene database. These pairs were further analysed in detail. The cDNA sequences of the human and mouse genes were translated to protein sequences and then aligned using the blast2seq program. The corresponding cDNA sequences were aligned based on the amino acid sequence alignment with proprietary Perl scripts, some of which utilize modules from the Bioperl Project [16]. The estimates of synonymous mutations per synonymous sites (Ks or dS) and of non-synonymous mutations per non-synonymous sites (Ka or dN) values were calculated [17]. Z values and the Ka/Ks quotients describe the conservation of given genes since the human-mouse divergence.