Integrating Gene Annotation into Genome-Wide Expression Data Analysis

Jun Li; Fan Meng; Yee-Man Chan; Jeremy Schmutz; P.V. Choudary; S.J. Evans; M.P. Vawter; H. Tomita; J. Meador-Woodruff, E.G. Jones; W.E. Bunney; S.J. Watson; H. Akil; R.M. Myers
Pacific Symposium of Biocomputing. 2003.


Most of the grossly observable phenotypes are likely to be associated with systemic and logical changes in gene expression, rather than sporadic changes in a few unrelated transcripts. This is likely true even if the original genetic cause of the phenotype is a single-gene mutation or misregulation. In order to efficiently discover coherent functional features in experimentally derived gene lists, we need tools that will go beyond the manual compilation of unstructured gene annotation, and make biological interpretations consistent and routine. To provide a quick, first-pass functional profiling of any list of genes, we evaluated the number of occurrences of individual Gene Ontology (GO) terms and KEGG Metabolic Pathways associated with these lists. For each annotation term, we calculated a nominal P value based on the expected distribution (i.e. hypergeometric) of term counts under random sampling. Since we are evaluating a large number of annotation terms simultaneously, and that many of them are correlated (i.e., covering overlapping sets of genes), we assessed the significance of the nominal P values by empirically establishing a null distribution for P values through repeated random re-sampling of genes, and pooling recalculated P values. We demonstrated the utility of these tools on two unpublished microarray datasets, one for comparing human and chimpanzee lymphoblastoid cell lines, another for comparing human post mortem brain tissues that are of normal pH and of low pH. We ranked all genes by t scores, and analyzed genes on the top, or bottom, 4%, 8% or 15% of the list. We identified several GO terms and KEGG Pathways whose occurrences are significantly increased or reduced in the top or bottom of the ranked gene lists. We found that the effects not only usually persist among the top or bottom 15%, but also are often stronger among the 15% than among the 4% of the genes. This result reflected extensive changes in expression, involving many genes, in the relevant pathways or functions, and underscored the importance of evaluating pathway-wide evidence for enhancing the confidence and biological meaning of the conclusions.