Identifying Gene and Protein Names from Biological Texts

W. Xuan; S. J. Watson; H.Akil; F. Meng
IEEE Computer Society Bioinformatics Conference Proceedings. 2003:639-644.


Extracting and identifying gene and protein names from literature is a critical step for mining functional information of genes and proteins. While extensive efforts have been devoted to this important task, most of them were aiming at extracting genelprotein name per se without paying much attention to associate the extracted name with existing gene and protein database entries. We developed a simple and efficient method to identify gene and protein names in literature using a combination of heuristic and statistical strategies. Our approach will map the extracted names to individual LocusLink entries thus enable the seamless integration of literature information with existing geneiprotein databases. Evaluation on a test corpus shows that our method can achieve both high recall and precision. Our method exhibits good performance and can be used as a building block for large biomedical literature mining systems.