Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features

Mindy K. Ross; Ko-Wei Lin; Karen Truong; Abhishek Kumar; Mike Conway

JOURNAL

Biomedical Informatics Insights

Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features

Submit a Paper

Mindy K. Ross, Ko-Wei Lin, Karen Truong, Abhishek Kumar and Mike Conway

Biomedical Informatics Insights 2013:6 35-45

Original Research

Published on 22 Jul 2013

DOI: 10.4137/BII.S11987

Further metadata provided in PDF

Download Article PDF

Sign up for email alerts to receive notifications of new articles published in Biomedical Informatics Insights

Abstract and Sharing
Article Metrics
Discuss

Abstract

The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ2 feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP.

Downloads

PDF (899.01 KB PDF FORMAT)

RIS citation (ENDNOTE, REFERENCE MANAGER, PROCITE, REFWORKS)

BibTex citation (BIBDESK, LATEX)

XML

PMC HTML

What Your Colleagues Say About Biomedical Informatics Insights

It's a great experience publishing with Biomedical Informatics Insights. I am particularly impressed with the in-depth and constructive comments provided by the reviewers within such a short time-frame. The typesetting was not only prompt, but most importantly, effective. In fact, this was among the very few publication experiences that I have had when no correction was needed in the author proofs. I highly recommend Biomedical Informatics Insights to both readers and prospective ...

Dr Chun Hsi Huang (Computer Science and Engineering, University of Connecticut)

More Testimonials