PubMed related articles: a probabilistic topic-based model for content similarity

BMC Bioinformatics. 2007 Oct 30:8:423. doi: 10.1186/1471-2105-8-423.

Abstract

Background: We present a probabilistic topic-based model for content similarity called pmra that underlies the related article search feature in PubMed. Whether or not a document is about a particular topic is computed from term frequencies, modeled as Poisson distributions. Unlike previous probabilistic retrieval models, we do not attempt to estimate relevance-but rather our focus is "relatedness", the probability that a user would want to examine a particular document given known interest in another. We also describe a novel technique for estimating parameters that does not require human relevance judgments; instead, the process is based on the existence of MeSH in MEDLINE.

Results: The pmra retrieval model was compared against bm25, a competitive probabilistic model that shares theoretical similarities. Experiments using the test collection from the TREC 2005 genomics track shows a small but statistically significant improvement of pmra over bm25 in terms of precision.

Conclusion: Our experiments suggest that the pmra model provides an effective ranking algorithm for related article search.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, N.I.H., Intramural

MeSH terms

  • Algorithms*
  • Artificial Intelligence
  • Bibliometrics*
  • Data Interpretation, Statistical*
  • Models, Statistical*
  • Natural Language Processing*
  • Periodicals as Topic / statistics & numerical data*
  • PubMed*
  • Vocabulary, Controlled