Deep Question Answering for protein annotation

Gobeill, Julien (Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale) ; Gaudinat, Arnaud (Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale) ; Pasche, Emilie (Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale, SIBTex group,) ; Vishnyakova, Dina (University and Hospitals of Geneva, Division of Medical Information Sciences, Geneva, Switzerland) ; Gaudet, Pascale (Calipho group, Swiss Institute of Bioinformatics) ; Bairoch, Amos (Calipho group, Swiss Institute of Bioinformatics) ; Ruch, Patrick (Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale)

Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision.


Type d'article:
scientifique
Faculté:
Economie et Services
Ecole:
HEG GE Haute école de gestion de Genève
Institut:
CRAG - Centre de Recherche Appliquée en Gestion
Classification:
Informatique
Sciences de l’information
Date:
2015
Pagination:
9 p.
Publié dans
Database : the journal of biological databases and curation
Numérotation (vol. no.):
2015
DOI:
ISSN:
1758-0463
Le document apparaît dans:



 Notice créée le 2015-11-30, modifiée le 2018-08-31

Fichiers:
Télécharger le document
PDF

Évaluer ce document:

Rate this document:
1
2
3
 
(Pas encore évalué)