UPCLASS : a deep learning-based classifier for UniProtKB entry publications

Teodoro, Douglas (Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale ; Text Mining Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland) ; Knafou, Julien (Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale ; Text Mining Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland) ; Naderi, Nona (Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale ; Text Mining Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland) ; Pasche, Emilie (Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale ; Text Mining Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland) ; Gobeill, Julien (Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale ; Text Mining Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland) ; Arighi, Cecilia N. (University of Delaware, Newark, USA) ; Ruch, Patrick (Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale ; Text Mining Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland)

In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession.


Article Type:
scientifique
Faculty:
Economie et Services
School:
HEG - Genève
Institute:
CRAG - Centre de Recherche Appliquée en Gestion
Subject(s):
Sciences de l'information
Date:
2020-04
Pagination:
13 p.
Published in:
Database
Numeration (vol. no.):
2020, vol. 2020, baaa026, pp. 1-13
DOI:
ISSN:
1758-0463
Appears in Collection:



 Record created 2020-07-20, last modified 2020-07-24

Fulltext:
Download fulltext
PDF

Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)