000002288 001__ 2288
000002288 005__ 20190205105841.0
000002288 022__ $$a1758-0463
000002288 0247_ $$2DOI$$a10.1093/database/bax083
000002288 037__ $$aARTICLE
000002288 041__ $$aeng
000002288 245__ $$aImproving average ranking precision in user searches for biomedical research datasets
000002288 260__ $$c2017
000002288 269__ $$a2017-11
000002288 300__ $$a18 p.
000002288 506__ $$avisible
000002288 520__ $$aAvailability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorisation method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries. Our system provides competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP among the participants, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. Our similarity measure algorithm seems to be robust, in particular compared to Divergence From Randomness framework, having smaller performance variations under different training conditions. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results.$$9eng
000002288 546__ $$aEnglish
000002288 540__ $$acorrect
000002288 592__ $$aHEG - Genève
000002288 592__ $$bCRAG - Centre de Recherche Appliquée en Gestion
000002288 592__ $$cEconomie et Services
000002288 655__ $$ascientifique
000002288 65017 $$aSciences de l'information
000002288 700__ $$uSIB Swiss Institute of Bioinformatics, Geneva ; Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale$$aTeodoro, Douglas
000002288 700__ $$uSIB Swiss Institute of Bioinformatics, Geneva ; Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale$$aMottin, Luc
000002288 700__ $$uSIB Swiss Institute of Bioinformatics, Geneva ; Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale$$aGobeill, Julien
000002288 700__ $$uSIB Swiss Institute of Bioinformatics, Geneva ; Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale$$aGaudinat, Arnaud
000002288 700__ $$uNovartis Institutes for BioMedical Research, Basel$$aVachon, Thérèse
000002288 700__ $$aRuch, Patrick$$uSIB Swiss Institute of Bioinformatics, Geneva ; Haute école de gestion de Genève, HES-SO // Haute Ecole Spécialisée de Suisse Occidentale
000002288 773__ $$g2017, vol. 2017, pp. 1-18$$tDatabase
000002288 8564_ $$uhttps://hesso.tind.io/record/2288/files/Teodoro_2017_improving_average_ranking.pdf$$s1062331
000002288 8564_ $$uhttps://hesso.tind.io/record/2288/files/Teodoro_2017_improving_average_ranking.pdf?subformat=pdfa$$s2380811$$xpdfa
000002288 906__ $$aGREEN
000002288 909CO $$pHEG_GE_ARTICLES_SCIENTIFIQUES$$pGLOBAL_SET$$ooai:hesso.tind.io:2288
000002288 950__ $$aI2
000002288 980__ $$ascientifique