Résumé

The rise of big data and artificial intelligence techniques such as deep learning has lead to an exponential increase in stored data in various fields, including medical imaging, genetics and financial trading. Sharing these increasing amounts of data for research is challenging, as privacy risks increase with the increased size of data. Physically moving very large datasets to researchers is inconvenient, as download or sending physical hard disks are not optimal. Research on sensitive data is often not possible, as sharing is not legal. The popularity of container-based technologies such as Docker has revolutionized the way applications are deployed, due to their self-sufficient, light-weight and portable nature. In this paper, we propose a novel distributed platform using containers for simple execution and evaluation of research applications on the data owner’s infrastructure, bringing the algorithms to the data. This approach avoids the cumbersome transfer of large datasets and can help circumventing problems linked to non-shareable data by providing a sandboxed execution environment with read-only access to the data. At no point the data leave the data owner’s site, giving researchers access to their evaluation results, only, and not the data themselves. The presented proof-of-concept confirms the feasibility of a distributed container-based evaluation platform for large and/or sensitive data. This has several advantages, including execution of code instead of submission of result files and availability of otherwise inaccessible data. The container architecture allows for minimal computational overhead, no software dependency management on the infrastructure, distributed runtime environment and isolation of processes from the underlying host system. A version addressing various identified architectural and security-related challenges has the potential to be deployed in a production setting and therefore allows researchers to gain insights from previously inaccessible data. One goal is to target hospitals with increasingly strong local infrastructure for storage and computation, needed for artificial intelligence based decision support (genetics and imaging).

Détails

Actions

PDF