Reinforcement learning-based joint reliability and performance optimization for hybrid-cache computing servers

Huang, Darong; Pahlevan, Ali; Costero, Luis; Zapater, Marina; Atienza, David

doi:10.1109/TCAD.2022.3158832

Huang, Darong; Pahlevan, Ali; Costero, Luis; Zapater, Marina; Atienza, David

2022

Télécharger

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Résumé

Computing servers play a key role in the development and process of emerging compute-intensive applications in recent years. However, they need to operate efficiently from an energy perspective viewpoint, while maximizing the performance and lifetime of the hottest server components (i.e., cores and cache). Previous methods focused on either improving energy efficiency by adopting new hybrid-cache architectures including the resistive random-access memory (RRAM) and static random-access memory (SRAM) at the hardware level, or exploring trade-offs between lifetime limitation and performance of multi-core processors under stable workloads conditions. Therefore, no work has so far proposed a co-optimization method with hybrid-cache-based server architectures for real-life dynamic scenarios taking into account scalability, performance, lifetime reliability, and energy efficiency at the same time. In this paper, we first formulate a reliability model for the hybrid-cache architecture to enable precise lifetime reliability management and energy efficiency optimization. We also include the performance and energy overheads of cache switching, and optimize the benefits of hybrid-cache usage for better energy efficiency and performance. Then, we propose a runtime Q-Learning-based reliability management and performance optimization approach for multi-core microprocessors with the hybrid-cache architecture, jointly incorporated with a dynamic preemptive priority queue management method to improve the overall tasks’ performance by targeting to respect their end time limits. Experimental results show that our proposed method achieves up to 44% average performance (i.e., tasks execution time) improvement, while maintaining the whole system design lifetime longer than 5 years, when compared to the latest state-of-the-art energy efficiency optimization and reliability management methods for computing servers.

Détails

Titre Reinforcement learning-based joint reliability and performance optimization for hybrid-cache computing servers

Auteur(s)/ trice(s) Huang, Darong (EPFL, Lausanne, Switzerland)
Pahlevan, Ali (EPFL, Lausanne, Switzerland)
Costero, Luis (EPFL, Lausanne, Switzerland)
Zapater, Marina (EPFL, Lausanne, Switzerland ; School of Management and Engineering Vaud, HES-SO University of Applied Sciences Western Switzerland)
Atienza, David (EPFL, Lausanne, Switzerland)

Date 2022-03

Publié dans IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Volume 2022, vol. 41, no. 12, pp. 5596-5609

Pagination 14 p.

DOI https://doi.org/10.1109/TCAD.2022.3158832

ISSN 0278-0070

Mots-clés (libres) reliability ; servers ; task analysis ; energy efficiency ; management ; nonvolatile memory ; computer architecture ; computing servers ; hybrid-cache ; reliability management ; preemptive queue management ; performance optimization ; reinforcement learning

Type d'article scientifique

Domaine Ingénierie et Architecture

Ecole HEIG-VD

Institut ReDS - Reconfigurable & embedded Digital Systems

Le document apparaît dans Articles scientifiques
Global

Résumé

Détails

Actions

PDF