Journée interne PEPR Cloud
Programme détaillé
Mardi 3 décembre 2024
Centre Inria de l’Université de Grenoble Alpes
655 Av. de l’Europe, 38330 Montbonnot-Saint-Martin
Comment venir au centre Inria de l’Université Grenoble Alpes
L’objectif est d’identifier des complémentarités potentielles entre les actions conduites, mais aussi de mettre en avant des activités visibles et d’utiliser ces rencontres pour initier une dynamique de groupe.
8h45 – 9h15
Accueil café
9h15 – 9h30
Mot d’ouverture
Frédéric Desprez, Directeur du Centre Inria de l’Université Grenoble Alpes
Adrien Lebre, Inria / Jean Noel Patillon, CEA List
9h30 -10h00
Live Migration of Virtual Machines on Heterogeneous processors
Alain Tchana, DIVA
Résumé : Live migration of virtual machines is highly used in clouds for various reasons such as server upgrades, consolidation to reduce energy consumption, etc. This migration however faces several challenges when taking place between servers that present a heterogeneous set of processors. It is important to properly characterise this heterogeneity as well as evaluate its impact on virtual machine migration. Our work aims at providing an extensive characterization of migration issues related to processor heterogeneity and propose an amelioration to the virtual machine migration algorithm in the context of Xen hypervisor.
10h00 – 10h30
Towards Efficient Learning on the Computing Continuum: Advancing Dynamic Adaptation of Federated Learning
Alexandru Costan et Thomas Badts, STEEL
Résumé : A common yet impractical assumption in existing Federated Learning (FL) approaches is that the deployment environment is static, which is rarely true in heterogeneous and highly-volatile environments like the Edge-Cloud Continuum, where FL is typically executed. While most of the current FL approaches process data in an online fashion, and are therefore adaptive by nature, they only support adaptation at the ML/DL level (e.g., through continual learning to tackle data and concept drift), putting aside the effects of system variance and the real-world complexities and dynamics of actual deployments. In this talk we make a first step to address these challenges. We devise a set of design principles for FL systems that can smartly adjust their strategies for aggregation, communication, privacy, and security in response to changing system conditions. To illustrate the benefits envisioned by these strategies, we present the results of a set of initial experiments on large scale testbeds, that are reproducible by means of the E2Clab framework. These experiments show how existing FL systems are strongly affected by changes in their operational environment. Based on these insights, we propose a set of take-aways for the FL community, towards further research into FL systems that are not only accurate and scalable but also able to dynamically adapt to the real-world deployment unpredictability.
10h30 – 11h00
Scheduling Machine Learning Compressible Inference Tasks with Limited Energy Budget
Frédéric Giroire, CareCloud
Résumé : With the advent and the growing usage of Machine Learning as a Service (MLaaS), cloud and network systems are now offering the possibility to deploy ML tasks on heterogeneous clusters. Then, network and cloud operators have to schedule these tasks, determining both when and on which devices to execute them. In parallel, several solutions, such as neural network compression, were proposed to build small models which can run on limited hardware. These solutions allow choosing the model size at inference time for any targeted processing time without having to re-train the network.
To implement such solutions, the first task is to study how much ML models can be sparsified. We made two contributions on this matter. We first carried out experiments to extend the Once-For-All solution for a full range of processing times and derived the full tradeoff between accuracy and processing time in [CCGrid 2024]. Second, in [NeurIPS 2024] we provide new results on the Strong Lottery Ticket Hypothesis (SLTH), stating that a random neural network 𝑁 contains subnetworks capable of accurately approximating any given neural network that is sufficiently smaller than 𝑁, without any training. We provide the first proof of the SLTH in classical settings, such as dense and equivariant networks, with guarantees on the sparsity of the subnetworks.
The second task is to see how to schedule ML models. We considered the Deadline Scheduling with Compressible Tasks (DSCT) problem [CCGrid 2024]: a novel scheduling problem with task deadlines where the tasks can be compressed. We also considered a variant, the Deadline Scheduling with Compressible Tasks-Energy Aware (DSCT-EA) problem [ICPP 2024], which addresses the scheduling of compressible machine learning tasks on several machines, with different speeds and energy efficiencies, under an energy budget constraint. For both problems, we propose approximation algorithms with proven guarantees to solve them and validate their efficiency with extensive simulation on deep learning classification jobs, achieving near-optimal results. Experimental results show that our approach allows to save up to 70% of the energy budget of image classification tasks, while only losing 2% of accuracy compared to when not using compression.
11h00 – 11h30
Pause café
11h30 – 12h00
Détection d’anomalie dans les applications sensibles aux délais (Cloud Gaming)
Joel Roman KY, SPIREC
Résumé : Detecting abnormal network events is an important activity of Internet Service Providers particularly when running critical applications (e.g., ultra-low-latency applications in mobile wireless networks). Abnormal events can stress the infrastructure and lead to severe degradation of user experience. Machine Learning (ML) models have demonstrated their relevance in many tasks including Anomaly Detection (AD). While promising remarkable performance compared to manual or threshold-based detection, applying ML-based AD methods is challenging for operators due to the proliferation of ML models and the lack of well-established methodology and metrics to evaluate them and select the most appropriate one. We present a comprehensive evaluation of eight unsupervised ML models selected from different classes of ML algorithms and applied to AD in the context of cloud gaming applications. We collect cloud gaming Key Performance Indicators (KPIs) time-series datasets in real-world network conditions, and we evaluate and compare the selected ML models using the same methodology, and assess their robustness to data contamination, their efficiency and computational complexity.
12h00 – 12h30
Juunansei: Para-Virtualized Guest Parallel Application Runtimes for Oversubscribed Hosts
Himadri Chhaya-Shailesh, DIVA
Résumé : It is common for modern-day applications to achieve parallelization using parallel application runtimes in order to utilize the parallel compute capacity of multi-core hardware. The degree of parallelization for such applications is decided based on the number of CPUs available on the hardware. While this is satisfactory for applications running on bare metal, using the same strategy for the applications running inside a guest virtual machine is problematic because the vCPUs of the guest can get preempted on the host at any time, especially when the host is over-subscribed. We show that the current strategy leads to suboptimal performance for the guest applications especially when the host is experiencing overload, which is a common side-effect of CPU oversubscription. We argue that parallel application runtimes should incorporate para-virtualized task scheduling information in order to optimize guest application performance on oversubscribed hosts. We propose Juunansei, a technique that dynamically adapts the degree of parallelization in the guest depending on the runtime conditions on the host using para-virtualized task scheduling information. We implement Juunansei for the popular parallel application runtime OpenMP and show that it can significantly improve the performance of the NAS parallel benchmarks.
12h30 – 13h00
Scheduling with lightweight predictions in power-constrained HPC platforms
Igor Fontana de Nardin, CareCloud
Résumé : With the increase of demand for computing resources and the struggle to provide the necessary energy, power-aware resource management is becoming a major issue for the High-performance computing (HPC) community. Including reliable energy management to a supercomputer’s resource and job management system (RJMS) is not an easy task. The energy consumption of jobs is rarely known in advance and the workload of every machine is unique and different from the others.
We argue that the first step towards properly managing power is to deeply understand the power consumption of the workload, which involves predicting the workload power consumption and exploiting it by using smart power-aware scheduling algorithms. Crucial questions are (i) how sophisticated a prediction method needs to be to provide accurate workload power predictions, and (ii) to what point an accurate workload’s power prediction translates into efficient power management.
In this work, we proposed a method to predict and exploit HPC workloads power consumption with the objective of reducing the supercomputers power consumption, while maintaining the management (scheduling) performance of the RJMS. Our method exploits workload submission logs with power monitoring data, and relies on a mix of lightweight power prediction methods and a classical EASY Backfillling inspired heuristic. Then, we model and solve the power capping scheduling as a greedy knapsack algorithm. This algorithm improves the Quality of Service and avoids starvation while keeping the solution lightweight.
We base this study on logs of Marconi 100, a 980-node supercomputer. We show using simulation that a lightweight history-based prediction method can provide accurate enough power prediction to improve the energy management of a large scale supercomputer compared to energy-unaware scheduling algorithms. These improvements have no significant negative impact on performance.
13h00 – 14h15
Pause déjeuner
14h15 – 14h45
Modeling to improve validation of hardware Inria Formal model, experimentation
Radu Mateescu et Zachary Assoumani, Archi-CESAM
Résumé : SoC architectures are complex and notoriously hard to verify. Current industrial practice is still mainly based on testing, which is made increasingly rigorous by developing SoC models oriented for testing. In this talk, we will present our ongoing work on improving both, test generation and formal modeling.
Concerning test generation, we present recent work based on PSS (Portable test and Stimulus Standard), defined by the Accelera consortium to facilitate the generation of system-level tests. Formally expressing the behavior of a PSS model as a composition of communicating labeled transition systems, we propose to improve the PSS methodology in two ways: on one hand, by making possible to formally verify temporal logic properties of the model using the CADP toolbox, therefore increasing the confidence in the model and the generated tests, and, on the other hand, by improving the coverage of the generated tests using conformance testing techniques.
Concerning formal modeling, we present ongoing work on building a formal model of the HPDCache, a high-performance L1 data cache designed for RISC-V processors. The HPDCache is highly reconfigurable, equipped with out-of-order and replay execution features and has interfaces compatible with AMBA AXI 5 and RISC-V. We present a formal model of the HPDCache architecture in the LNT language, considering several configurations depending on the activated HPDCache functionalities. This formal model will serve as basis for checking correctness properties, generating conformance tests, and estimating the performance of the cache.
14h45 – 15h15
Observabilité des fonctions réseaux cloud-native 5GCs
Nadjib AIT-SAADI, SPIREC
Résumé : In this talk, the concept of observability will be introduced, covering all its critical stages: (i) telemetry data collection, (ii) feature engineering, (iii) fault detection, (iv) root cause analysis, and (v) automated mitigation. Each stage will be explored to provide a comprehensive understanding of how observability enables proactive system management and troubleshooting. Following this, we will delve into a specific use case within the 5G core network, examining how observability can be effectively applied within this 3GPP architecture. This example will highlight the unique challenges and opportunities that observability brings to complex, high-performance networks. Finally, the talk will discuss the SPIREC project as part of the PEPR Cloud initiative, presenting its objectives and how it aims to advance observability practices. The role of SPIREC within PEPR Cloud will be highlighted, along with a discussion on its expected contributions to the field.
15h15 – 15h45
Designing Moving Target Defense (MTD) to enhance the trust in the cloud
Françoise Sailhan et Pierre Charreaux, TrustinCloudS
Résumé : Ensuring the trust in a cloud environment is a critical factor as part of a cloud daily operation. In this regards, Moving Target Defense (MTD) is an efficient technique that modifies the cloud state (by e.g., changing ports, virtualised services/OS, …) with the aim of increasing the attacker’s uncertainty and reducing the attack surface. While several MTD strategies can – and will – be applied as part of the TRUSTINCloudS project, we herein focus on the solution that has been developed so far, which aims at shuffling IP addresses and ports so as to disrupt the attacker’s exploration phase, increase the probing cost and make his.her investigations obsolete. In particular, we introduced a model formalising attacker-defender interactions as a weakly coupled Markov decision process problem, which we further solved numerically to identify the optimal strategy. Following, we dissuss our forthcoming work as part of the TRUSTINCloudS project, opening up the avenues of research, which could lead to collaborations.
15h45 – 16h15
Advances in confidential cloud storage
Sonia Ben Mokhtar, STEEL
Résumé : Key-Value Stores (KVSs), commonly used for storing sensitive data, face significant security challenges when deployed in untrusted cloud environments. These environments are susceptible to various types of attacks exploiting for instance a compromised OS. To protect sensitive data from these types of attacks, distributed KVSs relying on Trusted Execution Environments (TEEs) have been proposed. However, these solutions are still vulnerable to side channel attacks that may compromise any node and leak all its data at once. In the context of the STEEL project we are investigating a practical distributed in-memory KVS that integrates TEEs (Intel SGX) and Shamir Secret Sharing (SS) to provide security against high-privileged spywares while tolearating side channel attacks on a fraction of storage nodes.
16h15 – 16h45
Pause café
16h45 – 17h15
Engineering Trust and Acceptance using a Human-Centered Perspective on Security Policies and Security Mechanisms
Philippe Palanque et Nathan Mansoro, TrustinCloudS
Résumé : Designing for trust and acceptance of end users of security mechanisms is key in order to ensure that security policies will be adopted and followed. Cybersecurity training is one of the most prominent countermeasures to address cybersecurity threats and their reported increase in terms of types and occurrences. Several approaches addressing the development of cybersecurity training have been proposed but a careful analysis of these approaches highlighted limitations both in terms of identification of required knowledge, skills, in terms of description of users’ tasks (the job they have to perform) as well as in terms of adaptation of the training to diverse user groups. This presentation proposes a systematic process to tune cybersecurity training for diverse user groups, and in particular to support the development of cybersecurity training programs for different learning groups (built from the analysis of the diverse user groups). We illustrate this process on the concrete case of phishing attacks. We will also show that training and participatory design are key contributing factors to trust and acceptance of security mechanisms.
17h15 – 17h45
IaC Provisioning Engines – Study of Terraform and Pulumi Safety
Eloi Perdereau, TARANIS
Résumé : This talk will present ongoing work that aims to study the behavior of the provisioning tools Terraform and Pulumi (a class of Infrastructure-as-Code languages). This presentation will introduce the purpose and usage of these languages, and will progressively highlight their behavioral complexity and their unexpected divergences in some cases, leading to safety issues. This motivates scientific contributions to formally understand how these tools work and explicit their underlying abstractions. Leveraging this knowledge could ultimately lead to new programming constructs or verification solutions for users.
17h45 – 18h15
Présentation générale et point état sur le déploiement du nœud français
Christian Perez, SILECS
Résumé : présentation générale des objectifs de la plateforme, cas d’utilisations envisagées, état actuel et échanges avec l’auditoire.
18h15 – 18h30
Conclusion de la journée: Synthèse de la journée et mot de la fin
Adrien Lèbre
18h30
Clôture de la journée