Machine Learning for Autonomic System Operation in the Heterogeneous Edge-Cloud Continuum



Modern computing applications are becoming increasingly distributed and complex, spanning different stationary and mobile IoT nodes as well as more powerful nodes at the edge and/or in the cloud. Such systems are inherently heterogeneous and may exhibit considerable dynamics in terms of resource availability and application workload. This complexity makes efficient and effective resource and application management more challenging than ever.

To start with, one must deal with non-trivial resource provisioning and orchestration as well as application mapping and deployment issues across the edge-cloud continuum, while ensuring the required allocation and time/space sharing of the available computing, networking and storage resources as a function of application workload. In this respect, it is important to harness the full capabilities of modern hardware platforms (which may include multi-core CPUs, GPUs, FPGAs, TPUs or even ASICs), networking technology (such as efficient short-range wireless communication at the edge, and fast optical networks in datacenters), and hierarchical storage architectures (tiers along the different layers of the continuum, hot/cold storage, encoding and organization options) in order to achieve good utilization and improve energy efficiency.

Moreover, as several parts/nodes of the system reside in the outside world (outside well-protected datacenters), cross-cutting concerns such as security and trust become more crucial than ever. On the one hand, it is important not only to avoid but also to detect security incidents early on, so that appropriate countermeasures are taken as soon as possible. On the other hand, since nodes that belong to different parties are not guaranteed to operate properly or to provide correct information, one needs to adopt a flexible approach regarding trustworthiness of such nodes and the degree to which they should be involved in certain operations.

Last but not least, statically optimal solutions are not sufficient to handle the dynamics of such systems. IoT nodes may appear and disappear or follow different paths (if mobile), the wireless links between IoT nodes and edge nodes can vary significantly in terms of bandwidth, quality and availability, computing and storage nodes in edge and cloud datacenters may behave in different ways, new jobs may be submitted for execution while others may be withdrawn, while the application itself may go through very different phases and require significant reconfiguration to perform at the desired quality of service (QoS) level.

To address the above problems, the MLSysOps project will employ  machine-learning (ML) methods to achieve true autonomic system operation, enabling practically zero-touch yet effective application and resource management for heterogeneous and dynamic edge-cloud environments.

Fraunhofer Portugal AICOS contributes to the project by allowing the integration of far-edge nodes in the autonomic system operation, going beyond the current state-of-the-art where such automation is only applicable in the Cloud and Edge.





Funded by:

For any additional information, please contact us using the inquiries form.