Energy-SFE banner - Based on the public domain photo by Simon:

First EnergySFE Workshop: Grenoble - September 1st, 2016

Minatec Campus - 17 rue des Martyrs - DRT/LETI/DACLE - Building 51 C


08h45 - 09h00 Opening Session

09h00 - 10h30

  • CORSE Team Overview, Jean-François Méhaut (CORSE) [slides]
  • Using Data Dependencies to Improve Task Based Scheduling Strategies on NUMA Architectures, Philippe Virouleau (CORSE)
  • Adaptation of a HPC System to FPGA, Georgios Christodoulis (CORSE) [slides]

10h30 - 12h00

  • Distributed Processing and Energy Saving Techniques in Mobile Crowd Sensing, Enrique Carrera (SAPyC/WiCom) [slides]
  • Evaluation of Many-core Processors in HPC, Pablo Ramos (TIMA/SAPyC) [slides]
  • Fault Tolerance for Multi-core and Many-core Processors in HPC, Vanessa Vargas (TIMA/SAPyC) [slides]

12h00 - 13h30 Lunch - Brasserie du Carré - 53 Rue Pierre Semard, 38000 Grenoble

14h00 - 15h00 Guided Tour in the GENEPI2 Facility

15h30 - 16h30

  • Current Efforts in Scheduling and Fault Tolerance, Laércio Pilla (ECL) [slides]
  • Green HPC with Low-power Manycores, Márcio Castro (LaPeSD) [slides]

16h30 - 17h00 Coffee break

17h00 - 18h00

  • Three-Level Screening Designs for Fast Energy Savings, Lucas Schnorr (GPPD) [slides]
  • Energy-Aware Autotuning for HPC Kernels, Luis Felipe Millani (CORSE) [slides]

18h00 - 18h30 Discussions

19h30 Dinner (location to be announced)

Detailed Program

  • CORSE Team Overview
    • Speaker: Jean-François Méhaut (CORSE)
    • Abstract: Languages, compilers, and run-time systems are some of the most important components to bridge the gap between applications and hardware. With the continuously increasing power of computers, expectations are evolving, with more and more ambitious, computational intensive and complex applications. As desktop PCs are becoming a niche and servers mainstream, three categories of computing impose themselves for the next decade: mobile, cloud, and super-computing. Thus diversity, heterogeneity (even on a single chip) and thus also hardware virtualization is putting more and more pressure both on compilers and run-time systems. However, because of the energy wall, architectures are becoming more and more complex and parallelism ubiquitous at every level. Unfortunately, the memory-CPU gap continues to increase and energy consumption remains an important issue for future platforms. To address the challenge of performance and energy consumption raised by silicon companies, compilers and run-time systems must evolve and, in particular, interact, taking into account the complexity of the target architecture. The overall objective of Corse is to address this challenge by combining static and dynamic compilation techniques, with more interactive embedding of programs and compiler environment in the runtime system.
  • Using Data Dependencies to Improve Task Based Scheduling Strategies on NUMA Architectures
    • Speaker: Philippe Virouleau (CORSE)
    • Abstract: The recent addition of data dependencies to the OpenMP 4.0 standard provides the application programmer with a more flexible way of synchronizing tasks. Using such an approach allows both the compiler and the runtime system to know exactly which data are read or written by a given task, and how these data will be used through the program lifetime. Data placement and task scheduling strategies have a significant impact on performances when considering NUMA architectures. While numerous papers focus on these topics, none of them has made extensive use of the information available through dependencies. One can use this information to modify the behavior of the application at several levels: during initialization to control data placement and during the application execution to dynamically control both the task placement and the tasks stealing strategy, depending on the topology. This talk introduces several heuristics for these strategies and their implementations in our OpenMP runtime. We also evaluate their performances on linear algebra applications executed on a 192-core NUMA machine, reporting noticeable performance improvement when considering both the architecture topology and the tasks data dependencies. We finally compare them to strategies presented previously by related works.
  • Adaptation of a HPC System to FPGA
    • Speaker: Georgios Christodoulis (CORSE)
    • Abstract: Multicore architecture development appeared to tackle the unsustainable power consumption growth of single core CPUs. The next step towards this direction is the use of accelerators for the execution of tasks with certain characteristics (e.g. GPUs-OpenCL/CUDA kernels). In the scope of HEAVEN project we attempt the development of a heterogeneous system that will enable task acceleration using FPGAs, exploiting their outstanding energy efficiency. In our approach application programming would be feasible using OpenMP, the standard programming environment for shared memory architectures, hiding low level hardware specific mechanisms from the application developer (e.g. memory transfers between the host and the device). The HLS tool we are using to generate the corresponding VHDL for the configuration of the FPGA is AUGH, because of its ability to provide very quickly an RTL description of the design under resources constrains. We also extend StarPU to support the new device, a runtime system that provides mechanisms for heterogeneous scheduling, data transfers, and intranode communication.
  • Distributed Processing and Energy Saving Techniques in Mobile Crowd Sensing
    • Speaker: Enrique Carrera (SAPyC/WiCom)
    • Abstract: Mobile Crowd Sensing (MCS) is a large-scale sensing paradigm based on the power of user-companioned devices, including mobile phones, smart vehicles, wearable devices, and so on. The deployment of MCS applications in large-scale environments is not a trivial task because of its data aggregation requirements and limited battery resources. Thus, we propose a hierarchical distributed architecture where processing is pushed to the edge without increasing energy consumption of battery-powered devices. A simulation of the distributed architecture is been implemented and complemented with actual power consumption measurements in selected smartphones. We can conclude that there are many research opportunities related to distributed processing and energy consumption issues in the field of MCS.
  • Evaluation of Many-core Processors in HPC
    • Speaker: Pablo Ramos (TIMA/SAPyC)
    • Abstract: Multi-core and many-core processors design is becoming a new challenge, since manufacturers have to face critical factors such as performance, reliability and power consumption. The exceptional computational capabilities of these devices make them very attractive for the implementation of high-performance applications in scientific and commercial fields. However, the continuous technology shrink combined with the design complexity, increase their vulnerability to natural radiation, especially to Single Event Effects (SEEs). A significant advantage of many-cores processors to face this vulnerability concern, consists in their inherent redundancy capability which makes them ideal for implementing fault tolerant techniques. In addition, for improving device reliability, complementary protection mechanisms such as Error Correcting Codes (ECC) and Parity are commonly implemented in memory cells. Nevertheless, implementing additional protections involves the introduction of an extra area which leads to more power consumption and performance degradation. The dependability of multi-core and many-core processors is a crucial issue to consider, especially if the devices are intended to be used for safety-critical applications. It is thus mandatory to evaluate the SEE sensitivity of such devices. This work evaluates the sensitivity of PowerPC P2041RDB and the many-core KALRAY MPPA-256.
  • Fault Tolerance for Multi-core and Many-core Processors in HPC
    • Speaker: Vanessa Vargas (TIMA/SAPyC)
    • Abstract: The current technological trend in HPC systems is the use of multi-core and many-core processors in order to satisfy the growing demand of performance and reliability without a critical increase of power consumption. In fact, the Sunway TaihuLight Supercomputer which was ranked in the first position of the TOP500 list on June 2016, is based on the many-core processor ShenWei SW26010 (260 cores). The inherent redundancy capability of many-core processors makes them ideal for implementing fault-tolerant such as N-modular redundancy which applies majority-voting. Moreover, these devices provide a great flexibility because they allow implementing different multi-processing modes and programming paradigms. This talk presents our work related to study and to improve the reliability of multi-core and many-core processors that has been implemented on the multi-core PowerPC P2041RDB and the many-core KALRAY MPPA-256.
  • Current Efforts in Scheduling and Fault Tolerance
    • Speaker: Laércio Pilla (ECL)
    • Abstract: This presentation reviews matters related to current and future work in global scheduling and reliability for High Performance Computing systems. During the first half of the presentation, global scheduling algorithms that focus on load balance, communication, performance and energy efficiency will be explored. In the second half, radiation experiments regarding the reliability of parallel algorithms, the effects of hardening solutions, the effects of code optimizations, and the comparison of parallel accelerators will be discussed.
  • Green HPC with Low-power Manycores
    • Speaker: Márcio Castro (LaPeSD)
    • Abstract: Although we have observed a steady increase on the processing capabilities of HPC platforms, their energy efficiency is still lacking behind. Recently, a new class of highly-parallel processors called lightweight manycore processors was unveiled. Tilera Tile-Gx and Kalray MPPA-256 are examples of such processors, providing high levels of parallelism with hundreds or even thousands of cores. Although they present better energy efficiency than state-of-the-art general-purpose multicore processors, they can also make the development of efficient scientific parallel applications a challenging task due to their architectural idiosyncrasies. Some of these processors are built and optimized for certain classes of embedded applications like signal processing, video decoding and routing. Additionally, processors such as MPPA-256 have important memory constraints, e.g., limited amount of directly addressable memory and absence of cache coherence protocols. Furthermore, efficient execution on these processors requires data transfers to be in conformance with the Network-on-Chip (NoC) topology to mitigate the, otherwise high, communication costs. In this talk, I will present our efforts on using the MPPA-256 manycore processor in the HPC domain.
  • Three-Level Screening Designs for Fast Energy Savings
    • Speaker: Lucas Schnorr (GPPD)
    • Abstract: Dynamic Voltage and Frequency Scaling is often used to save energy by selecting the best frequency to execute code regions of HPC applications. If a memory-bound code region is correctly identified, one can potentially save energy with minimal performance losses by lowering the frequency. The problem this work addresses comes from two sources: the first one is that current processors have many available frequency levels; the second is that HPC applications are complex with many code regions. Detecting the best frequency to run each code region is very time consuming, especially if one wants to consider measurement variability through replications. Our strategy to reduce the time to detect such frequencies is two folded: to adopt three-level screening designs and to automatically detect memory-bound code regions using hardware counters (L2, L3, IPC). We plan to employ such strategy in a number of applications, such as those available in the Mantevo and Coral benchmark suites, using a rigorous experimental plan. Since this work is still in its infancy, we are looking for potential partners that might be interested for collaboration.
  • Energy-Aware Autotuning for HPC Kernels
    • Speaker: Luis Felipe Millani (CORSE)
    • Abstract: Energy consumption is a growing concern in HPC. Increasingly complex platforms with several cores and accelerators are difficult to optimize for, both in GFLOPS and GFLOPS/Watt. Autotuners are often used to broaden the search space and obtain portable performance across these different platforms. It’s interesting for autotuners to also consider the energy efficiency of the solutions. BOAST is a framework for comparing the time efficiency of different implementations of a computing kernel. We extend the BOAST framework to also consider the energy efficiency of the different implementations. This gives the user a better understanding of the time-energy trade-offs. In this work we present the time-energy tradeoffs of different optimizations found semi-automatically by BOAST for a few HPC kernels.