Saison 3 : Sept. 2016

29 Septembre 2016 – Séminaire n°50 – Décomposition Automatique des Programmes Parallèles pour l’Optimisation et la Prédiction de Performance.

Présentation par: Mihail Popov, Doctorant à l’Université de Versailles Saint-Quentin-En-Yvelines.

Salle Paul Gauguin, Ter@tec, 10h – durée: 45 min.

Résumé

Dans le domaine du calcul haute performance, de nombreux programmes étalons ou benchmarks sont utilisés pour mesurer l’efficacité des calculateurs, des compilateurs et des optimisations de performance. Les benchmarks de référence regroupent souvent des programmes de calcul issus de l’industrie et peuvent être très longs. Le processus d’étalonnage d’une nouvelle architecture de calcul ou d’une optimisation est donc coûteux.

La plupart des benchmarks sont constitués d’un ensemble de noyaux de calcul indépendants. Souvent l’étalonneur n’est intéressé que par un sous-ensemble de ces noyaux, il serait donc intéressant de pouvoir les exécuter séparément. Ainsi, il devient plus facile et rapide d’appliquer des optimisations locales sur les benchmarks. De plus, les benchmarks contiennent de nombreux noyaux de calcul redondants. Certaines opérations, bien que mesurées plusieurs fois, n’apportent pas d’informations supplémentaires sur le système à étudier. En détectant les similarités entre eux et en éliminant les noyaux redondants, on diminue le coût des benchmarks sans perte d’information.

Cette thèse propose une méthode permettant de décomposer automatiquement une application en un ensemble de noyaux de performance, que nous appelons codelets. La méthode proposée permet de rejouer les codelets, de manière isolée, dans différentes conditions expérimentales pour pouvoir étalonner leur performance. Cette thèse étudie dans quelle mesure la décomposition en noyaux permet de diminuer le coût du processus de benchmarking et d’optimisation. Elle évalue aussi l’avantage d’optimisations locales par rapport à une approche globale.

plaquette de diffusion

06 Octobre 2016 – Séminaire n°51 – Balance-enforced multi-level algorithm for multi-criteria graph partitioning.

Présentation par: Rémi Barat, Doctorant au CEA.

Salle de formation CCIE, Rez-de-Chaussée, Ter@tec, 10h – durée: 20 min.

Résumé

Nowadays, numerical simulations model increasingly complex phenomena. They require deeply coupled multi-physics codes that are designed to run on large distributed memory computers. On these kinds of architecture, data decomposition is critical to achieve good performance.
While distributing the data between the processes, two challenges must be addressed: balancing the workload between all processes, while minimizing the communications between processes.

Load balanced partitioning of a mesh with minimal communications can be reduced as a graph or hypergraph partitioning problem. Most of the current approaches do not strictly enforce constraints, returning partitions that are not always valid in respect to the constraints. We present an algorithm which strictly enforces balance constraints for all criteria. We will compare to Scotch and Metis, and comment their respective behaviors on small meshes.

plaquette de diffusion

13 Octobre 2016 – Séminaire n°52 – Étude et obtention d’heuristiques et d’algorithmes exacts et approchés pour un problème de partitionnement de maillage sous contraintes mémoire.

Présentation par: Sebastien Morais, Doctorant au CEA.

Salle Paul Gauguin, Ter@tec, 10h – durée: 30 min.

Résumé

Dans de nombreux domaines scientifiques, la taille et la complexité des simulations numériques sont si importantes qu’il est souvent nécessaire d’utiliser des supercalculateurs à mémoire distribuée pour les réaliser. Les données de la simulation ainsi que les traitements sont alors répartis sur différentes unités de calculs, en tenant compte de nombreux paramètres. En effet, cette répartition est cruciale et doit minimiser le coût de calcul des traitements à effectuer tout en assurant que les données nécessaires à chaque unité de calcul puissent être stockées localement en mémoire.

Pour la plupart des simulations numériques menées, les données des calculs sont attachées à un maillage, c’est-à-dire une discrétisation du domaine géométrique d’étude en éléments géométriques simples, les mailles. Les calculs à effectuer sont alors le plus souvent effectués au sein de chaque maille et la distribution des calculs correspond alors à un partitionnement du maillage. Dans un contexte de simulation numérique, où les méthodes mathématiques utilisées sont de types éléments ou volumes finis, la réalisation du calcul associé à une maille peut nécessiter des informations portées par des mailles voisines.

L’approche standard est alors de disposer de ce voisinage localement à l’unité de calcul. Le problème à résoudre n’est donc pas uniquement de partitionner un maillage sur k parties en plaçant chaque maille sur une et une seule partie et en tenant compte de la charge de calcul attribuée à chaque partie. Il faut ajouter à cela le fait de prendre en compte l’occupation mémoire des cellules où les calculs sont effectués et leurs voisines. Ceci amène à partitionner les calculs tandis que le maillage est distribué avec recouvrement. Prendre explicitement ce recouvrement de données est le problème que nous proposons d’étudier.

EN : In many scientific areas, the size and the complexity of numerical simulations lead to make intensive use of massively parallel runs on High Performance Computing (HPC) architectures. Such computers consist in a set of processing units (PU) where memory is distributed. Distribution of simulation data is therefore crucial: it has to minimize the computation time of the simulation while ensuring that the data allocated to every PU can be locally stored in memory.

For most of the numerical simulations, the physical and numerical data are based on a mesh. The computations are then performed at the cell level (for example within triangles and quadrilaterals in 2D, or within tetrahedrons and hexahedrons in 3D). More specifically, computing and memory cost can be associated to each cell. In our context, where the mathematical methods used are finite elements or finite volumes, the realization of the computations associated with a cell may require information carried by neighboring cells.

The standard implementation relies to locally store useful data of this neighborhood on the PU, even if cells of this neighborhood are not locally computed. Such non computed but stored cells are called ghost cells, and can have a significant impact on the memory consumption of a PU. The problem to solve is thus not only to partition a mesh on several parts by affecting each cell to one and only one part while minimizing the computational load assigned to each part. It is also necessary to keep into account that the memory load of both the cells where the computations are performed and their neighbors has to fit into PU memory. This leads to partition the computations while the mesh is distributed with overlaps. Explicitly taking these data overlaps into account is the problem that we propose to study.

plaquette de diffusion

03 Novembre 2016 – Séminaire n°53 – Support of a Unified Parallel Runtime in a Compiler: Adding MPC Extended Thread-Local Storage support in PGI Compilers

Présentation par: Jean Perier, Ancien stagiaire à PGI.

Salle Paul Gauguin, Ter@tec, 10h – durée: 25 min.

Résumé

To gain performance in a parallel application, a developer has to choose the compiler and the runtime libraries that will get the most of its code and hardware. Yet, it is not always possible to pick any combination of runtime libraries and compiler because runtime can require a specific support from the compiler.This is the case of the Multi-Processor Computing (MPC) runtime framework.

The goal of the work that will be presented is to go towards the possibility to take advantage of both NVIDIA’s PGI compilers and MPC. More precisely, the goal is to add support for MPC Extended-Thread Local Storage (ETLS) feature inside PGI compilers.

plaquette de diffusion

09 Février 2017 – Séminaire n°54 – Rewriting System for Profile-Guided Data Layout Transformations on Binaries

Présentation par: Christiopher Haine, Doctorant à l’INRIA Bordeaux Sud-Ouest.

Salle Paul Gauguin, Ter@tec, 10h – durée: 30min.

Résumé

Careful data layout design is crucial for achieving high performance, as nowadays processors waste a considerable amount of time being stalled by memory transactions, and in particular spacial and temporal locality have to be optimized. However, data layout transformations is an area left largely unexplored by state-of-the-art compilers, due to the difficulty to evaluate the possible performance gains of transformations. Moreover, optimizing data layout is time-consuming, error-prone, and layout transformations are too numerous to be experimented by hand in hope to discover a high performance version.

We propose to guide application programmers through data layout restructuring with an extensive feedback, firstly by providing a comprehensive multidimensional description of the initial layout, built via analysis of
memory traces collected from the application binary in fine aiming at pinpointing problematic strides at the instruction level, independently of the input language. We choose to focus on layout transformation, translatable to C-formalism to aid user understanding, that we apply and assess on case study composed of two representative multithreaded real-life applications, a cardiac wave simulation and lattice QCD simulation, with different inputs and parameters. The performance prediction of different transformations matches (within 5%) with hand-optimized layout code and the speed-up obtained are as high as 28x (for cardiac wave simulation) and 2.5x for Lattice QCD, when combined with SIMDization.

Biographie

Après un Master Informatique Haute Performance (MIHPS) à l’Université de Versailles/Ecole Centrale Paris, Christopher Haine a travaillé sur des méthodes de résolution parallèles de problèmes de valeur propres lors d’un stage au LBNL (Berkeley Lab). Il a ensuite réalisé un stage à l’Inria Bordeaux Sud-Ouest sur de la modélisation de performances de noyaux de calcul. Il est actuellement en thèse de doctorat à l’Inria Bordeaux Sud-ouest, portant sur la restructuration automatisée de noyaux de calcul.

plaquette de diffusion

23 Février 2017 – Séminaire n°55 – Conciliating programmability and efficiency by vectorizing SPMD programs at microarchitecture level

Présentation par: Sylvain Collange, Chargé de Recherche à l’INRIA Rennes.

Salle Paul Gauguin, Ter@tec, 11h – durée: 45min.

Résumé

GPUs are now established parallel accelerators for high-performance computing applications and machine learning. Part of their success is based on their so-called SIMT execution model. SIMT binds together threads of parallel applications so they perform the same instruction at the same time, in order to execute their instructions on energy-efficient SIMD units. Unfortunately, current GPU architectures lack the flexibility to work with standard instruction sets like x86 or ARM. Their implementation of SIMT requires special instruction sets with control-flow reconvergence annotations, and they do not support complex control flow like exceptions, context switches and thread migration.

In this talk, I will present how we can generalize the SIMT execution model of GPUs to general-purpose processors and multi-thread applications, and then use it to design new CPU-GPU hybrid cores. These hybrid cores will form the building blocks of heterogeneous architectures mixing CPU-like cores and GPU-like cores that all share the same instruction set and programming model. Beside improving programmability, generalized SIMT enables key improvements that were not possible in the traditional SIMD model, such as simultaneous execution of divergent paths. It opens the way for a whole spectrum of new architectures, hybrids of latency-oriented superscalar processors and throughput-oriented SIMT GPUs.

Biographie

Sylvain Collange is a Research Scientist at Inria Rennes. His research interests include the architecture of throughput processors and Graphics-Processing Units, compiler optimizations for GPUs and computer arithmetic. His key contributions to these areas include scalarization for GPU architectures, dynamic SPMD vectorization and the Barra GPU simulator.

Plaquette de diffusion

27 Avril 2017 – Séminaire n°56 – Scaling Parallel Seismic Raytracing

Présentation par: Allen Malony, Professor à l’University of Oregon.

Salle Paul Gauguin, Ter@tec, 11h – durée: 45min.

Résumé

Marine geologists use seismic tomography techniques to determine the 3D geophysical structure of the ocean floor. At the heart of seismic tomography methods is a forward solver used to compute minimum travel times from all locations in a earth model to sensors used in seismic experiments. The Stingray seismic raytracer developed at the University of Oregon was originally based on Dijkstra’s single-source shortest-path (SSSP) algorithm. Unfortunately, the algorithm’s inherent sequential nature limits its scalability. SSSP problems can also be solved in an iterative data parallel fashion based on the Bellman-Ford-Moore (BFM) algorithm. To overcome inherent scaling problems (both in time and space), a data parallel algorithm for seismic raytracing was developed and implemented. It allows for a scalable partitioning of the seismic model in multiple dimensions and high degrees of concurrency. However, it requires multiple iterations to converge. The tradeoff of greater parallelism potential and convergence governs performance. Results are presented for OpenMP, CUDA, and MPI experiments on seismic models of significantly larger size than Stingray has processed before.

The talk will include discussion of future direction, including the building of an integrated environment for geoinformatics that supports the full data management, analytics, simulation modeling, computational integration (e.g., geodynamics), visualization, and workflow for seismologists and earth scientists.

Biographie

Dr. Allen D. Malony is a Professor in the Department of Computer and Information Science at the Unversity of Oregon. Malony received the B.S. and M.S. degress in Computer Science from the University of California, Los Angeles in 1980 and 1982, respectively. He received the Ph.D. degree from the University of Illinois at Urbana-Champaign in October 1990. From 1981 to 1985, Malony worked at Hewlett-Packard Laboratories in Palo Alto, California. From 1986 to 1991, he was a Senior Software Engineer at the University of Illinois Center for Supercomputing Research and Development, where he was the leader of the performance evaluation project for the Cedar multiprocessor. In 1991, Malony joined the faculty at Oregon, spending his first year as a Fulbright Research Scholar and visiting Professor at Utrecht University in The Netherlands. Dr. Malony was awarded the NSF National Young Investigator award in 1994. In 1999 he was a Fulbright Research Scholar to Austria resident at the University of Vienna. Dr. Malony was awarded the prestigious Alexander von Humboldt Research Award for Senior U.S. Scientists by the Alexander von Humboldt Foundation in 2002. He was promoted to Full Professor in 2004

plaquette de diffusion

4 Mai 2017 – Séminaire n°57 – Calcul flottant et architectures manycoeurs

Présentation par: David Defour, Maître de conférence à l’Université de Perpignan.

Salle Paul Gauguin, Ter@tec, 11h – durée: 30min.

Résumé

Les calculs que l’on retrouve dans les applications scientifiques reposent sur l’utilisation intensive d’arithmétique flottante IEEE-754. Cette norme définit les formats de représentation des nombres flottants et le comportement des opérations arithmétiques usuelles. Elle est implémentée au niveau matériel et rends les codes scientifiques traditionnels portables, prédictibles et prouvables. Si cette norme se veut générique, elle ne répond pas à l’ensemble des problématiques rencontrées par les utilisateurs surtout dans le contexte des futures architectures exaflopiques.

Dans cet exposé, nous présenterons quelques un de ces problèmes et les solutions envisageables. Ainsi, nous traiterons de la question de la représentativité de l’information et de l’utilisation d’arithmétiques « alternatives », de l’efficacité du calcul avec des unités paramétrables, de la performance avec les différents formats flottants et l’exploitation de la régularité, de la qualité numérique des résultats en abordant la problématique de la reproductibilité numérique et de la tolérance aux pannes.

plaquette de diffusion

15 Juin 2017 – Séminaire n°58 – System-wide Power Management for Overprovisioned HPC Systems

Présentation par: Allen Malony, Professor à l’University of Oregon.

Salle Paul Gauguin, Ter@tec, 11h – durée: 45min.

Résumé

Power is quickly becoming a first class resource management concern in HPC. Current trends for high-performance systems are leading towards hardware overprovisioning where it is no longer possible to run all components at peak power without exceeding a system- or facility-wide power bound. Thus, power must be scheduled for all applications currently executing. The safety of any power scheduler for deployment can be proven through analyzing scheduler’s algorithm and mechanism with respect to a power scheduling invariant. The standard practice of static power scheduling avoids violation of the power caps, but is likely to lead to inefficiencies with over- and under-provisioning of power to components at runtime.

This talk presents the performance and scalability of an application-agnostic runtime power scheduler (POWsched) that is capable of enforcing a system-wide power limit. PowSched provides a proof by construction that power scheduling can be done safely and effectively without application specific models using a simple feedback mechanism. Experimental results show POWsched is robust, has negligible overhead, and can take advantage of opportunities to dynamically shift (every second) wasted power to more power-intensive applications, improving overall workload runtime compared to static job scheduler and without application specific profiling.

The talk also discusses a power simulation framework (PowSim) that gives a proof by construction of the generalized effects on runtime can be efficiently simulated at scale. PowSim thus provides critical simulation infrastructure for researchers exploring power scheduling at scale. Using simulation, power scheduling strategies are studied and dynamic power scheduling is shown to out perform static and reservation based techniques.

Biographie

plaquette de diffusion

6 Juillet 2017 – Séminaire n°59 – Studying a 40 Tb/s data acquisition system for the LHCb experiment

Présentation par: Sebastien Valat, Fellow au CERN

Salle Paul Gauguin, Ter@tec, 11h – durée: 25min.

Résumé

For the 2019-2020 upgrade, the CERN LHCb experiment decided to build a data acquisition system running without any hardware triggers. This mean reading all the data generated by the detector for every collisions and transport them to the software trigger composed of around 1000-2000 nodes. This will increase the physics quality of the data filtering but leads to the requirement for building a system capable of handling 40 Tb/s as data input. To this end, we are currently studying the available high-speed network coming for the HPC field. It concerns InfiniBand, OmniPath and 100g Ethernet. As those networks offers 100 Gb/s capability the final system for making the data aggregation needs to be scaled to around 500 nodes. This is the part we will discuss in this presentation. The communication pattern we need to handle is close to a continuous all-to-all or many gather which is not kind for the network. I will present the results we obtained with our application on existing HPC sites.

Biographie

Sebastien Valat started his training with a Master degree in particle physics making his internships on the OPERA and LHCb detectors for decay channel reconstruction and analysis. He continued his studies with a second master in computer science. It leaded to a PhD. on memory management in HPC context at CEA while working inside the MPC project. He then made a one year post-doc at the Exascale computing lab working on a memory profiling tool (MALT). He is now a Fellow at CERN since two years working on the future data acquisition system for the LHCb detector.

plaquette de diffusion