Commit 0fb2ce70 authored by Alan O'Cais's avatar Alan O'Cais
Browse files

Add HPC related guidelines

parent c968caa1
.. _eurohpc:
Future HPC Hardware in Europe
The European HPC Technology Platform, ETP4HPC, is an industry-led think-tank comprising of European HPC technology
stakeholders: technology vendors, research centres and end-users. The main objective of ETP4HPC is to define
research priorities and action plans in the area of HPC technology provision (i.e. the provision of supercomputing
systems). It has been responsible for the production and maintenance of the `European HPC Technology Strategic Research
Agenda (SRA) <>`_, a document that serves as a
mechanism to provide contextual guidance to European researchers and businesses as well as to guide EU priorities for
research in the Horizon 2020 HPC programme, i.e. it represents a roadmap for the achievement of European exascale
We have had numerous discussions of the E-CAM community software needs through our exchanges with ETP4HPC
during the course of our contributions to the SRA. The particular contribution from our discussion related to the
software needs for exascale computing within the ETP4HPC SRA report is shown in the paragraphs below:
E-CAM has not committed itself to a single set of applications or use cases that can represented in such a
manner, it is instead driven by the needs of the industrial pilot projects within the project (as well as the
wider community). Taking into consideration the CECAM community and the industrial collaborations
targeted by E-CAM, probably the largest exa-scale challenge is ensuring that the skillsets of the application
developers from academia and industry are sufficiently up to date and are aligned with programming
best practices. This means that they are at least competent in the latest relevant language specification
(Fortran 2015, C++17,...) and aware of additional tools and libraries that are necessary (or useful) for application
development at the exa-scale. For application users, this means that they have sufficient knowledge
of architecture, software installation and typical supercomputing environment to build, test and run
application software optimised for the target.
While quite specific "key applications" are under continuous support by other CoEs, this is not the current
model of E-CAM. E-CAM is more likely to support and develop a software installation framework (such as
`EasyBuild <>`_) that simplifies building the (increasingly non-trivial) software stack of a particular application
in a reliable, reproducible and portable way. Industry has already shown significant interest in this and
E-CAM is particularly interested in extending the capabilities of EasyBuild to EsD architectures, performance
analysis workflows and to new scientific software packages. Such an effort could easily be viewed
as transversal since such developments could be leveraged by any other CoE.
One important focus of the SRA is the development of the Extreme-Scale Demonstrators (EsDs) that are vehicles to
optimise and synergise the effectiveness of the entire HPC H2020 programme, through the integration of R&D outcomes
into fully integrated HPC system prototypes.
The work in developing the EsDs will fill critical gaps in the H2020 programme, including the following activities:
* Bring technologies from FET-HPC projects closer to commercialisation.
* Combined results from targeted R&D efforts into a complete system (European HPC technology ecosystem).
* Provide the missing link between the three HPC pillars: technology providers, user communities (e.g. E-CAM)
and infrastructure.
As one of the CoEs, E-CAM should aim to provide insight and input into the requirements of future exascale systems
based on lessons learnt from activities within E-CAM (e.g. software development and relevant performance optimisation
and scaling work). This would entail further knowledge and understanding within E-CAM on exploiting current
multi-petaflop infrastructures, what future exascale architectures may look like, as well as interaction and close collaboration
between E-CAM and other projects (i.e. the projects shown in Figure 12); these are also covered in subsequent
sections of this paper.
Emerging hardware architectures relevant to exascale computing
The European Commission supports a number of projects in developing and testing innovative architectures for next
generation supercomputers, aimed at tackling some of the biggest challenges in achieving exascale computing. They
often involve co-design involving HPC technologists, hardware vendors and code developer/end-user communities
in order to develop prototype systems. Some of these projects include:
* The `DEEP <>`_ (Dynamic Exascale Entry Platform) projects (DEEP, DEEP-ER and DEEP-EST)
* The `Mont-Blanc <>`_ projects (Mont-Blanc 1, 2 and 3)
* The `PRACE PCP <>`_ (Pre-Commercial Procurement) initiative
.. _fpga:
Field Programmable Gate Arrays (FPGAs) are semiconductor devices that are based around a matrix of configurable
logic blocks (CLBs) connected via programmable interconnects. FPGAs can be reprogrammed to desired application
or functionality requirements after manufacturing. This feature distinguishes FPGAs from Application Specific
Integrated Circuits (ASICs), which are custom manufactured for specific design tasks.
Xilinx Ultrascale FPGAs and ARM processors have been proposed by the EuroEXA project as a new path towards exascale.
EuroEXA is an EU Funded 20 Million Euro for an `ARM+FPGA Exascale Project
<>`_ which will lead Europe
towards exascale, together with `ExaNeSt <>`_, `EcoScale <>`_ and `ExaNoDe
<>`_ projects, scaling peak performance to 400 PFLOP in a peak
system power envelope of 30MW; over four times the performance at four times the energy efficiency of today’s HPC
Feedback for software developers
Despite their high efficiency in performance and power consumption, FPGA are known for being difficult to program.
`OpenCL for FPGA <>`_ is an example of programming language
for FPGA which we recommend, particularly considering that these new technologies will be soon available within the
E-CAM community through the EuroEXA project.
.. _knl:
Intel Many-core
The 2nd Generation Intel Xeon Phi platform, known as Knights Landing (KNL), has been released on the market in
Q2 of 2016. The chip, based on a 14nm lithography, contains up to 72 cores @1.5GHz with a maximum memory
bandwidth of 115.2 GB/s. One of the main features is the increased AVX512 ISE (Instruction Set Extensions) which
includes SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions). More details are available at `Intel
Knights Landing <>`_. The same component (Intel Xeon Phi
7250-F) is available in the JURECA Booster as part of the `DEEP
<>`_ and DEEP-ER projects.
Knights Hill is the codename for the third-generation MIC architecture and it will be manufactured in a 10 nm process.
Intel announced the first details at SC14, however since then no further details have been released and the Aurora
project from the DoE delayed (see `Some Surprises in the 2018 DoE Budget for Supercomputing
`Knights Mill <>`_ is Intel’s codename for a Xeon
Phi product specialised in deep learning. It is expected to support reduced variable precision which have been used to
accelerate machine learning in other products, such as half-precision floating-point variables in Nvidia’s Tesla.
Feedback for software developers
Based on the latest hardware developments specified above (and the AVX512 instruction set used by this hardware),
we strongly advise the software developer to take in consideration the importance of enhancing performance through
vectorization both from numerical algorithm point of view and at the compiler level. Intel provides very good tools to
achieve this through compiler flags (which allow you to have a full report about the vectorization efficiency) or more
sophisticated software like `Intel Advisor <>`_.
At node-level the recommended parallelism is by shared memory. In this case `OpenMP <>`_ is the
de facto standard and Intel provides good tools like `VTune
Many training courses and documents are available on line (see `Intel Advisor training
<>`_ and `VTune training
.. _gpu:
The new NVIDIA `Tesla V100 <>`_ accelerator
incorporates the new Volta GV100 GPU. Equipped with 21 billion transistors, Volta delivers over 7.5 Teraflops per
second of double precision performance, ∼1.5x increase compared to the its predecessor, the Pascal GP100 GPU. Moreover,
architectural improvements include:
* A tensor core is unit that multiplies two 4×4 FP16 matrices, and then adds a third FP16 or FP32 matrix to the
result by using fused multiply–add operations, and obtains an FP32 result that could be optionally demoted to
an FP16 result. Tensor cores are intended to speed up the training of neural networks.
* Tesla V100 uses a faster and more efficient HBM2 implementation. HBM2 memory is composed of memory
stacks located on the same physical package as the GPU, providing substantial power and area savings compared
to traditional GDDR5 memory designs, thus permitting more GPUs to be installed in servers. In addition
to the higher peak DRAM bandwidth on Tesla V100 compared to Tesla P100, the HBM2 efficiency on V100 GPUs
has been significantly improved as well. The combination of both a new generation HBM2 memory from Samsung,
and a new generation memory controller in Volta, provides 1.5x delivered memory bandwidth versus
Pascal GP100, and greater than 95% memory bandwidth efficiency running many workloads.
* NVlink 2.0, which is a high-bandwidth bus between multiple GPUs, and between the CPU and GPU. Compared to NVLink
on Pascal, NVLink 2.0 on V100 increases the signaling rate from 20 to 25 Gigabits/second. Each link now provides
25 Gigabytes/second in each direction. The number of links supported has been increased from four to six pushing
the supported GPU NVLink bandwidth to 300 Gigabytes/second. The links can be used exclusively for GPU-to-GPU
communication as in the DGX-1 with V100 topology shown in Figure 2, or some combination of GPU-to-GPU and
GPU-to-CPU communication as shown in Figure 3 (currently only available in combination with Power8/9 processors).
The tensor core of the Volta was explicitly added for deep learning workloads. The `NVIDIA Deep Learning SDK
<>`_ provides
powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications. It includes
libraries for deep learning primitives, inference, video analytics, linear algebra, sparse matrices, and multi-GPU
Feedback for software developers
Several approaches have been developed to exploit the full power of GPUs: from parallel computing platform and
application programming interface specific for NVidia GPU, like `CUDA 9.0
<>`_, to the latest version of `OpenMP 4.5
<>`_ which
contains directives to offload computational work from the CPU to the GPU. While CUDA currently is likely to achieve
best performance from the device, OpenMP allows for better portability of the code across different architectures.
Finally, the `OpenACC <>`_ open standard is an intermediate between the two, more
similar to OpenMP than CUDA, but allowing better usage of the GPU. Developers are strongly advised to look into these
language paradigms.
Moreover, it is fundamental to consider that there the several issues linked to hybrid architectures, like CPU-GPU and
GPU-GPU bandwidth communication (the latest greatly improved through NVlink), direct access through `Unified Virtual
Addressing <>`_, the presence
of new APIs for programming (such as `Tensor Core
<>`_ multiplications specifically designed for deep
learning alogrithms).
Finally, it is important to stress the improvements made by NVidia on the implemenation of `Unified Memory
<>`_. This
allows the system to automatically migrate data allocated in Unified Memory between host and device so that it looks
like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU making programmability
greatly simplified.
At this stage, GPU programming is quite mainstream and there are many training courses available online, see for
example the `NVidia education site <>`_ for material related to CUDA and
OpenACC. Material for OpenMP is more limited, but as an increasing number of compilers begin to support the OpenMP 4.5
standard, we expect the amount of such material to grow (see `this presentation on performance of the Clang OpenMP 4.5
implementaion on NVIDIA gpus
<>`_ for a status
report as of 2016).
.. _hpc_hardware:
Currently Available Hardware
For the last decade, power and thermal management has been of high importance. The entire market focus has moved
from achieving better performance through single-thread optimizations, e.g., speculative execution, towards simpler
architectures that achieve better performance per watt, provided that vast parallelism exists. The HPC community,
particularly at the higher end, focuses on the flops/watt metric since the running cost of high-end HPC systems are
so significant. It is the potential power requirements of exa-scale systems that are the limiting factor (given currently
available technologies).
The practical outcome of this is the rise of accelerating co-processors and many-core systems. In the following sections
we will discuss three such technologies that are likely to form the major computing components of the first
generation of exa-scale machines:
.. toctree::
:maxdepth: 2
We will outline the current generation of technologies in this space and also describe the (currently) most-productive
programming model for each. We will not discuss other new CPU technologies (such as Power 9, Intel Skylake, or
ARMv8) since in comparison to these technologies they would be expected to only provide ~10% or less of the compute
power of potential exa-scale systems.
The problem with the current three-pronged advance is that it is not always easy to develop parallel programs for these
technologies and, moreover, those parallel programs are not always performance portable between each technology,
meaning that each time the architecture changes the code may have to be rewritten. While there are open standards
available for each technology, each product currently has different preferred standards which are championed by the
individual vendors (and therefore the best performing).
In general, we see a clear trend towards more complex systems, which is expected to continue over the next decade.
These developments will significantly increase software complexity, demanding more and more intelligence across
the programming environment, including compiler, run-time and tool intelligence driven by appropriate programming
models. Manual optimization of the data layout, placement, and caching will become uneconomic and time
consuming, and will, in any case, most likely soon exceed the abilities of the best human programmers.
Impact of Deep Learning
Traditional machine learning uses handwritten feature extraction and modality-specific machine learning algorithms
to label images or recognize voices. However, this method has several drawbacks in both time-to-solution and accuracy.
Today’s advanced deep neural networks use algorithms, big data, and the computational power of the GPU (and
other technologies) to change this dynamic. Machines are now able to learn at a speed, accuracy, and scale that are
driving true artificial intelligence and AI Computing.
Deep learning is used in the research community and in industry to help solve many big data problems such as computer
vision, speech recognition, and natural language processing. Practical examples include:
* Vehicle, pedestrian and landmark identification for driver assistance
* Image recognition
* Speech recognition and translation
* Natural language processing
* Life sciences
The influence of deep-learning on the market is significant with the design of commodity products such as the Intel
MIC and NVIDIA Tesla being heavily impacted. Silicon is being dedicated to deep learning workloads and the scientific
workloads for these products will need to adapt to leverage this silicon.
.. _hpc_resources:
Accessing HPC Resources in Europe
As far as the Partnership for Advanced Computing in Europe (PRACE) initiative is concerned, the complete list of
available resources are shown in the figure below.
.. image:: ../images/PRACE-Resources.png
:width: 80%
:align: center
Access to PRACE resources can be obtained by application to the `PRACE calls <>`_.
Moreover, the Distributed European Computing Initiative (DECI) is designed for projects requiring access to resources
not currently available in the PI’s own country but where those projects do not require resources on the very largest
(Tier-0) European Supercomputers or very large allocations of CPU. To obtain resources from the DECI program,
applications should be made via the `DECI calls <>`_.
.. _hpc_guidelines:
HPC Programming Guidelines
Once we begin to discuss high performance computing (HPC), we necessarily must begin to discuss not only the latest
hardware technologies, but also the latest software technologies that make exploiting the capabilities of that hardware
Hardware Developments
There are a number of different organisations and projects that are generating roadmaps about the current and future
technologies that are (or may be in the future) available in the HPC space. `Eurolab-4-HPC
<>`_ has summarised many of these in the `Eurolab-4-HPC Long-Term Vision on
High-Performance Computing <>`_. Here
we focus on a small subset of the content of these roadmaps (primarily from `Eurolab-4-HPC
<>`_ and `ETP4HPC <>`_) that are most likely to impact the target
community of E-CAM in the 3-5 year horizon.
.. toctree::
:maxdepth: 2
Software Developments
It is clear that the hardware developments described above will greatly impact the software development practices of
the E-CAM development community. For this reason, we highlight the the language standards, runtime environments,
workflows and software tools that can help E-CAM developers to deliver high quality, resilient software for current
and next generation machines.
.. toctree::
:maxdepth: 2
Accessing Resources
All of the above is academic unless you have access to resources to develop and test new software. There are many
potential ways to access HPC resources, we simply highlight a limited set of the possibilities here.
.. toctree::
:maxdepth: 2
.. _programming_paradigms:
Programming for HPC
C++17 is the name for the most recent revision of the `ISO/IEC 14882 <>`_
standard for the C++ programming language.
The previous C++ versions show very limited parallel processing capabilities when using mutli/many core architectures.
This situation will change with the C++17, in which the parallelised version of `Standard Template Library
<>`_ is
included. The STL is a software library for C++ programming which has 4 components: Algorithms, Containers, Functors
and Iterators. `"Parallel STL advances the evolution of C++, adding vectorization and parallelization capabilities
without resorting to nonstandard or proprietary extensions, and leading to code modernization and the development of
new applications on modern architectures." <>`_
A `multi-threading programming model for C++ <>`_ is
supported since C++11.
Fortran 2015
`Fortran 2015 <>`_ is a minor revision of Fortran 2008 (which was when
Fortran became a Partioned Global Address Space (PGAS) language with the introduction of `coarrays
<>`_). The revisions mostly target additional parallelisation features and
increased interoperability with C.
Most Fortran-based software E-CAM sees in practice is implemented in Fortran 95 and there appears to be little awareness
of the parallel features of the latest Fortran standards. E-CAM is considering organising a workshop that addresses
this lack of awareness (similar to the "`Software Engineering and Parallel Programming in Modern Fortran
held at the Cranfield University).
It should be noted that `compiler support for the latest Fortran standards is limited
<>`_. This is most likely due to the fact
that Fortran is not widely used outside of the scientific research (limiting its commercial scope).
The (potential) role of Python
Given that it is an interpreted language (i.e., it is only compiled at runtime), Python is not usually discussed much
in the HPC space since there is limited scope for control over many factors that influence performance. Where we
are observing a lot of growth is where applications are being written in languages like C++ under the hood but are
intended to be primarily used via their Python interfaces.
This is a valuable, and user friendly, development model that allows users to leverage Python for fast prototyping while
maintaining the potential for high performance application codes.
A warning to would be users: `Python 2 will stop being developed in 2020 <>`_ so please make
sure that your code is Python3 compliant.
Open Standards
We describe here some of the open standards that are most likely to be leveraged on next generation HPCresources.
Now more than 25 years old, Message Passing Interface (MPI) is still with us and remains the de facto standard for
internode communication (though it is not the only option, alternatives such as `GASNet <>`_
exist). `MPI-3.1 <>`_ was approved
by the MPI Forum on June 4, 2015. It was mainly an errata release for MPI 3.0 which included some important enhancements
to MPI:
* Nonblocking collectives
* Sparse and scalable irregular collectives
* Enhancements to one-sided communication (very important for extreme scalability)
* Shared memory extensions (on clusters of SMP nodes)
* Fortran interface
Maintaining awareness of the `scope of past and future updates to the MPI standard
<>`_ is
important since it is the latest features that target the latest architectural developments.
OpenMP is also 20 years old and remains the most portable option for on-node workloads. The standard has introduced
new features to deal with increasing node-level heterogeneity (device offloading, such as for the GPU, in
particular) and varied workloads (task level parallelism).
From GCC 6.1, OpenMP 4.5 is fully supported for C and C++ (with Fortran support coming in the GCC 7 series). The
`level of OpenMP support among other compilers <>`_ varies
`OpenACC <>`_ (for open accelerators) is a programming standard for parallel computing
developed by Cray, CAPS, Nvidia
and PGI. The standard is designed to simplify parallel programming of heterogeneous CPU/GPU systems. Since the
paradigm is very similar to the latest OpenMP specs, a future merger into OpenMP is not unlikely.
It should be noted that CUDA (with the `nvcc compiler
<>`_) is still the most commonly used (and highest performing)
library for programming NVIDIA GPUs.
Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms
consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs),
field-programmable gate arrays (FPGAs, see Section 2.4 for the extreme relevance of this) and other processors or
hardware accelerators.
OpenCL 2.2 brings the OpenCL C++ kernel language into the core specification for significantly enhanced parallel
programming productivity. When releasing OpenCL version 2.2, the Khronos Group announced that OpenCL would
be merging into `Vulkan <>`_ (which targets high-performance realtime
3D graphics applications) in the future, leaving some uncertainty as to how this may affect the HPC space.
Runtime System Approaches
As noted already, programming paradigm standards are moving forward to adapt to the technologies that we see in the
market place. The complexity of the hardware infrastructure necessarily brings complexity to the implementation of the
programming standards.
There are number of programming models that leverage runtime systems under development. They promise to abstract
away hardware during the development process, with the proviso that tuning at runtime may be required. Our
experience to date with these systems is limited so we simply provide a list of three such systems here (which is certainly
not exhaustive) in no particular order:
* `HPX <>`_, a C++ Standard Library for concurrency and parallelism. The goal of the HPX project is to create a high
quality, freely available, open source implementation of `ParalleX <>`_ concepts for conventional and future systems
by building a modular and standards conforming runtime system for SMP and distributed application environments.
(Most recent release: v1.0, April 2017)
* `Kokkos <>`_ implements a programming model in C++ for writing performance portable applications targeting all
major HPC platforms. For that purpose it provides abstractions for both parallel execution of code and data
management. Kokkos is designed to target complex node architectures with N-level memory hierarchies and
multiple types of execution resources. It currently can use OpenMP, Pthreads and CUDA as backend programming
models. (Most recent release: v2.04.04, 11 Sept 2017)
* `OmpSs <>`_ is an effort to integrate features from the StarSs programming model developed at Barcelona Supercomputing
Centre (BSC) into a single programming model. In particular, the objective is to extend OpenMP with
new directives to support asynchronous parallelism and heterogeneity (devices like GPUs). However, it can
also be understood as new directives extending other accelerator based APIs like CUDA or OpenCL. The OmpSs
environment is built on top of BSCs Mercurium compiler and Nanos++ runtime system. (Most recent release:
v17.06, June 2017)
Feedback for software developers
Awareness of the latest standards and the status of their implementations are critical at all times during application
development. The adoption of new features from standards are likely to have large impact on the scalability of application
codes precisely because it is very likely that these features exist to target the scalability challenges on modern
systems. Nevertheless, you should be aware that there can be very long gaps between the publication of a standard
and the implementation in compilers (which is frequently also biased by who is pushing which standard and why: Intel
pushes OpenMP because of their Xeon Phi line, NVIDIA who now own PGI pushes OpenACC because of their GPUs,
AMD pushed OpenCL for their own GPUs to compete with CUDA). The likelihood of there being a single common (set
of ) standards that performs well on all architectures is not high in the immediate future. For typical developers that
we see in E-CAM, MPI+OpenMP remains the safest bet and is likely to perform well, as long as the latest standards are
More disruptive software technologies (such as GASNet) are more likely to gain traction if they are used by popular
abstraction layers (which could be PGAS langauages, runtime systems or even domain specific languages) "under
the hood". This would make the transition to new paradigms an implementation issue for the abstraction layer. Indeed,
given the expected complexity of next generation machines, new programming paradigms that help remove the
performance workload from the shoulders of scientific software developers will gain increasing importance.
As you may have noticed in the previous discussion, the computer scientists developing these abstractions are working
mostly in C++, and the implementation of new standards in compilers is also seen first for C++. From a practical
perspective this has some clear implications: if you want to access the latest software technologies then you had better
consider C++ for your application. This may appear harsh given that the Fortran standard has clear capabilities in this
space, but it is a current reality that cannot be ignored. Also, given that the vast majority of researchers will eventually
transition to industry (because there simply aren’t enough permanent jobs in academia) it is more responsible to
ensure they have programming expertise in a language that is heavily used in the commercial space. Finally, the ecosystem
surrounding C++ (IDEs, testing frameworks, libraries,. . . ) is much richer because of it’s use in industry and
computer science.
Taking all of the above into consideration, if you are starting out with an application we would distil the discussion into
the following advice: prototype your application using Python leveraging the Python APIs to the libraries you need;
write unit tests as you go; and, when you start doing computationally intensive work, use C++ with Python interfaces
to allow you to squeeze out maximal performance using the latest software technologies.
......@@ -42,31 +42,36 @@ on.
.. toctree::
:maxdepth: 2
:maxdepth: 3
Programming for an HPC Environment
.. Incorporate the output of D7.3 to generate this section
.. note:: This section of the site is under development but will cover the following topics
* Software
* Programming Paradigms
* Languages
* Applications and Libraries
* Portability and Optimisation
* Scalability
* Reproducability
* Hardware
* General Architecture
* Where is the compute power on modern HPC machines?
* Memory
* I/O
* Workflow
* Acquiring resources (getting compute time)
* Accessing resources (ssh, gui's,...)
* Resource management (slurm, UNICORE, ...)
* Moving data
Due to the nature of the HPC environment (novel hardware, latest techniques, remote resources,...). There are many
specific things that need to be considered that impact the software development process.
.. toctree::
:maxdepth: 3
.. * Software
.. * Programming Paradigms
.. * Languages
.. * Applications and Libraries
.. * Portability and Optimisation
.. * Scalability
.. * Reproducability
.. * Hardware
.. * General Architecture
.. * Where is the compute power on modern HPC machines?
.. * Memory
.. * I/O
.. * Workflow
.. * Acquiring resources (getting compute time)
.. * Accessing resources (ssh, gui's,...)
.. * Resource management (slurm, UNICORE, ...)
.. * Moving data
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment