Adam Włodarczyk (Wrocław Centre of Networking and Supercomputing),
Alan O'Cais (Juelich Supercomputing Centre)
.. In the next line you have the name of how this module will be referenced in the main documentation (which you can
reference, in this case, as ":ref:`example`"). You *MUST* change the reference below from "example" to something
unique otherwise you will cause cross-referencing errors. The reference must come right before the heading for the
reference to work (so don't insert a comment between).
.. _htc:
#######################################
E-CAM High Throughput Computing Library
#######################################
.. Let's add a local table of contents to help people navigate the page
.. contents:: :local:
.. Add an abstract for a *general* audience here. Write a few lines that explains the "helicopter view" of why you are
creating this module. For example, you might say that "This module is a stepping stone to incorporating XXXX effects
into YYYY process, which in turn should allow ZZZZ to be simulated. If successful, this could make it possible to
produce compound AAAA while avoiding expensive process BBBB and CCCC."
E-CAM is interested in the challenge
of bridging timescales. To study molecular dynamics with atomistic detail, timesteps must be used on
the order of a femtosecond. Many problems in biological chemistry, materials science, and other
fields involve events that only spontaneously occur after a millisecond or longer (for example,
biomolecular conformational changes, or nucleation processes). That means that around :math:`10^{12}` time
steps would be needed to see a single millisecond-scale event. This is the problem of "rare
events" in theoretical and computational chemistry.
Modern supercomputers are beginning to make it
possible to obtain trajectories long enough to observe some of these processes, but to fully
characterize a transition with proper statistics, many examples are needed. In order to obtain many
examples the same application must be run many thousands of times with varying inputs. To manage
this kind of computation a task scheduling high throughput computing (HTC) library is needed. The main elements of mentioned
scheduling library are: task definition, task scheduling and task execution.
While traditionally an HTC workload is looked down upon in the HPC
space, the scientific use case for extreme-scale resources exists and algorithms that require a
coordinated approach make efficient libraries that implement
this approach increasingly important in the HPC space. The 5 Petaflop booster technology of `JURECA <http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JURECA/JURECA_node.html>`_
is an interesting concept with respect to this approach since the offloading approach of heavy
computation marries perfectly to the concept outlined here.
Purpose of Module
_________________
.. Keep the helper text below around in your module by just adding ".. " in front of it, which turns it into a comment
This module is the first in a sequence that will form the overall capabilities of the library. In particular this module
deals with creating a set of decorators to wrap around the `Dask-Jobqueue <https://jobqueue.dask.org/en/latest/>`_
Python library, which aspires to make the development time cost of leveraging it lower for our use cases.
Background Information
______________________
.. Keep the helper text below around in your module by just adding ".. " in front of it, which turns it into a comment
The initial motivation for this library is driven by the ensemble-type calculations that are required in many scientific
fields, and in particular in the materials science domain in which the E-CAM Centre of Excellence operates. The scope
for parallelisation is best contextualised by the `Dask <https://dask.org/>`_ documentation:
A common approach to parallel execution in user-space is task scheduling. In task scheduling we break our program
into many medium-sized tasks or units of computation, often a function call on a non-trivial amount of data. We
represent these tasks as nodes in a graph with edges between nodes if one task depends on data produced by another.
We call upon a task scheduler to execute this graph in a way that respects these data dependencies and leverages
parallelism where possible, multiple independent tasks can be run simultaneously.
Many solutions exist. This is a common approach in parallel execution frameworks. Often task scheduling logic hides
within other larger frameworks (Luigi, Storm, Spark, IPython Parallel, and so on) and so is often reinvented.
Dask is a specification that encodes task schedules with minimal incidental complexity using terms common to all
Python projects, namely dicts, tuples, and callables. Ideally this minimum solution is easy to adopt and understand
by a broad community.
While we were attracted by this approach, Dask did not support *task-level* parallelisation (in particular
multi-node tasks). We researched other options (including Celery, PyCOMPSs, IPyParallel and others) and organised a
workshop that explored some of these (see https://www.cecam.org/workshop-0-1650.html for further details).
Building and Testing
____________________
.. Keep the helper text below around in your module by just adding ".. " in front of it, which turns it into a comment
The library is a Python module and can be installed with
::
python setup.py install
More details about how to install a Python package can be found at, for example, `Install Python packages on the
research computing systems at IU <https://kb.iu.edu/d/acey>`_
To run the tests for the decorators within the library, you need the ``pytest`` Python package. You can run all the
relevant tests from the ``jobqueue_features`` directory with
::
pytest tests/test_decorators.py
Examples of usage can be found in the ``examples`` directory.
Source Code
___________
The latest version of the library is available on the `jobqueue_features GitHub repository
<https://github.com/E-CAM/jobqueue_features>`_, the file specific to this module
is `decorators.py <https://github.com/E-CAM/jobqueue_features/blob/master/jobqueue_features/decorators.py>`_.
(The code that was originally created for this module can be seen in the specific commit `4590a0e427112f
See usage examples in the ``examples`` directory of the source code.
Licence
GNU Lesser General Public License v3.0
Author of Module
Emine Kucukbenli
.. contents:: :local:
Purpose of Module
_________________
FFTXlib module is a collection of driver routines for complex 3D fast Fourier transform (FFT) libraries
to be used within planewave-based electronic structure calculation software.
Generally speaking, FFT algorithm requires a data array to act on, a clear description of the
input-output sequence and transform domains.
In the context of planewave based electronic structure calculations, the data array may hold elements such as
electronic wavefunction :math:`\psi` or charge density :math:`\rho` or their functions.
The transform domains are direct (real) and reciprocal space,
the discretization in real space is represented as a uniform grid of the unit cell and
the discretization of the reciprocal space is in the basis of planewaves whose wavevectors
are multiples of reciprocal space vectors :math:`(\mathbf G)` .
To understand the main motivation behind FFTXlib routines we need to clarify the differences between the representation
of wavefunction and charge density in planewave based codes:
In these codes, the expansion of wavefunction in planewave basis is
truncated at a cut-off wave-vector :math:`\mathbf G_{max}`.
Since density is the norm-square of the wavefunction, the expansion that is consistent with
the one of wavefunctions requires a cut-off wavevector twice that of wavefunctions: :math:`2 \mathbf G_{max}`.
Meanwhile, the real space FFT domain is often only defined by one uniform grid of the unit cell,
so the array sizes of both :math:`\rho` and :math:`\psi` in their real space representation are the same.
Therefore, to boost optimization and to reduce numerical noise, the library implements two possible options while performing FFT:
in one ( 'Wave') the wavevectors beyond :math:`\mathbf G_{max}` are ignored,
in the other ( 'Rho' ) no such assumption is made.
Another crucial feature of FFTXlib is that some approximations in the electronic structure calculations
(such as usage of non-normconserving pseudopotentials) require that density is not just
norm-square of wavefunctions, but has spatially localized extra components. In that case,
these localized contributions may require higher G-vector components than the ones needed for density
(:math:`> 2 \mathbf G_{max}`).
Hence, in such systems, the density array in reciprocal space has more elements
than the norm-conserving case (in other words a finer resolution or a denser grid is needed in real space)
while the resolution needed to represent wavefunctions are left unchanged.
To accommodate for these different requirements of grid size, and to be able to make Fourier transforms back and forth between them,
the FFTXlib routines explicitly require descriptor arguments which define the grids to be used. For example,
if potential is obtained from density, the FFT operations on it should use the denser grid;
while FFT on wavefunctions should use the smoother grid (corresponding to :math:`2\mathbf G_{max}` as defined before).
When the Hamiltonian's action on wavefunctions are being calculated, the potential should be
brought from dense to smooth grid.
But when the density is being calculated, wavefunction normsquare should be carried from smooth to dense grid.
A final important feature of FFTXlib is the index mapping. In the simple case of no parallelization,
as a choice, the reciprocal space arrays are ordered in increasing order of :math:`|G|^2`
while the real space arrays are sorted in column major order.
Therefore for FFT to be performed, a map between these two orders must be known.
This index map is created and preserved by the FFTXlib.
In summary, FFTXlib allows the user to perform complex 3D fast Fourier transform (FFT) in the context of
plane wave based electronic structure software. It contains routines to initialize the array structures,
to calculate the desired grid shapes. It imposes underlying size assumptions and provides
correspondence maps for indices between the two transform domains.
Once this data structure is constructed, forward or inverse in-place FFT can be performed.
For this purpose FFTXlib can either use a local copy of an earlier version of FFTW (a commonly used open source FFT library),
or it can also serve as a wrapper to external FFT libraries via conditional compilition using pre-processor directives.
It supports both MPI and OpenMP parallelization technologies.
FFTXlib is currently employed within Quantum Espresso package, a widely used suite of codes
for electronic structure calculations and materials modeling in the nanoscale, based on
planewave and pseudopotentials. FFTXlib is also interfaced with "miniPWPP" module
that solves the Kohn Sham equations in the basis of planewaves and soon to be released as a part of E-CAM Electronic Structure Library.
Background Information
______________________
FFTXlib is mainly a rewrite and optimization of earlier versions of FFT related routines inside Quantum ESPRESSO pre-v6;
and finally their replacement.
This may shed light on some of the variable name choices, as well as the default of :math:`2\mathbf G_{max}` cut-off
for the expansion of the smooth part of the charge density, and the required format for lattice parameters in order to build the
FFT domain descriptor.
Despite many similarities, current version of FFTXlib dramatically changes the FFT strategy in the parallel execution,
from 1D+2D FFT performed in QE pre v6
to a 1D+1D+1D one; to allow for greater flexibility in parallelization.
Building and Testing
______________________________
A stable version of the module can be downloaded using `this link <https://gitlab.com/kucukben/fftxlib-esl-ecam>`_
.. when fftxlib has its own repo, this link can be moved there.
Current installation and testing are done with gfortran compiler, version 4.4.7.
The configuration uses GNU Autoconf 2.69.
The commands for installation are::
$ ./configure
$ make libfftx
As a result, the library archive "libfftx.a" is produced in src directory,
and symbolicly linked to a "lib" directory.
.. To test whether the library is working as expected, run::
.. $ make FFTXtest
.. Besides the PASS/FAIL status of the test, by changing the bash script in the tests directory, you can perform your custom tests. Read the README.test documentation in the tests subdirectory for further details about the tests.
To see how the library works in a realistic case scenario of an electronic structure calculation, run::
$make FFTXexamples
.. Besides the PASS/FAIL status of the example, by changing the bash script in the examples directory, you can create your custom examples.
A mini-app will be compiled in src directory and will be symbolicly copied into ``bin`` directory.
The mini-app simulates an FFT scenario with a test unit cell, and plane wave expansion cutoff.
It creates the FFT structures and tests forward and backward transform on sample array and reports timings.
Read the README.examples documentation in the examples subdirectory for further details.
Source Code
____________
The FFTXlib bundle corresponding to the stable release can be downloaded from this `link <https://gitlab.com/kucukben/fftxlib-esl-ecam>`_
The source code itself can be found under the subdirectory ``src``.
The development is ongoing.
The version that corresponds to the one of examples and tests can be obtained with SHA 31a6f4ecbb7ce474b0c87702c716713758f99a0a. This will soon be replaced with a version tag.
Further Information
____________________
This documentation can be found inside the ``docs`` subdirectory.
The FFTXlib is developed with the contributions of C. Cavazzoni, S. de Gironcoli,
P. Giannozzi, F. Affinito, P. Bonfa', Martin Hilgemans, Guido Roma, Pascal Thibaudeau,
Stephane Lefranc, Nicolas Lacorne, Filippo Spiga, Nicola Varini, Jason Wood, Emine Kucukbenli.
The first Meso- and Multi-scale ESDW was held in Barcelona, Spain, in July 2017. The following modules have been produced:
GC-AdResS
---------
This modules are connected to the Adaptive Resolution Simulation implementation in GROMACS.
.. toctree::
:glob:
:maxdepth: 1
./modules/DL_MESO_DPD/sionlib_dlmeso_dpd/readme
GC-AdResS
---------
Adaptive Resolution Simulation: Implementation in GROMACS
./modules/GC-AdResS/Abrupt_AdResS/readme
./modules/GC-AdResS/AdResS_RDF/readme
./modules/GC-AdResS/Abrupt_Adress_forcecap/readme
./modules/GC-AdResS/AdResS_TF/readme
./modules/GC-AdResS/LocalThermostat_AdResS/readme
./modules/GC-AdResS/Analyse_Tools/readme
./modules/GC-AdResS/Analyse_VACF/readme
.. _ALL_background:
ALL (A Load-balancing Library)
------------------------------
Most modern parallelized (classical) particle simulation programs are based on a spatial decomposition method as an
underlying parallel algorithm: different processors administrate different spatial regions of the simulation domain and
keep track of those particles that are located in their respective region. Processors exchange information
* in order to compute interactions between particles located on different processors
* to exchange particles that have moved to a region administrated by a different processor.
This implies that the workload of a given processor is very much determined by its number of particles, or, more
precisely, by the number of interactions that are to be evaluated within its spatial region.
Certain systems of high physical and practical interest (e.g. condensing fluids) dynamically develop into a state where
the distribution of particles becomes spatially inhomogeneous. Unless special care is being taken, this results in a
substantially inhomogeneous distribution of the processors’ workload. Since the work usually has to be synchronized
between the processors, the runtime is determined by the slowest processor (i.e. the one with highest workload). In the
extreme case, this means that a large fraction of the processors is idle during these waiting times. This problem
becomes particularly severe if one aims at strong scaling, where the number of processors is increased at constant
problem size: Every processor administrates smaller and smaller regions and therefore inhomogeneities will become more
and more pronounced. This will eventually saturate the scalability of a given problem, already at a processor number
that is still so small that communication overhead remains negligible.
The solution to this problem is the inclusion of dynamic load balancing techniques. These methods redistribute the
workload among the processors, by lowering the load of the most busy cores and enhancing the load of the most idle ones.
Fortunately, several successful techniques are known already to put this strategy into practice. Nevertheless, dynamic
load balancing that is both efficient and widely applicable implies highly non-trivial coding work. Therefore it has has
not yet been implemented in a number of important codes of the E-CAM community, e.g. DL_Meso, DL_Poly, Espresso,
Espresso++, to name a few. Other codes (e.g. LAMMPS) have implemented somewhat simpler schemes, which however might turn
out to lack sufficient flexibility to accommodate all important cases. Therefore, the ALL library was created in the
context of an Extended Software Development Workshop (ESDW) within E-CAM (see `ALL ESDW event details <https://www.e-cam2020.eu/legacy_event/extended-software-development-workshop-for-atomistic-meso-and-multiscale-methods-on-hpc-systems/>`_
), where code developers of CECAM community codes were invited together with E-CAM postdocs, to work on the
implementation of load balancing strategies. The goal of this activity was to increase the scalability of these
applications to a larger number of cores on HPC systems, for spatially inhomogeneous systems, and thus to reduce the