[QE-users] Finding the optimal parallelization parameters

Léon Luntadila Lufungula Tue, 18 Apr 2023 02:29:28 -0700

Dear QE users,



I need some help in optimizing the different parallelization levels of my QE 
calculations. Unfortunately, our HPC center is going to start billing the 
research groups for our calculations so I'm currently working on making our QE 
calculations as efficient as possible to avoid large bills at the end of the 
year. In our HPC center we have two clusters at our disposal with different 
architectures and different billing amounts, so I wanted to figure out which 
one to use and how many nodes to request per calculation.



The two clusters have the following architecture:
- Cluster 1 (Leibniz): 152 compute nodes containing 2 Xeon E5-2680v4 
[email protected]<mailto:[email protected]> (Broadwell), 14 cores each (28 cores per node 
in total)
- Cluster 2 (Vaughan): 152 compute nodes containing 2 AMD Epyc 7452 
[email protected]<mailto:[email protected]> GHz (Rome), 32 cores each (64 cores per node in 
total)



>From the QE documentation I've read that there are only several parameters 
>that are important for the parallelization. These parameters are given below 
>with their values for my systems:
- No. of k-points = 2

- 3rd dimension in the smooth FFT grid = 405
- 3rd dimension in the dense FFT grid = 720

- No. of KS states = 457



I am currently using rather arbitrary parallelization settings as I just 
request 8 nodes (of 28 cores) per calculation with k-point parallelization set 
to 2 (i.e., -nk 2) and using the serial algorithm for subspace diagonalization 
(i.e., -nd 1) to make the calculation complete within a reasonable timescale.



I've already read a lot about the parallelization implemented in QE, but I 
still have several questions relating to the different levels of 
parallelization:



k-point parallelization:

>From what I understand, having only 2 k-points in my calculations means that I 
>can maximally subdivide the processors into a set of 2 pools as it cannot 
>exceed the number of k-points, so that each pool of processors handles a 
>single k-point. If I would take more pools, this would be detrimental to 
>performance as multiple pools would handle a single k-point resulting in heavy 
>communications between these pools. Therefore, I am wondering if it is also 
>bad to request more than 2 nodes for my calculations considering I only have 
>two k-points and subdivide my processors into 2 pools? Requesting more than 2 
>nodes would mean every pool contains processors spread across multiple nodes, 
>so each pool would require inter-node communications to do its computations 
>which would slow down the calculation.



FFT parallelization:

It is stated in the documentation of pw.x that the parallelization on PWs 
yields best results when the number of processors in a pool is a divisor of the 
3rd dimension of the smooth (nr3s) and dense (nr3) FFT grids. Unfortunately, in 
my case the greatest common divisor of both dimensions is 45 which is a bad 
match with the number of processors available on the nodes on either of the 
clusters available to me (28 and 64 respectively). Therefore, I was wondering 
if it is okay to just manually alter the third dimensions with nr3=X and nr3s=X 
to make sure that the number of processors available are a common divisor of 
the third dimensions of the FFT grids?



Bands and tasks parallelization:

If I'm not mistaken, I should not use band or task group parallelization 
because bands parallelization is only useful when using hybrid functionals 
(which I don't use) and task group parallelization is only necessary when the 
number of processors exceeds the number of FFT planes (which is not the case 
here unless I ask an excessive number of nodes which would already be 
detrimental due to the inter-node communications).



OpenMP parallelization:

OpenMP cannot be used to coordinate multiple node jobs, so if I would want to 
use this level of parallelization, I would have to make sure that the number of 
processors in a pool is lower than or equal to the number of processors on a 
single node right?



Any help on the subject would be greatly appreciated!



Thanks in advance,

Léon Luntadila Lufungula

Structural Chemistry Group

University of Antwerp, Belgium

_______________________________________________
The Quantum ESPRESSO community stands by the Ukrainian
people and expresses its concerns about the devastating
effects that the Russian military offensive has on their
country and on the free and peaceful scientific, cultural,
and economic cooperation amongst peoples
_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

[QE-users] Finding the optimal parallelization parameters

Reply via email to