Dear Ye, Dear Paolo, I re-ran the benchmarks for my test case: a single MD step of a smallish supercell of a certain oxide semiconductor, with PBE and PAW (from PSlib). Previous timings were from the start of MD run until the end of the 1st SCF iteration of the 2nd MD step.
Interestingly, ELPA gave no advantage over ScaLAPACK, and diago_david_ndim=2 made things significantly slower. The ScaLAPACK build is QE 6.5, the ELPA build is the development version from last month. Both compiled with Intel 2020 and Intel MPI. Here are the numbers: MPI per node npool nodes ELPA/Scalapack diago_david_ndim time / s speedup vs 1 node 56 4 1 ELPA 4 1335 56 4 1 ELPA 2 1931 56 4 1 ScaLAPACK 4 976 56 4 1 ScaLAPACK 2 1486 56 4 4 ELPA 4 367 3.637602 56 4 4 ELPA 2 729 2.648834 56 4 4 ScaLAPACK 4 357 2.733894 56 4 4 ScaLAPACK 2 555 2.677477 Best, Michal On Wed, 27 May 2020 at 15:47, Ye Luo <[email protected]> wrote: > 3.26x seems possible to me. It can be caused by load imbalance in the > iterative solver among the 4 k-points. > Could you list the time in seconds with 1 node and 4 nodes? Those you used > to calculate 3.26x. > Could you also try diago_david_ndim=2 under "&ELECTRONS" and provide 1 and > 4-node time in seconds? > > In addition, you may try ELPA which usually gives better performance than > scalapack. > > Thanks, > Ye > =================== > Ye Luo, Ph.D. > Computational Science Division & Leadership Computing Facility > Argonne National Laboratory > > > On Wed, May 27, 2020 at 9:27 AM Michal Krompiec <[email protected]> > wrote: > >> Hello, >> How can I minimize inter-node MPI communication in a pw.x run? My >> system doesn't have Infiniband and inter-node MPI can easily become >> the bottleneck. >> Let's say, I'm running a calculation with 4 k-points, on 4 nodes, with >> 56 MPI tasks per node. I would then use -npool 4 to create 4 pools for >> the k-point parallelization. However, it seems that the >> diagonalization is by default parallelized imperfectly (or isn't it?): >> Subspace diagonalization in iterative solution of the eigenvalue >> problem: >> one sub-group per band group will be used >> scalapack distributed-memory algorithm (size of sub-group: 7* 7 >> procs) >> So far, speedup on 4 nodes vs 1 node is 3.26x. Is it normal or does it >> look like it can be improved? >> >> Best regards, >> >> Michal Krompiec >> Merck KGaA >> Southampton, UK >> _______________________________________________ >> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso) >> users mailing list [email protected] >> https://lists.quantum-espresso.org/mailman/listinfo/users >> > _______________________________________________ > Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso) > users mailing list [email protected] > https://lists.quantum-espresso.org/mailman/listinfo/users
_______________________________________________ Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso) users mailing list [email protected] https://lists.quantum-espresso.org/mailman/listinfo/users
