3.26x seems possible to me. It can be caused by load imbalance in the iterative solver among the 4 k-points. Could you list the time in seconds with 1 node and 4 nodes? Those you used to calculate 3.26x. Could you also try diago_david_ndim=2 under "&ELECTRONS" and provide 1 and 4-node time in seconds?
In addition, you may try ELPA which usually gives better performance than scalapack. Thanks, Ye =================== Ye Luo, Ph.D. Computational Science Division & Leadership Computing Facility Argonne National Laboratory On Wed, May 27, 2020 at 9:27 AM Michal Krompiec <[email protected]> wrote: > Hello, > How can I minimize inter-node MPI communication in a pw.x run? My > system doesn't have Infiniband and inter-node MPI can easily become > the bottleneck. > Let's say, I'm running a calculation with 4 k-points, on 4 nodes, with > 56 MPI tasks per node. I would then use -npool 4 to create 4 pools for > the k-point parallelization. However, it seems that the > diagonalization is by default parallelized imperfectly (or isn't it?): > Subspace diagonalization in iterative solution of the eigenvalue > problem: > one sub-group per band group will be used > scalapack distributed-memory algorithm (size of sub-group: 7* 7 > procs) > So far, speedup on 4 nodes vs 1 node is 3.26x. Is it normal or does it > look like it can be improved? > > Best regards, > > Michal Krompiec > Merck KGaA > Southampton, UK > _______________________________________________ > Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso) > users mailing list [email protected] > https://lists.quantum-espresso.org/mailman/listinfo/users >
_______________________________________________ Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso) users mailing list [email protected] https://lists.quantum-espresso.org/mailman/listinfo/users
