3.26x seems possible to me. It can be caused by load imbalance in the
iterative solver among the 4 k-points.
Could you list the time in seconds with 1 node and 4 nodes? Those you used
to calculate 3.26x.
Could you also try diago_david_ndim=2 under "&ELECTRONS" and provide 1 and
4-node time in seconds?

In addition, you may try ELPA which usually gives better performance than
scalapack.

Thanks,
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory


On Wed, May 27, 2020 at 9:27 AM Michal Krompiec <[email protected]>
wrote:

> Hello,
> How can I minimize inter-node MPI communication in a pw.x run? My
> system doesn't have Infiniband and inter-node MPI can easily become
> the bottleneck.
> Let's say, I'm running a calculation with 4 k-points, on 4 nodes, with
> 56 MPI tasks per node. I would then use -npool 4 to create 4 pools for
> the k-point parallelization. However, it seems that the
> diagonalization is by default parallelized imperfectly (or isn't it?):
>      Subspace diagonalization in iterative solution of the eigenvalue
> problem:
>      one sub-group per band group will be used
>      scalapack distributed-memory algorithm (size of sub-group:  7*  7
> procs)
> So far, speedup on 4 nodes vs 1 node is 3.26x. Is it normal or does it
> look like it can be improved?
>
> Best regards,
>
> Michal Krompiec
> Merck KGaA
> Southampton, UK
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
> users mailing list [email protected]
> https://lists.quantum-espresso.org/mailman/listinfo/users
>
_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

Reply via email to