In conclusion, and if I understand correctly: PWscf runs only with 1 MPI process per core, but it runs 800+ atoms in as little as 32 processes or in as much as 512, as I was expecting.
This confirms my opinion that the problem is on the BG side and not on the PWscf side, since there is NOTHING in the fortran code that depends upon where the MPI processes are running. Of course one can never rule out the possibility that some obscure bug is triggered only in that special cases, but it seems to me highly unlikely. Implementation of mixed MPI-openMPI parallelization is under development, but it wil take some time. In the meantime, if you can link openMPI-aware mathematical libraries, you might get some speedup. If you do not need k-points, and if you know how to deal with metallic systems, you might try CP instead of PWscf - it is better tested for large systems - but I don't expect a different behavior, since the routines performing parallel subspace diagonalization are the same that perform iterative orthonormalization, so the trouble is likely to move from "cholesky" to "ortho". You might try to find out what is wrong, since you have two cases that should yield exactly the same results but don't. It may take a lot of time and lead to no result, though. You may also try to raise this issue with the technical staff of the computing center. Paolo -- Paolo Giannozzi, Democritos and University of Udine, Italy
