Dear everyone, I seem to have a misunderstanding on how the Diag.ParallellOverK feature works, any comment would be much appreciated.
I've got a large metallic cell, though still with 9 k-points, that runs on a quad PC; moreover, routine diagkp shows k-points are distributed round robin among processes. Thus I was expecting "mpirun -np 5 ..." to run significantly faster than "mpirun -np 4 ...", as judged from the elapsed time of individual scf steps. Clearly, in the latter case, the 9th k-point would be taken by process 0 while the other three would remain waiting, right?. However, my exppectations turned out to be wrong; in fact the 2nd alternative appears to be a tiny bit faster. Why ?. Thanks in advance, Roberto P.
