Without knowing much, n=4 would lead to a square processor grid (2x2) on scalapack routines while n=5 will require a linear processor grid (5x1), which would make scalapack and blacs not as efficient for communication in the second case... (?)
Best regards, -Salvador ________________________________________ From: [email protected] <[email protected]> on behalf of RCP <[email protected]> Sent: Monday, November 2, 2015 12:42 PM To: SIESTA Subject: [SIESTA-L] Puzzled about ParallelOverK feature Dear everyone, I seem to have a misunderstanding on how the Diag.ParallellOverK feature works, any comment would be much appreciated. I've got a large metallic cell, though still with 9 k-points, that runs on a quad PC; moreover, routine diagkp shows k-points are distributed round robin among processes. Thus I was expecting "mpirun -np 5 ..." to run significantly faster than "mpirun -np 4 ...", as judged from the elapsed time of individual scf steps. Clearly, in the latter case, the 9th k-point would be taken by process 0 while the other three would remain waiting, right?. However, my exppectations turned out to be wrong; in fact the 2nd alternative appears to be a tiny bit faster. Why ?. Thanks in advance, Roberto P.
