Without knowing much, n=4 would lead to a square processor grid (2x2) on 
scalapack routines while n=5 will require a linear processor grid (5x1), which 
would make scalapack and blacs not as efficient for communication in the second 
case... (?) 

 Best regards,
-Salvador
________________________________________
From: [email protected] <[email protected]> on behalf of RCP 
<[email protected]>
Sent: Monday, November 2, 2015 12:42 PM
To: SIESTA
Subject: [SIESTA-L] Puzzled about ParallelOverK feature

Dear everyone,

I seem to have a misunderstanding on how the Diag.ParallellOverK
feature works, any comment would be much appreciated.

I've got a large metallic cell, though still with 9 k-points, that
runs on a quad PC; moreover, routine diagkp shows k-points are
distributed round robin among processes. Thus I was expecting
"mpirun -np 5 ..." to run significantly faster than "mpirun -np 4 ...",
as judged from the elapsed time of individual scf steps.
Clearly, in the latter case, the 9th k-point would be taken by
process 0 while the other three would remain waiting, right?.

However, my exppectations turned out to be wrong; in fact the
2nd alternative appears to be a tiny bit faster.
Why ?.

Thanks in advance,

Roberto P.

Responder a