I just catched the n=5 on a quad-core as emphasized by Nick: That means, you 
have one core running 50% at most (one core must pick up the n=5 on your 
mpirun, but you only have 4 physical cores, so one core is forced to pick that 
up, thus running at half speed), and all the others have to wait on it.

-Salvador
________________________________________
From: [email protected] <[email protected]> on behalf of Salvador 
Barraza-Lopez <[email protected]>
Sent: Monday, November 2, 2015 1:21 PM
To: SIESTA
Subject: Re: [SIESTA-L] Puzzled about ParallelOverK feature

Without knowing much, n=4 would lead to a square processor grid (2x2) on 
scalapack routines while n=5 will require a linear processor grid (5x1), which 
would make scalapack and blacs not as efficient for communication in the second 
case... (?)

 Best regards,
-Salvador
________________________________________
From: [email protected] <[email protected]> on behalf of RCP 
<[email protected]>
Sent: Monday, November 2, 2015 12:42 PM
To: SIESTA
Subject: [SIESTA-L] Puzzled about ParallelOverK feature

Dear everyone,

I seem to have a misunderstanding on how the Diag.ParallellOverK
feature works, any comment would be much appreciated.

I've got a large metallic cell, though still with 9 k-points, that
runs on a quad PC; moreover, routine diagkp shows k-points are
distributed round robin among processes. Thus I was expecting
"mpirun -np 5 ..." to run significantly faster than "mpirun -np 4 ...",
as judged from the elapsed time of individual scf steps.
Clearly, in the latter case, the 9th k-point would be taken by
process 0 while the other three would remain waiting, right?.

However, my exppectations turned out to be wrong; in fact the
2nd alternative appears to be a tiny bit faster.
Why ?.

Thanks in advance,

Roberto P.

Responder a