I just catched the n=5 on a quad-core as emphasized by Nick: That means, you have one core running 50% at most (one core must pick up the n=5 on your mpirun, but you only have 4 physical cores, so one core is forced to pick that up, thus running at half speed), and all the others have to wait on it.
-Salvador ________________________________________ From: [email protected] <[email protected]> on behalf of Salvador Barraza-Lopez <[email protected]> Sent: Monday, November 2, 2015 1:21 PM To: SIESTA Subject: Re: [SIESTA-L] Puzzled about ParallelOverK feature Without knowing much, n=4 would lead to a square processor grid (2x2) on scalapack routines while n=5 will require a linear processor grid (5x1), which would make scalapack and blacs not as efficient for communication in the second case... (?) Best regards, -Salvador ________________________________________ From: [email protected] <[email protected]> on behalf of RCP <[email protected]> Sent: Monday, November 2, 2015 12:42 PM To: SIESTA Subject: [SIESTA-L] Puzzled about ParallelOverK feature Dear everyone, I seem to have a misunderstanding on how the Diag.ParallellOverK feature works, any comment would be much appreciated. I've got a large metallic cell, though still with 9 k-points, that runs on a quad PC; moreover, routine diagkp shows k-points are distributed round robin among processes. Thus I was expecting "mpirun -np 5 ..." to run significantly faster than "mpirun -np 4 ...", as judged from the elapsed time of individual scf steps. Clearly, in the latter case, the 9th k-point would be taken by process 0 while the other three would remain waiting, right?. However, my exppectations turned out to be wrong; in fact the 2nd alternative appears to be a tiny bit faster. Why ?. Thanks in advance, Roberto P.
