Thank you Nick and Salvador for your comments. So Nick, basically you're saying that diagonalization time might be playing no role. That is at variance, for instance, with Wien2k, where diagonalization is the most time consuming step. In fact, my expectation is correct for it; veryfied with a similar cell and 9 k-points.
In that case "top" shows a first stage of 5 processes running at about 4/5=80% CPU power (and more or less stable) and a 2nd stage of 4 procs, running at 100%. This is not MPI, but a parallel strategy based on scripts (hope you are aware). The same experiment performed with "mpirun -np 5 ..." and Siesta, shows more jumpy figures for CPU usage. One task might be at 100%, another at 60%, and so on, as if Linux were playing with tasks like a juggler. To give you some feeling, please look at the numbers here, ----------------------------------------------------------------------- * Running on 4 nodes in parallel ... snipped ... siesta: iscf Eharris(eV) E_KS(eV) FreeEng(eV) dDmax Ef(eV) siesta: 1 -124261.2908 -124261.2891 -124261.2891 0.0001 -2.5494 timer: Routine,Calls,Time,% = IterSCF 1 1637.906 99.72 elaps: Routine,Calls,Wall,% = IterSCF 1 410.919 99.72 * Running on 5 nodes in parallel ... snipped ... siesta: iscf Eharris(eV) E_KS(eV) FreeEng(eV) dDmax Ef(eV) siesta: 1 -124261.2908 -124261.2891 -124261.2891 0.0001 -2.5494 timer: Routine,Calls,Time,% = IterSCF 1 1654.558 99.64 elaps: Routine,Calls,Wall,% = IterSCF 1 415.150 99.64 ------------------------------------------------------------------------ Those elapsed times are so close ... there must be an easy explanation. Best, Roberto On 11/02/2015 04:14 PM, Nick Papior wrote:
Basically: Diag.ParallelOverK false  uses scalapack to diagonalize the Hamiltonian Diag.ParallelOverK true  uses lapack to diagonalize the Hamiltonian If you have a very large system, you will not get anything out of using the latter option (rather than using an enormous amount of memory). Only for an _extreme_ number of k-points are the latter favourable, there are exceptions. The latter is intended for small bulk calculations with many k-points. Lastly, you have a quad core machine and run mpirun -np 5, and expect that to run faster. That is a wrong assumption. Secondly diagonalization is not everything in the program, check your TIMES file to figure out whether it _is_ the diagonalization or a mixture. 2015-11-02 19:42 GMT+01:00 RCP <[email protected] <mailto:[email protected]>>: Dear everyone, I seem to have a misunderstanding on how the Diag.ParallellOverK feature works, any comment would be much appreciated. I've got a large metallic cell, though still with 9 k-points, that runs on a quad PC; moreover, routine diagkp shows k-points are distributed round robin among processes. Thus I was expecting "mpirun -np 5 ..." to run significantly faster than "mpirun -np 4 ...", as judged from the elapsed time of individual scf steps. Clearly, in the latter case, the 9th k-point would be taken by process 0 while the other three would remain waiting, right?. However, my exppectations turned out to be wrong; in fact the 2nd alternative appears to be a tiny bit faster. Why ?. Thanks in advance, Roberto P. -- Kind regards Nick
