Re: [SIESTA-L] Puzzled about ParallelOverK feature

RCP Mon, 02 Nov 2015 12:39:12 -0800

Thank you Nick and Salvador for your comments.

So Nick, basically you're saying that diagonalization time might
be playing no role. That is at variance, for instance, with Wien2k,
where diagonalization is the most time consuming step. In fact,
my expectation is correct for it; veryfied with a similar cell
and 9 k-points.


In that case "top" shows a first stage of 5 processes running at
about 4/5=80% CPU power (and more or less stable) and a 2nd stage of
4 procs, running at 100%. This is not MPI, but a parallel strategy
based on scripts (hope you are aware).
The same experiment performed with "mpirun -np 5 ..." and Siesta,
shows more jumpy figures for CPU usage. One task might be at 100%,
another at 60%, and so on,  as if Linux were playing with tasks
like a juggler.

To give you some feeling, please look at the numbers here,

-----------------------------------------------------------------------
* Running on    4 nodes in parallel

 ... snipped ...

siesta: iscf   Eharris(eV)      E_KS(eV)   FreeEng(eV)   dDmax  Ef(eV)
siesta:    1  -124261.2908  -124261.2891  -124261.2891  0.0001 -2.5494
timer: Routine,Calls,Time,% = IterSCF        1    1637.906  99.72
elaps: Routine,Calls,Wall,% = IterSCF        1     410.919  99.72


* Running on    5 nodes in parallel

 ... snipped ...

siesta: iscf   Eharris(eV)      E_KS(eV)   FreeEng(eV)   dDmax  Ef(eV)
siesta:    1  -124261.2908  -124261.2891  -124261.2891  0.0001 -2.5494
timer: Routine,Calls,Time,% = IterSCF        1    1654.558  99.64
elaps: Routine,Calls,Wall,% = IterSCF        1     415.150  99.64
------------------------------------------------------------------------

Those elapsed times are so close ... there must be an easy explanation.

Best,

Roberto


On 11/02/2015 04:14 PM, Nick Papior wrote:

Basically:
Diag.ParallelOverK false
Â uses scalapack to diagonalize the Hamiltonian
Diag.ParallelOverK true
Â uses lapack to diagonalize the Hamiltonian

If you have a very large system, you will not get anything out of using
the latter option (rather than using an enormous amount of memory).
Only for an _extreme_ number of k-points are the latter favourable,
there are exceptions.

The latter is intended for small bulk calculations with many k-points.

Lastly, you have a quad core machine and run mpirun -np 5, and expect
that to run faster. That is a wrong assumption.Â
Secondly diagonalization is not everything in the program, check your
TIMES file to figure out whether it _is_ the diagonalization or a mixture.Â


2015-11-02 19:42 GMT+01:00 RCP <[email protected]
<mailto:[email protected]>>:

    Dear everyone,

    I seem to have a misunderstanding on how the Diag.ParallellOverK
    feature works, any comment would be much appreciated.

    I've got a large metallic cell, though still with 9 k-points, that
    runs on a quad PC; moreover, routine diagkp shows k-points are
    distributed round robin among processes. Thus I was expecting
    "mpirun -np 5 ..." to run significantly faster than "mpirun -np 4 ...",
    as judged from the elapsed time of individual scf steps.
    Clearly, in the latter case, the 9th k-point would be taken by
    process 0 while the other three would remain waiting, right?.

    However, my exppectations turned out to be wrong; in fact the
    2nd alternative appears to be a tiny bit faster.
    Why ?.

    Thanks in advance,

    Roberto P.




--
Kind regards Nick

Re: [SIESTA-L] Puzzled about ParallelOverK feature

Responder a