2015-11-02 21:37 GMT+01:00 RCP <[email protected]>:

> Thank you Nick and Salvador for your comments.
>
> So Nick, basically you're saying that diagonalization time might
> be playing no role. That is at variance, for instance, with Wien2k,
> where diagonalization is the most time consuming step. In fact,
> my expectation is correct for it; veryfied with a similar cell
> and 9 k-points.
>
No, I am definitely not saying that! But I have no idea about how your
system is setup.
Diagonalization _is_ a big part of the computation.
How have you specified the k-points? Is it 9 kpoints or 9 kpoints in the
monkhorst pack grid?

>
> In that case "top" shows a first stage of 5 processes running at
> about 4/5=80% CPU power (and more or less stable) and a 2nd stage of
> 4 procs, running at 100%. This is not MPI, but a parallel strategy
> based on scripts (hope you are aware).
>
wien2k is not siesta.
If wien2k is script based, i.e. sequential running and self-managing the
processes, then sure they behave _very_ differently and wien2k should give
you the desired speedup. Your figures sounds like hyperthreading to me.

> The same experiment performed with "mpirun -np 5 ..." and Siesta,
> shows more jumpy figures for CPU usage. One task might be at 100%,
> another at 60%, and so on,  as if Linux were playing with tasks
> like a juggler.
>
You are still implying usage of a quad core machine (quad == 4) and 4<5. If
you _only_ have 4 processors (intel hyperthreads do _not_ count as a
processes) then your assumption is not correct.
How would you expect a speedup by using 1 more process than you have on
your system?
If you see this juggling it sounds like quad == 4 and not 5.

>
> To give you some feeling, please look at the numbers here,
>
> -----------------------------------------------------------------------
> * Running on    4 nodes in parallel
>
>  ... snipped ...
>
> siesta: iscf   Eharris(eV)      E_KS(eV)   FreeEng(eV)   dDmax  Ef(eV)
> siesta:    1  -124261.2908  -124261.2891  -124261.2891  0.0001 -2.5494
> timer: Routine,Calls,Time,% = IterSCF        1    1637.906  99.72
> elaps: Routine,Calls,Wall,% = IterSCF        1     410.919 99.72
>
>
> * Running on    5 nodes in parallel
>
>  ... snipped ...
>
> siesta: iscf   Eharris(eV)      E_KS(eV)   FreeEng(eV)   dDmax  Ef(eV)
> siesta:    1  -124261.2908  -124261.2891  -124261.2891  0.0001 -2.5494
> timer: Routine,Calls,Time,% = IterSCF        1    1654.558  99.64
> elaps: Routine,Calls,Wall,% = IterSCF        1     415.150  99.64
> ------------------------------------------------------------------------
>
> Those elapsed times are so close ... there must be an easy explanation.
>
Yes, if you are using mpirun -np 5 on a quad core machine, then the
explanation is easy and your numbers are irrelevant.

>
> Best,
>
> Roberto
>
>
> On 11/02/2015 04:14 PM, Nick Papior wrote:
>
>> Basically:
>> Diag.ParallelOverK false
>> Ā uses scalapack to diagonalize the Hamiltonian
>> Diag.ParallelOverK true
>> Ā uses lapack to diagonalize the Hamiltonian
>>
>> If you have a very large system, you will not get anything out of using
>> the latter option (rather than using an enormous amount of memory).
>> Only for an _extreme_ number of k-points are the latter favourable,
>> there are exceptions.
>>
>> The latter is intended for small bulk calculations with many k-points.
>>
>> Lastly, you have a quad core machine and run mpirun -np 5, and expect
>> that to run faster. That is a wrong assumption.Ā
>> Secondly diagonalization is not everything in the program, check your
>> TIMES file to figure out whether it _is_ the diagonalization or a
>> mixture.Ā
>>
>>
>> 2015-11-02 19:42 GMT+01:00 RCP <[email protected]
>> <mailto:[email protected]>>:
>>
>>
>>     Dear everyone,
>>
>>     I seem to have a misunderstanding on how the Diag.ParallellOverK
>>     feature works, any comment would be much appreciated.
>>
>>     I've got a large metallic cell, though still with 9 k-points, that
>>     runs on a quad PC; moreover, routine diagkp shows k-points are
>>     distributed round robin among processes. Thus I was expecting
>>     "mpirun -np 5 ..." to run significantly faster than "mpirun -np 4
>> ...",
>>     as judged from the elapsed time of individual scf steps.
>>     Clearly, in the latter case, the 9th k-point would be taken by
>>     process 0 while the other three would remain waiting, right?.
>>
>>     However, my exppectations turned out to be wrong; in fact the
>>     2nd alternative appears to be a tiny bit faster.
>>     Why ?.
>>
>>     Thanks in advance,
>>
>>     Roberto P.
>>
>>
>>
>>
>> --
>> Kind regards Nick
>>
>


-- 
Kind regards Nick

Responder a