2015-11-02 21:37 GMT+01:00 RCP <[email protected]>: > Thank you Nick and Salvador for your comments. > > So Nick, basically you're saying that diagonalization time might > be playing no role. That is at variance, for instance, with Wien2k, > where diagonalization is the most time consuming step. In fact, > my expectation is correct for it; veryfied with a similar cell > and 9 k-points. > No, I am definitely not saying that! But I have no idea about how your system is setup. Diagonalization _is_ a big part of the computation. How have you specified the k-points? Is it 9 kpoints or 9 kpoints in the monkhorst pack grid?
> > In that case "top" shows a first stage of 5 processes running at > about 4/5=80% CPU power (and more or less stable) and a 2nd stage of > 4 procs, running at 100%. This is not MPI, but a parallel strategy > based on scripts (hope you are aware). > wien2k is not siesta. If wien2k is script based, i.e. sequential running and self-managing the processes, then sure they behave _very_ differently and wien2k should give you the desired speedup. Your figures sounds like hyperthreading to me. > The same experiment performed with "mpirun -np 5 ..." and Siesta, > shows more jumpy figures for CPU usage. One task might be at 100%, > another at 60%, and so on, as if Linux were playing with tasks > like a juggler. > You are still implying usage of a quad core machine (quad == 4) and 4<5. If you _only_ have 4 processors (intel hyperthreads do _not_ count as a processes) then your assumption is not correct. How would you expect a speedup by using 1 more process than you have on your system? If you see this juggling it sounds like quad == 4 and not 5. > > To give you some feeling, please look at the numbers here, > > ----------------------------------------------------------------------- > * Running on 4 nodes in parallel > > ... snipped ... > > siesta: iscf Eharris(eV) E_KS(eV) FreeEng(eV) dDmax Ef(eV) > siesta: 1 -124261.2908 -124261.2891 -124261.2891 0.0001 -2.5494 > timer: Routine,Calls,Time,% = IterSCF 1 1637.906 99.72 > elaps: Routine,Calls,Wall,% = IterSCF 1 410.919 99.72 > > > * Running on 5 nodes in parallel > > ... snipped ... > > siesta: iscf Eharris(eV) E_KS(eV) FreeEng(eV) dDmax Ef(eV) > siesta: 1 -124261.2908 -124261.2891 -124261.2891 0.0001 -2.5494 > timer: Routine,Calls,Time,% = IterSCF 1 1654.558 99.64 > elaps: Routine,Calls,Wall,% = IterSCF 1 415.150 99.64 > ------------------------------------------------------------------------ > > Those elapsed times are so close ... there must be an easy explanation. > Yes, if you are using mpirun -np 5 on a quad core machine, then the explanation is easy and your numbers are irrelevant. > > Best, > > Roberto > > > On 11/02/2015 04:14 PM, Nick Papior wrote: > >> Basically: >> Diag.ParallelOverK false >> Ā uses scalapack to diagonalize the Hamiltonian >> Diag.ParallelOverK true >> Ā uses lapack to diagonalize the Hamiltonian >> >> If you have a very large system, you will not get anything out of using >> the latter option (rather than using an enormous amount of memory). >> Only for an _extreme_ number of k-points are the latter favourable, >> there are exceptions. >> >> The latter is intended for small bulk calculations with many k-points. >> >> Lastly, you have a quad core machine and run mpirun -np 5, and expect >> that to run faster. That is a wrong assumption.Ā >> Secondly diagonalization is not everything in the program, check your >> TIMES file to figure out whether it _is_ the diagonalization or a >> mixture.Ā >> >> >> 2015-11-02 19:42 GMT+01:00 RCP <[email protected] >> <mailto:[email protected]>>: >> >> >> Dear everyone, >> >> I seem to have a misunderstanding on how the Diag.ParallellOverK >> feature works, any comment would be much appreciated. >> >> I've got a large metallic cell, though still with 9 k-points, that >> runs on a quad PC; moreover, routine diagkp shows k-points are >> distributed round robin among processes. Thus I was expecting >> "mpirun -np 5 ..." to run significantly faster than "mpirun -np 4 >> ...", >> as judged from the elapsed time of individual scf steps. >> Clearly, in the latter case, the 9th k-point would be taken by >> process 0 while the other three would remain waiting, right?. >> >> However, my exppectations turned out to be wrong; in fact the >> 2nd alternative appears to be a tiny bit faster. >> Why ?. >> >> Thanks in advance, >> >> Roberto P. >> >> >> >> >> -- >> Kind regards Nick >> > -- Kind regards Nick
