Hi Nick, Please take my word: I'm not a computer guru but started using computers before the PC era :-). I know hyperthreading is evil for scientific calculations, they're even disabled in BIOS. It is not that.
Why I'm saying np=5 should take less time than np=4, even if my PC is a quad, is as follows. Distribution of k-points is round robin, and assume k-points (the, trimmed, real ones, not M&K grid) take about the same time to process. Thus for np=4 I need 3 "time steps" to get the job done, namely (4 + 4 + 1) when seen from k-points perspective. On the other hand for np=5 the time taken would be something like 2* 1/0.80 = 2.5, or even shorter, 1/0.80 + 1 = 2.25. ¿What is flawed with this argument?. Best regards, Roberto On 11/02/2015 05:50 PM, Nick Papior wrote:
2015-11-02 21:37 GMT+01:00 RCP <[email protected] <mailto:[email protected]>>: Thank you Nick and Salvador for your comments. So Nick, basically you're saying that diagonalization time might be playing no role. That is at variance, for instance, with Wien2k, where diagonalization is the most time consuming step. In fact, my expectation is correct for it; veryfied with a similar cell and 9 k-points. No, I am definitely not saying that! But I have no idea about how your system is setup. Diagonalization _is_ a big part of the computation. How have you specified the k-points? Is it 9 kpoints or 9 kpoints in the monkhorst pack grid? In that case "top" shows a first stage of 5 processes running at about 4/5=80% CPU power (and more or less stable) and a 2nd stage of 4 procs, running at 100%. This is not MPI, but a parallel strategy based on scripts (hope you are aware). wien2k is not siesta. If wien2k is script based, i.e. sequential running and self-managing the processes, then sure they behave _very_ differently and wien2k should give you the desired speedup. Your figures sounds like hyperthreading to me. The same experiment performed with "mpirun -np 5 ..." and Siesta, shows more jumpy figures for CPU usage. One task might be at 100%, another at 60%, and so on, as if Linux were playing with tasks like a juggler. You are still implying usage of a quad core machine (quad == 4) and 4<5. If you _only_ have 4 processors (intel hyperthreads do _not_ count as a processes) then your assumption is not correct. How would you expect a speedup by using 1 more process than you have on your system? If you see this juggling it sounds like quad == 4 and not 5. To give you some feeling, please look at the numbers here, ----------------------------------------------------------------------- * Running on  4 nodes in parallel  ... snipped ... siesta: iscf  Eharris(eV)   E_KS(eV)  FreeEng(eV)  dDmax Ef(eV) siesta:  1 -124261.2908 -124261.2891 -124261.2891 0.0001 -2.5494 timer: Routine,Calls,Time,% = IterSCF    1  1637.906 99.72 elaps: Routine,Calls,Wall,% = IterSCF    1   410.919 99.72 <tel:410.919%20%2099.72> * Running on  5 nodes in parallel  ... snipped ... siesta: iscf  Eharris(eV)   E_KS(eV)  FreeEng(eV)  dDmax Ef(eV) siesta:  1 -124261.2908 -124261.2891 -124261.2891 0.0001 -2.5494 timer: Routine,Calls,Time,% = IterSCF    1  1654.558 99.64 elaps: Routine,Calls,Wall,% = IterSCF    1   415.150 99.64 ------------------------------------------------------------------------ Those elapsed times are so close ... there must be an easy explanation. Yes, if you are using mpirun -np 5 on a quad core machine, then the explanation is easy and your numbers are irrelevant. Best, Roberto On 11/02/2015 04:14 PM, Nick Papior wrote: Basically: Diag.ParallelOverK false Ā uses scalapack to diagonalize the Hamiltonian Diag.ParallelOverK true Ā uses lapack to diagonalize the Hamiltonian If you have a very large system, you will not get anything out of using the latter option (rather than using an enormous amount of memory). Only for an _extreme_ number of k-points are the latter favourable, there are exceptions. The latter is intended for small bulk calculations with many k-points. Lastly, you have a quad core machine and run mpirun -np 5, and expect that to run faster. That is a wrong assumption.Ā Secondly diagonalization is not everything in the program, check your TIMES file to figure out whether it _is_ the diagonalization or a mixture.Ā 2015-11-02 19:42 GMT+01:00 RCP <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>:   Dear everyone,   I seem to have a misunderstanding on how the Diag.ParallellOverK   feature works, any comment would be much appreciated.   I've got a large metallic cell, though still with 9 k-points, that   runs on a quad PC; moreover, routine diagkp shows k-points are   distributed round robin among processes. Thus I was expecting   "mpirun -np 5 ..." to run significantly faster than "mpirun -np 4 ...",   as judged from the elapsed time of individual scf steps.   Clearly, in the latter case, the 9th k-point would be taken by   process 0 while the other three would remain waiting, right?.   However, my exppectations turned out to be wrong; in fact the   2nd alternative appears to be a tiny bit faster.   Why ?.   Thanks in advance,   Roberto P. -- Kind regards Nick -- Kind regards Nick
