Oh, a last thing, timing can be dubious as that is typically inferred from clock-cycles in the cpu, hence I would advice you to time it your self. For instance by doing:
date ... date It depends on the underlying timing function used. 2015-11-03 14:03 GMT+01:00 Nick Papior <[email protected]>: > That is crazy, and it is _amazing_! > > Nevertheless, I would still not recommend you do these kind of things. > > > 2015-11-03 13:37 GMT+01:00 RCP <[email protected]>: > >> Good morning !, >> >> Please have a look at the outcome of the crazy "mpirun -np 9 ..." >> exercise, >> >> ---------------------------------------------------------------------- >> * Running on 9 nodes in parallel >> >> ... snipped ... >> >> siesta: iscf Eharris(eV) E_KS(eV) FreeEng(eV) dDmax Ef(eV) >> siesta: 1 -124261.2908 -124261.2891 -124261.2891 0.0001 -2.5494 >> timer: Routine,Calls,Time,% = IterSCF 1 1733.886 99.53 >> elaps: Routine,Calls,Wall,% = IterSCF 1 435.030 99.53 >> ----------------------------------------------------------------------- >> >> Amazing: the elaps row is pretty close to the 410.0 (or so) of my >> previous posts. >> >> However, yes, I seem to have a misunderstanding about the inner >> workings of the code (well, in a sense, this was my first question) >> because the timing info at the end of the output file says "diagon" >> is taking only about 1/3 of the total time, >> >> ------------------------------------------------------------------ >> elaps: ELAPSED times: >> elaps: Routine Calls Time/call Tot.time % >> elaps: siesta 1 712.814 712.814 100.00 >> ... >> elaps: diagon 1 225.603 225.603 31.65 >> elaps: cdiag 2 47.763 95.527 13.40 >> elaps: cdiag1 2 0.920 1.840 0.26 >> elaps: cdiag2 2 3.335 6.670 0.94 >> elaps: cdiag3 2 42.961 85.922 12.05 >> elaps: cdiag4 2 0.512 1.025 0.14 >> elaps: DHSCF4 1 71.874 71.874 10.08 >> elaps: dfscf 1 70.398 70.398 9.88 >> elaps: overfsm 1 0.269 0.269 0.04 >> elaps: optical 1 0.000 0.000 0.00 >> ------------------------------------------------------------------- >> >> Take care, >> >> Roberto >> >> On 11/03/2015 08:28 AM, Nick Papior wrote: >> >>> >>> >>> 2015-11-03 12:10 GMT+01:00 RCP <[email protected] >>> <mailto:[email protected]>>: >>> >>> Hi, >>> >>> Thanks for your time and sharing of wisdom. >>> In general terms I do agree with you Nick, in the sense that >>> running several sequential independent tasks (wien2k) >>> simultaneously is not equivalent to running a set of >>> inter-communicated, MPI, tasks. >>> >>> However here we're talking about a peculiar situation, >>> namely, parallelization over k-points is, essentially, an >>> embarrassingly parallel problem, at least for my rather >>> large cell (97 atoms). The sequential gathering of >>> results from different k-points, building the new charge >>> density and so on, should take negligible time compared >>> to the time spent by a single task in diagonalizing a large >>> matrix. >>> >>> ! ! NO ! ! ;) >>> Parallelization across k-points in siesta is NOT the same as an >>> embarrassingly parallel problem across k-points. >>> The _only_ thing in siesta that is parallelized embarrassingly is the >>> diagonalisation part (after having communicated all Hamiltonian elements >>> to all other nodes). Everything else is MPI parallelized, grid >>> operations, construction of the Hamiltonian, etc. etc.! >>> Yes, even though the diagonalization is embarrassingly and it _should_ >>> take the longest time your assumption that the diagonalization part is >>> still the most time consuming becomes wrong. >>> Furthermore 96 atoms ~ 1000 orbitals, not that big a matrix to >>> diagonalize. >>> >>>  Please look at the timing output for clarity of this. >>> >>> >>> Of course, oversubscribing the CPUs must hurt performance >>> at some point, and this is most likely worse for MPI tasks than >>> for truly independent ones. But to me 5 MPI tasks competing for >>> 4 cores does not look a scenario that terrible. >>> >>> MPI is not sequential programming and any assumption on oversubscribing >>> you have is wrong, put simply. The 5 MPI tasks does not compete for 4 >>> cores, siesta MPI tasks are linearly dependent on each other (as written >>> in the last mail) and hence they have to keep up all the time. If the >>> MPI program was fully embarrassingly parallelized, then yes, you could, >>> perhaps, have a point, but siesta is not such a code. >>> How you can keep saying that oversubscribing can not be that damaging >>> for performance (in fact improve) is really baffling to me :) >>> >>> Moreover, np=5 and np=4 resulted in almost the same elapsed >>> time. It is hard to believe that my expected time win for >>> np=5 was (almost) exactly compensated by performance loss.  >>> >>> Try doubling your system size and do the same calculation. >>> >>> >>> Nice discussion guys. I'll do a little more research and let you >>> know if something worth comes out. >>> >>> >>> Take care, >>> >>> Roberto >>> >>> On 11/02/2015 07:21 PM, Salvador Barraza-Lopez wrote: >>> >>> Could not be clearer Nick. >>> >>> RIcardo, if you type top on your machine, you'll see two SIESTA >>> processes competing for one core's time, and performing at 50% >>> at most. >>> >>> >>> Other cores will wait for these processes when an operation >>> among all >>> cores is necessary in the algorithm (i.e., a sum or a >>> distributed matrix >>> product)... thus these other cores will just have to wait for the >>> task these processes competing for the same core time to end; >>> thus >>> degrading performance. >>> >>> >>> -Salvador >>> >>> >>> >>> >>> ------------------------------------------------------------------------ >>> *From:* [email protected] <mailto:[email protected]> >>> <[email protected] <mailto:[email protected]>> on >>> behalf of >>> Nick Papior <[email protected] <mailto:[email protected]>> >>> *Sent:* Monday, November 2, 2015 4:08 PM >>> *To:* [email protected] <mailto:[email protected]> >>> *Subject:* Re: [SIESTA-L] Puzzled about ParallelOverK feature >>> >>> >>> 2015-11-02 22:37 GMT+01:00 RCP <[email protected] >>> <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>>>: >>> >>> >>>   Hi Nick, >>> >>>   Please take my word: I'm not a computer guru but started >>>   using computers before the PC era :-). >>>   I know hyperthreading is evil for scientific calculations, >>>   they're even disabled in BIOS. It is not that. >>> >>>   Why I'm saying np=5 should take less time than np=4, even >>> if >>>   my PC is a quad, is as follows. >>> >>> This is a wrong statement! >>> By this argument everything that can be embarrassingly >>> parallellized >>> will take less or equal time when using the number of sequential >>> divisions. >>> >>>   Distribution of k-points is round robin, and assume >>> k-points >>>   (the, trimmed, real ones, not M&K grid) take about the same >>>   time to process. >>>   Thus for np=4 I need 3 "time steps" to get the job done, >>>   namely (4 + 4 + 1) when seen from k-points perspective. >>>   On the other hand for np=5 the time taken would be >>>   something like 2* 1/0.80 = 2.5, or even shorter, >>>   1/0.80 + 1 = 2.25. >>>   ¿What is flawed with this argument?. >>> >>> >>> Your flaw lies in using more cores than available, this has >>> nothing to >>> do with number of k-points, and your figures are based on a >>> sequential >>> program governed by the OS, not a parallel program (from what >>> I've >>> gathered). >>> You should try running a simple openmp program with >>> OMP_NUM_THREADS=4 >>> and 5 and see if that also degrades performance. >>> >>> Oversubscribing your CPU is _heavily_ inflicting performance and >>> yes, >>> oversubscribing can make your program run worse than the number >>> of >>> cores, especially when using MPI. >>> By your argument you would get the same performance by doing >>> mpirun -np >>> 9, no? Try that and you will see that it will be slower and >>> slower the >>> more processors you throw at it. >>> MPI is not sequential and comparing the execution of a parallel >>> and >>> sequential program is, at best, erroneous. >>> >>> The reason it runs _perfect_ for your wien2k calculations (from >>> what you >>> say they are sequential programs) is that the processors there >>> make NO >>> communication with each other, meaning that each process can be >>> halted/resumed at any time without notifying anything but the >>> running >>> process. With your wien2k np=5 the OS can pause, resume >>> processors as it >>> pleases with *relatively* little impact on the performance, >>> there is >>> some, but not that much. This is because each process is not >>> dependent >>> on the others and it will try and finish some before moving on. >>> >>> With MPI (siesta) this is _very_ wrong. Most MPI programs are >>> communication bounded (i.e. not embarrassingly parallellized >>> using MPI). >>> The data is distributed and every process is dependent on each >>> other, no >>> process can progress without informing the other processors. >>> This means 1) every processor does some work, 2) all processors >>> communicate with each other, 3) repeat from step 1). Now do >>> steps 1 to 3 >>> a couple of million times and the OS becomes flooded with >>> stop/resumes >>> (basically, not in its entirety, but for brevity). >>> Whenever you use MPI you should never use more processors than >>> you have >>> available. >>> (https://www.open-mpi.org/faq/?category=running#oversubscribing >>> < >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Drunning-23oversubscribing&d=BQMFaQ&c=JL-fUnQvtjNLb7dA39cQUcqmjBVITE8MbOdX7Lx6ge8&r=n_Y76F1vumEs9EYNHN2gzA5FD9jzyPhrzl3eOzxCHIQ&m=Vswqzh2TD_CL1r9kiCwjwL16KtOxW26uq4agbMQhfiQ&s=f2e4kVNouFg3LpIMPb-7nvfQslbQkj9jqkn-q-lsO-I&e= >>> >) >>> if you time your execution with timings of the MPI calls you >>> should most >>> likely see immense increases in communication times as the >>> processes >>> waits all the time, test this if you want more clear proof! >>> >>> Bottomline, never use more MPI processors than you have physical >>> processors. >>> If you still want more explanations, turn to MPI developers for >>> more >>> technical details, all I can say, never use more MPI processors >>> than you >>> have physical cores. >>> >>> >>>   Best regards, >>> >>>   Roberto >>> >>> >>>   On 11/02/2015 05:50 PM, Nick Papior wrote: >>> >>> >>> >>>     2015-11-02 21:37 GMT+01:00 RCP <[email protected] >>> <mailto:[email protected]> >>>     <mailto:[email protected] >>> <mailto:[email protected]>> >>>     <mailto:[email protected] >>> <mailto:[email protected]> <mailto:[email protected] >>> <mailto:[email protected]>>>>: >>> >>> >>>        Thank you Nick and Salvador for your >>> comments. >>> >>>        So Nick, basically you're saying that >>> diagonalization time >>>     might >>>        be playing no role. That is at variance, for >>> instance, with >>>     Wien2k, >>>        where diagonalization is the most time >>> consuming step. In fact, >>>        my expectation is correct for it; veryfied >>> with a similar cell >>>        and 9 k-points. >>> >>>     No, I am definitely not saying that! But I have no >>> idea about >>>     how your >>>     system is setup. >>>     Diagonalization _is_ a big part of the computation. >>>     How have you specified the k-points? Is it 9 kpoints >>> or 9 >>>     kpoints in the >>>     monkhorst pack grid? >>> >>> >>>        In that case "top" shows a first stage of 5 >>> processes >>>     running at >>>        about 4/5=80% CPU power (and more or less >>> stable) and a 2nd >>>     stage of >>>        4 procs, running at 100%. This is not MPI, >>> but a parallel >>>     strategy >>>        based on scripts (hope you are aware). >>> >>>     wien2k is not siesta. >>>     If wien2k is script based, i.e. sequential running >>> and >>>     self-managing the >>>     processes, then sure they behave _very_ differently >>> and wien2k >>>     should >>>     give you the desired speedup. Your figures sounds >>> like >>>     hyperthreading to me. >>> >>>        The same experiment performed with "mpirun >>> -np 5 ..." and >>>     Siesta, >>>        shows more jumpy figures for CPU usage. One >>> task might be >>>     at 100%, >>>        another at 60%, and so on, as if Linux >>> were playing with >>>     tasks >>>        like a juggler. >>> >>>     You are still implying usage of a quad core machine >>> (quad == 4) >>>     and 4<5. >>>     If you _only_ have 4 processors (intel hyperthreads >>> do _not_ >>>     count as a >>>     processes) then your assumption is not correct. >>>     How would you expect a speedup by using 1 more >>> process than you >>>     have on >>>     your system? >>>     If you see this juggling it sounds like quad == 4 >>> and not 5. >>> >>> >>>        To give you some feeling, please look at the >>> numbers here, >>> >>> >>>     >>> >>> ----------------------------------------------------------------------- >>>        * Running on  4 nodes in parallel >>> >>>        Â ... snipped ... >>> >>>        siesta: iscf  Eharris(eV)  >>>  E_KS(eV)  FreeEng(eV) >>>        Â dDmax Ef(eV) >>>        siesta:  1 -124261.2908 >>> -124261.2891 >>>     -124261.2891 0.0001 >>>        -2.5494 >>>        timer: Routine,Calls,Time,% = IterSCF >>>    1  >>>     1637.906 99.72 >>>        elaps: Routine,Calls,Wall,% = IterSCF >>>    1   >>>     410.919 >>>        99.72 <tel:410.919%20%2099.72> >>> >>> >>>        * Running on  5 nodes in parallel >>> >>>        Â ... snipped ... >>> >>>        siesta: iscf  Eharris(eV)  >>>  E_KS(eV)  FreeEng(eV) >>>        Â dDmax Ef(eV) >>>        siesta:  1 -124261.2908 >>> -124261.2891 >>>     -124261.2891 0.0001 >>>        -2.5494 >>>        timer: Routine,Calls,Time,% = IterSCF >>>    1  >>>     1654.558 99.64 >>>        elaps: Routine,Calls,Wall,% = IterSCF >>>    1   >>>     415.150 >>>        99.64 >>> >>>     >>> >>> ------------------------------------------------------------------------ >>> >>>        Those elapsed times are so close ... there >>> must be an easy >>>     explanation. >>> >>>     Yes, if you are using mpirun -np 5 on a quad core >>> machine, then the >>>     explanation is easy and your numbers are >>> irrelevant. >>> >>> >>>        Best, >>> >>>        Roberto >>> >>> >>>        On 11/02/2015 04:14 PM, Nick Papior wrote: >>> >>>          Basically: >>>          Diag.ParallelOverK false >>>          Ä€ uses scalapack to diagonalize >>> the Hamiltonian >>>          Diag.ParallelOverK true >>>          Ä€ uses lapack to diagonalize the >>> Hamiltonian >>> >>>          If you have a very large system, you >>> will not get >>>     anything out >>>          of using >>>          the latter option (rather than using >>> an enormous amount >>>     of memory). >>>          Only for an _extreme_ number of >>> k-points are the latter >>>     favourable, >>>          there are exceptions. >>> >>>          The latter is intended for small bulk >>> calculations with >>>     many >>>          k-points. >>> >>>          Lastly, you have a quad core machine >>> and run mpirun -np >>>     5, and >>>          expect >>>          that to run faster. That is a wrong >>> assumption.Ä€ >>>          Secondly diagonalization is not >>> everything in the >>>     program, check >>>          your >>>          TIMES file to figure out whether it >>> _is_ the >>>     diagonalization or >>>          a mixture.Ä€ >>> >>> >>>          2015-11-02 19:42 GMT+01:00 RCP >>> <[email protected] <mailto:[email protected]> >>>     <mailto:[email protected] >>> <mailto:[email protected]>> >>>          <mailto:[email protected] >>> <mailto:[email protected]> <mailto:[email protected] >>> <mailto:[email protected]>>> >>>          <mailto:[email protected] >>> <mailto:[email protected]> >>>     <mailto:[email protected] >>> <mailto:[email protected]>> <mailto:[email protected] >>> <mailto:[email protected]> >>>     <mailto:[email protected] >>> <mailto:[email protected]>>>>>: >>> >>> >>>          Â Â Dear everyone, >>> >>>          Â Â I seem to have a >>> misunderstanding on how the >>>          Diag.ParallellOverK >>>          Â Â feature works, any comment >>> would be much appreciated. >>> >>>          Â Â I've got a large metallic >>> cell, though still with 9 >>>          k-points, that >>>          Â Â runs on a quad PC; moreover, >>> routine diagkp shows >>>     k-points are >>>          Â Â distributed round robin >>> among processes. Thus I >>>     was expecting >>>          Â Â "mpirun -np 5 ..." to run >>> significantly faster than >>>          "mpirun -np 4 ...", >>>          Â Â as judged from the elapsed >>> time of individual scf >>>     steps. >>>          Â Â Clearly, in the latter case, >>> the 9th k-point >>>     would be taken by >>>          Â Â process 0 while the other >>> three would remain >>>     waiting, right?. >>> >>>          Â Â However, my exppectations >>> turned out to be wrong; >>>     in fact the >>>          Â Â 2nd alternative appears to >>> be a tiny bit faster. >>>          Â Â Why ?. >>> >>>          Â Â Thanks in advance, >>> >>>          Â Â Roberto P. >>> >>> >>> >>> >>>          -- >>>          Kind regards Nick >>> >>> >>> >>> >>>     -- >>>     Kind regards Nick >>> >>> >>> >>> >>> -- >>> Kind regards Nick >>> >>> >>> -- >>> >>> >>> |---------------------------------------------------------------------| >>> |  Dr. Roberto C. Pasianot     Phone: 54 11 4839 6709 >>>      | >>> |  Gcia. Materiales, CAC-CNEA   FAX : 54 11 6772 7362 >>>      | >>> |  Avda. Gral. Paz 1499      Email: >>> [email protected] <mailto:[email protected]>    | >>> |  1650 San Martin, Buenos Aires           >>>         | >>> |  ARGENTINA                  >>>            | >>> >>> |---------------------------------------------------------------------| >>> >>> >>> >>> >>> -- >>> Kind regards Nick >>> >> > > > -- > Kind regards Nick > -- Kind regards Nick
