That is crazy, and it is _amazing_! Nevertheless, I would still not recommend you do these kind of things.
2015-11-03 13:37 GMT+01:00 RCP <[email protected]>: > Good morning !, > > Please have a look at the outcome of the crazy "mpirun -np 9 ..." > exercise, > > ---------------------------------------------------------------------- > * Running on 9 nodes in parallel > > ... snipped ... > > siesta: iscf Eharris(eV) E_KS(eV) FreeEng(eV) dDmax Ef(eV) > siesta: 1 -124261.2908 -124261.2891 -124261.2891 0.0001 -2.5494 > timer: Routine,Calls,Time,% = IterSCF 1 1733.886 99.53 > elaps: Routine,Calls,Wall,% = IterSCF 1 435.030 99.53 > ----------------------------------------------------------------------- > > Amazing: the elaps row is pretty close to the 410.0 (or so) of my > previous posts. > > However, yes, I seem to have a misunderstanding about the inner > workings of the code (well, in a sense, this was my first question) > because the timing info at the end of the output file says "diagon" > is taking only about 1/3 of the total time, > > ------------------------------------------------------------------ > elaps: ELAPSED times: > elaps: Routine Calls Time/call Tot.time % > elaps: siesta 1 712.814 712.814 100.00 > ... > elaps: diagon 1 225.603 225.603 31.65 > elaps: cdiag 2 47.763 95.527 13.40 > elaps: cdiag1 2 0.920 1.840 0.26 > elaps: cdiag2 2 3.335 6.670 0.94 > elaps: cdiag3 2 42.961 85.922 12.05 > elaps: cdiag4 2 0.512 1.025 0.14 > elaps: DHSCF4 1 71.874 71.874 10.08 > elaps: dfscf 1 70.398 70.398 9.88 > elaps: overfsm 1 0.269 0.269 0.04 > elaps: optical 1 0.000 0.000 0.00 > ------------------------------------------------------------------- > > Take care, > > Roberto > > On 11/03/2015 08:28 AM, Nick Papior wrote: > >> >> >> 2015-11-03 12:10 GMT+01:00 RCP <[email protected] >> <mailto:[email protected]>>: >> >> Hi, >> >> Thanks for your time and sharing of wisdom. >> In general terms I do agree with you Nick, in the sense that >> running several sequential independent tasks (wien2k) >> simultaneously is not equivalent to running a set of >> inter-communicated, MPI, tasks. >> >> However here we're talking about a peculiar situation, >> namely, parallelization over k-points is, essentially, an >> embarrassingly parallel problem, at least for my rather >> large cell (97 atoms). The sequential gathering of >> results from different k-points, building the new charge >> density and so on, should take negligible time compared >> to the time spent by a single task in diagonalizing a large >> matrix. >> >> ! ! NO ! ! ;) >> Parallelization across k-points in siesta is NOT the same as an >> embarrassingly parallel problem across k-points. >> The _only_ thing in siesta that is parallelized embarrassingly is the >> diagonalisation part (after having communicated all Hamiltonian elements >> to all other nodes). Everything else is MPI parallelized, grid >> operations, construction of the Hamiltonian, etc. etc.! >> Yes, even though the diagonalization is embarrassingly and it _should_ >> take the longest time your assumption that the diagonalization part is >> still the most time consuming becomes wrong. >> Furthermore 96 atoms ~ 1000 orbitals, not that big a matrix to >> diagonalize. >> >>  Please look at the timing output for clarity of this. >> >> >> Of course, oversubscribing the CPUs must hurt performance >> at some point, and this is most likely worse for MPI tasks than >> for truly independent ones. But to me 5 MPI tasks competing for >> 4 cores does not look a scenario that terrible. >> >> MPI is not sequential programming and any assumption on oversubscribing >> you have is wrong, put simply. The 5 MPI tasks does not compete for 4 >> cores, siesta MPI tasks are linearly dependent on each other (as written >> in the last mail) and hence they have to keep up all the time. If the >> MPI program was fully embarrassingly parallelized, then yes, you could, >> perhaps, have a point, but siesta is not such a code. >> How you can keep saying that oversubscribing can not be that damaging >> for performance (in fact improve) is really baffling to me :) >> >> Moreover, np=5 and np=4 resulted in almost the same elapsed >> time. It is hard to believe that my expected time win for >> np=5 was (almost) exactly compensated by performance loss.  >> >> Try doubling your system size and do the same calculation. >> >> >> Nice discussion guys. I'll do a little more research and let you >> know if something worth comes out. >> >> >> Take care, >> >> Roberto >> >> On 11/02/2015 07:21 PM, Salvador Barraza-Lopez wrote: >> >> Could not be clearer Nick. >> >> RIcardo, if you type top on your machine, you'll see two SIESTA >> processes competing for one core's time, and performing at 50% >> at most. >> >> >> Other cores will wait for these processes when an operation >> among all >> cores is necessary in the algorithm (i.e., a sum or a >> distributed matrix >> product)... thus these other cores will just have to wait for the >> task these processes competing for the same core time to end; thus >> degrading performance. >> >> >> -Salvador >> >> >> >> >> ------------------------------------------------------------------------ >> *From:* [email protected] <mailto:[email protected]> >> <[email protected] <mailto:[email protected]>> on >> behalf of >> Nick Papior <[email protected] <mailto:[email protected]>> >> *Sent:* Monday, November 2, 2015 4:08 PM >> *To:* [email protected] <mailto:[email protected]> >> *Subject:* Re: [SIESTA-L] Puzzled about ParallelOverK feature >> >> >> 2015-11-02 22:37 GMT+01:00 RCP <[email protected] >> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>>: >> >> >>   Hi Nick, >> >>   Please take my word: I'm not a computer guru but started >>   using computers before the PC era :-). >>   I know hyperthreading is evil for scientific calculations, >>   they're even disabled in BIOS. It is not that. >> >>   Why I'm saying np=5 should take less time than np=4, even if >>   my PC is a quad, is as follows. >> >> This is a wrong statement! >> By this argument everything that can be embarrassingly >> parallellized >> will take less or equal time when using the number of sequential >> divisions. >> >>   Distribution of k-points is round robin, and assume k-points >>   (the, trimmed, real ones, not M&K grid) take about the same >>   time to process. >>   Thus for np=4 I need 3 "time steps" to get the job done, >>   namely (4 + 4 + 1) when seen from k-points perspective. >>   On the other hand for np=5 the time taken would be >>   something like 2* 1/0.80 = 2.5, or even shorter, >>   1/0.80 + 1 = 2.25. >>   ¿What is flawed with this argument?. >> >> >> Your flaw lies in using more cores than available, this has >> nothing to >> do with number of k-points, and your figures are based on a >> sequential >> program governed by the OS, not a parallel program (from what I've >> gathered). >> You should try running a simple openmp program with >> OMP_NUM_THREADS=4 >> and 5 and see if that also degrades performance. >> >> Oversubscribing your CPU is _heavily_ inflicting performance and >> yes, >> oversubscribing can make your program run worse than the number of >> cores, especially when using MPI. >> By your argument you would get the same performance by doing >> mpirun -np >> 9, no? Try that and you will see that it will be slower and >> slower the >> more processors you throw at it. >> MPI is not sequential and comparing the execution of a parallel >> and >> sequential program is, at best, erroneous. >> >> The reason it runs _perfect_ for your wien2k calculations (from >> what you >> say they are sequential programs) is that the processors there >> make NO >> communication with each other, meaning that each process can be >> halted/resumed at any time without notifying anything but the >> running >> process. With your wien2k np=5 the OS can pause, resume >> processors as it >> pleases with *relatively* little impact on the performance, there >> is >> some, but not that much. This is because each process is not >> dependent >> on the others and it will try and finish some before moving on. >> >> With MPI (siesta) this is _very_ wrong. Most MPI programs are >> communication bounded (i.e. not embarrassingly parallellized >> using MPI). >> The data is distributed and every process is dependent on each >> other, no >> process can progress without informing the other processors. >> This means 1) every processor does some work, 2) all processors >> communicate with each other, 3) repeat from step 1). Now do >> steps 1 to 3 >> a couple of million times and the OS becomes flooded with >> stop/resumes >> (basically, not in its entirety, but for brevity). >> Whenever you use MPI you should never use more processors than >> you have >> available. >> (https://www.open-mpi.org/faq/?category=running#oversubscribing >> < >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Drunning-23oversubscribing&d=BQMFaQ&c=JL-fUnQvtjNLb7dA39cQUcqmjBVITE8MbOdX7Lx6ge8&r=n_Y76F1vumEs9EYNHN2gzA5FD9jzyPhrzl3eOzxCHIQ&m=Vswqzh2TD_CL1r9kiCwjwL16KtOxW26uq4agbMQhfiQ&s=f2e4kVNouFg3LpIMPb-7nvfQslbQkj9jqkn-q-lsO-I&e= >> >) >> if you time your execution with timings of the MPI calls you >> should most >> likely see immense increases in communication times as the >> processes >> waits all the time, test this if you want more clear proof! >> >> Bottomline, never use more MPI processors than you have physical >> processors. >> If you still want more explanations, turn to MPI developers for >> more >> technical details, all I can say, never use more MPI processors >> than you >> have physical cores. >> >> >>   Best regards, >> >>   Roberto >> >> >>   On 11/02/2015 05:50 PM, Nick Papior wrote: >> >> >> >>     2015-11-02 21:37 GMT+01:00 RCP <[email protected] >> <mailto:[email protected]> >>     <mailto:[email protected] >> <mailto:[email protected]>> >>     <mailto:[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>>>>: >> >> >>        Thank you Nick and Salvador for your comments. >> >>        So Nick, basically you're saying that >> diagonalization time >>     might >>        be playing no role. That is at variance, for >> instance, with >>     Wien2k, >>        where diagonalization is the most time >> consuming step. In fact, >>        my expectation is correct for it; veryfied >> with a similar cell >>        and 9 k-points. >> >>     No, I am definitely not saying that! But I have no >> idea about >>     how your >>     system is setup. >>     Diagonalization _is_ a big part of the computation. >>     How have you specified the k-points? Is it 9 kpoints >> or 9 >>     kpoints in the >>     monkhorst pack grid? >> >> >>        In that case "top" shows a first stage of 5 >> processes >>     running at >>        about 4/5=80% CPU power (and more or less >> stable) and a 2nd >>     stage of >>        4 procs, running at 100%. This is not MPI, >> but a parallel >>     strategy >>        based on scripts (hope you are aware). >> >>     wien2k is not siesta. >>     If wien2k is script based, i.e. sequential running and >>     self-managing the >>     processes, then sure they behave _very_ differently >> and wien2k >>     should >>     give you the desired speedup. Your figures sounds like >>     hyperthreading to me. >> >>        The same experiment performed with "mpirun >> -np 5 ..." and >>     Siesta, >>        shows more jumpy figures for CPU usage. One >> task might be >>     at 100%, >>        another at 60%, and so on, as if Linux >> were playing with >>     tasks >>        like a juggler. >> >>     You are still implying usage of a quad core machine >> (quad == 4) >>     and 4<5. >>     If you _only_ have 4 processors (intel hyperthreads >> do _not_ >>     count as a >>     processes) then your assumption is not correct. >>     How would you expect a speedup by using 1 more >> process than you >>     have on >>     your system? >>     If you see this juggling it sounds like quad == 4 >> and not 5. >> >> >>        To give you some feeling, please look at the >> numbers here, >> >> >>     >> >> ----------------------------------------------------------------------- >>        * Running on  4 nodes in parallel >> >>        Â ... snipped ... >> >>        siesta: iscf  Eharris(eV)  >>  E_KS(eV)  FreeEng(eV) >>        Â dDmax Ef(eV) >>        siesta:  1 -124261.2908 >> -124261.2891 >>     -124261.2891 0.0001 >>        -2.5494 >>        timer: Routine,Calls,Time,% = IterSCF >>    1  >>     1637.906 99.72 >>        elaps: Routine,Calls,Wall,% = IterSCF >>    1   >>     410.919 >>        99.72 <tel:410.919%20%2099.72> >> >> >>        * Running on  5 nodes in parallel >> >>        Â ... snipped ... >> >>        siesta: iscf  Eharris(eV)  >>  E_KS(eV)  FreeEng(eV) >>        Â dDmax Ef(eV) >>        siesta:  1 -124261.2908 >> -124261.2891 >>     -124261.2891 0.0001 >>        -2.5494 >>        timer: Routine,Calls,Time,% = IterSCF >>    1  >>     1654.558 99.64 >>        elaps: Routine,Calls,Wall,% = IterSCF >>    1   >>     415.150 >>        99.64 >> >>     >> >> ------------------------------------------------------------------------ >> >>        Those elapsed times are so close ... there >> must be an easy >>     explanation. >> >>     Yes, if you are using mpirun -np 5 on a quad core >> machine, then the >>     explanation is easy and your numbers are irrelevant. >> >> >>        Best, >> >>        Roberto >> >> >>        On 11/02/2015 04:14 PM, Nick Papior wrote: >> >>          Basically: >>          Diag.ParallelOverK false >>          Ä€ uses scalapack to diagonalize >> the Hamiltonian >>          Diag.ParallelOverK true >>          Ä€ uses lapack to diagonalize the >> Hamiltonian >> >>          If you have a very large system, you >> will not get >>     anything out >>          of using >>          the latter option (rather than using >> an enormous amount >>     of memory). >>          Only for an _extreme_ number of >> k-points are the latter >>     favourable, >>          there are exceptions. >> >>          The latter is intended for small bulk >> calculations with >>     many >>          k-points. >> >>          Lastly, you have a quad core machine >> and run mpirun -np >>     5, and >>          expect >>          that to run faster. That is a wrong >> assumption.Ä€ >>          Secondly diagonalization is not >> everything in the >>     program, check >>          your >>          TIMES file to figure out whether it >> _is_ the >>     diagonalization or >>          a mixture.Ä€ >> >> >>          2015-11-02 19:42 GMT+01:00 RCP >> <[email protected] <mailto:[email protected]> >>     <mailto:[email protected] >> <mailto:[email protected]>> >>          <mailto:[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>>> >>          <mailto:[email protected] >> <mailto:[email protected]> >>     <mailto:[email protected] >> <mailto:[email protected]>> <mailto:[email protected] >> <mailto:[email protected]> >>     <mailto:[email protected] >> <mailto:[email protected]>>>>>: >> >> >>          Â Â Dear everyone, >> >>          Â Â I seem to have a >> misunderstanding on how the >>          Diag.ParallellOverK >>          Â Â feature works, any comment >> would be much appreciated. >> >>          Â Â I've got a large metallic >> cell, though still with 9 >>          k-points, that >>          Â Â runs on a quad PC; moreover, >> routine diagkp shows >>     k-points are >>          Â Â distributed round robin >> among processes. Thus I >>     was expecting >>          Â Â "mpirun -np 5 ..." to run >> significantly faster than >>          "mpirun -np 4 ...", >>          Â Â as judged from the elapsed >> time of individual scf >>     steps. >>          Â Â Clearly, in the latter case, >> the 9th k-point >>     would be taken by >>          Â Â process 0 while the other >> three would remain >>     waiting, right?. >> >>          Â Â However, my exppectations >> turned out to be wrong; >>     in fact the >>          Â Â 2nd alternative appears to >> be a tiny bit faster. >>          Â Â Why ?. >> >>          Â Â Thanks in advance, >> >>          Â Â Roberto P. >> >> >> >> >>          -- >>          Kind regards Nick >> >> >> >> >>     -- >>     Kind regards Nick >> >> >> >> >> -- >> Kind regards Nick >> >> >> -- >> >> >> |---------------------------------------------------------------------| >> |  Dr. Roberto C. Pasianot     Phone: 54 11 4839 6709 >>      | >> |  Gcia. Materiales, CAC-CNEA   FAX : 54 11 6772 7362 >>      | >> |  Avda. Gral. Paz 1499      Email: >> [email protected] <mailto:[email protected]>    | >> |  1650 San Martin, Buenos Aires           >>         | >> |  ARGENTINA                  >>            | >> >> |---------------------------------------------------------------------| >> >> >> >> >> -- >> Kind regards Nick >> > -- Kind regards Nick
