That is crazy, and it is _amazing_!

Nevertheless, I would still not recommend you do these kind of things.


2015-11-03 13:37 GMT+01:00 RCP <[email protected]>:

> Good morning !,
>
> Please have a look at the outcome of the crazy "mpirun -np 9 ..."
> exercise,
>
> ----------------------------------------------------------------------
> * Running on    9 nodes in parallel
>
>  ... snipped ...
>
> siesta: iscf   Eharris(eV)      E_KS(eV)   FreeEng(eV)   dDmax  Ef(eV)
> siesta:    1  -124261.2908  -124261.2891  -124261.2891  0.0001 -2.5494
> timer: Routine,Calls,Time,% = IterSCF        1    1733.886  99.53
> elaps: Routine,Calls,Wall,% = IterSCF        1     435.030  99.53
> -----------------------------------------------------------------------
>
> Amazing: the elaps row is pretty close to the 410.0 (or so) of my
> previous posts.
>
> However, yes, I seem to have a misunderstanding about the inner
> workings of the code (well, in a sense, this was my first question)
> because the timing info at the end of the output file says "diagon"
> is taking only about 1/3 of the total time,
>
> ------------------------------------------------------------------
> elaps: ELAPSED times:
> elaps:  Routine       Calls   Time/call    Tot.time        %
> elaps:  siesta            1     712.814     712.814   100.00
> ...
> elaps:  diagon            1     225.603     225.603 31.65
> elaps:  cdiag             2      47.763      95.527    13.40
> elaps:  cdiag1            2       0.920       1.840     0.26
> elaps:  cdiag2            2       3.335       6.670     0.94
> elaps:  cdiag3            2      42.961      85.922    12.05
> elaps:  cdiag4            2       0.512       1.025     0.14
> elaps:  DHSCF4            1      71.874      71.874    10.08
> elaps:  dfscf             1      70.398      70.398     9.88
> elaps:  overfsm           1       0.269       0.269     0.04
> elaps:  optical           1       0.000       0.000     0.00
> -------------------------------------------------------------------
>
> Take care,
>
> Roberto
>
> On 11/03/2015 08:28 AM, Nick Papior wrote:
>
>>
>>
>> 2015-11-03 12:10 GMT+01:00 RCP <[email protected]
>> <mailto:[email protected]>>:
>>
>>     Hi,
>>
>>     Thanks for your time and sharing of wisdom.
>>     In general terms I do agree with you Nick, in the sense that
>>     running several sequential independent tasks (wien2k)
>>     simultaneously is not equivalent to running a set of
>>     inter-communicated, MPI, tasks.
>>
>>     However here we're talking about a peculiar situation,
>>     namely, parallelization over k-points is, essentially, an
>>     embarrassingly parallel problem, at least for my rather
>>     large cell (97 atoms). The sequential gathering of
>>     results from different k-points, building the new charge
>>     density and so on, should take negligible time compared
>>     to the time spent by a single task in diagonalizing a large
>>     matrix.
>>
>> ! ! NO ! ! ;)
>> Parallelization across k-points in siesta is NOT the same as an
>> embarrassingly parallel problem across k-points.
>> The _only_ thing in siesta that is parallelized embarrassingly is the
>> diagonalisation part (after having communicated all Hamiltonian elements
>> to all other nodes). Everything else is MPI parallelized, grid
>> operations, construction of the Hamiltonian, etc. etc.!Â
>> Yes, even though the diagonalization is embarrassingly and it _should_
>> take the longest time your assumption that the diagonalization part is
>> still the most time consuming becomes wrong.
>> Furthermore 96 atoms ~ 1000 orbitals, not that big a matrix to
>> diagonalize.
>>
>> Â Please look at the timing output for clarity of this.
>>
>>
>>     Of course, oversubscribing the CPUs must hurt performance
>>     at some point, and this is most likely worse for MPI tasks than
>>     for truly independent ones. But to me 5 MPI tasks competing for
>>     4 cores does not look a scenario that terrible.
>>
>> MPI is not sequential programming and any assumption on oversubscribing
>> you have is wrong, put simply. The 5 MPI tasks does not compete for 4
>> cores, siesta MPI tasks are linearly dependent on each other (as written
>> in the last mail) and hence they have to keep up all the time. If the
>> MPI program was fully embarrassingly parallelized, then yes, you could,
>> perhaps, have a point, but siesta is not such a code.
>> How you can keep saying that oversubscribing can not be that damaging
>> for performance (in fact improve) is really baffling to me :)
>>
>>     Moreover, np=5 and np=4 resulted in almost the same elapsed
>>     time. It is hard to believe that my expected time win for
>>     np=5 was (almost) exactly compensated by performance loss. Â
>>
>> Try doubling your system size and do the same calculation.Â
>>
>>
>>     Nice discussion guys. I'll do a little more research and let you
>>     know if something worth comes out.Â
>>
>>
>>     Take care,
>>
>>     Roberto
>>
>>     On 11/02/2015 07:21 PM, Salvador Barraza-Lopez wrote:
>>
>>         Could not be clearer Nick.
>>
>>         RIcardo, if you type top on your machine, you'll see two SIESTA
>>         processes competing for one core's time, and performing at 50%
>>         at most.
>>
>>
>>         Other cores will wait for these processes when an operation
>>         among all
>>         cores is necessary in the algorithm (i.e., a sum or a
>>         distributed matrix
>>         product)... thus these other cores will just have to wait for the
>>         task these processes competing for the same core time to end; thus
>>         degrading performance.
>>
>>
>>         -Salvador
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>         *From:* [email protected] <mailto:[email protected]>
>>         <[email protected] <mailto:[email protected]>> on
>>         behalf of
>>         Nick Papior <[email protected] <mailto:[email protected]>>
>>         *Sent:* Monday, November 2, 2015 4:08 PM
>>         *To:* [email protected] <mailto:[email protected]>
>>         *Subject:* Re: [SIESTA-L] Puzzled about ParallelOverK feature
>>
>>
>>         2015-11-02 22:37 GMT+01:00 RCP <[email protected]
>>         <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>>:
>>
>>
>>         Â  Â  Hi Nick,
>>
>>         Â  Â  Please take my word: I'm not a computer guru but started
>>         Â  Â  using computers before the PC era :-).
>>         Â  Â  I know hyperthreading is evil for scientific calculations,
>>         Â  Â  they're even disabled in BIOS. It is not that.
>>
>>         Â  Â  Why I'm saying np=5 should take less time than np=4, even if
>>         Â  Â  my PC is a quad, is as follows.
>>
>>         This is a wrong statement!
>>         By this argument everything that can be embarrassingly
>> parallellized
>>         will take less or equal time when using the number of sequential
>>         divisions.
>>
>>         Â  Â  Distribution of k-points is round robin, and assume k-points
>>         Â  Â  (the, trimmed, real ones, not M&K grid) take about the same
>>         Â  Â  time to process.
>>         Â  Â  Thus for np=4 I need 3 "time steps" to get the job done,
>>         Â  Â  namely (4 + 4 + 1) when seen from k-points perspective.
>>         Â  Â  On the other hand for np=5 the time taken would be
>>             something like 2* 1/0.80 = 2.5,  or even shorter,
>>         Â  Â  1/0.80 + 1 = 2.25.
>>             ¿What is flawed with this argument?.
>>
>>
>>         Your flaw lies in using more cores than available, this has
>>         nothing to
>>         do with number of k-points, and your figures are based on a
>>         sequential
>>         program governed by the OS, not a parallel program (from what I've
>>         gathered).
>>         You should try running a simple openmp program with
>>         OMP_NUM_THREADS=4
>>         and 5 and see if that also degrades performance.
>>
>>         Oversubscribing your CPU is _heavily_ inflicting performance and
>>         yes,
>>         oversubscribing can make your program run worse than the number of
>>         cores, especially when using MPI.
>>         By your argument you would get the same performance by doing
>>         mpirun -np
>>         9, no? Try that and you will see that it will be slower and
>>         slower the
>>         more processors you throw at it.
>>         MPI is not sequential and comparing the execution of a parallel
>> and
>>         sequential program is, at best, erroneous.
>>
>>         The reason it runs _perfect_ for your wien2k calculations (from
>>         what you
>>         say they are sequential programs) is that the processors there
>>         make NO
>>         communication with each other, meaning that each process can be
>>         halted/resumed at any time without notifying anything but the
>>         running
>>         process. With your wien2k np=5 the OS can pause, resume
>>         processors as it
>>         pleases with *relatively* little impact on the performance, there
>> is
>>         some, but not that much. This is because each process is not
>>         dependent
>>         on the others and it will try and finish some before moving on.
>>
>>         With MPI (siesta) this is _very_ wrong. Most MPI programs are
>>         communication bounded (i.e. not embarrassingly parallellized
>>         using MPI).
>>         The data is distributed and every process is dependent on each
>>         other, no
>>         process can progress without informing the other processors.
>>         This means 1) every processor does some work, 2) all processors
>>         communicate with each other, 3) repeat from step 1). Now do
>>         steps 1 to 3
>>         a couple of million times and the OS becomes flooded with
>>         stop/resumes
>>         (basically, not in its entirety, but for brevity).
>>         Whenever you use MPI you should never use more processors than
>>         you have
>>         available.
>>         (https://www.open-mpi.org/faq/?category=running#oversubscribing
>>         <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Drunning-23oversubscribing&d=BQMFaQ&c=JL-fUnQvtjNLb7dA39cQUcqmjBVITE8MbOdX7Lx6ge8&r=n_Y76F1vumEs9EYNHN2gzA5FD9jzyPhrzl3eOzxCHIQ&m=Vswqzh2TD_CL1r9kiCwjwL16KtOxW26uq4agbMQhfiQ&s=f2e4kVNouFg3LpIMPb-7nvfQslbQkj9jqkn-q-lsO-I&e=
>> >)
>>         if you time your execution with timings of the MPI calls you
>>         should most
>>         likely see immense increases in communication times as the
>> processes
>>         waits all the time, test this if you want more clear proof!
>>
>>         Bottomline, never use more MPI processors than you have physical
>>         processors.
>>         If you still want more explanations, turn to MPI developers for
>> more
>>         technical details, all I can say, never use more MPI processors
>>         than you
>>         have physical cores.
>>
>>
>>         Â  Â  Best regards,
>>
>>         Â  Â  Roberto
>>
>>
>>         Â  Â  On 11/02/2015 05:50 PM, Nick Papior wrote:
>>
>>
>>
>>         Â  Â  Â  Â  2015-11-02 21:37 GMT+01:00 RCP <[email protected]
>>         <mailto:[email protected]>
>>         Â  Â  Â  Â  <mailto:[email protected]
>>         <mailto:[email protected]>>
>>         Â  Â  Â  Â  <mailto:[email protected]
>>         <mailto:[email protected]> <mailto:[email protected]
>>         <mailto:[email protected]>>>>:
>>
>>
>>         Â  Â  Â  Â  Â  Â  Â Thank you Nick and Salvador for your comments.
>>
>>         Â  Â  Â  Â  Â  Â  Â So Nick, basically you're saying that
>>         diagonalization time
>>         Â  Â  Â  Â  might
>>         Â  Â  Â  Â  Â  Â  Â be playing no role. That is at variance, for
>>         instance, with
>>         Â  Â  Â  Â  Wien2k,
>>         Â  Â  Â  Â  Â  Â  Â where diagonalization is the most time
>>         consuming step. In fact,
>>         Â  Â  Â  Â  Â  Â  Â my expectation is correct for it; veryfied
>>         with a similar cell
>>         Â  Â  Â  Â  Â  Â  Â and 9 k-points.
>>
>>         Â  Â  Â  Â  No, I am definitely not saying that! But I have no
>>         idea about
>>         Â  Â  Â  Â  how your
>>         Â  Â  Â  Â  system is setup.
>>         Â  Â  Â  Â  Diagonalization _is_ a big part of the computation.
>>         Â  Â  Â  Â  How have you specified the k-points? Is it 9 kpoints
>>         or 9
>>         Â  Â  Â  Â  kpoints in the
>>         Â  Â  Â  Â  monkhorst pack grid?
>>
>>
>>         Â  Â  Â  Â  Â  Â  Â In that case "top" shows a first stage of 5
>>         processes
>>         Â  Â  Â  Â  running at
>>         Â  Â  Â  Â  Â  Â  Â about 4/5=80% CPU power (and more or less
>>         stable) and a 2nd
>>         Â  Â  Â  Â  stage of
>>         Â  Â  Â  Â  Â  Â  Â 4 procs, running at 100%. This is not MPI,
>>         but a parallel
>>         Â  Â  Â  Â  strategy
>>         Â  Â  Â  Â  Â  Â  Â based on scripts (hope you are aware).
>>
>>         Â  Â  Â  Â  wien2k is not siesta.
>>         Â  Â  Â  Â  If wien2k is script based, i.e. sequential running and
>>         Â  Â  Â  Â  self-managing the
>>         Â  Â  Â  Â  processes, then sure they behave _very_ differently
>>         and wien2k
>>         Â  Â  Â  Â  should
>>         Â  Â  Â  Â  give you the desired speedup. Your figures sounds like
>>         Â  Â  Â  Â  hyperthreading to me.
>>
>>         Â  Â  Â  Â  Â  Â  Â The same experiment performed with "mpirun
>>         -np 5 ..." and
>>         Â  Â  Â  Â  Siesta,
>>         Â  Â  Â  Â  Â  Â  Â shows more jumpy figures for CPU usage. One
>>         task might be
>>         Â  Â  Â  Â  at 100%,
>>                      another at 60%, and so on,  as if Linux
>>         were playing with
>>         Â  Â  Â  Â  tasks
>>         Â  Â  Â  Â  Â  Â  Â like a juggler.
>>
>>         Â  Â  Â  Â  You are still implying usage of a quad core machine
>>         (quad == 4)
>>         Â  Â  Â  Â  and 4<5.
>>         Â  Â  Â  Â  If you _only_ have 4 processors (intel hyperthreads
>>         do _not_
>>         Â  Â  Â  Â  count as a
>>                 processes) then your assumption is not correct.Â
>>         Â  Â  Â  Â  How would you expect a speedup by using 1 more
>>         process than you
>>         Â  Â  Â  Â  have on
>>         Â  Â  Â  Â  your system?
>>         Â  Â  Â  Â  If you see this juggling it sounds like quad == 4
>>         and not 5.
>>
>>
>>         Â  Â  Â  Â  Â  Â  Â To give you some feeling, please look at the
>>         numbers here,
>>
>>
>>         Â  Â  Â  Â
>>
>> -----------------------------------------------------------------------
>>                      * Running on    4 nodes in parallel
>>
>>                      Â ... snipped ...
>>
>>                      siesta: iscf  Â Eharris(eV)  ÂÂ
>>           E_KS(eV)   FreeEng(eV)Â
>>                      Â dDmax  Ef(eV)
>>                      siesta:    1  -124261.2908ÂÂ
>>         -124261.2891Â
>>                 -124261.2891  0.0001
>>         Â  Â  Â  Â  Â  Â  Â -2.5494
>>                      timer: Routine,Calls,Time,% = IterSCFÂÂ
>>               1  Â
>>                 1637.906  99.72
>>                      elaps: Routine,Calls,Wall,% = IterSCFÂÂ
>>               1    Â
>>         Â  Â  Â  Â  410.919
>>         Â  Â  Â  Â  Â  Â  Â 99.72 <tel:410.919%20%2099.72>
>>
>>
>>                      * Running on    5 nodes in parallel
>>
>>                      Â ... snipped ...
>>
>>                      siesta: iscf  Â Eharris(eV)  ÂÂ
>>           E_KS(eV)   FreeEng(eV)Â
>>                      Â dDmax  Ef(eV)
>>                      siesta:    1  -124261.2908ÂÂ
>>         -124261.2891Â
>>                 -124261.2891  0.0001
>>         Â  Â  Â  Â  Â  Â  Â -2.5494
>>                      timer: Routine,Calls,Time,% = IterSCFÂÂ
>>               1  Â
>>                 1654.558  99.64
>>                      elaps: Routine,Calls,Wall,% = IterSCFÂÂ
>>               1    Â
>>                 415.150Â
>>         Â  Â  Â  Â  Â  Â  Â 99.64
>>
>>         Â  Â  Â  Â
>>
>> ------------------------------------------------------------------------
>>
>>         Â  Â  Â  Â  Â  Â  Â Those elapsed times are so close ... there
>>         must be an easy
>>         Â  Â  Â  Â  explanation.
>>
>>         Â  Â  Â  Â  Yes, if you are using mpirun -np 5 on a quad core
>>         machine, then the
>>                 explanation is easy and your numbers are irrelevant.Â
>>
>>
>>         Â  Â  Â  Â  Â  Â  Â Best,
>>
>>         Â  Â  Â  Â  Â  Â  Â Roberto
>>
>>
>>         Â  Â  Â  Â  Â  Â  Â On 11/02/2015 04:14 PM, Nick Papior wrote:
>>
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Basically:
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Diag.ParallelOverK false
>>                          Ä€ uses scalapack to diagonalize
>>         the Hamiltonian
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Diag.ParallelOverK true
>>                          Ä€ uses lapack to diagonalize the
>>         Hamiltonian
>>
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â If you have a very large system, you
>>         will not get
>>         Â  Â  Â  Â  anything out
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â of using
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â the latter option (rather than using
>>         an enormous amount
>>         Â  Â  Â  Â  of memory).
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Only for an _extreme_ number of
>>         k-points are the latter
>>         Â  Â  Â  Â  favourable,
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â there are exceptions.
>>
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â The latter is intended for small bulk
>>         calculations with
>>         Â  Â  Â  Â  many
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â k-points.
>>
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Lastly, you have a quad core machine
>>         and run mpirun -np
>>         Â  Â  Â  Â  5, and
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â expect
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â that to run faster. That is a wrong
>>         assumption.Ä€
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Secondly diagonalization is not
>>         everything in the
>>         Â  Â  Â  Â  program, check
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â your
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â TIMES file to figure out whether it
>>         _is_ the
>>         Â  Â  Â  Â  diagonalization or
>>                          a mixture.Ä€
>>
>>
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â 2015-11-02 19:42 GMT+01:00 RCP
>>         <[email protected] <mailto:[email protected]>
>>         Â  Â  Â  Â  <mailto:[email protected]
>>         <mailto:[email protected]>>
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â <mailto:[email protected]
>>         <mailto:[email protected]> <mailto:[email protected]
>>         <mailto:[email protected]>>>
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â <mailto:[email protected]
>>         <mailto:[email protected]>
>>         Â  Â  Â  Â  <mailto:[email protected]
>>         <mailto:[email protected]>> <mailto:[email protected]
>>         <mailto:[email protected]>
>>         Â  Â  Â  Â  <mailto:[email protected]
>>         <mailto:[email protected]>>>>>:
>>
>>
>>                          Â    Dear everyone,
>>
>>                          Â    I seem to have a
>>         misunderstanding on how the
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Diag.ParallellOverK
>>                          Â    feature works, any comment
>>         would be much appreciated.
>>
>>                          Â    I've got a large metallic
>>         cell, though still with 9
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â k-points, that
>>                          Â    runs on a quad PC; moreover,
>>         routine diagkp shows
>>         Â  Â  Â  Â  k-points are
>>                          Â    distributed round robin
>>         among processes. Thus I
>>         Â  Â  Â  Â  was expecting
>>                          Â    "mpirun -np 5 ..." to run
>>         significantly faster than
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â "mpirun -np 4 ...",
>>                          Â    as judged from the elapsed
>>         time of individual scf
>>         Â  Â  Â  Â  steps.
>>                          Â    Clearly, in the latter case,
>>         the 9th k-point
>>         Â  Â  Â  Â  would be taken by
>>                          Â    process 0 while the other
>>         three would remain
>>         Â  Â  Â  Â  waiting, right?.
>>
>>                          Â    However, my exppectations
>>         turned out to be wrong;
>>         Â  Â  Â  Â  in fact the
>>                          Â    2nd alternative appears to
>>         be a tiny bit faster.
>>                          Â    Why ?.
>>
>>                          Â    Thanks in advance,
>>
>>                          Â    Roberto P.
>>
>>
>>
>>
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â --
>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Kind regards Nick
>>
>>
>>
>>
>>         Â  Â  Â  Â  --
>>         Â  Â  Â  Â  Kind regards Nick
>>
>>
>>
>>
>>         --
>>         Kind regards Nick
>>
>>
>>     --
>>
>>
>> |---------------------------------------------------------------------|
>>     |   Dr. Roberto C. Pasianot         Phone: 54 11 4839 6709Â
>>     Â  Â  Â  Â  Â  |
>>     |Â  Â Gcia. Materiales, CAC-CNEAÂ  Â  Â  FAXÂ  : 54 11 6772 7362Â
>>     Â  Â  Â  Â  Â  |
>>     |Â  Â Avda. Gral. Paz 1499Â  Â  Â  Â  Â  Â  Email:
>>     [email protected] <mailto:[email protected]>Â  Â  Â  Â |
>>     |   1650 San Martin, Buenos Aires                    Â
>>     Â  Â  Â  Â  Â  Â  Â  Â |
>>     |Â  Â ARGENTINAÂ  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â
>>     Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â |
>>
>> |---------------------------------------------------------------------|
>>
>>
>>
>>
>> --
>> Kind regards Nick
>>
>


-- 
Kind regards Nick

Responder a