Good morning !,

Please have a look at the outcome of the crazy "mpirun -np 9 ..."
exercise,

----------------------------------------------------------------------
* Running on    9 nodes in parallel

 ... snipped ...

siesta: iscf   Eharris(eV)      E_KS(eV)   FreeEng(eV)   dDmax  Ef(eV)
siesta:    1  -124261.2908  -124261.2891  -124261.2891  0.0001 -2.5494
timer: Routine,Calls,Time,% = IterSCF        1    1733.886  99.53
elaps: Routine,Calls,Wall,% = IterSCF        1     435.030  99.53
-----------------------------------------------------------------------

Amazing: the elaps row is pretty close to the 410.0 (or so) of my
previous posts.

However, yes, I seem to have a misunderstanding about the inner
workings of the code (well, in a sense, this was my first question)
because the timing info at the end of the output file says "diagon"
is taking only about 1/3 of the total time,

------------------------------------------------------------------
elaps: ELAPSED times:
elaps:  Routine       Calls   Time/call    Tot.time        %
elaps:  siesta            1     712.814     712.814   100.00
...
elaps:  diagon            1     225.603     225.603    31.65
elaps:  cdiag             2      47.763      95.527    13.40
elaps:  cdiag1            2       0.920       1.840     0.26
elaps:  cdiag2            2       3.335       6.670     0.94
elaps:  cdiag3            2      42.961      85.922    12.05
elaps:  cdiag4            2       0.512       1.025     0.14
elaps:  DHSCF4            1      71.874      71.874    10.08
elaps:  dfscf             1      70.398      70.398     9.88
elaps:  overfsm           1       0.269       0.269     0.04
elaps:  optical           1       0.000       0.000     0.00
-------------------------------------------------------------------

Take care,

Roberto

On 11/03/2015 08:28 AM, Nick Papior wrote:


2015-11-03 12:10 GMT+01:00 RCP <[email protected]
<mailto:[email protected]>>:

    Hi,

    Thanks for your time and sharing of wisdom.
    In general terms I do agree with you Nick, in the sense that
    running several sequential independent tasks (wien2k)
    simultaneously is not equivalent to running a set of
    inter-communicated, MPI, tasks.

    However here we're talking about a peculiar situation,
    namely, parallelization over k-points is, essentially, an
    embarrassingly parallel problem, at least for my rather
    large cell (97 atoms). The sequential gathering of
    results from different k-points, building the new charge
    density and so on, should take negligible time compared
    to the time spent by a single task in diagonalizing a large
    matrix.

! ! NO ! ! ;)
Parallelization across k-points in siesta is NOT the same as an
embarrassingly parallel problem across k-points.
The _only_ thing in siesta that is parallelized embarrassingly is the
diagonalisation part (after having communicated all Hamiltonian elements
to all other nodes). Everything else is MPI parallelized, grid
operations, construction of the Hamiltonian, etc. etc.!Â
Yes, even though the diagonalization is embarrassingly and it _should_
take the longest time your assumption that the diagonalization part is
still the most time consuming becomes wrong.
Furthermore 96 atoms ~ 1000 orbitals, not that big a matrix to diagonalize.

 Please look at the timing output for clarity of this.


    Of course, oversubscribing the CPUs must hurt performance
    at some point, and this is most likely worse for MPI tasks than
    for truly independent ones. But to me 5 MPI tasks competing for
    4 cores does not look a scenario that terrible.

MPI is not sequential programming and any assumption on oversubscribing
you have is wrong, put simply. The 5 MPI tasks does not compete for 4
cores, siesta MPI tasks are linearly dependent on each other (as written
in the last mail) and hence they have to keep up all the time. If the
MPI program was fully embarrassingly parallelized, then yes, you could,
perhaps, have a point, but siesta is not such a code.
How you can keep saying that oversubscribing can not be that damaging
for performance (in fact improve) is really baffling to me :)

    Moreover, np=5 and np=4 resulted in almost the same elapsed
    time. It is hard to believe that my expected time win for
    np=5 was (almost) exactly compensated by performance loss. Â

Try doubling your system size and do the same calculation.Â


    Nice discussion guys. I'll do a little more research and let you
    know if something worth comes out.Â


    Take care,

    Roberto

    On 11/02/2015 07:21 PM, Salvador Barraza-Lopez wrote:

        Could not be clearer Nick.

        RIcardo, if you type top on your machine, you'll see two SIESTA
        processes competing for one core's time, and performing at 50%
        at most.


        Other cores will wait for these processes when an operation
        among all
        cores is necessary in the algorithm (i.e., a sum or a
        distributed matrix
        product)... thus these other cores will just have to wait for the
        task these processes competing for the same core time to end; thus
        degrading performance.


        -Salvador



        ------------------------------------------------------------------------
        *From:* [email protected] <mailto:[email protected]>
        <[email protected] <mailto:[email protected]>> on
        behalf of
        Nick Papior <[email protected] <mailto:[email protected]>>
        *Sent:* Monday, November 2, 2015 4:08 PM
        *To:* [email protected] <mailto:[email protected]>
        *Subject:* Re: [SIESTA-L] Puzzled about ParallelOverK feature


        2015-11-02 22:37 GMT+01:00 RCP <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>:


        Â  Â  Hi Nick,

        Â  Â  Please take my word: I'm not a computer guru but started
        Â  Â  using computers before the PC era :-).
        Â  Â  I know hyperthreading is evil for scientific calculations,
        Â  Â  they're even disabled in BIOS. It is not that.

        Â  Â  Why I'm saying np=5 should take less time than np=4, even if
        Â  Â  my PC is a quad, is as follows.

        This is a wrong statement!
        By this argument everything that can be embarrassingly parallellized
        will take less or equal time when using the number of sequential
        divisions.

        Â  Â  Distribution of k-points is round robin, and assume k-points
        Â  Â  (the, trimmed, real ones, not M&K grid) take about the same
        Â  Â  time to process.
        Â  Â  Thus for np=4 I need 3 "time steps" to get the job done,
        Â  Â  namely (4 + 4 + 1) when seen from k-points perspective.
        Â  Â  On the other hand for np=5 the time taken would be
            something like 2* 1/0.80 = 2.5,  or even shorter,
        Â  Â  1/0.80 + 1 = 2.25.
            ¿What is flawed with this argument?.

        Your flaw lies in using more cores than available, this has
        nothing to
        do with number of k-points, and your figures are based on a
        sequential
        program governed by the OS, not a parallel program (from what I've
        gathered).
        You should try running a simple openmp program with
        OMP_NUM_THREADS=4
        and 5 and see if that also degrades performance.

        Oversubscribing your CPU is _heavily_ inflicting performance and
        yes,
        oversubscribing can make your program run worse than the number of
        cores, especially when using MPI.
        By your argument you would get the same performance by doing
        mpirun -np
        9, no? Try that and you will see that it will be slower and
        slower the
        more processors you throw at it.
        MPI is not sequential and comparing the execution of a parallel and
        sequential program is, at best, erroneous.

        The reason it runs _perfect_ for your wien2k calculations (from
        what you
        say they are sequential programs) is that the processors there
        make NO
        communication with each other, meaning that each process can be
        halted/resumed at any time without notifying anything but the
        running
        process. With your wien2k np=5 the OS can pause, resume
        processors as it
        pleases with *relatively* little impact on the performance, there is
        some, but not that much. This is because each process is not
        dependent
        on the others and it will try and finish some before moving on.

        With MPI (siesta) this is _very_ wrong. Most MPI programs are
        communication bounded (i.e. not embarrassingly parallellized
        using MPI).
        The data is distributed and every process is dependent on each
        other, no
        process can progress without informing the other processors.
        This means 1) every processor does some work, 2) all processors
        communicate with each other, 3) repeat from step 1). Now do
        steps 1 to 3
        a couple of million times and the OS becomes flooded with
        stop/resumes
        (basically, not in its entirety, but for brevity).
        Whenever you use MPI you should never use more processors than
        you have
        available.
        (https://www.open-mpi.org/faq/?category=running#oversubscribing
        
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Drunning-23oversubscribing&d=BQMFaQ&c=JL-fUnQvtjNLb7dA39cQUcqmjBVITE8MbOdX7Lx6ge8&r=n_Y76F1vumEs9EYNHN2gzA5FD9jzyPhrzl3eOzxCHIQ&m=Vswqzh2TD_CL1r9kiCwjwL16KtOxW26uq4agbMQhfiQ&s=f2e4kVNouFg3LpIMPb-7nvfQslbQkj9jqkn-q-lsO-I&e=>)
        if you time your execution with timings of the MPI calls you
        should most
        likely see immense increases in communication times as the processes
        waits all the time, test this if you want more clear proof!

        Bottomline, never use more MPI processors than you have physical
        processors.
        If you still want more explanations, turn to MPI developers for more
        technical details, all I can say, never use more MPI processors
        than you
        have physical cores.


        Â  Â  Best regards,

        Â  Â  Roberto


        Â  Â  On 11/02/2015 05:50 PM, Nick Papior wrote:



        Â  Â  Â  Â  2015-11-02 21:37 GMT+01:00 RCP <[email protected]
        <mailto:[email protected]>
        Â  Â  Â  Â  <mailto:[email protected]
        <mailto:[email protected]>>
        Â  Â  Â  Â  <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>>:


        Â  Â  Â  Â  Â  Â  Â Thank you Nick and Salvador for your comments.

        Â  Â  Â  Â  Â  Â  Â So Nick, basically you're saying that
        diagonalization time
        Â  Â  Â  Â  might
        Â  Â  Â  Â  Â  Â  Â be playing no role. That is at variance, for
        instance, with
        Â  Â  Â  Â  Wien2k,
        Â  Â  Â  Â  Â  Â  Â where diagonalization is the most time
        consuming step. In fact,
        Â  Â  Â  Â  Â  Â  Â my expectation is correct for it; veryfied
        with a similar cell
        Â  Â  Â  Â  Â  Â  Â and 9 k-points.

        Â  Â  Â  Â  No, I am definitely not saying that! But I have no
        idea about
        Â  Â  Â  Â  how your
        Â  Â  Â  Â  system is setup.
        Â  Â  Â  Â  Diagonalization _is_ a big part of the computation.
        Â  Â  Â  Â  How have you specified the k-points? Is it 9 kpoints
        or 9
        Â  Â  Â  Â  kpoints in the
        Â  Â  Â  Â  monkhorst pack grid?


        Â  Â  Â  Â  Â  Â  Â In that case "top" shows a first stage of 5
        processes
        Â  Â  Â  Â  running at
        Â  Â  Â  Â  Â  Â  Â about 4/5=80% CPU power (and more or less
        stable) and a 2nd
        Â  Â  Â  Â  stage of
        Â  Â  Â  Â  Â  Â  Â 4 procs, running at 100%. This is not MPI,
        but a parallel
        Â  Â  Â  Â  strategy
        Â  Â  Â  Â  Â  Â  Â based on scripts (hope you are aware).

        Â  Â  Â  Â  wien2k is not siesta.
        Â  Â  Â  Â  If wien2k is script based, i.e. sequential running and
        Â  Â  Â  Â  self-managing the
        Â  Â  Â  Â  processes, then sure they behave _very_ differently
        and wien2k
        Â  Â  Â  Â  should
        Â  Â  Â  Â  give you the desired speedup. Your figures sounds like
        Â  Â  Â  Â  hyperthreading to me.

        Â  Â  Â  Â  Â  Â  Â The same experiment performed with "mpirun
        -np 5 ..." and
        Â  Â  Â  Â  Siesta,
        Â  Â  Â  Â  Â  Â  Â shows more jumpy figures for CPU usage. One
        task might be
        Â  Â  Â  Â  at 100%,
                     another at 60%, and so on,  as if Linux
        were playing with
        Â  Â  Â  Â  tasks
        Â  Â  Â  Â  Â  Â  Â like a juggler.

        Â  Â  Â  Â  You are still implying usage of a quad core machine
        (quad == 4)
        Â  Â  Â  Â  and 4<5.
        Â  Â  Â  Â  If you _only_ have 4 processors (intel hyperthreads
        do _not_
        Â  Â  Â  Â  count as a
                processes) then your assumption is not correct.Â
        Â  Â  Â  Â  How would you expect a speedup by using 1 more
        process than you
        Â  Â  Â  Â  have on
        Â  Â  Â  Â  your system?
        Â  Â  Â  Â  If you see this juggling it sounds like quad == 4
        and not 5.


        Â  Â  Â  Â  Â  Â  Â To give you some feeling, please look at the
        numbers here,


        Â  Â  Â  Â
        -----------------------------------------------------------------------
                     * Running on    4 nodes in parallel

                     Â ... snipped ...

                     siesta: iscf  Â Eharris(eV)  ÂÂ
          E_KS(eV)   FreeEng(eV)Â
                     Â dDmax  Ef(eV)
                     siesta:    1  -124261.2908ÂÂ
        -124261.2891Â
                -124261.2891  0.0001
        Â  Â  Â  Â  Â  Â  Â -2.5494
                     timer: Routine,Calls,Time,% = IterSCFÂÂ
              1  Â
                1637.906  99.72
                     elaps: Routine,Calls,Wall,% = IterSCFÂÂ
              1    Â
        Â  Â  Â  Â  410.919
        Â  Â  Â  Â  Â  Â  Â 99.72 <tel:410.919%20%2099.72>


                     * Running on    5 nodes in parallel

                     Â ... snipped ...

                     siesta: iscf  Â Eharris(eV)  ÂÂ
          E_KS(eV)   FreeEng(eV)Â
                     Â dDmax  Ef(eV)
                     siesta:    1  -124261.2908ÂÂ
        -124261.2891Â
                -124261.2891  0.0001
        Â  Â  Â  Â  Â  Â  Â -2.5494
                     timer: Routine,Calls,Time,% = IterSCFÂÂ
              1  Â
                1654.558  99.64
                     elaps: Routine,Calls,Wall,% = IterSCFÂÂ
              1    Â
                415.150Â
        Â  Â  Â  Â  Â  Â  Â 99.64

        Â  Â  Â  Â
        ------------------------------------------------------------------------

        Â  Â  Â  Â  Â  Â  Â Those elapsed times are so close ... there
        must be an easy
        Â  Â  Â  Â  explanation.

        Â  Â  Â  Â  Yes, if you are using mpirun -np 5 on a quad core
        machine, then the
                explanation is easy and your numbers are irrelevant.Â


        Â  Â  Â  Â  Â  Â  Â Best,

        Â  Â  Â  Â  Â  Â  Â Roberto


        Â  Â  Â  Â  Â  Â  Â On 11/02/2015 04:14 PM, Nick Papior wrote:

        Â  Â  Â  Â  Â  Â  Â  Â  Â Basically:
        Â  Â  Â  Â  Â  Â  Â  Â  Â Diag.ParallelOverK false
                         Ä€ uses scalapack to diagonalize
        the Hamiltonian
        Â  Â  Â  Â  Â  Â  Â  Â  Â Diag.ParallelOverK true
                         Ä€ uses lapack to diagonalize the
        Hamiltonian

        Â  Â  Â  Â  Â  Â  Â  Â  Â If you have a very large system, you
        will not get
        Â  Â  Â  Â  anything out
        Â  Â  Â  Â  Â  Â  Â  Â  Â of using
        Â  Â  Â  Â  Â  Â  Â  Â  Â the latter option (rather than using
        an enormous amount
        Â  Â  Â  Â  of memory).
        Â  Â  Â  Â  Â  Â  Â  Â  Â Only for an _extreme_ number of
        k-points are the latter
        Â  Â  Â  Â  favourable,
        Â  Â  Â  Â  Â  Â  Â  Â  Â there are exceptions.

        Â  Â  Â  Â  Â  Â  Â  Â  Â The latter is intended for small bulk
        calculations with
        Â  Â  Â  Â  many
        Â  Â  Â  Â  Â  Â  Â  Â  Â k-points.

        Â  Â  Â  Â  Â  Â  Â  Â  Â Lastly, you have a quad core machine
        and run mpirun -np
        Â  Â  Â  Â  5, and
        Â  Â  Â  Â  Â  Â  Â  Â  Â expect
        Â  Â  Â  Â  Â  Â  Â  Â  Â that to run faster. That is a wrong
        assumption.Ä€
        Â  Â  Â  Â  Â  Â  Â  Â  Â Secondly diagonalization is not
        everything in the
        Â  Â  Â  Â  program, check
        Â  Â  Â  Â  Â  Â  Â  Â  Â your
        Â  Â  Â  Â  Â  Â  Â  Â  Â TIMES file to figure out whether it
        _is_ the
        Â  Â  Â  Â  diagonalization or
                         a mixture.Ä€


        Â  Â  Â  Â  Â  Â  Â  Â  Â 2015-11-02 19:42 GMT+01:00 RCP
        <[email protected] <mailto:[email protected]>
        Â  Â  Â  Â  <mailto:[email protected]
        <mailto:[email protected]>>
        Â  Â  Â  Â  Â  Â  Â  Â  Â <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>
        Â  Â  Â  Â  Â  Â  Â  Â  Â <mailto:[email protected]
        <mailto:[email protected]>
        Â  Â  Â  Â  <mailto:[email protected]
        <mailto:[email protected]>> <mailto:[email protected]
        <mailto:[email protected]>
        Â  Â  Â  Â  <mailto:[email protected]
        <mailto:[email protected]>>>>>:


                         Â    Dear everyone,

                         Â    I seem to have a
        misunderstanding on how the
        Â  Â  Â  Â  Â  Â  Â  Â  Â Diag.ParallellOverK
                         Â    feature works, any comment
        would be much appreciated.

                         Â    I've got a large metallic
        cell, though still with 9
        Â  Â  Â  Â  Â  Â  Â  Â  Â k-points, that
                         Â    runs on a quad PC; moreover,
        routine diagkp shows
        Â  Â  Â  Â  k-points are
                         Â    distributed round robin
        among processes. Thus I
        Â  Â  Â  Â  was expecting
                         Â    "mpirun -np 5 ..." to run
        significantly faster than
        Â  Â  Â  Â  Â  Â  Â  Â  Â "mpirun -np 4 ...",
                         Â    as judged from the elapsed
        time of individual scf
        Â  Â  Â  Â  steps.
                         Â    Clearly, in the latter case,
        the 9th k-point
        Â  Â  Â  Â  would be taken by
                         Â    process 0 while the other
        three would remain
        Â  Â  Â  Â  waiting, right?.

                         Â    However, my exppectations
        turned out to be wrong;
        Â  Â  Â  Â  in fact the
                         Â    2nd alternative appears to
        be a tiny bit faster.
                         Â    Why ?.

                         Â    Thanks in advance,

                         Â    Roberto P.




        Â  Â  Â  Â  Â  Â  Â  Â  Â --
        Â  Â  Â  Â  Â  Â  Â  Â  Â Kind regards Nick




        Â  Â  Â  Â  --
        Â  Â  Â  Â  Kind regards Nick




        --
        Kind regards Nick


    --

    |---------------------------------------------------------------------|
    |   Dr. Roberto C. Pasianot         Phone: 54 11 4839 6709Â
    Â  Â  Â  Â  Â  |
    |Â  Â Gcia. Materiales, CAC-CNEAÂ  Â  Â  FAXÂ  : 54 11 6772 7362Â
    Â  Â  Â  Â  Â  |
    |Â  Â Avda. Gral. Paz 1499Â  Â  Â  Â  Â  Â  Email:
    [email protected] <mailto:[email protected]>Â  Â  Â  Â |
    |   1650 San Martin, Buenos Aires                    Â
    Â  Â  Â  Â  Â  Â  Â  Â |
    |Â  Â ARGENTINAÂ  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â
    Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â |
    |---------------------------------------------------------------------|




--
Kind regards Nick

Responder a