Re: [SIESTA-L] Puzzled about ParallelOverK feature

Salvador Barraza-Lopez Mon, 02 Nov 2015 14:25:40 -0800

Could not be clearer Nick.

RIcardo, if you type top on your machine, you'll see two SIESTA processes 
competing for one core's time, and performing at 50% at most.



Other cores will wait for these processes when an operation among all cores is 
necessary in the algorithm (i.e., a sum or a distributed matrix product)... 
thus these other cores will just have to wait for the task these processes 
competing for the same core time to end; thus degrading performance.


-Salvador



________________________________
From: [email protected] <[email protected]> on behalf of Nick 
Papior <[email protected]>
Sent: Monday, November 2, 2015 4:08 PM
To: [email protected]
Subject: Re: [SIESTA-L] Puzzled about ParallelOverK feature



2015-11-02 22:37 GMT+01:00 RCP 
<[email protected]<mailto:[email protected]>>:
Hi Nick,

Please take my word: I'm not a computer guru but started
using computers before the PC era :-).
I know hyperthreading is evil for scientific calculations,
they're even disabled in BIOS. It is not that.

Why I'm saying np=5 should take less time than np=4, even if
my PC is a quad, is as follows.
This is a wrong statement!
By this argument everything that can be embarrassingly parallellized will take 
less or equal time when using the number of sequential divisions.
Distribution of k-points is round robin, and assume k-points
(the, trimmed, real ones, not M&K grid) take about the same
time to process.
Thus for np=4 I need 3 "time steps" to get the job done,
namely (4 + 4 + 1) when seen from k-points perspective.
On the other hand for np=5 the time taken would be
something like 2* 1/0.80 = 2.5,  or even shorter,
1/0.80 + 1 = 2.25.
¿What is flawed with this argument?.
Your flaw lies in using more cores than available, this has nothing to do with 
number of k-points, and your figures are based on a sequential program governed 
by the OS, not a parallel program (from what I've gathered).
You should try running a simple openmp program with OMP_NUM_THREADS=4 and 5 and 
see if that also degrades performance.

Oversubscribing your CPU is _heavily_ inflicting performance and yes, 
oversubscribing can make your program run worse than the number of cores, 
especially when using MPI.
By your argument you would get the same performance by doing mpirun -np 9, no? 
Try that and you will see that it will be slower and slower the more processors 
you throw at it.
MPI is not sequential and comparing the execution of a parallel and sequential 
program is, at best, erroneous.

The reason it runs _perfect_ for your wien2k calculations (from what you say 
they are sequential programs) is that the processors there make NO 
communication with each other, meaning that each process can be halted/resumed 
at any time without notifying anything but the running process. With your 
wien2k np=5 the OS can pause, resume processors as it pleases with *relatively* 
little impact on the performance, there is some, but not that much. This is 
because each process is not dependent on the others and it will try and finish 
some before moving on.

With MPI (siesta) this is _very_ wrong. Most MPI programs are communication 
bounded (i.e. not embarrassingly parallellized using MPI). The data is 
distributed and every process is dependent on each other, no process can 
progress without informing the other processors.
This means 1) every processor does some work, 2) all processors communicate 
with each other, 3) repeat from step 1). Now do steps 1 to 3 a couple of 
million times and the OS becomes flooded with stop/resumes (basically, not in 
its entirety, but for brevity).
Whenever you use MPI you should never use more processors than you have 
available. 
(https://www.open-mpi.org/faq/?category=running#oversubscribing<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Drunning-23oversubscribing&d=BQMFaQ&c=JL-fUnQvtjNLb7dA39cQUcqmjBVITE8MbOdX7Lx6ge8&r=n_Y76F1vumEs9EYNHN2gzA5FD9jzyPhrzl3eOzxCHIQ&m=Vswqzh2TD_CL1r9kiCwjwL16KtOxW26uq4agbMQhfiQ&s=f2e4kVNouFg3LpIMPb-7nvfQslbQkj9jqkn-q-lsO-I&e=>)
if you time your execution with timings of the MPI calls you should most likely 
see immense increases in communication times as the processes waits all the 
time, test this if you want more clear proof!

Bottomline, never use more MPI processors than you have physical processors.
If you still want more explanations, turn to MPI developers for more technical 
details, all I can say, never use more MPI processors than you have physical 
cores.


Best regards,

Roberto


On 11/02/2015 05:50 PM, Nick Papior wrote:


2015-11-02 21:37 GMT+01:00 RCP 
<[email protected]<mailto:[email protected]>
<mailto:[email protected]<mailto:[email protected]>>>:

    Thank you Nick and Salvador for your comments.

    So Nick, basically you're saying that diagonalization time might
    be playing no role. That is at variance, for instance, with Wien2k,
    where diagonalization is the most time consuming step. In fact,
    my expectation is correct for it; veryfied with a similar cell
    and 9 k-points.

No, I am definitely not saying that! But I have no idea about how your
system is setup.
Diagonalization _is_ a big part of the computation.
How have you specified the k-points? Is it 9 kpoints or 9 kpoints in the
monkhorst pack grid?


    In that case "top" shows a first stage of 5 processes running at
    about 4/5=80% CPU power (and more or less stable) and a 2nd stage of
    4 procs, running at 100%. This is not MPI, but a parallel strategy
    based on scripts (hope you are aware).

wien2k is not siesta.
If wien2k is script based, i.e. sequential running and self-managing the
processes, then sure they behave _very_ differently and wien2k should
give you the desired speedup. Your figures sounds like hyperthreading to me.

    The same experiment performed with "mpirun -np 5 ..." and Siesta,
    shows more jumpy figures for CPU usage. One task might be at 100%,
    another at 60%, and so on,Â  as if Linux were playing with tasks
    like a juggler.

You are still implying usage of a quad core machine (quad == 4) and 4<5.
If you _only_ have 4 processors (intel hyperthreads do _not_ count as a
processes) then your assumption is not correct.Â
How would you expect a speedup by using 1 more process than you have on
your system?
If you see this juggling it sounds like quad == 4 and not 5.


    To give you some feeling, please look at the numbers here,

    -----------------------------------------------------------------------
    * Running onÂ  Â  4 nodes in parallel

    Â ... snipped ...

    siesta: iscfÂ  Â Eharris(eV)Â  Â  Â  E_KS(eV)Â  Â FreeEng(eV)Â
    Â dDmaxÂ  Ef(eV)
    siesta:Â  Â  1Â  -124261.2908Â  -124261.2891Â  -124261.2891Â  0.0001
    -2.5494
    timer: Routine,Calls,Time,% = IterSCFÂ  Â  Â  Â  1Â  Â  1637.906Â  99.72
    elaps: Routine,Calls,Wall,% = IterSCFÂ  Â  Â  Â  1Â  Â  Â 410.919
    99.72 <tel:410.919%20%2099.72>


    * Running onÂ  Â  5 nodes in parallel

    Â ... snipped ...

    siesta: iscfÂ  Â Eharris(eV)Â  Â  Â  E_KS(eV)Â  Â FreeEng(eV)Â
    Â dDmaxÂ  Ef(eV)
    siesta:Â  Â  1Â  -124261.2908Â  -124261.2891Â  -124261.2891Â  0.0001
    -2.5494
    timer: Routine,Calls,Time,% = IterSCFÂ  Â  Â  Â  1Â  Â  1654.558Â  99.64
    elaps: Routine,Calls,Wall,% = IterSCFÂ  Â  Â  Â  1Â  Â  Â 415.150Â
    99.64
    ------------------------------------------------------------------------

    Those elapsed times are so close ... there must be an easy explanation.

Yes, if you are using mpirun -np 5 on a quad core machine, then the
explanation is easy and your numbers are irrelevant.Â


    Best,

    Roberto


    On 11/02/2015 04:14 PM, Nick Papior wrote:

        Basically:
        Diag.ParallelOverK false
        Ä€ uses scalapack to diagonalize the Hamiltonian
        Diag.ParallelOverK true
        Ä€ uses lapack to diagonalize the Hamiltonian

        If you have a very large system, you will not get anything out
        of using
        the latter option (rather than using an enormous amount of memory).
        Only for an _extreme_ number of k-points are the latter favourable,
        there are exceptions.

        The latter is intended for small bulk calculations with many
        k-points.

        Lastly, you have a quad core machine and run mpirun -np 5, and
        expect
        that to run faster. That is a wrong assumption.Ä€
        Secondly diagonalization is not everything in the program, check
        your
        TIMES file to figure out whether it _is_ the diagonalization or
        a mixture.Ä€


        2015-11-02 19:42 GMT+01:00 RCP 
<[email protected]<mailto:[email protected]>
        <mailto:[email protected]<mailto:[email protected]>>
        <mailto:[email protected]<mailto:[email protected]> 
<mailto:[email protected]<mailto:[email protected]>>>>:


        Â  Â  Dear everyone,

        Â  Â  I seem to have a misunderstanding on how the
        Diag.ParallellOverK
        Â  Â  feature works, any comment would be much appreciated.

        Â  Â  I've got a large metallic cell, though still with 9
        k-points, that
        Â  Â  runs on a quad PC; moreover, routine diagkp shows k-points are
        Â  Â  distributed round robin among processes. Thus I was expecting
        Â  Â  "mpirun -np 5 ..." to run significantly faster than
        "mpirun -np 4 ...",
        Â  Â  as judged from the elapsed time of individual scf steps.
        Â  Â  Clearly, in the latter case, the 9th k-point would be taken by
        Â  Â  process 0 while the other three would remain waiting, right?.

        Â  Â  However, my exppectations turned out to be wrong; in fact the
        Â  Â  2nd alternative appears to be a tiny bit faster.
        Â  Â  Why ?.

        Â  Â  Thanks in advance,

        Â  Â  Roberto P.




        --
        Kind regards Nick




--
Kind regards Nick



--
Kind regards Nick

Re: [SIESTA-L] Puzzled about ParallelOverK feature

Responder a