Oh, a last thing, timing can be dubious as that is typically inferred
from clock-cycles in the cpu, hence I would advice you to time it your
self.Â
For instance by doing:
date
...
date
It depends on the underlying timing function used.
2015-11-03 14:03 GMT+01:00 Nick Papior <[email protected]
<mailto:[email protected]>>:
That is crazy, and it is _amazing_!
Nevertheless, I would still not recommend you do these kind of things.
2015-11-03 13:37 GMT+01:00 RCP <[email protected]
<mailto:[email protected]>>:
Good morning !,
Please have a look at the outcome of the crazy "mpirun -np 9 ..."
exercise,
----------------------------------------------------------------------
* Running on  9 nodes in parallel
 ... snipped ...
siesta: iscf  Eharris(eV)   E_KS(eV)  FreeEng(eV)Â
 dDmax Ef(eV)
siesta:Â Â 1Â -124261.2908Â -124261.2891Â -124261.2891Â
0.0001 -2.5494
timer: Routine,Calls,Time,% = IterSCFÂ Â Â Â 1Â Â
1733.886Â 99.53
elaps: Routine,Calls,Wall,% = IterSCFÂ Â Â Â 1Â Â
 435.030 99.53
-----------------------------------------------------------------------
Amazing: the elaps row is pretty close to the 410.0 (or so) of my
previous posts.
However, yes, I seem to have a misunderstanding about the inner
workings of the code (well, in a sense, this was my first question)
because the timing info at the end of the output file says "diagon"
is taking only about 1/3 of the total time,
------------------------------------------------------------------
elaps: ELAPSED times:
elaps: Routine    Calls  Time/call  Tot.timeÂ
   %
elaps: siesta      1   712.814 Â
 712.814  100.00
...
elaps: diagon      1   225.603   225.603
31.65 <tel:225.603%20%20%20%2031.65>
elaps: cdiag       2   47.763  Â
95.527Â Â 13.40
elaps:Â cdiag1Â Â Â Â Â Â 2Â Â Â Â 0.920Â Â Â
 1.840   0.26
elaps:Â cdiag2Â Â Â Â Â Â 2Â Â Â Â 3.335Â Â Â
 6.670   0.94
elaps:Â cdiag3Â Â Â Â Â Â 2Â Â Â 42.961Â Â Â
85.922Â Â 12.05
elaps:Â cdiag4Â Â Â Â Â Â 2Â Â Â Â 0.512Â Â Â
 1.025   0.14
elaps:Â DHSCF4Â Â Â Â Â Â 1Â Â Â 71.874Â Â Â
71.874Â Â 10.08
elaps: dfscf       1   70.398  Â
70.398Â Â Â 9.88
elaps: overfsm      1    0.269  Â
 0.269   0.04
elaps: optical      1    0.000  Â
 0.000   0.00
-------------------------------------------------------------------
Take care,
Roberto
On 11/03/2015 08:28 AM, Nick Papior wrote:
2015-11-03 12:10 GMT+01:00 RCP <[email protected]
<mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>:
  Hi,
  Thanks for your time and sharing of wisdom.
  In general terms I do agree with you Nick, in the
sense that
  running several sequential independent tasks (wien2k)
  simultaneously is not equivalent to running a set of
  inter-communicated, MPI, tasks.
  However here we're talking about a peculiar situation,
  namely, parallelization over k-points is, essentially, an
  embarrassingly parallel problem, at least for my rather
  large cell (97 atoms). The sequential gathering of
  results from different k-points, building the new charge
  density and so on, should take negligible time compared
  to the time spent by a single task in diagonalizing a
large
  matrix.
! ! NO ! ! ;)
Parallelization across k-points in siesta is NOT the same as an
embarrassingly parallel problem across k-points.
The _only_ thing in siesta that is parallelized
embarrassingly is the
diagonalisation part (after having communicated all
Hamiltonian elements
to all other nodes). Everything else is MPI parallelized, grid
operations, construction of the Hamiltonian, etc. etc.!Â
Yes, even though the diagonalization is embarrassingly and
it _should_
take the longest time your assumption that the
diagonalization part is
still the most time consuming becomes wrong.
Furthermore 96 atoms ~ 1000 orbitals, not that big a matrix
to diagonalize.
 Please look at the timing output for clarity of this.
  Of course, oversubscribing the CPUs must hurt performance
  at some point, and this is most likely worse for MPI
tasks than
  for truly independent ones. But to me 5 MPI tasks
competing for
  4 cores does not look a scenario that terrible.
MPI is not sequential programming and any assumption on
oversubscribing
you have is wrong, put simply. The 5 MPI tasks does not
compete for 4
cores, siesta MPI tasks are linearly dependent on each other
(as written
in the last mail) and hence they have to keep up all the
time. If the
MPI program was fully embarrassingly parallelized, then
yes, you could,
perhaps, have a point, but siesta is not such a code.
How you can keep saying that oversubscribing can not be that
damaging
for performance (in fact improve) is really baffling to me :)
  Moreover, np=5 and np=4 resulted in almost the same
elapsed
  time. It is hard to believe that my expected time win for
  np=5 was (almost) exactly compensated by performance
loss. Â
Try doubling your system size and do the same calculation.Â
  Nice discussion guys. I'll do a little more research
and let you
  know if something worth comes out.Â
  Take care,
  Roberto
  On 11/02/2015 07:21 PM, Salvador Barraza-Lopez wrote:
    Could not be clearer Nick.
    RIcardo, if you type top on your machine, you'll
see two SIESTA
    processes competing for one core's time, and
performing at 50%
    at most.
    Other cores will wait for these processes when
an operation
    among all
    cores is necessary in the algorithm (i.e., a sum
or a
    distributed matrix
    product)... thus these other cores will just
have to wait for the
    task these processes competing for the same core
time to end; thus
    degrading performance.
    -Salvador
   Â
------------------------------------------------------------------------
    *From:* [email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
    <[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>> on
    behalf of
    Nick Papior <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
    *Sent:* Monday, November 2, 2015 4:08 PM
    *To:* [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
    *Subject:* Re: [SIESTA-L] Puzzled about
ParallelOverK feature
    2015-11-02 22:37 GMT+01:00 RCP
<[email protected] <mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>>
    <mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>>:
    Â Â Hi Nick,
    Â Â Please take my word: I'm not a
computer guru but started
    Â Â using computers before the PC era :-).
    Â Â I know hyperthreading is evil for
scientific calculations,
    Â Â they're even disabled in BIOS. It is
not that.
    Â Â Why I'm saying np=5 should take less
time than np=4, even if
    Â Â my PC is a quad, is as follows.
    This is a wrong statement!
    By this argument everything that can be
embarrassingly parallellized
    will take less or equal time when using the
number of sequential
    divisions.
    Â Â Distribution of k-points is round
robin, and assume k-points
    Â Â (the, trimmed, real ones, not M&K
grid) take about the same
    Â Â time to process.
    Â Â Thus for np=4 I need 3 "time steps" to
get the job done,
    Â Â namely (4 + 4 + 1) when seen from
k-points perspective.
    Â Â On the other hand for np=5 the time
taken would be
    Â Â something like 2* 1/0.80 = 2.5, or
even shorter,
    Â Â 1/0.80 + 1 = 2.25.
    Â Â Â¿What is flawed with this argument?.
    Your flaw lies in using more cores than
available, this has
    nothing to
    do with number of k-points, and your figures are
based on a
    sequential
    program governed by the OS, not a parallel
program (from what I've
    gathered).
    You should try running a simple openmp program with
    OMP_NUM_THREADS=4
    and 5 and see if that also degrades performance.
    Oversubscribing your CPU is _heavily_ inflicting
performance and
    yes,
    oversubscribing can make your program run worse
than the number of
    cores, especially when using MPI.
    By your argument you would get the same
performance by doing
    mpirun -np
    9, no? Try that and you will see that it will be
slower and
    slower the
    more processors you throw at it.
    MPI is not sequential and comparing the
execution of a parallel and
    sequential program is, at best, erroneous.
    The reason it runs _perfect_ for your wien2k
calculations (from
    what you
    say they are sequential programs) is that the
processors there
    make NO
    communication with each other, meaning that each
process can be
    halted/resumed at any time without notifying
anything but the
    running
    process. With your wien2k np=5 the OS can pause,
resume
    processors as it
    pleases with *relatively* little impact on the
performance, there is
    some, but not that much. This is because each
process is not
    dependent
    on the others and it will try and finish some
before moving on.
    With MPI (siesta) this is _very_ wrong. Most MPI
programs are
    communication bounded (i.e. not embarrassingly
parallellized
    using MPI).
    The data is distributed and every process is
dependent on each
    other, no
    process can progress without informing the other
processors.
    This means 1) every processor does some work, 2)
all processors
    communicate with each other, 3) repeat from step
1). Now do
    steps 1 to 3
    a couple of million times and the OS becomes
flooded with
    stop/resumes
    (basically, not in its entirety, but for brevity).
    Whenever you use MPI you should never use more
processors than
    you have
    available.
   Â
(https://www.open-mpi.org/faq/?category=running#oversubscribing
   Â
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Drunning-23oversubscribing&d=BQMFaQ&c=JL-fUnQvtjNLb7dA39cQUcqmjBVITE8MbOdX7Lx6ge8&r=n_Y76F1vumEs9EYNHN2gzA5FD9jzyPhrzl3eOzxCHIQ&m=Vswqzh2TD_CL1r9kiCwjwL16KtOxW26uq4agbMQhfiQ&s=f2e4kVNouFg3LpIMPb-7nvfQslbQkj9jqkn-q-lsO-I&e=>)
    if you time your execution with timings of the
MPI calls you
    should most
    likely see immense increases in communication
times as the processes
    waits all the time, test this if you want more
clear proof!
    Bottomline, never use more MPI processors than
you have physical
    processors.
    If you still want more explanations, turn to MPI
developers for more
    technical details, all I can say, never use more
MPI processors
    than you
    have physical cores.
    Â Â Best regards,
    Â Â Roberto
    Â Â On 11/02/2015 05:50 PM, Nick Papior wrote:
    Â Â Â Â 2015-11-02 21:37 GMT+01:00
RCP <[email protected] <mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>>
    Â Â Â Â <mailto:[email protected]
<mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>>>
    Â Â Â Â <mailto:[email protected]
<mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>>>>>:
    Â Â Â Â Â Â Â Thank you Nick
and Salvador for your comments.
    Â Â Â Â Â Â Â So Nick,
basically you're saying that
    diagonalization time
    Â Â Â Â might
    Â Â Â Â Â Â Â be playing no
role. That is at variance, for
    instance, with
    Â Â Â Â Wien2k,
    Â Â Â Â Â Â Â where
diagonalization is the most time
    consuming step. In fact,
    Â Â Â Â Â Â Â my expectation
is correct for it; veryfied
    with a similar cell
    Â Â Â Â Â Â Â and 9 k-points.
    Â Â Â Â No, I am definitely not
saying that! But I have no
    idea about
    Â Â Â Â how your
    Â Â Â Â system is setup.
    Â Â Â Â Diagonalization _is_ a big
part of the computation.
    Â Â Â Â How have you specified the
k-points? Is it 9 kpoints
    or 9
    Â Â Â Â kpoints in the
    Â Â Â Â monkhorst pack grid?
    Â Â Â Â Â Â Â In that case
"top" shows a first stage of 5
    processes
    Â Â Â Â running at
    Â Â Â Â Â Â Â about 4/5=80%
CPU power (and more or less
    stable) and a 2nd
    Â Â Â Â stage of
    Â Â Â Â Â Â Â 4 procs,
running at 100%. This is not MPI,
    but a parallel
    Â Â Â Â strategy
    Â Â Â Â Â Â Â based on
scripts (hope you are aware).
    Â Â Â Â wien2k is not siesta.
    Â Â Â Â If wien2k is script based,
i.e. sequential running and
    Â Â Â Â self-managing the
    Â Â Â Â processes, then sure they
behave _very_ differently
    and wien2k
    Â Â Â Â should
    Â Â Â Â give you the desired
speedup. Your figures sounds like
    Â Â Â Â hyperthreading to me.
    Â Â Â Â Â Â Â The same
experiment performed with "mpirun
    -np 5 ..." and
    Â Â Â Â Siesta,
    Â Â Â Â Â Â Â shows more
jumpy figures for CPU usage. One
    task might be
    Â Â Â Â at 100%,
    Â Â Â Â Â Â Â another at 60%,
and so on, as if Linux
    were playing with
    Â Â Â Â tasks
    Â Â Â Â Â Â Â like a juggler.
    Â Â Â Â You are still implying usage
of a quad core machine
    (quad == 4)
    Â Â Â Â and 4<5.
    Â Â Â Â If you _only_ have 4
processors (intel hyperthreads
    do _not_
    Â Â Â Â count as a
    Â Â Â Â processes) then your
assumption is not correct.Â
    Â Â Â Â How would you expect a
speedup by using 1 more
    process than you
    Â Â Â Â have on
    Â Â Â Â your system?
    Â Â Â Â If you see this juggling it
sounds like quad == 4
    and not 5.
    Â Â Â Â Â Â Â To give you
some feeling, please look at the
    numbers here,
    Â Â Â Â
   Â
-----------------------------------------------------------------------
    Â Â Â Â Â Â Â * Running
on  4 nodes in parallel
    Â Â Â Â Â Â Â  ...
snipped ...
    Â Â Â Â Â Â Â siesta:
iscf Â Eharris(eV) ÂÂ
    Ã‚ E_KS(eV) Â FreeEng(eV)Â
    Â Â Â Â Â Â Â Â
dDmax Ef(eV)
    Â Â Â Â Â Â Â
siesta:  1 -124261.2908ÂÂ
    -124261.2891Â
    Â Â Â Â -124261.2891 0.0001
    Â Â Â Â Â Â Â -2.5494
    Â Â Â Â Â Â Â timer:
Routine,Calls,Time,% = IterSCFÂÂ
    Ã‚ Â Â 1 Â
    Â Â Â Â 1637.906 99.72
    Â Â Â Â Â Â Â elaps:
Routine,Calls,Wall,% = IterSCFÂÂ
    Ã‚ Â Â 1ÂÂÂ
 Â
    Â Â Â Â 410.919
    Â Â Â Â Â Â Â 99.72
<tel:410.919%20%2099.72>
    Â Â Â Â Â Â Â * Running
on  5 nodes in parallel
    Â Â Â Â Â Â Â  ...
snipped ...
    Â Â Â Â Â Â Â siesta:
iscf Â Eharris(eV) ÂÂ
    Ã‚ E_KS(eV) Â FreeEng(eV)Â
    Â Â Â Â Â Â Â Â
dDmax Ef(eV)
    Â Â Â Â Â Â Â
siesta:  1 -124261.2908ÂÂ
    -124261.2891Â
    Â Â Â Â -124261.2891 0.0001
    Â Â Â Â Â Â Â -2.5494
    Â Â Â Â Â Â Â timer:
Routine,Calls,Time,% = IterSCFÂÂ
    Ã‚ Â Â 1 Â
    Â Â Â Â 1654.558 99.64
    Â Â Â Â Â Â Â elaps:
Routine,Calls,Wall,% = IterSCFÂÂ
    Ã‚ Â Â 1ÂÂÂ
 Â
    Â Â Â Â 415.150Â
    Â Â Â Â Â Â Â 99.64
    Â Â Â Â
   Â
------------------------------------------------------------------------
    Â Â Â Â Â Â Â Those elapsed
times are so close ... there
    must be an easy
    Â Â Â Â explanation.
    Â Â Â Â Yes, if you are using mpirun
-np 5 on a quad core
    machine, then the
    Â Â Â Â explanation is easy and your
numbers are irrelevant.Â
    Â Â Â Â Â Â Â Best,
    Â Â Â Â Â Â Â Roberto
    Â Â Â Â Â Â Â On 11/02/2015
04:14 PM, Nick Papior wrote:
    Â Â Â Â Â Â Â Â Â
Basically:
    Â Â Â Â Â Â Â Â Â
Diag.ParallelOverK false
    Â Â Â Â Â Â Â Â Â
Ä€ uses scalapack to diagonalize
    the Hamiltonian
    Â Â Â Â Â Â Â Â Â
Diag.ParallelOverK true
    Â Â Â Â Â Â Â Â Â
Ä€ uses lapack to diagonalize the
    Hamiltonian
    Â Â Â Â Â Â Â Â Â If
you have a very large system, you
    will not get
    Â Â Â Â anything out
    Â Â Â Â Â Â Â Â Â of using
    Â Â Â Â Â Â Â Â Â the
latter option (rather than using
    an enormous amount
    Â Â Â Â of memory).
    Â Â Â Â Â Â Â Â Â Only
for an _extreme_ number of
    k-points are the latter
    Â Â Â Â favourable,
    Â Â Â Â Â Â Â Â Â there
are exceptions.
    Â Â Â Â Â Â Â Â Â The
latter is intended for small bulk
    calculations with
    Â Â Â Â many
    Â Â Â Â Â Â Â Â Â k-points.
    Â Â Â Â Â Â Â Â Â
Lastly, you have a quad core machine
    and run mpirun -np
    Â Â Â Â 5, and
    Â Â Â Â Â Â Â Â Â expect
    Â Â Â Â Â Â Â Â Â that
to run faster. That is a wrong
    assumption.Ä€
    Â Â Â Â Â Â Â Â Â
Secondly diagonalization is not
    everything in the
    Â Â Â Â program, check
    Â Â Â Â Â Â Â Â Â your
    Â Â Â Â Â Â Â Â Â TIMES
file to figure out whether it
    _is_ the
    Â Â Â Â diagonalization or
    Â Â Â Â Â Â Â Â Â a
mixture.Ä€
    Â Â Â Â Â Â Â Â Â
2015-11-02 19:42 GMT+01:00 RCP
    <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
    Â Â Â Â <mailto:[email protected]
<mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>>>
    Â Â Â Â Â Â Â Â Â
<mailto:[email protected] <mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>>>>
    Â Â Â Â Â Â Â Â Â
<mailto:[email protected] <mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>>
    Â Â Â Â <mailto:[email protected]
<mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>>> <mailto:[email protected]
<mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>>
    Â Â Â Â <mailto:[email protected]
<mailto:[email protected]>
    <mailto:[email protected]
<mailto:[email protected]>>>>>>:
    Â Â Â Â Â Â Â Â Â
  Dear everyone,
    Â Â Â Â Â Â Â Â Â
  I seem to have a
    misunderstanding on how the
    Â Â Â Â Â Â Â Â Â
Diag.ParallellOverK
    Â Â Â Â Â Â Â Â Â
  feature works, any comment
    would be much appreciated.
    Â Â Â Â Â Â Â Â Â
  I've got a large metallic
    cell, though still with 9
    Â Â Â Â Â Â Â Â Â
k-points, that
    Â Â Â Â Â Â Â Â Â
  runs on a quad PC; moreover,
    routine diagkp shows
    Â Â Â Â k-points are
    Â Â Â Â Â Â Â Â Â
  distributed round robin
    among processes. Thus I
    Â Â Â Â was expecting
    Â Â Â Â Â Â Â Â Â
  "mpirun -np 5 ..." to run
    significantly faster than
    Â Â Â Â Â Â Â Â Â
"mpirun -np 4 ...",
    Â Â Â Â Â Â Â Â Â
  as judged from the elapsed
    time of individual scf
    Â Â Â Â steps.
    Â Â Â Â Â Â Â Â Â
  Clearly, in the latter case,
    the 9th k-point
    Â Â Â Â would be taken by
    Â Â Â Â Â Â Â Â Â
  process 0 while the other
    three would remain
    Â Â Â Â waiting, right?.
    Â Â Â Â Â Â Â Â Â
  However, my exppectations
    turned out to be wrong;
    Â Â Â Â in fact the
    Â Â Â Â Â Â Â Â Â
  2nd alternative appears to
    be a tiny bit faster.
    Â Â Â Â Â Â Â Â Â
  Why ?.
    Â Â Â Â Â Â Â Â Â
  Thanks in advance,
    Â Â Â Â Â Â Â Â Â
  Roberto P.
    Â Â Â Â Â Â Â Â Â --
    Â Â Â Â Â Â Â Â Â Kind
regards Nick
    Â Â Â Â --
    Â Â Â Â Kind regards Nick
    --
    Kind regards Nick
  --
 Â
|---------------------------------------------------------------------|
  |  Dr. Roberto C. Pasianot    Â
Phone: 54 11 4839 6709Â
  Â Â Â Â Â |
  | Â Gcia. Materiales, CAC-CNEA  ÂÂ
FAX : 54 11 6772 7362Â
  Â Â Â Â Â |
  | Â Avda. Gral. Paz 1499    ÂÂ
 Email:
  [email protected] <mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>    |
  | Â 1650 San Martin, Buenos Aires  ÂÂ
       Â
  Â Â Â Â Â Â Â Â |
  | Â ARGENTINA      ÂÂ
          Â
  Â Â Â Â Â Â Â Â Â Â Â |
 Â
|---------------------------------------------------------------------|
--
Kind regards Nick
--
Kind regards Nick
--
Kind regards Nick