Yes and no, it depends on the underlying system_clock, and that is system
dependent.

So yes, time outside siesta for "correct" timing of full application.

2015-11-03 14:49 GMT+01:00 RCP <[email protected]>:

> Yeah!, I also suspected the clock. As far as I could follow the
> code, wall-clock time derives from f95 SYSTEM_CLOCK intrinsic,
> which is being handled by the master node. Rulling out programing
> mistakes, would that still be dubious ?.
> Well, at least I could time siesta from the shell, so that no
> tinkering with the sources.
>
> On 11/03/2015 10:12 AM, Nick Papior wrote:
>
>> Oh, a last thing, timing can be dubious as that is typically inferred
>> from clock-cycles in the cpu, hence I would advice you to time it your
>> self.Â
>> For instance by doing:
>>
>> date
>> ...
>> date
>>
>> It depends on the underlying timing function used.
>>
>>
>> 2015-11-03 14:03 GMT+01:00 Nick Papior <[email protected]
>> <mailto:[email protected]>>:
>>
>>     That is crazy, and it is _amazing_!
>>
>>     Nevertheless, I would still not recommend you do these kind of things.
>>
>>
>>     2015-11-03 13:37 GMT+01:00 RCP <[email protected]
>>     <mailto:[email protected]>>:
>>
>>         Good morning !,
>>
>>         Please have a look at the outcome of the crazy "mpirun -np 9 ..."
>>         exercise,
>>
>>
>> ----------------------------------------------------------------------
>>         * Running on    9 nodes in parallel
>>
>>         Â ... snipped ...
>>
>>         siesta: iscf   Eharris(eV)      E_KS(eV)   FreeEng(eV)Â
>>          dDmax  Ef(eV)
>>         siesta:Â  Â  1Â  -124261.2908Â  -124261.2891Â  -124261.2891Â
>>         0.0001 -2.5494
>>         timer: Routine,Calls,Time,% = IterSCFÂ  Â  Â  Â  1Â  Â
>>         1733.886Â  99.53
>>         elaps: Routine,Calls,Wall,% = IterSCFÂ  Â  Â  Â  1Â  Â
>>         Â 435.030Â  99.53
>>
>> -----------------------------------------------------------------------
>>
>>         Amazing: the elaps row is pretty close to the 410.0 (or so) of my
>>         previous posts.
>>
>>         However, yes, I seem to have a misunderstanding about the inner
>>         workings of the code (well, in a sense, this was my first
>> question)
>>         because the timing info at the end of the output file says
>> "diagon"
>>         is taking only about 1/3 of the total time,
>>
>>         ------------------------------------------------------------------
>>         elaps: ELAPSED times:
>>         elaps:  Routine       Calls   Time/call    Tot.timeÂ
>>         Â  Â  Â  %
>>         elaps:  siesta            1     712.814  Â
>>         Â 712.814Â  Â 100.00
>>         ...
>>         elaps:  diagon            1     225.603     225.603
>>         31.65 <tel:225.603%20%20%20%2031.65>
>>         elaps:  cdiag             2      47.763    Â
>>         95.527Â  Â  13.40
>>         elaps:Â  cdiag1Â  Â  Â  Â  Â  Â  2Â  Â  Â  Â 0.920Â  Â  Â
>>         Â 1.840Â  Â  Â 0.26
>>         elaps:Â  cdiag2Â  Â  Â  Â  Â  Â  2Â  Â  Â  Â 3.335Â  Â  Â
>>         Â 6.670Â  Â  Â 0.94
>>         elaps:Â  cdiag3Â  Â  Â  Â  Â  Â  2Â  Â  Â  42.961Â  Â  Â
>>         85.922Â  Â  12.05
>>         elaps:Â  cdiag4Â  Â  Â  Â  Â  Â  2Â  Â  Â  Â 0.512Â  Â  Â
>>         Â 1.025Â  Â  Â 0.14
>>         elaps:Â  DHSCF4Â  Â  Â  Â  Â  Â  1Â  Â  Â  71.874Â  Â  Â
>>         71.874Â  Â  10.08
>>         elaps:  dfscf             1      70.398    Â
>>         70.398Â  Â  Â 9.88
>>         elaps:  overfsm           1       0.269    Â
>>         Â 0.269Â  Â  Â 0.04
>>         elaps:  optical           1       0.000    Â
>>         Â 0.000Â  Â  Â 0.00
>>
>> -------------------------------------------------------------------
>>
>>         Take care,
>>
>>         Roberto
>>
>>         On 11/03/2015 08:28 AM, Nick Papior wrote:
>>
>>
>>
>>             2015-11-03 12:10 GMT+01:00 RCP <[email protected]
>>             <mailto:[email protected]>
>>             <mailto:[email protected] <mailto:[email protected]>>>:
>>
>>             Â  Â  Hi,
>>
>>             Â  Â  Thanks for your time and sharing of wisdom.
>>             Â  Â  In general terms I do agree with you Nick, in the
>>             sense that
>>             Â  Â  running several sequential independent tasks (wien2k)
>>             Â  Â  simultaneously is not equivalent to running a set of
>>             Â  Â  inter-communicated, MPI, tasks.
>>
>>             Â  Â  However here we're talking about a peculiar situation,
>>             Â  Â  namely, parallelization over k-points is, essentially,
>> an
>>             Â  Â  embarrassingly parallel problem, at least for my rather
>>             Â  Â  large cell (97 atoms). The sequential gathering of
>>             Â  Â  results from different k-points, building the new charge
>>             Â  Â  density and so on, should take negligible time compared
>>             Â  Â  to the time spent by a single task in diagonalizing a
>>             large
>>             Â  Â  matrix.
>>
>>             ! ! NO ! ! ;)
>>             Parallelization across k-points in siesta is NOT the same as
>> an
>>             embarrassingly parallel problem across k-points.
>>             The _only_ thing in siesta that is parallelized
>>             embarrassingly is the
>>             diagonalisation part (after having communicated all
>>             Hamiltonian elements
>>             to all other nodes). Everything else is MPI parallelized, grid
>>             operations, construction of the Hamiltonian, etc. etc.!Â
>>             Yes, even though the diagonalization is embarrassingly and
>>             it _should_
>>             take the longest time your assumption that the
>>             diagonalization part is
>>             still the most time consuming becomes wrong.
>>             Furthermore 96 atoms ~ 1000 orbitals, not that big a matrix
>>             to diagonalize.
>>
>>             Â Please look at the timing output for clarity of this.
>>
>>
>>             Â  Â  Of course, oversubscribing the CPUs must hurt
>> performance
>>             Â  Â  at some point, and this is most likely worse for MPI
>>             tasks than
>>             Â  Â  for truly independent ones. But to me 5 MPI tasks
>>             competing for
>>             Â  Â  4 cores does not look a scenario that terrible.
>>
>>             MPI is not sequential programming and any assumption on
>>             oversubscribing
>>             you have is wrong, put simply. The 5 MPI tasks does not
>>             compete for 4
>>             cores, siesta MPI tasks are linearly dependent on each other
>>             (as written
>>             in the last mail) and hence they have to keep up all the
>>             time. If the
>>             MPI program was fully embarrassingly parallelized, then
>>             yes, you could,
>>             perhaps, have a point, but siesta is not such a code.
>>             How you can keep saying that oversubscribing can not be that
>>             damaging
>>             for performance (in fact improve) is really baffling to me :)
>>
>>             Â  Â  Moreover, np=5 and np=4 resulted in almost the same
>>             elapsed
>>             Â  Â  time. It is hard to believe that my expected time win
>> for
>>             Â  Â  np=5 was (almost) exactly compensated by performance
>>             loss. Â
>>
>>             Try doubling your system size and do the same calculation.Â
>>
>>
>>             Â  Â  Nice discussion guys. I'll do a little more research
>>             and let you
>>                 know if something worth comes out.Â
>>
>>
>>             Â  Â  Take care,
>>
>>             Â  Â  Roberto
>>
>>             Â  Â  On 11/02/2015 07:21 PM, Salvador Barraza-Lopez wrote:
>>
>>             Â  Â  Â  Â  Could not be clearer Nick.
>>
>>             Â  Â  Â  Â  RIcardo, if you type top on your machine, you'll
>>             see two SIESTA
>>             Â  Â  Â  Â  processes competing for one core's time, and
>>             performing at 50%
>>             Â  Â  Â  Â  at most.
>>
>>
>>             Â  Â  Â  Â  Other cores will wait for these processes when
>>             an operation
>>             Â  Â  Â  Â  among all
>>             Â  Â  Â  Â  cores is necessary in the algorithm (i.e., a sum
>>             or a
>>             Â  Â  Â  Â  distributed matrix
>>             Â  Â  Â  Â  product)... thus these other cores will just
>>             have to wait for the
>>             Â  Â  Â  Â  task these processes competing for the same core
>>             time to end; thus
>>             Â  Â  Â  Â  degrading performance.
>>
>>
>>             Â  Â  Â  Â  -Salvador
>>
>>
>>
>>             Â  Â  Â  Â
>>
>> ------------------------------------------------------------------------
>>             Â  Â  Â  Â  *From:* [email protected]
>>             <mailto:[email protected]>
>>             <mailto:[email protected]
>>             <mailto:[email protected]>>
>>             Â  Â  Â  Â  <[email protected]
>>             <mailto:[email protected]>
>>             <mailto:[email protected]
>>             <mailto:[email protected]>>> on
>>             Â  Â  Â  Â  behalf of
>>             Â  Â  Â  Â  Nick Papior <[email protected]
>>             <mailto:[email protected]> <mailto:[email protected]
>>             <mailto:[email protected]>>>
>>             Â  Â  Â  Â  *Sent:* Monday, November 2, 2015 4:08 PM
>>             Â  Â  Â  Â  *To:* [email protected] <mailto:[email protected]>
>>             <mailto:[email protected] <mailto:[email protected]>>
>>             Â  Â  Â  Â  *Subject:* Re: [SIESTA-L] Puzzled about
>>             ParallelOverK feature
>>
>>
>>             Â  Â  Â  Â  2015-11-02 22:37 GMT+01:00 RCP
>>             <[email protected] <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]> <mailto:[email protected]
>>             <mailto:[email protected]>>>>:
>>
>>
>>                         Hi Nick,
>>
>>                         Please take my word: I'm not a
>>             computer guru but started
>>                         using computers before the PC era :-).
>>                         I know hyperthreading is evil for
>>             scientific calculations,
>>                         they're even disabled in BIOS. It is
>>             not that.
>>
>>                         Why I'm saying np=5 should take less
>>             time than np=4, even if
>>                         my PC is a quad, is as follows.
>>
>>             Â  Â  Â  Â  This is a wrong statement!
>>             Â  Â  Â  Â  By this argument everything that can be
>>             embarrassingly parallellized
>>             Â  Â  Â  Â  will take less or equal time when using the
>>             number of sequential
>>             Â  Â  Â  Â  divisions.
>>
>>                         Distribution of k-points is round
>>             robin, and assume k-points
>>                         (the, trimmed, real ones, not M&K
>>             grid) take about the same
>>                         time to process.
>>                         Thus for np=4 I need 3 "time steps" to
>>             get the job done,
>>                         namely (4 + 4 + 1) when seen from
>>             k-points perspective.
>>                         On the other hand for np=5 the time
>>             taken would be
>>                         something like 2* 1/0.80 = 2.5,  or
>>             even shorter,
>>                         1/0.80 + 1 = 2.25.
>>                         ¿What is flawed with this argument?.
>>
>>
>>             Â  Â  Â  Â  Your flaw lies in using more cores than
>>             available, this has
>>             Â  Â  Â  Â  nothing to
>>             Â  Â  Â  Â  do with number of k-points, and your figures are
>>             based on a
>>             Â  Â  Â  Â  sequential
>>             Â  Â  Â  Â  program governed by the OS, not a parallel
>>             program (from what I've
>>             Â  Â  Â  Â  gathered).
>>             Â  Â  Â  Â  You should try running a simple openmp program
>> with
>>             Â  Â  Â  Â  OMP_NUM_THREADS=4
>>             Â  Â  Â  Â  and 5 and see if that also degrades performance.
>>
>>             Â  Â  Â  Â  Oversubscribing your CPU is _heavily_ inflicting
>>             performance and
>>             Â  Â  Â  Â  yes,
>>             Â  Â  Â  Â  oversubscribing can make your program run worse
>>             than the number of
>>             Â  Â  Â  Â  cores, especially when using MPI.
>>             Â  Â  Â  Â  By your argument you would get the same
>>             performance by doing
>>             Â  Â  Â  Â  mpirun -np
>>             Â  Â  Â  Â  9, no? Try that and you will see that it will be
>>             slower and
>>             Â  Â  Â  Â  slower the
>>             Â  Â  Â  Â  more processors you throw at it.
>>             Â  Â  Â  Â  MPI is not sequential and comparing the
>>             execution of a parallel and
>>             Â  Â  Â  Â  sequential program is, at best, erroneous.
>>
>>             Â  Â  Â  Â  The reason it runs _perfect_ for your wien2k
>>             calculations (from
>>             Â  Â  Â  Â  what you
>>             Â  Â  Â  Â  say they are sequential programs) is that the
>>             processors there
>>             Â  Â  Â  Â  make NO
>>             Â  Â  Â  Â  communication with each other, meaning that each
>>             process can be
>>             Â  Â  Â  Â  halted/resumed at any time without notifying
>>             anything but the
>>             Â  Â  Â  Â  running
>>             Â  Â  Â  Â  process. With your wien2k np=5 the OS can pause,
>>             resume
>>             Â  Â  Â  Â  processors as it
>>             Â  Â  Â  Â  pleases with *relatively* little impact on the
>>             performance, there is
>>             Â  Â  Â  Â  some, but not that much. This is because each
>>             process is not
>>             Â  Â  Â  Â  dependent
>>             Â  Â  Â  Â  on the others and it will try and finish some
>>             before moving on.
>>
>>             Â  Â  Â  Â  With MPI (siesta) this is _very_ wrong. Most MPI
>>             programs are
>>             Â  Â  Â  Â  communication bounded (i.e. not embarrassingly
>>             parallellized
>>             Â  Â  Â  Â  using MPI).
>>             Â  Â  Â  Â  The data is distributed and every process is
>>             dependent on each
>>             Â  Â  Â  Â  other, no
>>             Â  Â  Â  Â  process can progress without informing the other
>>             processors.
>>             Â  Â  Â  Â  This means 1) every processor does some work, 2)
>>             all processors
>>             Â  Â  Â  Â  communicate with each other, 3) repeat from step
>>             1). Now do
>>             Â  Â  Â  Â  steps 1 to 3
>>             Â  Â  Â  Â  a couple of million times and the OS becomes
>>             flooded with
>>             Â  Â  Â  Â  stop/resumes
>>             Â  Â  Â  Â  (basically, not in its entirety, but for brevity).
>>             Â  Â  Â  Â  Whenever you use MPI you should never use more
>>             processors than
>>             Â  Â  Â  Â  you have
>>             Â  Â  Â  Â  available.
>>             Â  Â  Â  Â
>>             (
>> https://www.open-mpi.org/faq/?category=running#oversubscribing
>>             Â  Â  Â  Â
>>             <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Drunning-23oversubscribing&d=BQMFaQ&c=JL-fUnQvtjNLb7dA39cQUcqmjBVITE8MbOdX7Lx6ge8&r=n_Y76F1vumEs9EYNHN2gzA5FD9jzyPhrzl3eOzxCHIQ&m=Vswqzh2TD_CL1r9kiCwjwL16KtOxW26uq4agbMQhfiQ&s=f2e4kVNouFg3LpIMPb-7nvfQslbQkj9jqkn-q-lsO-I&e=
>> >)
>>             Â  Â  Â  Â  if you time your execution with timings of the
>>             MPI calls you
>>             Â  Â  Â  Â  should most
>>             Â  Â  Â  Â  likely see immense increases in communication
>>             times as the processes
>>             Â  Â  Â  Â  waits all the time, test this if you want more
>>             clear proof!
>>
>>             Â  Â  Â  Â  Bottomline, never use more MPI processors than
>>             you have physical
>>             Â  Â  Â  Â  processors.
>>             Â  Â  Â  Â  If you still want more explanations, turn to MPI
>>             developers for more
>>             Â  Â  Â  Â  technical details, all I can say, never use more
>>             MPI processors
>>             Â  Â  Â  Â  than you
>>             Â  Â  Â  Â  have physical cores.
>>
>>
>>                         Best regards,
>>
>>                         Roberto
>>
>>
>>                         On 11/02/2015 05:50 PM, Nick Papior
>> wrote:
>>
>>
>>
>>                             2015-11-02 21:37 GMT+01:00
>>             RCP <[email protected] <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>
>>                             <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>>
>>                             <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>> <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>>>>:
>>
>>
>>                                  Thank you Nick
>>             and Salvador for your comments.
>>
>>                                  So Nick,
>>             basically you're saying that
>>             Â  Â  Â  Â  diagonalization time
>>                             might
>>                                  be playing no
>>             role. That is at variance, for
>>             Â  Â  Â  Â  instance, with
>>                             Wien2k,
>>                                  where
>>             diagonalization is the most time
>>             Â  Â  Â  Â  consuming step. In fact,
>>                                  my expectation
>>             is correct for it; veryfied
>>             Â  Â  Â  Â  with a similar cell
>>                                  and 9 k-points.
>>
>>                             No, I am definitely not
>>             saying that! But I have no
>>             Â  Â  Â  Â  idea about
>>                             how your
>>                             system is setup.
>>                             Diagonalization _is_ a big
>>             part of the computation.
>>                             How have you specified the
>>             k-points? Is it 9 kpoints
>>             Â  Â  Â  Â  or 9
>>                             kpoints in the
>>                             monkhorst pack grid?
>>
>>
>>                                  In that case
>>             "top" shows a first stage of 5
>>             Â  Â  Â  Â  processes
>>                             running at
>>                                  about 4/5=80%
>>             CPU power (and more or less
>>             Â  Â  Â  Â  stable) and a 2nd
>>                             stage of
>>                                  4 procs,
>>             running at 100%. This is not MPI,
>>             Â  Â  Â  Â  but a parallel
>>                             strategy
>>                                  based on
>>             scripts (hope you are aware).
>>
>>                             wien2k is not siesta.
>>                             If wien2k is script based,
>>             i.e. sequential running and
>>                             self-managing the
>>                             processes, then sure they
>>             behave _very_ differently
>>             Â  Â  Â  Â  and wien2k
>>                             should
>>                             give you the desired
>>             speedup. Your figures sounds like
>>                             hyperthreading to me.
>>
>>                                  The same
>>             experiment performed with "mpirun
>>             Â  Â  Â  Â  -np 5 ..." and
>>                             Siesta,
>>                                  shows more
>>             jumpy figures for CPU usage. One
>>             Â  Â  Â  Â  task might be
>>                             at 100%,
>>                                  another at 60%,
>>             and so on,  as if Linux
>>             Â  Â  Â  Â  were playing with
>>                             tasks
>>                                  like a juggler.
>>
>>                             You are still implying usage
>>             of a quad core machine
>>             Â  Â  Â  Â  (quad == 4)
>>                             and 4<5.
>>                             If you _only_ have 4
>>             processors (intel hyperthreads
>>             Â  Â  Â  Â  do _not_
>>                             count as a
>>                             processes) then your
>>             assumption is not correct.Â
>>                             How would you expect a
>>             speedup by using 1 more
>>             Â  Â  Â  Â  process than you
>>                             have on
>>                             your system?
>>                             If you see this juggling it
>>             sounds like quad == 4
>>             Â  Â  Â  Â  and not 5.
>>
>>
>>                                  To give you
>>             some feeling, please look at the
>>             Â  Â  Â  Â  numbers here,
>>
>>
>>                           Â
>>             Â  Â  Â  Â
>>
>> -----------------------------------------------------------------------
>>                                  * Running
>>             on    4 nodes in parallel
>>
>>                                  Â ...
>>             snipped ...
>>
>>                                  siesta:
>>             iscf  Â Eharris(eV)  ÂÂ
>>                       E_KS(eV)  Â FreeEng(eV)Â
>>                                  Â
>>             dDmax  Ef(eV)
>>                                 Â
>>             siesta:    1  -124261.2908ÂÂ
>>                     -124261.2891Â
>>                             -124261.2891  0.0001
>>                                  -2.5494
>>                                  timer:
>>             Routine,Calls,Time,% = IterSCFÂÂ
>>                           1  Â
>>                             1637.906  99.72
>>                                  elaps:
>>             Routine,Calls,Wall,% = IterSCFÂÂ
>>                           1ÂÂÂ
>>               Â
>>                             410.919
>>                                  99.72
>>             <tel:410.919%20%2099.72>
>>
>>
>>                                  * Running
>>             on    5 nodes in parallel
>>
>>                                  Â ...
>>             snipped ...
>>
>>                                  siesta:
>>             iscf  Â Eharris(eV)  ÂÂ
>>                       E_KS(eV)  Â FreeEng(eV)Â
>>                                  Â
>>             dDmax  Ef(eV)
>>                                 Â
>>             siesta:    1  -124261.2908ÂÂ
>>                     -124261.2891Â
>>                             -124261.2891  0.0001
>>                                  -2.5494
>>                                  timer:
>>             Routine,Calls,Time,% = IterSCFÂÂ
>>                           1  Â
>>                             1654.558  99.64
>>                                  elaps:
>>             Routine,Calls,Wall,% = IterSCFÂÂ
>>                           1ÂÂÂ
>>               Â
>>                             415.150Â
>>                                  99.64
>>
>>                           Â
>>             Â  Â  Â  Â
>>
>> ------------------------------------------------------------------------
>>
>>                                  Those elapsed
>>             times are so close ... there
>>             Â  Â  Â  Â  must be an easy
>>                             explanation.
>>
>>                             Yes, if you are using mpirun
>>             -np 5 on a quad core
>>             Â  Â  Â  Â  machine, then the
>>                             explanation is easy and your
>>             numbers are irrelevant.Â
>>
>>
>>                                  Best,
>>
>>                                  Roberto
>>
>>
>>                                  On 11/02/2015
>>             04:14 PM, Nick Papior wrote:
>>
>>                                     Â
>>             Basically:
>>                                     Â
>>             Diag.ParallelOverK false
>>                                     Â
>>             Ā uses scalapack to diagonalize
>>             Â  Â  Â  Â  the Hamiltonian
>>                                     Â
>>             Diag.ParallelOverK true
>>                                     Â
>>             Ā uses lapack to diagonalize the
>>             Â  Â  Â  Â  Hamiltonian
>>
>>                                      If
>>             you have a very large system, you
>>             Â  Â  Â  Â  will not get
>>                             anything out
>>                                      of
>> using
>>                                      the
>>             latter option (rather than using
>>             Â  Â  Â  Â  an enormous amount
>>                             of memory).
>>                                      Only
>>             for an _extreme_ number of
>>             Â  Â  Â  Â  k-points are the latter
>>                             favourable,
>>                                      there
>>             are exceptions.
>>
>>                                      The
>>             latter is intended for small bulk
>>             Â  Â  Â  Â  calculations with
>>                             many
>>                                     Â
>> k-points.
>>
>>                                     Â
>>             Lastly, you have a quad core machine
>>             Â  Â  Â  Â  and run mpirun -np
>>                             5, and
>>                                      expect
>>                                      that
>>             to run faster. That is a wrong
>>                     assumption.Ä€
>>                                     Â
>>             Secondly diagonalization is not
>>             Â  Â  Â  Â  everything in the
>>                             program, check
>>                                      your
>>                                      TIMES
>>             file to figure out whether it
>>             Â  Â  Â  Â  _is_ the
>>                             diagonalization or
>>                                      a
>>             mixture.Ä€
>>
>>
>>                                     Â
>>             2015-11-02 19:42 GMT+01:00 RCP
>>             Â  Â  Â  Â  <[email protected]
>>             <mailto:[email protected]> <mailto:[email protected]
>>             <mailto:[email protected]>>
>>                             <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>>
>>                                     Â
>>             <mailto:[email protected] <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>> <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>>>
>>                                     Â
>>             <mailto:[email protected] <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>
>>                             <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>> <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>
>>                             <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>>>>>:
>>
>>
>>                                     Â
>>                 Dear everyone,
>>
>>                                     Â
>>                 I seem to have a
>>             Â  Â  Â  Â  misunderstanding on how the
>>                                     Â
>>             Diag.ParallellOverK
>>                                     Â
>>                 feature works, any comment
>>             Â  Â  Â  Â  would be much appreciated.
>>
>>                                     Â
>>                 I've got a large metallic
>>             Â  Â  Â  Â  cell, though still with 9
>>                                     Â
>>             k-points, that
>>                                     Â
>>                 runs on a quad PC; moreover,
>>             Â  Â  Â  Â  routine diagkp shows
>>                             k-points are
>>                                     Â
>>                 distributed round robin
>>             Â  Â  Â  Â  among processes. Thus I
>>                             was expecting
>>                                     Â
>>                 "mpirun -np 5 ..." to run
>>             Â  Â  Â  Â  significantly faster than
>>                                     Â
>>             "mpirun -np 4 ...",
>>                                     Â
>>                 as judged from the elapsed
>>             Â  Â  Â  Â  time of individual scf
>>                             steps.
>>                                     Â
>>                 Clearly, in the latter case,
>>             Â  Â  Â  Â  the 9th k-point
>>                             would be taken by
>>                                     Â
>>                 process 0 while the other
>>             Â  Â  Â  Â  three would remain
>>                             waiting, right?.
>>
>>                                     Â
>>                 However, my exppectations
>>             Â  Â  Â  Â  turned out to be wrong;
>>                             in fact the
>>                                     Â
>>                 2nd alternative appears to
>>             Â  Â  Â  Â  be a tiny bit faster.
>>                                     Â
>>                 Why ?.
>>
>>                                     Â
>>                 Thanks in advance,
>>
>>                                     Â
>>                 Roberto P.
>>
>>
>>
>>
>>                                      --
>>                                      Kind
>>             regards Nick
>>
>>
>>
>>
>>                             --
>>                             Kind regards Nick
>>
>>
>>
>>
>>             Â  Â  Â  Â  --
>>             Â  Â  Â  Â  Kind regards Nick
>>
>>
>>             Â  Â  --
>>
>>             Â  Â
>>
>> |---------------------------------------------------------------------|
>>                 |   Dr. Roberto C. Pasianot        Â
>>             Phone: 54 11 4839 6709Â
>>                           |
>>                 |  Â Gcia. Materiales, CAC-CNEA    ÂÂ
>>             FAX  : 54 11 6772 7362Â
>>                           |
>>                 |  Â Avda. Gral. Paz 1499        ÂÂ
>>               Email:
>>             Â  Â [email protected] <mailto:[email protected]>
>>             <mailto:[email protected]
>>             <mailto:[email protected]>>      Â |
>>                 |  Â 1650 San Martin, Buenos Aires    ÂÂ
>>                           Â
>>                                |
>>                 |  Â ARGENTINA            ÂÂ
>>                                 Â
>>                                      |
>>             Â  Â
>>
>> |---------------------------------------------------------------------|
>>
>>
>>
>>
>>             --
>>             Kind regards Nick
>>
>>
>>
>>
>>     --
>>     Kind regards Nick
>>
>>
>>
>>
>> --
>> Kind regards Nick
>>
>
> --
>
> |---------------------------------------------------------------------|
>
> |   Dr. Roberto C. Pasianot         Phone: 54 11 4839 6709            |
> |   Gcia. Materiales, CAC-CNEA      FAX  : 54 11 6772 7362            |
> |   Avda. Gral. Paz 1499            Email: [email protected]       |
> |   1650 San Martin, Buenos Aires                                     |
> |   ARGENTINA                                                         |
> |---------------------------------------------------------------------|
>



-- 
Kind regards Nick

Responder a