Re: [SIESTA-L] Puzzled about ParallelOverK feature

Nick Papior Tue, 03 Nov 2015 05:18:14 -0800

Oh, a last thing, timing can be dubious as that is typically inferred from
clock-cycles in the cpu, hence I would advice you to time it your self.
For instance by doing:


date
...
date

It depends on the underlying timing function used.


2015-11-03 14:03 GMT+01:00 Nick Papior <[email protected]>:

> That is crazy, and it is _amazing_!
>
> Nevertheless, I would still not recommend you do these kind of things.
>
>
> 2015-11-03 13:37 GMT+01:00 RCP <[email protected]>:
>
>> Good morning !,
>>
>> Please have a look at the outcome of the crazy "mpirun -np 9 ..."
>> exercise,
>>
>> ----------------------------------------------------------------------
>> * Running on    9 nodes in parallel
>>
>>  ... snipped ...
>>
>> siesta: iscf   Eharris(eV)      E_KS(eV)   FreeEng(eV)   dDmax  Ef(eV)
>> siesta:    1  -124261.2908  -124261.2891  -124261.2891  0.0001 -2.5494
>> timer: Routine,Calls,Time,% = IterSCF        1    1733.886  99.53
>> elaps: Routine,Calls,Wall,% = IterSCF        1     435.030  99.53
>> -----------------------------------------------------------------------
>>
>> Amazing: the elaps row is pretty close to the 410.0 (or so) of my
>> previous posts.
>>
>> However, yes, I seem to have a misunderstanding about the inner
>> workings of the code (well, in a sense, this was my first question)
>> because the timing info at the end of the output file says "diagon"
>> is taking only about 1/3 of the total time,
>>
>> ------------------------------------------------------------------
>> elaps: ELAPSED times:
>> elaps:  Routine       Calls   Time/call    Tot.time        %
>> elaps:  siesta            1     712.814     712.814   100.00
>> ...
>> elaps:  diagon            1     225.603     225.603 31.65
>> elaps:  cdiag             2      47.763      95.527    13.40
>> elaps:  cdiag1            2       0.920       1.840     0.26
>> elaps:  cdiag2            2       3.335       6.670     0.94
>> elaps:  cdiag3            2      42.961      85.922    12.05
>> elaps:  cdiag4            2       0.512       1.025     0.14
>> elaps:  DHSCF4            1      71.874      71.874    10.08
>> elaps:  dfscf             1      70.398      70.398     9.88
>> elaps:  overfsm           1       0.269       0.269     0.04
>> elaps:  optical           1       0.000       0.000     0.00
>> -------------------------------------------------------------------
>>
>> Take care,
>>
>> Roberto
>>
>> On 11/03/2015 08:28 AM, Nick Papior wrote:
>>
>>>
>>>
>>> 2015-11-03 12:10 GMT+01:00 RCP <[email protected]
>>> <mailto:[email protected]>>:
>>>
>>>     Hi,
>>>
>>>     Thanks for your time and sharing of wisdom.
>>>     In general terms I do agree with you Nick, in the sense that
>>>     running several sequential independent tasks (wien2k)
>>>     simultaneously is not equivalent to running a set of
>>>     inter-communicated, MPI, tasks.
>>>
>>>     However here we're talking about a peculiar situation,
>>>     namely, parallelization over k-points is, essentially, an
>>>     embarrassingly parallel problem, at least for my rather
>>>     large cell (97 atoms). The sequential gathering of
>>>     results from different k-points, building the new charge
>>>     density and so on, should take negligible time compared
>>>     to the time spent by a single task in diagonalizing a large
>>>     matrix.
>>>
>>> ! ! NO ! ! ;)
>>> Parallelization across k-points in siesta is NOT the same as an
>>> embarrassingly parallel problem across k-points.
>>> The _only_ thing in siesta that is parallelized embarrassingly is the
>>> diagonalisation part (after having communicated all Hamiltonian elements
>>> to all other nodes). Everything else is MPI parallelized, grid
>>> operations, construction of the Hamiltonian, etc. etc.!Â
>>> Yes, even though the diagonalization isÂ embarrassingly and it _should_
>>> take the longest time your assumption that the diagonalization part is
>>> still the most time consuming becomes wrong.
>>> Furthermore 96 atoms ~ 1000 orbitals, not that big a matrix to
>>> diagonalize.
>>>
>>> Â Please look at the timing output for clarity of this.
>>>
>>>
>>>     Of course, oversubscribing the CPUs must hurt performance
>>>     at some point, and this is most likely worse for MPI tasks than
>>>     for truly independent ones. But to me 5 MPI tasks competing for
>>>     4 cores does not look a scenario that terrible.
>>>
>>> MPI is not sequential programming and any assumption on oversubscribing
>>> you have is wrong, put simply. The 5 MPI tasks does not compete for 4
>>> cores, siesta MPI tasks are linearly dependent on each other (as written
>>> in the last mail) and hence they have to keep up all the time. If the
>>> MPI program was fullyÂ embarrassingly parallelized, then yes, you could,
>>> perhaps, have a point, but siesta is not such a code.
>>> How you can keep saying that oversubscribing can not be that damaging
>>> for performance (in fact improve) is really baffling to me :)
>>>
>>>     Moreover, np=5 and np=4 resulted in almost the same elapsed
>>>     time. It is hard to believe that my expected time win for
>>>     np=5 was (almost) exactly compensated by performance loss. Â
>>>
>>> Try doubling your system size and do the same calculation.Â
>>>
>>>
>>>     Nice discussion guys. I'll do a little more research and let you
>>>     know if something worth comes out.Â
>>>
>>>
>>>     Take care,
>>>
>>>     Roberto
>>>
>>>     On 11/02/2015 07:21 PM, Salvador Barraza-Lopez wrote:
>>>
>>>         Could not be clearer Nick.
>>>
>>>         RIcardo, if you type top on your machine, you'll see two SIESTA
>>>         processes competing for one core's time, and performing at 50%
>>>         at most.
>>>
>>>
>>>         Other cores will wait for these processes when an operation
>>>         among all
>>>         cores is necessary in the algorithm (i.e., a sum or a
>>>         distributed matrix
>>>         product)... thus these other cores will just have to wait for the
>>>         task these processes competing for the same core time to end;
>>> thus
>>>         degrading performance.
>>>
>>>
>>>         -Salvador
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>         *From:* [email protected] <mailto:[email protected]>
>>>         <[email protected] <mailto:[email protected]>> on
>>>         behalf of
>>>         Nick Papior <[email protected] <mailto:[email protected]>>
>>>         *Sent:* Monday, November 2, 2015 4:08 PM
>>>         *To:* [email protected] <mailto:[email protected]>
>>>         *Subject:* Re: [SIESTA-L] Puzzled about ParallelOverK feature
>>>
>>>
>>>         2015-11-02 22:37 GMT+01:00 RCP <[email protected]
>>>         <mailto:[email protected]>
>>>         <mailto:[email protected] <mailto:[email protected]>>>:
>>>
>>>
>>>         Â  Â  Hi Nick,
>>>
>>>         Â  Â  Please take my word: I'm not a computer guru but started
>>>         Â  Â  using computers before the PC era :-).
>>>         Â  Â  I know hyperthreading is evil for scientific calculations,
>>>         Â  Â  they're even disabled in BIOS. It is not that.
>>>
>>>         Â  Â  Why I'm saying np=5 should take less time than np=4, even
>>> if
>>>         Â  Â  my PC is a quad, is as follows.
>>>
>>>         This is a wrong statement!
>>>         By this argument everything that can be embarrassingly
>>> parallellized
>>>         will take less or equal time when using the number of sequential
>>>         divisions.
>>>
>>>         Â  Â  Distribution of k-points is round robin, and assume
>>> k-points
>>>         Â  Â  (the, trimmed, real ones, not M&K grid) take about the same
>>>         Â  Â  time to process.
>>>         Â  Â  Thus for np=4 I need 3 "time steps" to get the job done,
>>>         Â  Â  namely (4 + 4 + 1) when seen from k-points perspective.
>>>         Â  Â  On the other hand for np=5 the time taken would be
>>>         Â  Â  something like 2* 1/0.80 = 2.5,Â  or even shorter,
>>>         Â  Â  1/0.80 + 1 = 2.25.
>>>         Â  Â  Â¿What is flawed with this argument?.
>>>
>>>
>>>         Your flaw lies in using more cores than available, this has
>>>         nothing to
>>>         do with number of k-points, and your figures are based on a
>>>         sequential
>>>         program governed by the OS, not a parallel program (from what
>>> I've
>>>         gathered).
>>>         You should try running a simple openmp program with
>>>         OMP_NUM_THREADS=4
>>>         and 5 and see if that also degrades performance.
>>>
>>>         Oversubscribing your CPU is _heavily_ inflicting performance and
>>>         yes,
>>>         oversubscribing can make your program run worse than the number
>>> of
>>>         cores, especially when using MPI.
>>>         By your argument you would get the same performance by doing
>>>         mpirun -np
>>>         9, no? Try that and you will see that it will be slower and
>>>         slower the
>>>         more processors you throw at it.
>>>         MPI is not sequential and comparing the execution of a parallel
>>> and
>>>         sequential program is, at best, erroneous.
>>>
>>>         The reason it runs _perfect_ for your wien2k calculations (from
>>>         what you
>>>         say they are sequential programs) is that the processors there
>>>         make NO
>>>         communication with each other, meaning that each process can be
>>>         halted/resumed at any time without notifying anything but the
>>>         running
>>>         process. With your wien2k np=5 the OS can pause, resume
>>>         processors as it
>>>         pleases with *relatively* little impact on the performance,
>>> there is
>>>         some, but not that much. This is because each process is not
>>>         dependent
>>>         on the others and it will try and finish some before moving on.
>>>
>>>         With MPI (siesta) this is _very_ wrong. Most MPI programs are
>>>         communication bounded (i.e. not embarrassingly parallellized
>>>         using MPI).
>>>         The data is distributed and every process is dependent on each
>>>         other, no
>>>         process can progress without informing the other processors.
>>>         This means 1) every processor does some work, 2) all processors
>>>         communicate with each other, 3) repeat from step 1). Now do
>>>         steps 1 to 3
>>>         a couple of million times and the OS becomes flooded with
>>>         stop/resumes
>>>         (basically, not in its entirety, but for brevity).
>>>         Whenever you use MPI you should never use more processors than
>>>         you have
>>>         available.
>>>         (https://www.open-mpi.org/faq/?category=running#oversubscribing
>>>         <
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Drunning-23oversubscribing&d=BQMFaQ&c=JL-fUnQvtjNLb7dA39cQUcqmjBVITE8MbOdX7Lx6ge8&r=n_Y76F1vumEs9EYNHN2gzA5FD9jzyPhrzl3eOzxCHIQ&m=Vswqzh2TD_CL1r9kiCwjwL16KtOxW26uq4agbMQhfiQ&s=f2e4kVNouFg3LpIMPb-7nvfQslbQkj9jqkn-q-lsO-I&e=
>>> >)
>>>         if you time your execution with timings of the MPI calls you
>>>         should most
>>>         likely see immense increases in communication times as the
>>> processes
>>>         waits all the time, test this if you want more clear proof!
>>>
>>>         Bottomline, never use more MPI processors than you have physical
>>>         processors.
>>>         If you still want more explanations, turn to MPI developers for
>>> more
>>>         technical details, all I can say, never use more MPI processors
>>>         than you
>>>         have physical cores.
>>>
>>>
>>>         Â  Â  Best regards,
>>>
>>>         Â  Â  Roberto
>>>
>>>
>>>         Â  Â  On 11/02/2015 05:50 PM, Nick Papior wrote:
>>>
>>>
>>>
>>>         Â  Â  Â  Â  2015-11-02 21:37 GMT+01:00 RCP <[email protected]
>>>         <mailto:[email protected]>
>>>         Â  Â  Â  Â  <mailto:[email protected]
>>>         <mailto:[email protected]>>
>>>         Â  Â  Â  Â  <mailto:[email protected]
>>>         <mailto:[email protected]> <mailto:[email protected]
>>>         <mailto:[email protected]>>>>:
>>>
>>>
>>>         Â  Â  Â  Â  Â  Â  Â Thank you Nick and Salvador for your
>>> comments.
>>>
>>>         Â  Â  Â  Â  Â  Â  Â So Nick, basically you're saying that
>>>         diagonalization time
>>>         Â  Â  Â  Â  might
>>>         Â  Â  Â  Â  Â  Â  Â be playing no role. That is at variance, for
>>>         instance, with
>>>         Â  Â  Â  Â  Wien2k,
>>>         Â  Â  Â  Â  Â  Â  Â where diagonalization is the most time
>>>         consuming step. In fact,
>>>         Â  Â  Â  Â  Â  Â  Â my expectation is correct for it; veryfied
>>>         with a similar cell
>>>         Â  Â  Â  Â  Â  Â  Â and 9 k-points.
>>>
>>>         Â  Â  Â  Â  No, I am definitely not saying that! But I have no
>>>         idea about
>>>         Â  Â  Â  Â  how your
>>>         Â  Â  Â  Â  system is setup.
>>>         Â  Â  Â  Â  Diagonalization _is_ a big part of the computation.
>>>         Â  Â  Â  Â  How have you specified the k-points? Is it 9 kpoints
>>>         or 9
>>>         Â  Â  Â  Â  kpoints in the
>>>         Â  Â  Â  Â  monkhorst pack grid?
>>>
>>>
>>>         Â  Â  Â  Â  Â  Â  Â In that case "top" shows a first stage of 5
>>>         processes
>>>         Â  Â  Â  Â  running at
>>>         Â  Â  Â  Â  Â  Â  Â about 4/5=80% CPU power (and more or less
>>>         stable) and a 2nd
>>>         Â  Â  Â  Â  stage of
>>>         Â  Â  Â  Â  Â  Â  Â 4 procs, running at 100%. This is not MPI,
>>>         but a parallel
>>>         Â  Â  Â  Â  strategy
>>>         Â  Â  Â  Â  Â  Â  Â based on scripts (hope you are aware).
>>>
>>>         Â  Â  Â  Â  wien2k is not siesta.
>>>         Â  Â  Â  Â  If wien2k is script based, i.e. sequential running
>>> and
>>>         Â  Â  Â  Â  self-managing the
>>>         Â  Â  Â  Â  processes, then sure they behave _very_ differently
>>>         and wien2k
>>>         Â  Â  Â  Â  should
>>>         Â  Â  Â  Â  give you the desired speedup. Your figures sounds
>>> like
>>>         Â  Â  Â  Â  hyperthreading to me.
>>>
>>>         Â  Â  Â  Â  Â  Â  Â The same experiment performed with "mpirun
>>>         -np 5 ..." and
>>>         Â  Â  Â  Â  Siesta,
>>>         Â  Â  Â  Â  Â  Â  Â shows more jumpy figures for CPU usage. One
>>>         task might be
>>>         Â  Â  Â  Â  at 100%,
>>>         Â  Â  Â  Â  Â  Â  Â another at 60%, and so on,Ã‚Â  as if Linux
>>>         were playing with
>>>         Â  Â  Â  Â  tasks
>>>         Â  Â  Â  Â  Â  Â  Â like a juggler.
>>>
>>>         Â  Â  Â  Â  You are still implying usage of a quad core machine
>>>         (quad == 4)
>>>         Â  Â  Â  Â  and 4<5.
>>>         Â  Â  Â  Â  If you _only_ have 4 processors (intel hyperthreads
>>>         do _not_
>>>         Â  Â  Â  Â  count as a
>>>         Â  Â  Â  Â  processes) then your assumption is not correct.Ã‚
>>>         Â  Â  Â  Â  How would you expect a speedup by using 1 more
>>>         process than you
>>>         Â  Â  Â  Â  have on
>>>         Â  Â  Â  Â  your system?
>>>         Â  Â  Â  Â  If you see this juggling it sounds like quad == 4
>>>         and not 5.
>>>
>>>
>>>         Â  Â  Â  Â  Â  Â  Â To give you some feeling, please look at the
>>>         numbers here,
>>>
>>>
>>>         Â  Â  Â  Â
>>>
>>> -----------------------------------------------------------------------
>>>         Â  Â  Â  Â  Â  Â  Â * Running onÃ‚Â  Ã‚Â  4 nodes in parallel
>>>
>>>         Â  Â  Â  Â  Â  Â  Â Ã‚ ... snipped ...
>>>
>>>         Â  Â  Â  Â  Â  Â  Â siesta: iscfÃ‚Â  Ã‚ Eharris(eV)Ã‚Â  Ã‚Â
>>>         Ã‚Â  E_KS(eV)Ã‚Â  Ã‚ FreeEng(eV)Ã‚
>>>         Â  Â  Â  Â  Â  Â  Â Ã‚ dDmaxÃ‚Â  Ef(eV)
>>>         Â  Â  Â  Â  Â  Â  Â siesta:Ã‚Â  Ã‚Â  1Ã‚Â  -124261.2908Ã‚Â
>>>         -124261.2891Ã‚
>>>         Â  Â  Â  Â  -124261.2891Ã‚Â  0.0001
>>>         Â  Â  Â  Â  Â  Â  Â -2.5494
>>>         Â  Â  Â  Â  Â  Â  Â timer: Routine,Calls,Time,% = IterSCFÃ‚Â
>>>         Ã‚Â  Ã‚Â  Ã‚Â  1Ã‚Â  Ã‚
>>>         Â  Â  Â  Â  1637.906Ã‚Â  99.72
>>>         Â  Â  Â  Â  Â  Â  Â elaps: Routine,Calls,Wall,% = IterSCFÃ‚Â
>>>         Ã‚Â  Ã‚Â  Ã‚Â  1Ã‚Â  Ã‚Â  Ã‚
>>>         Â  Â  Â  Â  410.919
>>>         Â  Â  Â  Â  Â  Â  Â 99.72 <tel:410.919%20%2099.72>
>>>
>>>
>>>         Â  Â  Â  Â  Â  Â  Â * Running onÃ‚Â  Ã‚Â  5 nodes in parallel
>>>
>>>         Â  Â  Â  Â  Â  Â  Â Ã‚ ... snipped ...
>>>
>>>         Â  Â  Â  Â  Â  Â  Â siesta: iscfÃ‚Â  Ã‚ Eharris(eV)Ã‚Â  Ã‚Â
>>>         Ã‚Â  E_KS(eV)Ã‚Â  Ã‚ FreeEng(eV)Ã‚
>>>         Â  Â  Â  Â  Â  Â  Â Ã‚ dDmaxÃ‚Â  Ef(eV)
>>>         Â  Â  Â  Â  Â  Â  Â siesta:Ã‚Â  Ã‚Â  1Ã‚Â  -124261.2908Ã‚Â
>>>         -124261.2891Ã‚
>>>         Â  Â  Â  Â  -124261.2891Ã‚Â  0.0001
>>>         Â  Â  Â  Â  Â  Â  Â -2.5494
>>>         Â  Â  Â  Â  Â  Â  Â timer: Routine,Calls,Time,% = IterSCFÃ‚Â
>>>         Ã‚Â  Ã‚Â  Ã‚Â  1Ã‚Â  Ã‚
>>>         Â  Â  Â  Â  1654.558Ã‚Â  99.64
>>>         Â  Â  Â  Â  Â  Â  Â elaps: Routine,Calls,Wall,% = IterSCFÃ‚Â
>>>         Ã‚Â  Ã‚Â  Ã‚Â  1Ã‚Â  Ã‚Â  Ã‚
>>>         Â  Â  Â  Â  415.150Ã‚
>>>         Â  Â  Â  Â  Â  Â  Â 99.64
>>>
>>>         Â  Â  Â  Â
>>>
>>> ------------------------------------------------------------------------
>>>
>>>         Â  Â  Â  Â  Â  Â  Â Those elapsed times are so close ... there
>>>         must be an easy
>>>         Â  Â  Â  Â  explanation.
>>>
>>>         Â  Â  Â  Â  Yes, if you are using mpirun -np 5 on a quad core
>>>         machine, then the
>>>         Â  Â  Â  Â  explanation is easy and your numbers are
>>> irrelevant.Ã‚
>>>
>>>
>>>         Â  Â  Â  Â  Â  Â  Â Best,
>>>
>>>         Â  Â  Â  Â  Â  Â  Â Roberto
>>>
>>>
>>>         Â  Â  Â  Â  Â  Â  Â On 11/02/2015 04:14 PM, Nick Papior wrote:
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Basically:
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Diag.ParallelOverK false
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã„â‚¬ uses scalapack to diagonalize
>>>         the Hamiltonian
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Diag.ParallelOverK true
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã„â‚¬ uses lapack to diagonalize the
>>>         Hamiltonian
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â If you have a very large system, you
>>>         will not get
>>>         Â  Â  Â  Â  anything out
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â of using
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â the latter option (rather than using
>>>         an enormous amount
>>>         Â  Â  Â  Â  of memory).
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Only for an _extreme_ number of
>>>         k-points are the latter
>>>         Â  Â  Â  Â  favourable,
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â there are exceptions.
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â The latter is intended for small bulk
>>>         calculations with
>>>         Â  Â  Â  Â  many
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â k-points.
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Lastly, you have a quad core machine
>>>         and run mpirun -np
>>>         Â  Â  Â  Â  5, and
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â expect
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â that to run faster. That is a wrong
>>>         assumption.Ã„â‚¬
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Secondly diagonalization is not
>>>         everything in the
>>>         Â  Â  Â  Â  program, check
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â your
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â TIMES file to figure out whether it
>>>         _is_ the
>>>         Â  Â  Â  Â  diagonalization or
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â a mixture.Ã„â‚¬
>>>
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â 2015-11-02 19:42 GMT+01:00 RCP
>>>         <[email protected] <mailto:[email protected]>
>>>         Â  Â  Â  Â  <mailto:[email protected]
>>>         <mailto:[email protected]>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â <mailto:[email protected]
>>>         <mailto:[email protected]> <mailto:[email protected]
>>>         <mailto:[email protected]>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â <mailto:[email protected]
>>>         <mailto:[email protected]>
>>>         Â  Â  Â  Â  <mailto:[email protected]
>>>         <mailto:[email protected]>> <mailto:[email protected]
>>>         <mailto:[email protected]>
>>>         Â  Â  Â  Â  <mailto:[email protected]
>>>         <mailto:[email protected]>>>>>:
>>>
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  Dear everyone,
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  I seem to have a
>>>         misunderstanding on how the
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Diag.ParallellOverK
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  feature works, any comment
>>>         would be much appreciated.
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  I've got a large metallic
>>>         cell, though still with 9
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â k-points, that
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  runs on a quad PC; moreover,
>>>         routine diagkp shows
>>>         Â  Â  Â  Â  k-points are
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  distributed round robin
>>>         among processes. Thus I
>>>         Â  Â  Â  Â  was expecting
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  "mpirun -np 5 ..." to run
>>>         significantly faster than
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â "mpirun -np 4 ...",
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  as judged from the elapsed
>>>         time of individual scf
>>>         Â  Â  Â  Â  steps.
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  Clearly, in the latter case,
>>>         the 9th k-point
>>>         Â  Â  Â  Â  would be taken by
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  process 0 while the other
>>>         three would remain
>>>         Â  Â  Â  Â  waiting, right?.
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  However, my exppectations
>>>         turned out to be wrong;
>>>         Â  Â  Â  Â  in fact the
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  2nd alternative appears to
>>>         be a tiny bit faster.
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  Why ?.
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  Thanks in advance,
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Ã‚Â  Ã‚Â  Roberto P.
>>>
>>>
>>>
>>>
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â --
>>>         Â  Â  Â  Â  Â  Â  Â  Â  Â Kind regards Nick
>>>
>>>
>>>
>>>
>>>         Â  Â  Â  Â  --
>>>         Â  Â  Â  Â  Kind regards Nick
>>>
>>>
>>>
>>>
>>>         --
>>>         Kind regards Nick
>>>
>>>
>>>     --
>>>
>>>
>>> |---------------------------------------------------------------------|
>>>     |Â  Â Dr. Roberto C. PasianotÂ  Â  Â  Â  Â Phone: 54 11 4839 6709Â
>>>     Â  Â  Â  Â  Â  |
>>>     |Â  Â Gcia. Materiales, CAC-CNEAÂ  Â  Â  FAXÂ  : 54 11 6772 7362Â
>>>     Â  Â  Â  Â  Â  |
>>>     |Â  Â Avda. Gral. Paz 1499Â  Â  Â  Â  Â  Â  Email:
>>>     [email protected] <mailto:[email protected]>Â  Â  Â  Â |
>>>     |Â  Â 1650 San Martin, Buenos AiresÂ  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â
>>>     Â  Â  Â  Â  Â  Â  Â  Â |
>>>     |Â  Â ARGENTINAÂ  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â
>>>     Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â |
>>>
>>> |---------------------------------------------------------------------|
>>>
>>>
>>>
>>>
>>> --
>>> Kind regards Nick
>>>
>>
>
>
> --
> Kind regards Nick
>



-- 
Kind regards Nick

Re: [SIESTA-L] Puzzled about ParallelOverK feature

Responder a