Re: [SIESTA-L] Puzzled about ParallelOverK feature

Nick Papior Tue, 03 Nov 2015 05:52:04 -0800

Yes and no, it depends on the underlying system_clock, and that is system
dependent.


So yes, time outside siesta for "correct" timing of full application.

2015-11-03 14:49 GMT+01:00 RCP <[email protected]>:

> Yeah!, I also suspected the clock. As far as I could follow the
> code, wall-clock time derives from f95 SYSTEM_CLOCK intrinsic,
> which is being handled by the master node. Rulling out programing
> mistakes, would that still be dubious ?.
> Well, at least I could time siesta from the shell, so that no
> tinkering with the sources.
>
> On 11/03/2015 10:12 AM, Nick Papior wrote:
>
>> Oh, a last thing, timing can be dubious as that is typically inferred
>> from clock-cycles in the cpu, hence I would advice you to time it your
>> self.Â
>> For instance by doing:
>>
>> date
>> ...
>> date
>>
>> It depends on the underlying timing function used.
>>
>>
>> 2015-11-03 14:03 GMT+01:00 Nick Papior <[email protected]
>> <mailto:[email protected]>>:
>>
>>     That is crazy, and it is _amazing_!
>>
>>     Nevertheless, I would still not recommend you do these kind of things.
>>
>>
>>     2015-11-03 13:37 GMT+01:00 RCP <[email protected]
>>     <mailto:[email protected]>>:
>>
>>         Good morning !,
>>
>>         Please have a look at the outcome of the crazy "mpirun -np 9 ..."
>>         exercise,
>>
>>
>> ----------------------------------------------------------------------
>>         * Running onÂ  Â  9 nodes in parallel
>>
>>         Â ... snipped ...
>>
>>         siesta: iscfÂ  Â Eharris(eV)Â  Â  Â  E_KS(eV)Â  Â FreeEng(eV)Â
>>         Â dDmaxÂ  Ef(eV)
>>         siesta:Â  Â  1Â  -124261.2908Â  -124261.2891Â  -124261.2891Â
>>         0.0001 -2.5494
>>         timer: Routine,Calls,Time,% = IterSCFÂ  Â  Â  Â  1Â  Â
>>         1733.886Â  99.53
>>         elaps: Routine,Calls,Wall,% = IterSCFÂ  Â  Â  Â  1Â  Â
>>         Â 435.030Â  99.53
>>
>> -----------------------------------------------------------------------
>>
>>         Amazing: the elaps row is pretty close to the 410.0 (or so) of my
>>         previous posts.
>>
>>         However, yes, I seem to have a misunderstanding about the inner
>>         workings of the code (well, in a sense, this was my first
>> question)
>>         because the timing info at the end of the output file says
>> "diagon"
>>         is taking only about 1/3 of the total time,
>>
>>         ------------------------------------------------------------------
>>         elaps: ELAPSED times:
>>         elaps:Â  RoutineÂ  Â  Â  Â CallsÂ  Â Time/callÂ  Â  Tot.timeÂ
>>         Â  Â  Â  %
>>         elaps:Â  siestaÂ  Â  Â  Â  Â  Â  1Â  Â  Â 712.814Â  Â
>>         Â 712.814Â  Â 100.00
>>         ...
>>         elaps:Â  diagonÂ  Â  Â  Â  Â  Â  1Â  Â  Â 225.603Â  Â  Â 225.603
>>         31.65 <tel:225.603%20%20%20%2031.65>
>>         elaps:Â  cdiagÂ  Â  Â  Â  Â  Â  Â 2Â  Â  Â  47.763Â  Â  Â
>>         95.527Â  Â  13.40
>>         elaps:Â  cdiag1Â  Â  Â  Â  Â  Â  2Â  Â  Â  Â 0.920Â  Â  Â
>>         Â 1.840Â  Â  Â 0.26
>>         elaps:Â  cdiag2Â  Â  Â  Â  Â  Â  2Â  Â  Â  Â 3.335Â  Â  Â
>>         Â 6.670Â  Â  Â 0.94
>>         elaps:Â  cdiag3Â  Â  Â  Â  Â  Â  2Â  Â  Â  42.961Â  Â  Â
>>         85.922Â  Â  12.05
>>         elaps:Â  cdiag4Â  Â  Â  Â  Â  Â  2Â  Â  Â  Â 0.512Â  Â  Â
>>         Â 1.025Â  Â  Â 0.14
>>         elaps:Â  DHSCF4Â  Â  Â  Â  Â  Â  1Â  Â  Â  71.874Â  Â  Â
>>         71.874Â  Â  10.08
>>         elaps:Â  dfscfÂ  Â  Â  Â  Â  Â  Â 1Â  Â  Â  70.398Â  Â  Â
>>         70.398Â  Â  Â 9.88
>>         elaps:Â  overfsmÂ  Â  Â  Â  Â  Â 1Â  Â  Â  Â 0.269Â  Â  Â
>>         Â 0.269Â  Â  Â 0.04
>>         elaps:Â  opticalÂ  Â  Â  Â  Â  Â 1Â  Â  Â  Â 0.000Â  Â  Â
>>         Â 0.000Â  Â  Â 0.00
>>
>> -------------------------------------------------------------------
>>
>>         Take care,
>>
>>         Roberto
>>
>>         On 11/03/2015 08:28 AM, Nick Papior wrote:
>>
>>
>>
>>             2015-11-03 12:10 GMT+01:00 RCP <[email protected]
>>             <mailto:[email protected]>
>>             <mailto:[email protected] <mailto:[email protected]>>>:
>>
>>             Â  Â  Hi,
>>
>>             Â  Â  Thanks for your time and sharing of wisdom.
>>             Â  Â  In general terms I do agree with you Nick, in the
>>             sense that
>>             Â  Â  running several sequential independent tasks (wien2k)
>>             Â  Â  simultaneously is not equivalent to running a set of
>>             Â  Â  inter-communicated, MPI, tasks.
>>
>>             Â  Â  However here we're talking about a peculiar situation,
>>             Â  Â  namely, parallelization over k-points is, essentially,
>> an
>>             Â  Â  embarrassingly parallel problem, at least for my rather
>>             Â  Â  large cell (97 atoms). The sequential gathering of
>>             Â  Â  results from different k-points, building the new charge
>>             Â  Â  density and so on, should take negligible time compared
>>             Â  Â  to the time spent by a single task in diagonalizing a
>>             large
>>             Â  Â  matrix.
>>
>>             ! ! NO ! ! ;)
>>             Parallelization across k-points in siesta is NOT the same as
>> an
>>             embarrassingly parallel problem across k-points.
>>             The _only_ thing in siesta that is parallelized
>>             embarrassingly is the
>>             diagonalisation part (after having communicated all
>>             Hamiltonian elements
>>             to all other nodes). Everything else is MPI parallelized, grid
>>             operations, construction of the Hamiltonian, etc. etc.!Ã‚
>>             Yes, even though the diagonalization isÃ‚ embarrassingly and
>>             it _should_
>>             take the longest time your assumption that the
>>             diagonalization part is
>>             still the most time consuming becomes wrong.
>>             Furthermore 96 atoms ~ 1000 orbitals, not that big a matrix
>>             to diagonalize.
>>
>>             Ã‚ Please look at the timing output for clarity of this.
>>
>>
>>             Â  Â  Of course, oversubscribing the CPUs must hurt
>> performance
>>             Â  Â  at some point, and this is most likely worse for MPI
>>             tasks than
>>             Â  Â  for truly independent ones. But to me 5 MPI tasks
>>             competing for
>>             Â  Â  4 cores does not look a scenario that terrible.
>>
>>             MPI is not sequential programming and any assumption on
>>             oversubscribing
>>             you have is wrong, put simply. The 5 MPI tasks does not
>>             compete for 4
>>             cores, siesta MPI tasks are linearly dependent on each other
>>             (as written
>>             in the last mail) and hence they have to keep up all the
>>             time. If the
>>             MPI program was fullyÃ‚ embarrassingly parallelized, then
>>             yes, you could,
>>             perhaps, have a point, but siesta is not such a code.
>>             How you can keep saying that oversubscribing can not be that
>>             damaging
>>             for performance (in fact improve) is really baffling to me :)
>>
>>             Â  Â  Moreover, np=5 and np=4 resulted in almost the same
>>             elapsed
>>             Â  Â  time. It is hard to believe that my expected time win
>> for
>>             Â  Â  np=5 was (almost) exactly compensated by performance
>>             loss. Ã‚
>>
>>             Try doubling your system size and do the same calculation.Ã‚
>>
>>
>>             Â  Â  Nice discussion guys. I'll do a little more research
>>             and let you
>>             Â  Â  know if something worth comes out.Ã‚
>>
>>
>>             Â  Â  Take care,
>>
>>             Â  Â  Roberto
>>
>>             Â  Â  On 11/02/2015 07:21 PM, Salvador Barraza-Lopez wrote:
>>
>>             Â  Â  Â  Â  Could not be clearer Nick.
>>
>>             Â  Â  Â  Â  RIcardo, if you type top on your machine, you'll
>>             see two SIESTA
>>             Â  Â  Â  Â  processes competing for one core's time, and
>>             performing at 50%
>>             Â  Â  Â  Â  at most.
>>
>>
>>             Â  Â  Â  Â  Other cores will wait for these processes when
>>             an operation
>>             Â  Â  Â  Â  among all
>>             Â  Â  Â  Â  cores is necessary in the algorithm (i.e., a sum
>>             or a
>>             Â  Â  Â  Â  distributed matrix
>>             Â  Â  Â  Â  product)... thus these other cores will just
>>             have to wait for the
>>             Â  Â  Â  Â  task these processes competing for the same core
>>             time to end; thus
>>             Â  Â  Â  Â  degrading performance.
>>
>>
>>             Â  Â  Â  Â  -Salvador
>>
>>
>>
>>             Â  Â  Â  Â
>>
>> ------------------------------------------------------------------------
>>             Â  Â  Â  Â  *From:* [email protected]
>>             <mailto:[email protected]>
>>             <mailto:[email protected]
>>             <mailto:[email protected]>>
>>             Â  Â  Â  Â  <[email protected]
>>             <mailto:[email protected]>
>>             <mailto:[email protected]
>>             <mailto:[email protected]>>> on
>>             Â  Â  Â  Â  behalf of
>>             Â  Â  Â  Â  Nick Papior <[email protected]
>>             <mailto:[email protected]> <mailto:[email protected]
>>             <mailto:[email protected]>>>
>>             Â  Â  Â  Â  *Sent:* Monday, November 2, 2015 4:08 PM
>>             Â  Â  Â  Â  *To:* [email protected] <mailto:[email protected]>
>>             <mailto:[email protected] <mailto:[email protected]>>
>>             Â  Â  Â  Â  *Subject:* Re: [SIESTA-L] Puzzled about
>>             ParallelOverK feature
>>
>>
>>             Â  Â  Â  Â  2015-11-02 22:37 GMT+01:00 RCP
>>             <[email protected] <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]> <mailto:[email protected]
>>             <mailto:[email protected]>>>>:
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Hi Nick,
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Please take my word: I'm not a
>>             computer guru but started
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  using computers before the PC era :-).
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  I know hyperthreading is evil for
>>             scientific calculations,
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  they're even disabled in BIOS. It is
>>             not that.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Why I'm saying np=5 should take less
>>             time than np=4, even if
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  my PC is a quad, is as follows.
>>
>>             Â  Â  Â  Â  This is a wrong statement!
>>             Â  Â  Â  Â  By this argument everything that can be
>>             embarrassingly parallellized
>>             Â  Â  Â  Â  will take less or equal time when using the
>>             number of sequential
>>             Â  Â  Â  Â  divisions.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Distribution of k-points is round
>>             robin, and assume k-points
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  (the, trimmed, real ones, not M&K
>>             grid) take about the same
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  time to process.
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Thus for np=4 I need 3 "time steps" to
>>             get the job done,
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  namely (4 + 4 + 1) when seen from
>>             k-points perspective.
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  On the other hand for np=5 the time
>>             taken would be
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  something like 2* 1/0.80 = 2.5,Ã‚Â  or
>>             even shorter,
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  1/0.80 + 1 = 2.25.
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â¿What is flawed with this argument?.
>>
>>
>>             Â  Â  Â  Â  Your flaw lies in using more cores than
>>             available, this has
>>             Â  Â  Â  Â  nothing to
>>             Â  Â  Â  Â  do with number of k-points, and your figures are
>>             based on a
>>             Â  Â  Â  Â  sequential
>>             Â  Â  Â  Â  program governed by the OS, not a parallel
>>             program (from what I've
>>             Â  Â  Â  Â  gathered).
>>             Â  Â  Â  Â  You should try running a simple openmp program
>> with
>>             Â  Â  Â  Â  OMP_NUM_THREADS=4
>>             Â  Â  Â  Â  and 5 and see if that also degrades performance.
>>
>>             Â  Â  Â  Â  Oversubscribing your CPU is _heavily_ inflicting
>>             performance and
>>             Â  Â  Â  Â  yes,
>>             Â  Â  Â  Â  oversubscribing can make your program run worse
>>             than the number of
>>             Â  Â  Â  Â  cores, especially when using MPI.
>>             Â  Â  Â  Â  By your argument you would get the same
>>             performance by doing
>>             Â  Â  Â  Â  mpirun -np
>>             Â  Â  Â  Â  9, no? Try that and you will see that it will be
>>             slower and
>>             Â  Â  Â  Â  slower the
>>             Â  Â  Â  Â  more processors you throw at it.
>>             Â  Â  Â  Â  MPI is not sequential and comparing the
>>             execution of a parallel and
>>             Â  Â  Â  Â  sequential program is, at best, erroneous.
>>
>>             Â  Â  Â  Â  The reason it runs _perfect_ for your wien2k
>>             calculations (from
>>             Â  Â  Â  Â  what you
>>             Â  Â  Â  Â  say they are sequential programs) is that the
>>             processors there
>>             Â  Â  Â  Â  make NO
>>             Â  Â  Â  Â  communication with each other, meaning that each
>>             process can be
>>             Â  Â  Â  Â  halted/resumed at any time without notifying
>>             anything but the
>>             Â  Â  Â  Â  running
>>             Â  Â  Â  Â  process. With your wien2k np=5 the OS can pause,
>>             resume
>>             Â  Â  Â  Â  processors as it
>>             Â  Â  Â  Â  pleases with *relatively* little impact on the
>>             performance, there is
>>             Â  Â  Â  Â  some, but not that much. This is because each
>>             process is not
>>             Â  Â  Â  Â  dependent
>>             Â  Â  Â  Â  on the others and it will try and finish some
>>             before moving on.
>>
>>             Â  Â  Â  Â  With MPI (siesta) this is _very_ wrong. Most MPI
>>             programs are
>>             Â  Â  Â  Â  communication bounded (i.e. not embarrassingly
>>             parallellized
>>             Â  Â  Â  Â  using MPI).
>>             Â  Â  Â  Â  The data is distributed and every process is
>>             dependent on each
>>             Â  Â  Â  Â  other, no
>>             Â  Â  Â  Â  process can progress without informing the other
>>             processors.
>>             Â  Â  Â  Â  This means 1) every processor does some work, 2)
>>             all processors
>>             Â  Â  Â  Â  communicate with each other, 3) repeat from step
>>             1). Now do
>>             Â  Â  Â  Â  steps 1 to 3
>>             Â  Â  Â  Â  a couple of million times and the OS becomes
>>             flooded with
>>             Â  Â  Â  Â  stop/resumes
>>             Â  Â  Â  Â  (basically, not in its entirety, but for brevity).
>>             Â  Â  Â  Â  Whenever you use MPI you should never use more
>>             processors than
>>             Â  Â  Â  Â  you have
>>             Â  Â  Â  Â  available.
>>             Â  Â  Â  Â
>>             (
>> https://www.open-mpi.org/faq/?category=running#oversubscribing
>>             Â  Â  Â  Â
>>             <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Drunning-23oversubscribing&d=BQMFaQ&c=JL-fUnQvtjNLb7dA39cQUcqmjBVITE8MbOdX7Lx6ge8&r=n_Y76F1vumEs9EYNHN2gzA5FD9jzyPhrzl3eOzxCHIQ&m=Vswqzh2TD_CL1r9kiCwjwL16KtOxW26uq4agbMQhfiQ&s=f2e4kVNouFg3LpIMPb-7nvfQslbQkj9jqkn-q-lsO-I&e=
>> >)
>>             Â  Â  Â  Â  if you time your execution with timings of the
>>             MPI calls you
>>             Â  Â  Â  Â  should most
>>             Â  Â  Â  Â  likely see immense increases in communication
>>             times as the processes
>>             Â  Â  Â  Â  waits all the time, test this if you want more
>>             clear proof!
>>
>>             Â  Â  Â  Â  Bottomline, never use more MPI processors than
>>             you have physical
>>             Â  Â  Â  Â  processors.
>>             Â  Â  Â  Â  If you still want more explanations, turn to MPI
>>             developers for more
>>             Â  Â  Â  Â  technical details, all I can say, never use more
>>             MPI processors
>>             Â  Â  Â  Â  than you
>>             Â  Â  Â  Â  have physical cores.
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Best regards,
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Roberto
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  On 11/02/2015 05:50 PM, Nick Papior
>> wrote:
>>
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  2015-11-02 21:37 GMT+01:00
>>             RCP <[email protected] <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>> <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>>>>:
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ Thank you Nick
>>             and Salvador for your comments.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ So Nick,
>>             basically you're saying that
>>             Â  Â  Â  Â  diagonalization time
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  might
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ be playing no
>>             role. That is at variance, for
>>             Â  Â  Â  Â  instance, with
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Wien2k,
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ where
>>             diagonalization is the most time
>>             Â  Â  Â  Â  consuming step. In fact,
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ my expectation
>>             is correct for it; veryfied
>>             Â  Â  Â  Â  with a similar cell
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ and 9 k-points.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  No, I am definitely not
>>             saying that! But I have no
>>             Â  Â  Â  Â  idea about
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  how your
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  system is setup.
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Diagonalization _is_ a big
>>             part of the computation.
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  How have you specified the
>>             k-points? Is it 9 kpoints
>>             Â  Â  Â  Â  or 9
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  kpoints in the
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  monkhorst pack grid?
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ In that case
>>             "top" shows a first stage of 5
>>             Â  Â  Â  Â  processes
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  running at
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ about 4/5=80%
>>             CPU power (and more or less
>>             Â  Â  Â  Â  stable) and a 2nd
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  stage of
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ 4 procs,
>>             running at 100%. This is not MPI,
>>             Â  Â  Â  Â  but a parallel
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  strategy
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ based on
>>             scripts (hope you are aware).
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  wien2k is not siesta.
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  If wien2k is script based,
>>             i.e. sequential running and
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  self-managing the
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  processes, then sure they
>>             behave _very_ differently
>>             Â  Â  Â  Â  and wien2k
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  should
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  give you the desired
>>             speedup. Your figures sounds like
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  hyperthreading to me.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ The same
>>             experiment performed with "mpirun
>>             Â  Â  Â  Â  -np 5 ..." and
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Siesta,
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ shows more
>>             jumpy figures for CPU usage. One
>>             Â  Â  Â  Â  task might be
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  at 100%,
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ another at 60%,
>>             and so on,Ãƒâ€šÃ‚Â  as if Linux
>>             Â  Â  Â  Â  were playing with
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  tasks
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ like a juggler.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  You are still implying usage
>>             of a quad core machine
>>             Â  Â  Â  Â  (quad == 4)
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  and 4<5.
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  If you _only_ have 4
>>             processors (intel hyperthreads
>>             Â  Â  Â  Â  do _not_
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  count as a
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  processes) then your
>>             assumption is not correct.Ãƒâ€š
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  How would you expect a
>>             speedup by using 1 more
>>             Â  Â  Â  Â  process than you
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  have on
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  your system?
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  If you see this juggling it
>>             sounds like quad == 4
>>             Â  Â  Â  Â  and not 5.
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ To give you
>>             some feeling, please look at the
>>             Â  Â  Â  Â  numbers here,
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Â  Â  Â  Â
>>
>> -----------------------------------------------------------------------
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ * Running
>>             onÃƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  4 nodes in parallel
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ Ãƒâ€š ...
>>             snipped ...
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ siesta:
>>             iscfÃƒâ€šÃ‚Â  Ãƒâ€š Eharris(eV)Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚
>>             Â  Â  Â  Â  Ãƒâ€šÃ‚Â  E_KS(eV)Ãƒâ€šÃ‚Â  Ãƒâ€š FreeEng(eV)Ãƒâ€š
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ Ãƒâ€š
>>             dDmaxÃƒâ€šÃ‚Â  Ef(eV)
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             siesta:Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  1Ãƒâ€šÃ‚Â  -124261.2908Ãƒâ€šÃ‚
>>             Â  Â  Â  Â  -124261.2891Ãƒâ€š
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  -124261.2891Ãƒâ€šÃ‚Â  0.0001
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ -2.5494
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ timer:
>>             Routine,Calls,Time,% = IterSCFÃƒâ€šÃ‚
>>             Â  Â  Â  Â  Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  1Ãƒâ€šÃ‚Â  Ãƒâ€š
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  1637.906Ãƒâ€šÃ‚Â  99.72
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ elaps:
>>             Routine,Calls,Wall,% = IterSCFÃƒâ€šÃ‚
>>             Â  Â  Â  Â  Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  1Ãƒâ€šÃ‚Â
>>             Ãƒâ€šÃ‚Â  Ãƒâ€š
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  410.919
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ 99.72
>>             <tel:410.919%20%2099.72>
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ * Running
>>             onÃƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  5 nodes in parallel
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ Ãƒâ€š ...
>>             snipped ...
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ siesta:
>>             iscfÃƒâ€šÃ‚Â  Ãƒâ€š Eharris(eV)Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚
>>             Â  Â  Â  Â  Ãƒâ€šÃ‚Â  E_KS(eV)Ãƒâ€šÃ‚Â  Ãƒâ€š FreeEng(eV)Ãƒâ€š
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ Ãƒâ€š
>>             dDmaxÃƒâ€šÃ‚Â  Ef(eV)
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             siesta:Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  1Ãƒâ€šÃ‚Â  -124261.2908Ãƒâ€šÃ‚
>>             Â  Â  Â  Â  -124261.2891Ãƒâ€š
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  -124261.2891Ãƒâ€šÃ‚Â  0.0001
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ -2.5494
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ timer:
>>             Routine,Calls,Time,% = IterSCFÃƒâ€šÃ‚
>>             Â  Â  Â  Â  Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  1Ãƒâ€šÃ‚Â  Ãƒâ€š
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  1654.558Ãƒâ€šÃ‚Â  99.64
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ elaps:
>>             Routine,Calls,Wall,% = IterSCFÃƒâ€šÃ‚
>>             Â  Â  Â  Â  Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  1Ãƒâ€šÃ‚Â
>>             Ãƒâ€šÃ‚Â  Ãƒâ€š
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  415.150Ãƒâ€š
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ 99.64
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Â  Â  Â  Â
>>
>> ------------------------------------------------------------------------
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ Those elapsed
>>             times are so close ... there
>>             Â  Â  Â  Â  must be an easy
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  explanation.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Yes, if you are using mpirun
>>             -np 5 on a quad core
>>             Â  Â  Â  Â  machine, then the
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  explanation is easy and your
>>             numbers are irrelevant.Ãƒâ€š
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ Best,
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ Roberto
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ On 11/02/2015
>>             04:14 PM, Nick Papior wrote:
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Basically:
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Diag.ParallelOverK false
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€žÃ¢â€šÂ¬ uses scalapack to diagonalize
>>             Â  Â  Â  Â  the Hamiltonian
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Diag.ParallelOverK true
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€žÃ¢â€šÂ¬ uses lapack to diagonalize the
>>             Â  Â  Â  Â  Hamiltonian
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ If
>>             you have a very large system, you
>>             Â  Â  Â  Â  will not get
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  anything out
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ of
>> using
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ the
>>             latter option (rather than using
>>             Â  Â  Â  Â  an enormous amount
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  of memory).
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ Only
>>             for an _extreme_ number of
>>             Â  Â  Â  Â  k-points are the latter
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  favourable,
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ there
>>             are exceptions.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ The
>>             latter is intended for small bulk
>>             Â  Â  Â  Â  calculations with
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  many
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>> k-points.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Lastly, you have a quad core machine
>>             Â  Â  Â  Â  and run mpirun -np
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  5, and
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ expect
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ that
>>             to run faster. That is a wrong
>>             Â  Â  Â  Â  assumption.Ãƒâ€žÃ¢â€šÂ¬
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Secondly diagonalization is not
>>             Â  Â  Â  Â  everything in the
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  program, check
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ your
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ TIMES
>>             file to figure out whether it
>>             Â  Â  Â  Â  _is_ the
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  diagonalization or
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ a
>>             mixture.Ãƒâ€žÃ¢â€šÂ¬
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             2015-11-02 19:42 GMT+01:00 RCP
>>             Â  Â  Â  Â  <[email protected]
>>             <mailto:[email protected]> <mailto:[email protected]
>>             <mailto:[email protected]>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             <mailto:[email protected] <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>> <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             <mailto:[email protected] <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>> <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  <mailto:[email protected]
>>             <mailto:[email protected]>
>>             Â  Â  Â  Â  <mailto:[email protected]
>>             <mailto:[email protected]>>>>>>:
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  Dear everyone,
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  I seem to have a
>>             Â  Â  Â  Â  misunderstanding on how the
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Diag.ParallellOverK
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  feature works, any comment
>>             Â  Â  Â  Â  would be much appreciated.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  I've got a large metallic
>>             Â  Â  Â  Â  cell, though still with 9
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             k-points, that
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  runs on a quad PC; moreover,
>>             Â  Â  Â  Â  routine diagkp shows
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  k-points are
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  distributed round robin
>>             Â  Â  Â  Â  among processes. Thus I
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  was expecting
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  "mpirun -np 5 ..." to run
>>             Â  Â  Â  Â  significantly faster than
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             "mpirun -np 4 ...",
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  as judged from the elapsed
>>             Â  Â  Â  Â  time of individual scf
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  steps.
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  Clearly, in the latter case,
>>             Â  Â  Â  Â  the 9th k-point
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  would be taken by
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  process 0 while the other
>>             Â  Â  Â  Â  three would remain
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  waiting, right?.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  However, my exppectations
>>             Â  Â  Â  Â  turned out to be wrong;
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  in fact the
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  2nd alternative appears to
>>             Â  Â  Â  Â  be a tiny bit faster.
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  Why ?.
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  Thanks in advance,
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Ãƒâ€šÃ‚Â  Ãƒâ€šÃ‚Â  Roberto P.
>>
>>
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ --
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ Kind
>>             regards Nick
>>
>>
>>
>>
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  --
>>             Â  Â  Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Kind regards Nick
>>
>>
>>
>>
>>             Â  Â  Â  Â  --
>>             Â  Â  Â  Â  Kind regards Nick
>>
>>
>>             Â  Â  --
>>
>>             Â  Â
>>
>> |---------------------------------------------------------------------|
>>             Â  Â  |Ã‚Â  Ã‚ Dr. Roberto C. PasianotÃ‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Phone: 54 11 4839 6709Ã‚
>>             Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  |
>>             Â  Â  |Ã‚Â  Ã‚ Gcia. Materiales, CAC-CNEAÃ‚Â  Ã‚Â  Ã‚Â
>>             FAXÃ‚Â  : 54 11 6772 7362Ã‚
>>             Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  |
>>             Â  Â  |Ã‚Â  Ã‚ Avda. Gral. Paz 1499Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â
>>             Ã‚Â  Email:
>>             Â  Â [email protected] <mailto:[email protected]>
>>             <mailto:[email protected]
>>             <mailto:[email protected]>>Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ |
>>             Â  Â  |Ã‚Â  Ã‚ 1650 San Martin, Buenos AiresÃ‚Â  Ã‚Â  Ã‚Â
>>             Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ |
>>             Â  Â  |Ã‚Â  Ã‚ ARGENTINAÃ‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â
>>             Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚
>>             Â  Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚Â  Ã‚ |
>>             Â  Â
>>
>> |---------------------------------------------------------------------|
>>
>>
>>
>>
>>             --
>>             Kind regards Nick
>>
>>
>>
>>
>>     --
>>     Kind regards Nick
>>
>>
>>
>>
>> --
>> Kind regards Nick
>>
>
> --
>
> |---------------------------------------------------------------------|
>
> |   Dr. Roberto C. Pasianot         Phone: 54 11 4839 6709            |
> |   Gcia. Materiales, CAC-CNEA      FAX  : 54 11 6772 7362            |
> |   Avda. Gral. Paz 1499            Email: [email protected]       |
> |   1650 San Martin, Buenos Aires                                     |
> |   ARGENTINA                                                         |
> |---------------------------------------------------------------------|
>



-- 
Kind regards Nick

Re: [SIESTA-L] Puzzled about ParallelOverK feature

Responder a