Re: [gmx-users] Different optimal pme grid ... coulomb cutoff values from identical input files

2014-02-06 Thread Szilárd Páll
Note that your above (CPU) runs had a far from optimal PP-PME balanc
(pme mesh/force
should be close to one).

Performance instability can be caused by a busy network (how many
nodes are you running on?) or even incorrect affinity settings.

If you post a/some log files, we may be able to tell more.

Cheers,
--
Szilárd


On Thu, Feb 6, 2014 at 8:35 AM, yunshi11 . yunsh...@gmail.com wrote:
 On Wed, Feb 5, 2014 at 9:43 AM, Mark Abraham mark.j.abra...@gmail.comwrote:

 What's the network? If it's some kind of switched Infiniband shared with
 other user's jobs, then getting hit by the traffic does happen. You can see



 It indeed use an InfiniBand 4X QDR (Quad Data Rate) 40 Gbit/s switched
 fabric, with a two to one blocking factor.

 And I tried running this again with GPU version, which illustrated the same
 issue: every single run gets a different coulomb cutoff after automatic
 optimization.

 Since it is unlikely to have my own corner on a nation-wide supercomputer,
 is there any parameters that could avoid this from happening?
 Turning off load balancing sounds crazy.




 that the individual timings of the things the load balancer tries differ a
 lot between runs. So there must be an extrinsic factor (if the .tpr is
 functionally the same). Organizing yourself a quiet corner of the network
 is ideal, if you can do the required social engineering :-P

 Mark


 On Wed, Feb 5, 2014 at 6:22 PM, yunshi11 . yunsh...@gmail.com wrote:

  Hello all,
 
  I am doing a production MD run of a protein-ligand complex in explicit
  water with GROMACS4.6.5
 
  However, I got different coulomb cutoff values as shown in the output log
  files.
 
  1st one:
 
 
 ...
  NOTE: Turning on dynamic load balancing
 
  step   60: timed with pme grid 112 112 112, coulomb cutoff 1.000: 235.9
  M-cycles
  step  100: timed with pme grid 100 100 100, coulomb cutoff 1.116: 228.8
  M-cycles
  step  100: the domain decompostion limits the PME load balancing to a
  coulomb cut-off of 1.162
  step  140: timed with pme grid 112 112 112, coulomb cutoff 1.000: 223.9
  M-cycles
  step  180: timed with pme grid 108 108 108, coulomb cutoff 1.033: 219.2
  M-cycles
  step  220: timed with pme grid 104 104 104, coulomb cutoff 1.073: 210.9
  M-cycles
  step  260: timed with pme grid 100 100 100, coulomb cutoff 1.116: 229.0
  M-cycles
  step  300: timed with pme grid 96 96 96, coulomb cutoff 1.162: 267.8
  M-cycles
  step  340: timed with pme grid 112 112 112, coulomb cutoff 1.000: 241.4
  M-cycles
  step  380: timed with pme grid 108 108 108, coulomb cutoff 1.033: 424.1
  M-cycles
  step  420: timed with pme grid 104 104 104, coulomb cutoff 1.073: 215.1
  M-cycles
  step  460: timed with pme grid 100 100 100, coulomb cutoff 1.116: 226.4
  M-cycles
optimal pme grid 104 104 104, coulomb cutoff 1.073
  DD  step 24999  vol min/aver 0.834  load imb.: force  2.3%  pme
 mesh/force
  0.687
 
 
 ...
 
 
  2nd one:
  NOTE: Turning on dynamic load balancing
 
  step   60: timed with pme grid 112 112 112, coulomb cutoff 1.000: 187.1
  M-cycles
  step  100: timed with pme grid 100 100 100, coulomb cutoff 1.116: 218.3
  M-cycles
  step  140: timed with pme grid 112 112 112, coulomb cutoff 1.000: 172.4
  M-cycles
  step  180: timed with pme grid 108 108 108, coulomb cutoff 1.033: 188.3
  M-cycles
  step  220: timed with pme grid 104 104 104, coulomb cutoff 1.073: 203.1
  M-cycles
  step  260: timed with pme grid 112 112 112, coulomb cutoff 1.000: 174.3
  M-cycles
  step  300: timed with pme grid 108 108 108, coulomb cutoff 1.033: 184.4
  M-cycles
  step  340: timed with pme grid 104 104 104, coulomb cutoff 1.073: 205.4
  M-cycles
  step  380: timed with pme grid 112 112 112, coulomb cutoff 1.000: 172.1
  M-cycles
  step  420: timed with pme grid 108 108 108, coulomb cutoff 1.033: 188.8
  M-cycles
optimal pme grid 112 112 112, coulomb cutoff 1.000
  DD  step 24999  vol min/aver 0.789  load imb.: force  4.7%  pme
 mesh/force
  0.766
 
 
 ...
 
 
 
 
  The 2nd MD run turned out to be much faster (5 times), and the reason I
  submitted the 2nd is because the 1st was unexpectedly slow.
 
  I made sure the .tpr file and .pbs file (MPI for a cluster, which
 consists
  of Xeon E5649 CPUs) are virtually identical, and here is my .mdp file:
  ;
  title= Production Simulation
  cpp  = /lib/cpp
 
  ; RUN CONTROL PARAMETERS
  integrator   = md
  tinit= 0   ; Starting time
  dt   = 0.002   ; 2 femtosecond time step for
  integration
  nsteps   = 5  ; 

Re: [gmx-users] Different optimal pme grid ... coulomb cutoff values from identical input files

2014-02-06 Thread Mark Abraham
On Feb 6, 2014 8:42 AM, yunshi11 . yunsh...@gmail.com wrote:

 On Wed, Feb 5, 2014 at 9:43 AM, Mark Abraham mark.j.abra...@gmail.com
wrote:

  What's the network? If it's some kind of switched Infiniband shared with
  other user's jobs, then getting hit by the traffic does happen. You can
see
 


 It indeed use an InfiniBand 4X QDR (Quad Data Rate) 40 Gbit/s switched
 fabric, with a two to one blocking factor.

 And I tried running this again with GPU version, which illustrated the
same
 issue: every single run gets a different coulomb cutoff after automatic
 optimization.

It is getting a different PME tuning, which is not surprising given the
noise in the timings it measures. There's probably a right tuning for you,
but you'd have to run each testing phase long enough to average over the
noise! The differences in result don't matter for correctness, only for
efficiency. If you decide on a setup you think is fastest on balance,
describe it in the .mdp and use mdrun -notunepme. That way you won't get a
stupid result from the tuner. It doesn't help that any single measurement
could be slow from noise, or because it is bad, and it is hard to tell the
difference without repeats.

 Since it is unlikely to have my own corner on a nation-wide supercomputer,
 is there any parameters that could avoid this from happening?

Whether there are other users ;-) The next best is to request the scheduler
give you nodes at the same lowest level of the switch hierarchy. This
reduces your surface area, by making you your own neighbour more often.
This will lead to longer queue times, of course, so weigh up efficiency vs
throughput. Naturally, your scheduler won't support this request, but if
you don't ask for it, it never will! Likewise for a machine that can be
partitioned for sufficient need.

 Turning off load balancing sounds crazy.

Yes. PME tuning and load balancing are different things! Neither is a
problem here, but both are affected by the runtime context.

Mark




  that the individual timings of the things the load balancer tries
differ a
  lot between runs. So there must be an extrinsic factor (if the .tpr is
  functionally the same). Organizing yourself a quiet corner of the
network
  is ideal, if you can do the required social engineering :-P
 
  Mark
 
 
  On Wed, Feb 5, 2014 at 6:22 PM, yunshi11 . yunsh...@gmail.com wrote:
 
   Hello all,
  
   I am doing a production MD run of a protein-ligand complex in explicit
   water with GROMACS4.6.5
  
   However, I got different coulomb cutoff values as shown in the output
log
   files.
  
   1st one:
  
  
 
...
   NOTE: Turning on dynamic load balancing
  
   step   60: timed with pme grid 112 112 112, coulomb cutoff 1.000:
235.9
   M-cycles
   step  100: timed with pme grid 100 100 100, coulomb cutoff 1.116:
228.8
   M-cycles
   step  100: the domain decompostion limits the PME load balancing to a
   coulomb cut-off of 1.162
   step  140: timed with pme grid 112 112 112, coulomb cutoff 1.000:
223.9
   M-cycles
   step  180: timed with pme grid 108 108 108, coulomb cutoff 1.033:
219.2
   M-cycles
   step  220: timed with pme grid 104 104 104, coulomb cutoff 1.073:
210.9
   M-cycles
   step  260: timed with pme grid 100 100 100, coulomb cutoff 1.116:
229.0
   M-cycles
   step  300: timed with pme grid 96 96 96, coulomb cutoff 1.162: 267.8
   M-cycles
   step  340: timed with pme grid 112 112 112, coulomb cutoff 1.000:
241.4
   M-cycles
   step  380: timed with pme grid 108 108 108, coulomb cutoff 1.033:
424.1
   M-cycles
   step  420: timed with pme grid 104 104 104, coulomb cutoff 1.073:
215.1
   M-cycles
   step  460: timed with pme grid 100 100 100, coulomb cutoff 1.116:
226.4
   M-cycles
 optimal pme grid 104 104 104, coulomb cutoff 1.073
   DD  step 24999  vol min/aver 0.834  load imb.: force  2.3%  pme
  mesh/force
   0.687
  
  
 
...
  
  
   2nd one:
   NOTE: Turning on dynamic load balancing
  
   step   60: timed with pme grid 112 112 112, coulomb cutoff 1.000:
187.1
   M-cycles
   step  100: timed with pme grid 100 100 100, coulomb cutoff 1.116:
218.3
   M-cycles
   step  140: timed with pme grid 112 112 112, coulomb cutoff 1.000:
172.4
   M-cycles
   step  180: timed with pme grid 108 108 108, coulomb cutoff 1.033:
188.3
   M-cycles
   step  220: timed with pme grid 104 104 104, coulomb cutoff 1.073:
203.1
   M-cycles
   step  260: timed with pme grid 112 112 112, coulomb cutoff 1.000:
174.3
   M-cycles
   step  300: timed with pme grid 108 108 108, coulomb cutoff 1.033:
184.4
   M-cycles
   step  340: timed with pme grid 104 104 104, coulomb cutoff 1.073:
205.4
   M-cycles
   step  380: timed with pme grid 112 112 112, coulomb cutoff 1.000:
172.1
   M-cycles
   step  420: timed with 

Re: [gmx-users] Different optimal pme grid ... coulomb cutoff values from identical input files

2014-02-05 Thread Mark Abraham
What's the network? If it's some kind of switched Infiniband shared with
other user's jobs, then getting hit by the traffic does happen. You can see
that the individual timings of the things the load balancer tries differ a
lot between runs. So there must be an extrinsic factor (if the .tpr is
functionally the same). Organizing yourself a quiet corner of the network
is ideal, if you can do the required social engineering :-P

Mark


On Wed, Feb 5, 2014 at 6:22 PM, yunshi11 . yunsh...@gmail.com wrote:

 Hello all,

 I am doing a production MD run of a protein-ligand complex in explicit
 water with GROMACS4.6.5

 However, I got different coulomb cutoff values as shown in the output log
 files.

 1st one:

 ...
 NOTE: Turning on dynamic load balancing

 step   60: timed with pme grid 112 112 112, coulomb cutoff 1.000: 235.9
 M-cycles
 step  100: timed with pme grid 100 100 100, coulomb cutoff 1.116: 228.8
 M-cycles
 step  100: the domain decompostion limits the PME load balancing to a
 coulomb cut-off of 1.162
 step  140: timed with pme grid 112 112 112, coulomb cutoff 1.000: 223.9
 M-cycles
 step  180: timed with pme grid 108 108 108, coulomb cutoff 1.033: 219.2
 M-cycles
 step  220: timed with pme grid 104 104 104, coulomb cutoff 1.073: 210.9
 M-cycles
 step  260: timed with pme grid 100 100 100, coulomb cutoff 1.116: 229.0
 M-cycles
 step  300: timed with pme grid 96 96 96, coulomb cutoff 1.162: 267.8
 M-cycles
 step  340: timed with pme grid 112 112 112, coulomb cutoff 1.000: 241.4
 M-cycles
 step  380: timed with pme grid 108 108 108, coulomb cutoff 1.033: 424.1
 M-cycles
 step  420: timed with pme grid 104 104 104, coulomb cutoff 1.073: 215.1
 M-cycles
 step  460: timed with pme grid 100 100 100, coulomb cutoff 1.116: 226.4
 M-cycles
   optimal pme grid 104 104 104, coulomb cutoff 1.073
 DD  step 24999  vol min/aver 0.834  load imb.: force  2.3%  pme mesh/force
 0.687

 ...


 2nd one:
 NOTE: Turning on dynamic load balancing

 step   60: timed with pme grid 112 112 112, coulomb cutoff 1.000: 187.1
 M-cycles
 step  100: timed with pme grid 100 100 100, coulomb cutoff 1.116: 218.3
 M-cycles
 step  140: timed with pme grid 112 112 112, coulomb cutoff 1.000: 172.4
 M-cycles
 step  180: timed with pme grid 108 108 108, coulomb cutoff 1.033: 188.3
 M-cycles
 step  220: timed with pme grid 104 104 104, coulomb cutoff 1.073: 203.1
 M-cycles
 step  260: timed with pme grid 112 112 112, coulomb cutoff 1.000: 174.3
 M-cycles
 step  300: timed with pme grid 108 108 108, coulomb cutoff 1.033: 184.4
 M-cycles
 step  340: timed with pme grid 104 104 104, coulomb cutoff 1.073: 205.4
 M-cycles
 step  380: timed with pme grid 112 112 112, coulomb cutoff 1.000: 172.1
 M-cycles
 step  420: timed with pme grid 108 108 108, coulomb cutoff 1.033: 188.8
 M-cycles
   optimal pme grid 112 112 112, coulomb cutoff 1.000
 DD  step 24999  vol min/aver 0.789  load imb.: force  4.7%  pme mesh/force
 0.766

 ...




 The 2nd MD run turned out to be much faster (5 times), and the reason I
 submitted the 2nd is because the 1st was unexpectedly slow.

 I made sure the .tpr file and .pbs file (MPI for a cluster, which consists
 of Xeon E5649 CPUs) are virtually identical, and here is my .mdp file:
 ;
 title= Production Simulation
 cpp  = /lib/cpp

 ; RUN CONTROL PARAMETERS
 integrator   = md
 tinit= 0   ; Starting time
 dt   = 0.002   ; 2 femtosecond time step for
 integration
 nsteps   = 5  ; 1000 ns = 0.002ps * 50,000,000

 ; OUTPUT CONTROL OPTIONS
 nstxout  = 25000   ;  .trr full precision coor every
 50ps
 nstvout  = 0 ;  .trr velocities output
 nstfout  = 0 ; Not writing forces
 nstlog   = 25000   ; Writing to the log file every 50ps
 nstenergy= 25000   ; Writing out energy information
 every 50ps
 energygrps   = dikpgdu Water_and_ions

 ; NEIGHBORSEARCHING PARAMETERS
 cutoff-scheme = Verlet
 nstlist  = 20
 ns-type  = Grid
 pbc  = xyz   ; 3-D PBC
 rlist= 1.0

 ; OPTIONS FOR ELECTROSTATICS AND VDW
 rcoulomb = 1.0  ; short-range electrostatic cutoff (in
 nm)
 coulombtype  = PME   ; Particle Mesh Ewald for long-range
 electrostatics
 pme_order= 4 ; interpolation
 fourierspacing   = 0.12  ; grid spacing for FFT
 vdw-type = Cut-off
 rvdw