Re: [gmx-users] Different optimal pme grid ... coulomb cutoff values from identical input files
Note that your above (CPU) runs had a far from optimal PP-PME balanc (pme mesh/force should be close to one). Performance instability can be caused by a busy network (how many nodes are you running on?) or even incorrect affinity settings. If you post a/some log files, we may be able to tell more. Cheers, -- Szilárd On Thu, Feb 6, 2014 at 8:35 AM, yunshi11 . yunsh...@gmail.com wrote: On Wed, Feb 5, 2014 at 9:43 AM, Mark Abraham mark.j.abra...@gmail.comwrote: What's the network? If it's some kind of switched Infiniband shared with other user's jobs, then getting hit by the traffic does happen. You can see It indeed use an InfiniBand 4X QDR (Quad Data Rate) 40 Gbit/s switched fabric, with a two to one blocking factor. And I tried running this again with GPU version, which illustrated the same issue: every single run gets a different coulomb cutoff after automatic optimization. Since it is unlikely to have my own corner on a nation-wide supercomputer, is there any parameters that could avoid this from happening? Turning off load balancing sounds crazy. that the individual timings of the things the load balancer tries differ a lot between runs. So there must be an extrinsic factor (if the .tpr is functionally the same). Organizing yourself a quiet corner of the network is ideal, if you can do the required social engineering :-P Mark On Wed, Feb 5, 2014 at 6:22 PM, yunshi11 . yunsh...@gmail.com wrote: Hello all, I am doing a production MD run of a protein-ligand complex in explicit water with GROMACS4.6.5 However, I got different coulomb cutoff values as shown in the output log files. 1st one: ... NOTE: Turning on dynamic load balancing step 60: timed with pme grid 112 112 112, coulomb cutoff 1.000: 235.9 M-cycles step 100: timed with pme grid 100 100 100, coulomb cutoff 1.116: 228.8 M-cycles step 100: the domain decompostion limits the PME load balancing to a coulomb cut-off of 1.162 step 140: timed with pme grid 112 112 112, coulomb cutoff 1.000: 223.9 M-cycles step 180: timed with pme grid 108 108 108, coulomb cutoff 1.033: 219.2 M-cycles step 220: timed with pme grid 104 104 104, coulomb cutoff 1.073: 210.9 M-cycles step 260: timed with pme grid 100 100 100, coulomb cutoff 1.116: 229.0 M-cycles step 300: timed with pme grid 96 96 96, coulomb cutoff 1.162: 267.8 M-cycles step 340: timed with pme grid 112 112 112, coulomb cutoff 1.000: 241.4 M-cycles step 380: timed with pme grid 108 108 108, coulomb cutoff 1.033: 424.1 M-cycles step 420: timed with pme grid 104 104 104, coulomb cutoff 1.073: 215.1 M-cycles step 460: timed with pme grid 100 100 100, coulomb cutoff 1.116: 226.4 M-cycles optimal pme grid 104 104 104, coulomb cutoff 1.073 DD step 24999 vol min/aver 0.834 load imb.: force 2.3% pme mesh/force 0.687 ... 2nd one: NOTE: Turning on dynamic load balancing step 60: timed with pme grid 112 112 112, coulomb cutoff 1.000: 187.1 M-cycles step 100: timed with pme grid 100 100 100, coulomb cutoff 1.116: 218.3 M-cycles step 140: timed with pme grid 112 112 112, coulomb cutoff 1.000: 172.4 M-cycles step 180: timed with pme grid 108 108 108, coulomb cutoff 1.033: 188.3 M-cycles step 220: timed with pme grid 104 104 104, coulomb cutoff 1.073: 203.1 M-cycles step 260: timed with pme grid 112 112 112, coulomb cutoff 1.000: 174.3 M-cycles step 300: timed with pme grid 108 108 108, coulomb cutoff 1.033: 184.4 M-cycles step 340: timed with pme grid 104 104 104, coulomb cutoff 1.073: 205.4 M-cycles step 380: timed with pme grid 112 112 112, coulomb cutoff 1.000: 172.1 M-cycles step 420: timed with pme grid 108 108 108, coulomb cutoff 1.033: 188.8 M-cycles optimal pme grid 112 112 112, coulomb cutoff 1.000 DD step 24999 vol min/aver 0.789 load imb.: force 4.7% pme mesh/force 0.766 ... The 2nd MD run turned out to be much faster (5 times), and the reason I submitted the 2nd is because the 1st was unexpectedly slow. I made sure the .tpr file and .pbs file (MPI for a cluster, which consists of Xeon E5649 CPUs) are virtually identical, and here is my .mdp file: ; title= Production Simulation cpp = /lib/cpp ; RUN CONTROL PARAMETERS integrator = md tinit= 0 ; Starting time dt = 0.002 ; 2 femtosecond time step for integration nsteps = 5 ;
Re: [gmx-users] Different optimal pme grid ... coulomb cutoff values from identical input files
On Feb 6, 2014 8:42 AM, yunshi11 . yunsh...@gmail.com wrote: On Wed, Feb 5, 2014 at 9:43 AM, Mark Abraham mark.j.abra...@gmail.com wrote: What's the network? If it's some kind of switched Infiniband shared with other user's jobs, then getting hit by the traffic does happen. You can see It indeed use an InfiniBand 4X QDR (Quad Data Rate) 40 Gbit/s switched fabric, with a two to one blocking factor. And I tried running this again with GPU version, which illustrated the same issue: every single run gets a different coulomb cutoff after automatic optimization. It is getting a different PME tuning, which is not surprising given the noise in the timings it measures. There's probably a right tuning for you, but you'd have to run each testing phase long enough to average over the noise! The differences in result don't matter for correctness, only for efficiency. If you decide on a setup you think is fastest on balance, describe it in the .mdp and use mdrun -notunepme. That way you won't get a stupid result from the tuner. It doesn't help that any single measurement could be slow from noise, or because it is bad, and it is hard to tell the difference without repeats. Since it is unlikely to have my own corner on a nation-wide supercomputer, is there any parameters that could avoid this from happening? Whether there are other users ;-) The next best is to request the scheduler give you nodes at the same lowest level of the switch hierarchy. This reduces your surface area, by making you your own neighbour more often. This will lead to longer queue times, of course, so weigh up efficiency vs throughput. Naturally, your scheduler won't support this request, but if you don't ask for it, it never will! Likewise for a machine that can be partitioned for sufficient need. Turning off load balancing sounds crazy. Yes. PME tuning and load balancing are different things! Neither is a problem here, but both are affected by the runtime context. Mark that the individual timings of the things the load balancer tries differ a lot between runs. So there must be an extrinsic factor (if the .tpr is functionally the same). Organizing yourself a quiet corner of the network is ideal, if you can do the required social engineering :-P Mark On Wed, Feb 5, 2014 at 6:22 PM, yunshi11 . yunsh...@gmail.com wrote: Hello all, I am doing a production MD run of a protein-ligand complex in explicit water with GROMACS4.6.5 However, I got different coulomb cutoff values as shown in the output log files. 1st one: ... NOTE: Turning on dynamic load balancing step 60: timed with pme grid 112 112 112, coulomb cutoff 1.000: 235.9 M-cycles step 100: timed with pme grid 100 100 100, coulomb cutoff 1.116: 228.8 M-cycles step 100: the domain decompostion limits the PME load balancing to a coulomb cut-off of 1.162 step 140: timed with pme grid 112 112 112, coulomb cutoff 1.000: 223.9 M-cycles step 180: timed with pme grid 108 108 108, coulomb cutoff 1.033: 219.2 M-cycles step 220: timed with pme grid 104 104 104, coulomb cutoff 1.073: 210.9 M-cycles step 260: timed with pme grid 100 100 100, coulomb cutoff 1.116: 229.0 M-cycles step 300: timed with pme grid 96 96 96, coulomb cutoff 1.162: 267.8 M-cycles step 340: timed with pme grid 112 112 112, coulomb cutoff 1.000: 241.4 M-cycles step 380: timed with pme grid 108 108 108, coulomb cutoff 1.033: 424.1 M-cycles step 420: timed with pme grid 104 104 104, coulomb cutoff 1.073: 215.1 M-cycles step 460: timed with pme grid 100 100 100, coulomb cutoff 1.116: 226.4 M-cycles optimal pme grid 104 104 104, coulomb cutoff 1.073 DD step 24999 vol min/aver 0.834 load imb.: force 2.3% pme mesh/force 0.687 ... 2nd one: NOTE: Turning on dynamic load balancing step 60: timed with pme grid 112 112 112, coulomb cutoff 1.000: 187.1 M-cycles step 100: timed with pme grid 100 100 100, coulomb cutoff 1.116: 218.3 M-cycles step 140: timed with pme grid 112 112 112, coulomb cutoff 1.000: 172.4 M-cycles step 180: timed with pme grid 108 108 108, coulomb cutoff 1.033: 188.3 M-cycles step 220: timed with pme grid 104 104 104, coulomb cutoff 1.073: 203.1 M-cycles step 260: timed with pme grid 112 112 112, coulomb cutoff 1.000: 174.3 M-cycles step 300: timed with pme grid 108 108 108, coulomb cutoff 1.033: 184.4 M-cycles step 340: timed with pme grid 104 104 104, coulomb cutoff 1.073: 205.4 M-cycles step 380: timed with pme grid 112 112 112, coulomb cutoff 1.000: 172.1 M-cycles step 420: timed with
Re: [gmx-users] Different optimal pme grid ... coulomb cutoff values from identical input files
What's the network? If it's some kind of switched Infiniband shared with other user's jobs, then getting hit by the traffic does happen. You can see that the individual timings of the things the load balancer tries differ a lot between runs. So there must be an extrinsic factor (if the .tpr is functionally the same). Organizing yourself a quiet corner of the network is ideal, if you can do the required social engineering :-P Mark On Wed, Feb 5, 2014 at 6:22 PM, yunshi11 . yunsh...@gmail.com wrote: Hello all, I am doing a production MD run of a protein-ligand complex in explicit water with GROMACS4.6.5 However, I got different coulomb cutoff values as shown in the output log files. 1st one: ... NOTE: Turning on dynamic load balancing step 60: timed with pme grid 112 112 112, coulomb cutoff 1.000: 235.9 M-cycles step 100: timed with pme grid 100 100 100, coulomb cutoff 1.116: 228.8 M-cycles step 100: the domain decompostion limits the PME load balancing to a coulomb cut-off of 1.162 step 140: timed with pme grid 112 112 112, coulomb cutoff 1.000: 223.9 M-cycles step 180: timed with pme grid 108 108 108, coulomb cutoff 1.033: 219.2 M-cycles step 220: timed with pme grid 104 104 104, coulomb cutoff 1.073: 210.9 M-cycles step 260: timed with pme grid 100 100 100, coulomb cutoff 1.116: 229.0 M-cycles step 300: timed with pme grid 96 96 96, coulomb cutoff 1.162: 267.8 M-cycles step 340: timed with pme grid 112 112 112, coulomb cutoff 1.000: 241.4 M-cycles step 380: timed with pme grid 108 108 108, coulomb cutoff 1.033: 424.1 M-cycles step 420: timed with pme grid 104 104 104, coulomb cutoff 1.073: 215.1 M-cycles step 460: timed with pme grid 100 100 100, coulomb cutoff 1.116: 226.4 M-cycles optimal pme grid 104 104 104, coulomb cutoff 1.073 DD step 24999 vol min/aver 0.834 load imb.: force 2.3% pme mesh/force 0.687 ... 2nd one: NOTE: Turning on dynamic load balancing step 60: timed with pme grid 112 112 112, coulomb cutoff 1.000: 187.1 M-cycles step 100: timed with pme grid 100 100 100, coulomb cutoff 1.116: 218.3 M-cycles step 140: timed with pme grid 112 112 112, coulomb cutoff 1.000: 172.4 M-cycles step 180: timed with pme grid 108 108 108, coulomb cutoff 1.033: 188.3 M-cycles step 220: timed with pme grid 104 104 104, coulomb cutoff 1.073: 203.1 M-cycles step 260: timed with pme grid 112 112 112, coulomb cutoff 1.000: 174.3 M-cycles step 300: timed with pme grid 108 108 108, coulomb cutoff 1.033: 184.4 M-cycles step 340: timed with pme grid 104 104 104, coulomb cutoff 1.073: 205.4 M-cycles step 380: timed with pme grid 112 112 112, coulomb cutoff 1.000: 172.1 M-cycles step 420: timed with pme grid 108 108 108, coulomb cutoff 1.033: 188.8 M-cycles optimal pme grid 112 112 112, coulomb cutoff 1.000 DD step 24999 vol min/aver 0.789 load imb.: force 4.7% pme mesh/force 0.766 ... The 2nd MD run turned out to be much faster (5 times), and the reason I submitted the 2nd is because the 1st was unexpectedly slow. I made sure the .tpr file and .pbs file (MPI for a cluster, which consists of Xeon E5649 CPUs) are virtually identical, and here is my .mdp file: ; title= Production Simulation cpp = /lib/cpp ; RUN CONTROL PARAMETERS integrator = md tinit= 0 ; Starting time dt = 0.002 ; 2 femtosecond time step for integration nsteps = 5 ; 1000 ns = 0.002ps * 50,000,000 ; OUTPUT CONTROL OPTIONS nstxout = 25000 ; .trr full precision coor every 50ps nstvout = 0 ; .trr velocities output nstfout = 0 ; Not writing forces nstlog = 25000 ; Writing to the log file every 50ps nstenergy= 25000 ; Writing out energy information every 50ps energygrps = dikpgdu Water_and_ions ; NEIGHBORSEARCHING PARAMETERS cutoff-scheme = Verlet nstlist = 20 ns-type = Grid pbc = xyz ; 3-D PBC rlist= 1.0 ; OPTIONS FOR ELECTROSTATICS AND VDW rcoulomb = 1.0 ; short-range electrostatic cutoff (in nm) coulombtype = PME ; Particle Mesh Ewald for long-range electrostatics pme_order= 4 ; interpolation fourierspacing = 0.12 ; grid spacing for FFT vdw-type = Cut-off rvdw