Can you set MV2_SHOW_CPU_BINDING equal to 1 when running your job? This should show whether affinity is causing your processes to be oversubscribed on a set of cores.
If this is the case you can disable the affinity from the library by setting MV2_ENABLE_AFFINITY to 0. On Fri, Feb 06, 2015 at 03:18:48PM -0800, Novosielski, Ryan wrote: > I am running into a similar problem, with Slurm 14.11.3 and MVAPICH2 2.0. I > am wondering if perhaps having CPU affinity configured in MVAPICH2 and Slurm > at the same time isn't a bad idea (I've also since realized that it uses > cgroups and that the 2.6.18 kernel in RHEL5 does not support it anyway -- but > it didn't seem to be harming anything. Maybe it was?). > > ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* > || \\UTGERS |---------------------*O*--------------------- > ||_// Biomedical | Ryan Novosielski - Senior Technologist > || \\ and Health | [email protected]<mailto:[email protected]>- > 973/972.0922 (2x0922) > || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark > `' > > On Feb 6, 2015, at 18:10, Ralph Castain > <[email protected]<mailto:[email protected]>> wrote: > > > Glad to hear you found/fixed the problem! > > On Feb 6, 2015, at 2:44 PM, Peter A Ruprecht > <[email protected]<mailto:[email protected]>> wrote: > > > Thanks to everyone who responded. > > It appears that the issue was not the version of Slurm, but rather that we > had set TaskAffinity=yes in cgroups.conf at the same time we installed the > new version. > > Applications that were using OpenMPI version 1.6 and prior were in many > cases showing dramatically slower run times. I incorrectly wrote earlier > that v1.8 was also affected; in fact it seems to have been OK. > > I don't have a good environment for testing this further at the moment, > unfortunately, but since we backed out the change the users are happy > again. > > Thanks again, > Peter > > On 2/6/15, 6:49 AM, "Ralph Castain" > <[email protected]<mailto:[email protected]>> wrote: > > > If you are launching via mpirun, then you won't be using either version > of PMI - OMPI has its own internal daemons that handle the launch and > wireup. > > It's odd that it happens across OMPI versions as there exist significant > differences between them. Is the speed difference associated with non-MPI > jobs as well? In other words, if you execute "mpirun hostname", does it > also take an inordinate amount of time? > > If not, then the other possibility is that you are falling back on TCP > instead of IB, or that something is preventing the use of shared memory > as a transport for procs on the same node. > > > On Feb 5, 2015, at 5:02 PM, Peter A Ruprecht > <[email protected]<mailto:[email protected]>> wrote: > > > Answering two questions at one time: > > I am pretty sure we are not using PMI2. > > Jobs are launched via "sbatch job_script" where the script contains > "mpirun ./executable_file". There appear to be issues with at least > OMPI > 1.6.4 and 1.8.X. > > Thanks > Peter > > On 2/5/15, 5:39 PM, "Ralph Castain" > <[email protected]<mailto:[email protected]>> wrote: > > > And are you launching via mpirun or directly with srun <myapp>? What > OMPI > version are you using? > > > On Feb 5, 2015, at 3:32 PM, Chris Samuel > <[email protected]<mailto:[email protected]>> > wrote: > > > On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote: > > I ask because some of our users have started reporting a 10x increase > in > run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3. > It's > possible there is some other problem going on in our cluster, but all > of > our hardware checks including Infiniband diagnostics look pretty > clean. > > Are you using PMI2? > > cheers, > Chris > -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: [email protected]<mailto:[email protected]> Phone: +61 (0)3 > 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci -- Jonathan Perkins
