Can you set MV2_SHOW_CPU_BINDING equal to 1 when running your job?  This
should show whether affinity is causing your processes to be
oversubscribed on a set of cores.

If this is the case you can disable the affinity from the library by
setting MV2_ENABLE_AFFINITY to 0.

On Fri, Feb 06, 2015 at 03:18:48PM -0800, Novosielski, Ryan wrote:
> I am running into a similar problem, with Slurm 14.11.3 and MVAPICH2 2.0. I 
> am wondering if perhaps having CPU affinity configured in MVAPICH2 and Slurm 
> at the same time isn't a bad idea (I've also since realized that it uses 
> cgroups and that the 2.6.18 kernel in RHEL5 does not support it anyway -- but 
> it didn't seem to be harming anything. Maybe it was?).
> 
> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
> || \\UTGERS      |---------------------*O*---------------------
> ||_// Biomedical | Ryan Novosielski - Senior Technologist
> || \\ and Health | [email protected]<mailto:[email protected]>- 
> 973/972.0922 (2x0922)
> ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
>     `'
> 
> On Feb 6, 2015, at 18:10, Ralph Castain 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> 
> Glad to hear you found/fixed the problem!
> 
> On Feb 6, 2015, at 2:44 PM, Peter A Ruprecht 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> 
> Thanks to everyone who responded.
> 
> It appears that the issue was not the version of Slurm, but rather that we
> had set TaskAffinity=yes in cgroups.conf at the same time we installed the
> new version.
> 
> Applications that were using OpenMPI version 1.6 and prior were in many
> cases showing dramatically slower run times.  I incorrectly wrote earlier
> that v1.8 was also affected; in fact it seems to have been OK.
> 
> I don't have a good environment for testing this further at the moment,
> unfortunately, but since we backed out the change the users are happy
> again.
> 
> Thanks again,
> Peter
> 
> On 2/6/15, 6:49 AM, "Ralph Castain" 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> 
> If you are launching via mpirun, then you won't be using either version
> of PMI - OMPI has its own internal daemons that handle the launch and
> wireup.
> 
> It's odd that it happens across OMPI versions as there exist significant
> differences between them. Is the speed difference associated with non-MPI
> jobs as well? In other words, if you execute "mpirun hostname", does it
> also take an inordinate amount of time?
> 
> If not, then the other possibility is that you are falling back on TCP
> instead of IB, or that something is preventing the use of shared memory
> as a transport for procs on the same node.
> 
> 
> On Feb 5, 2015, at 5:02 PM, Peter A Ruprecht
> <[email protected]<mailto:[email protected]>> wrote:
> 
> 
> Answering two questions at one time:
> 
> I am pretty sure we are not using PMI2.
> 
> Jobs are launched via "sbatch job_script" where the script contains
> "mpirun ./executable_file".  There appear to be issues with at least
> OMPI
> 1.6.4 and 1.8.X.
> 
> Thanks
> Peter
> 
> On 2/5/15, 5:39 PM, "Ralph Castain" 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> 
> And are you launching via mpirun or directly with srun <myapp>? What
> OMPI
> version are you using?
> 
> 
> On Feb 5, 2015, at 3:32 PM, Chris Samuel 
> <[email protected]<mailto:[email protected]>>
> wrote:
> 
> 
> On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote:
> 
> I ask because some of our users have started reporting a 10x increase
> in
> run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3.
> It's
> possible there is some other problem going on in our cluster, but all
> of
> our hardware checks including Infiniband diagnostics look pretty
> clean.
> 
> Are you using PMI2?
> 
> cheers,
> Chris
> --
> Christopher Samuel        Senior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: [email protected]<mailto:[email protected]> Phone: +61 (0)3 
> 903 55545
> http://www.vlsci.org.au/      http://twitter.com/vlsci

-- 
Jonathan Perkins

Reply via email to