I am running into a similar problem, with Slurm 14.11.3 and MVAPICH2 2.0. I am 
wondering if perhaps having CPU affinity configured in MVAPICH2 and Slurm at 
the same time isn't a bad idea (I've also since realized that it uses cgroups 
and that the 2.6.18 kernel in RHEL5 does not support it anyway -- but it didn't 
seem to be harming anything. Maybe it was?).

____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS      |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | [email protected]<mailto:[email protected]>- 
973/972.0922 (2x0922)
||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
    `'

On Feb 6, 2015, at 18:10, Ralph Castain 
<[email protected]<mailto:[email protected]>> wrote:


Glad to hear you found/fixed the problem!

On Feb 6, 2015, at 2:44 PM, Peter A Ruprecht 
<[email protected]<mailto:[email protected]>> wrote:


Thanks to everyone who responded.

It appears that the issue was not the version of Slurm, but rather that we
had set TaskAffinity=yes in cgroups.conf at the same time we installed the
new version.

Applications that were using OpenMPI version 1.6 and prior were in many
cases showing dramatically slower run times.  I incorrectly wrote earlier
that v1.8 was also affected; in fact it seems to have been OK.

I don't have a good environment for testing this further at the moment,
unfortunately, but since we backed out the change the users are happy
again.

Thanks again,
Peter

On 2/6/15, 6:49 AM, "Ralph Castain" 
<[email protected]<mailto:[email protected]>> wrote:


If you are launching via mpirun, then you won't be using either version
of PMI - OMPI has its own internal daemons that handle the launch and
wireup.

It's odd that it happens across OMPI versions as there exist significant
differences between them. Is the speed difference associated with non-MPI
jobs as well? In other words, if you execute "mpirun hostname", does it
also take an inordinate amount of time?

If not, then the other possibility is that you are falling back on TCP
instead of IB, or that something is preventing the use of shared memory
as a transport for procs on the same node.


On Feb 5, 2015, at 5:02 PM, Peter A Ruprecht
<[email protected]<mailto:[email protected]>> wrote:


Answering two questions at one time:

I am pretty sure we are not using PMI2.

Jobs are launched via "sbatch job_script" where the script contains
"mpirun ./executable_file".  There appear to be issues with at least
OMPI
1.6.4 and 1.8.X.

Thanks
Peter

On 2/5/15, 5:39 PM, "Ralph Castain" 
<[email protected]<mailto:[email protected]>> wrote:


And are you launching via mpirun or directly with srun <myapp>? What
OMPI
version are you using?


On Feb 5, 2015, at 3:32 PM, Chris Samuel 
<[email protected]<mailto:[email protected]>>
wrote:


On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote:

I ask because some of our users have started reporting a 10x increase
in
run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3.
It's
possible there is some other problem going on in our cluster, but all
of
our hardware checks including Infiniband diagnostics look pretty
clean.

Are you using PMI2?

cheers,
Chris
--
Christopher Samuel        Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: [email protected]<mailto:[email protected]> Phone: +61 (0)3 903 
55545
http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to