I am running into a similar problem, with Slurm 14.11.3 and MVAPICH2 2.0. I am wondering if perhaps having CPU affinity configured in MVAPICH2 and Slurm at the same time isn't a bad idea (I've also since realized that it uses cgroups and that the 2.6.18 kernel in RHEL5 does not support it anyway -- but it didn't seem to be harming anything. Maybe it was?).
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |---------------------*O*--------------------- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | [email protected]<mailto:[email protected]>- 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `' On Feb 6, 2015, at 18:10, Ralph Castain <[email protected]<mailto:[email protected]>> wrote: Glad to hear you found/fixed the problem! On Feb 6, 2015, at 2:44 PM, Peter A Ruprecht <[email protected]<mailto:[email protected]>> wrote: Thanks to everyone who responded. It appears that the issue was not the version of Slurm, but rather that we had set TaskAffinity=yes in cgroups.conf at the same time we installed the new version. Applications that were using OpenMPI version 1.6 and prior were in many cases showing dramatically slower run times. I incorrectly wrote earlier that v1.8 was also affected; in fact it seems to have been OK. I don't have a good environment for testing this further at the moment, unfortunately, but since we backed out the change the users are happy again. Thanks again, Peter On 2/6/15, 6:49 AM, "Ralph Castain" <[email protected]<mailto:[email protected]>> wrote: If you are launching via mpirun, then you won't be using either version of PMI - OMPI has its own internal daemons that handle the launch and wireup. It's odd that it happens across OMPI versions as there exist significant differences between them. Is the speed difference associated with non-MPI jobs as well? In other words, if you execute "mpirun hostname", does it also take an inordinate amount of time? If not, then the other possibility is that you are falling back on TCP instead of IB, or that something is preventing the use of shared memory as a transport for procs on the same node. On Feb 5, 2015, at 5:02 PM, Peter A Ruprecht <[email protected]<mailto:[email protected]>> wrote: Answering two questions at one time: I am pretty sure we are not using PMI2. Jobs are launched via "sbatch job_script" where the script contains "mpirun ./executable_file". There appear to be issues with at least OMPI 1.6.4 and 1.8.X. Thanks Peter On 2/5/15, 5:39 PM, "Ralph Castain" <[email protected]<mailto:[email protected]>> wrote: And are you launching via mpirun or directly with srun <myapp>? What OMPI version are you using? On Feb 5, 2015, at 3:32 PM, Chris Samuel <[email protected]<mailto:[email protected]>> wrote: On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote: I ask because some of our users have started reporting a 10x increase in run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3. It's possible there is some other problem going on in our cluster, but all of our hardware checks including Infiniband diagnostics look pretty clean. Are you using PMI2? cheers, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: [email protected]<mailto:[email protected]> Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
