Thanks to everyone who responded. It appears that the issue was not the version of Slurm, but rather that we had set TaskAffinity=yes in cgroups.conf at the same time we installed the new version.
Applications that were using OpenMPI version 1.6 and prior were in many cases showing dramatically slower run times. I incorrectly wrote earlier that v1.8 was also affected; in fact it seems to have been OK. I don't have a good environment for testing this further at the moment, unfortunately, but since we backed out the change the users are happy again. Thanks again, Peter On 2/6/15, 6:49 AM, "Ralph Castain" <[email protected]> wrote: > >If you are launching via mpirun, then you won't be using either version >of PMI - OMPI has its own internal daemons that handle the launch and >wireup. > >It's odd that it happens across OMPI versions as there exist significant >differences between them. Is the speed difference associated with non-MPI >jobs as well? In other words, if you execute "mpirun hostname", does it >also take an inordinate amount of time? > >If not, then the other possibility is that you are falling back on TCP >instead of IB, or that something is preventing the use of shared memory >as a transport for procs on the same node. > > >> On Feb 5, 2015, at 5:02 PM, Peter A Ruprecht >><[email protected]> wrote: >> >> >> Answering two questions at one time: >> >> I am pretty sure we are not using PMI2. >> >> Jobs are launched via "sbatch job_script" where the script contains >> "mpirun ./executable_file". There appear to be issues with at least >>OMPI >> 1.6.4 and 1.8.X. >> >> Thanks >> Peter >> >> On 2/5/15, 5:39 PM, "Ralph Castain" <[email protected]> wrote: >> >>> >>> And are you launching via mpirun or directly with srun <myapp>? What >>>OMPI >>> version are you using? >>> >>> >>>> On Feb 5, 2015, at 3:32 PM, Chris Samuel <[email protected]> >>>>wrote: >>>> >>>> >>>> On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote: >>>> >>>>> I ask because some of our users have started reporting a 10x increase >>>>> in >>>>> run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3. >>>>>It's >>>>> possible there is some other problem going on in our cluster, but all >>>>> of >>>>> our hardware checks including Infiniband diagnostics look pretty >>>>>clean. >>>> >>>> Are you using PMI2? >>>> >>>> cheers, >>>> Chris >>>> -- >>>> Christopher Samuel Senior Systems Administrator >>>> VLSCI - Victorian Life Sciences Computation Initiative >>>> Email: [email protected] Phone: +61 (0)3 903 55545 >>>> http://www.vlsci.org.au/ http://twitter.com/vlsci
