Glad to hear you found/fixed the problem!
> On Feb 6, 2015, at 2:44 PM, Peter A Ruprecht <[email protected]> > wrote: > > > Thanks to everyone who responded. > > It appears that the issue was not the version of Slurm, but rather that we > had set TaskAffinity=yes in cgroups.conf at the same time we installed the > new version. > > Applications that were using OpenMPI version 1.6 and prior were in many > cases showing dramatically slower run times. I incorrectly wrote earlier > that v1.8 was also affected; in fact it seems to have been OK. > > I don't have a good environment for testing this further at the moment, > unfortunately, but since we backed out the change the users are happy > again. > > Thanks again, > Peter > > On 2/6/15, 6:49 AM, "Ralph Castain" <[email protected]> wrote: > >> >> If you are launching via mpirun, then you won't be using either version >> of PMI - OMPI has its own internal daemons that handle the launch and >> wireup. >> >> It's odd that it happens across OMPI versions as there exist significant >> differences between them. Is the speed difference associated with non-MPI >> jobs as well? In other words, if you execute "mpirun hostname", does it >> also take an inordinate amount of time? >> >> If not, then the other possibility is that you are falling back on TCP >> instead of IB, or that something is preventing the use of shared memory >> as a transport for procs on the same node. >> >> >>> On Feb 5, 2015, at 5:02 PM, Peter A Ruprecht >>> <[email protected]> wrote: >>> >>> >>> Answering two questions at one time: >>> >>> I am pretty sure we are not using PMI2. >>> >>> Jobs are launched via "sbatch job_script" where the script contains >>> "mpirun ./executable_file". There appear to be issues with at least >>> OMPI >>> 1.6.4 and 1.8.X. >>> >>> Thanks >>> Peter >>> >>> On 2/5/15, 5:39 PM, "Ralph Castain" <[email protected]> wrote: >>> >>>> >>>> And are you launching via mpirun or directly with srun <myapp>? What >>>> OMPI >>>> version are you using? >>>> >>>> >>>>> On Feb 5, 2015, at 3:32 PM, Chris Samuel <[email protected]> >>>>> wrote: >>>>> >>>>> >>>>> On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote: >>>>> >>>>>> I ask because some of our users have started reporting a 10x increase >>>>>> in >>>>>> run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3. >>>>>> It's >>>>>> possible there is some other problem going on in our cluster, but all >>>>>> of >>>>>> our hardware checks including Infiniband diagnostics look pretty >>>>>> clean. >>>>> >>>>> Are you using PMI2? >>>>> >>>>> cheers, >>>>> Chris >>>>> -- >>>>> Christopher Samuel Senior Systems Administrator >>>>> VLSCI - Victorian Life Sciences Computation Initiative >>>>> Email: [email protected] Phone: +61 (0)3 903 55545 >>>>> http://www.vlsci.org.au/ http://twitter.com/vlsci
