Thanks to everyone who responded.

It appears that the issue was not the version of Slurm, but rather that we
had set TaskAffinity=yes in cgroups.conf at the same time we installed the
new version.

Applications that were using OpenMPI version 1.6 and prior were in many
cases showing dramatically slower run times.  I incorrectly wrote earlier
that v1.8 was also affected; in fact it seems to have been OK.

I don't have a good environment for testing this further at the moment,
unfortunately, but since we backed out the change the users are happy
again.

Thanks again,
Peter

On 2/6/15, 6:49 AM, "Ralph Castain" <[email protected]> wrote:

>
>If you are launching via mpirun, then you won't be using either version
>of PMI - OMPI has its own internal daemons that handle the launch and
>wireup.
>
>It's odd that it happens across OMPI versions as there exist significant
>differences between them. Is the speed difference associated with non-MPI
>jobs as well? In other words, if you execute "mpirun hostname", does it
>also take an inordinate amount of time?
>
>If not, then the other possibility is that you are falling back on TCP
>instead of IB, or that something is preventing the use of shared memory
>as a transport for procs on the same node.
>
>
>> On Feb 5, 2015, at 5:02 PM, Peter A Ruprecht
>><[email protected]> wrote:
>> 
>> 
>> Answering two questions at one time:
>> 
>> I am pretty sure we are not using PMI2.
>> 
>> Jobs are launched via "sbatch job_script" where the script contains
>> "mpirun ./executable_file".  There appear to be issues with at least
>>OMPI
>> 1.6.4 and 1.8.X.
>> 
>> Thanks
>> Peter
>> 
>> On 2/5/15, 5:39 PM, "Ralph Castain" <[email protected]> wrote:
>> 
>>> 
>>> And are you launching via mpirun or directly with srun <myapp>? What
>>>OMPI
>>> version are you using?
>>> 
>>> 
>>>> On Feb 5, 2015, at 3:32 PM, Chris Samuel <[email protected]>
>>>>wrote:
>>>> 
>>>> 
>>>> On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote:
>>>> 
>>>>> I ask because some of our users have started reporting a 10x increase
>>>>> in
>>>>> run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3.
>>>>>It's
>>>>> possible there is some other problem going on in our cluster, but all
>>>>> of
>>>>> our hardware checks including Infiniband diagnostics look pretty
>>>>>clean.
>>>> 
>>>> Are you using PMI2?
>>>> 
>>>> cheers,
>>>> Chris
>>>> -- 
>>>> Christopher Samuel        Senior Systems Administrator
>>>> VLSCI - Victorian Life Sciences Computation Initiative
>>>> Email: [email protected] Phone: +61 (0)3 903 55545
>>>> http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to