Hello John,

We tried a number of combination of flags and some work and some don't.
1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog
2. salloc -n 9 srun ./mympiprog
(test cluster with 8 cores per node)

Case 1: works flawless (for every combination)
Case 2: works sometimes, warnings in some cases, segmentation faults in
some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc.

mpirun instead of srun works all the time.

We are going to look into openmpi 1.8.6 now. We would like to have -n X
work, since that is what most of our users use anyway.

Best,
Paul




On 06/05/2015 08:19 AM, John Desantis wrote:
> 
> Paul,
> 
> How are you invoking srun with the application in question?
> 
> It seems strange that the messages would be manifest when the job runs
> on more than one node.  Have you tried passing the flags "-N" and
> "--ntasks-per-node" for testing?  What about using "-w hostfile"?
> Those would be the options that I'd immediately try to begin
> trouble-shooting the issue.
> 
> John DeSantis
> 
> 2015-06-02 14:19 GMT-04:00 Paul van der Mark <[email protected]>:
>>
>> All,
>>
>> We are preparing for a switch from our current job scheduler to slurm
>> and I am running into a strange issue. I compiled openmpi with slurm
>> support and when I start a job with sbatch and use mpirun everything
>> works fine. However, when I use srun instead of mpirun and the job does
>> not fit on a single node, I either receive the following openmpi warning
>> a number of times:
>> --------------------------------------------------------------------------
>> WARNING: Missing locality information required for sm initialization.
>> Continuing without shared memory support.
>> --------------------------------------------------------------------------
>> or a segmentation fault in an openmpi library (address not mapped) or
>> both.
>>
>> I only observe this with mpi-programs compiled with openmpi and ran by
>> srun when the job does not fit on a single node. The same program
>> started by openmpi's mpirun runs fine. The same source compiled with
>> mvapich2 works fine with srun.
>>
>> Some version info:
>> slurm 14.11.7
>> openmpi 1.8.5
>> hwloc 1.10.1 (used for both slurm and openmpi)
>> os: RHEL 7.1
>>
>> Has anyone seen that warning before and what would be a good place to
>> start troubleshooting?
>>
>>
>> Thank you,
>> Paul

Reply via email to