Hello John, We tried a number of combination of flags and some work and some don't. 1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog 2. salloc -n 9 srun ./mympiprog (test cluster with 8 cores per node)
Case 1: works flawless (for every combination) Case 2: works sometimes, warnings in some cases, segmentation faults in some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc. mpirun instead of srun works all the time. We are going to look into openmpi 1.8.6 now. We would like to have -n X work, since that is what most of our users use anyway. Best, Paul On 06/05/2015 08:19 AM, John Desantis wrote: > > Paul, > > How are you invoking srun with the application in question? > > It seems strange that the messages would be manifest when the job runs > on more than one node. Have you tried passing the flags "-N" and > "--ntasks-per-node" for testing? What about using "-w hostfile"? > Those would be the options that I'd immediately try to begin > trouble-shooting the issue. > > John DeSantis > > 2015-06-02 14:19 GMT-04:00 Paul van der Mark <[email protected]>: >> >> All, >> >> We are preparing for a switch from our current job scheduler to slurm >> and I am running into a strange issue. I compiled openmpi with slurm >> support and when I start a job with sbatch and use mpirun everything >> works fine. However, when I use srun instead of mpirun and the job does >> not fit on a single node, I either receive the following openmpi warning >> a number of times: >> -------------------------------------------------------------------------- >> WARNING: Missing locality information required for sm initialization. >> Continuing without shared memory support. >> -------------------------------------------------------------------------- >> or a segmentation fault in an openmpi library (address not mapped) or >> both. >> >> I only observe this with mpi-programs compiled with openmpi and ran by >> srun when the job does not fit on a single node. The same program >> started by openmpi's mpirun runs fine. The same source compiled with >> mvapich2 works fine with srun. >> >> Some version info: >> slurm 14.11.7 >> openmpi 1.8.5 >> hwloc 1.10.1 (used for both slurm and openmpi) >> os: RHEL 7.1 >> >> Has anyone seen that warning before and what would be a good place to >> start troubleshooting? >> >> >> Thank you, >> Paul
