I upgraded to OpenMPI 1.8.6 last week, and this did change how the problem 
presents but did not solve our problems.  Now MPI indicates that it cannot use 
the BTL OpenIB and so runs, but without using the IB.  I also tried building 
with the --without-scif switch as suggested earlier last week, with no help.  
I've not had time to dig in since then.

Still no luck on our end.

Paul.

> On Jun 22, 2015, at 11:45, Ralph Castain <[email protected]> wrote:
> 
> Good to hear! Thanks
> Ralph
> 
> 
> On Mon, Jun 22, 2015 at 7:50 AM, Paul van der Mark <[email protected]> 
> wrote:
> 
> Hello Ralph,
> 
> A quick update: our upgrade to OpenMPI 1.8.6 (from 1.8.5) seems to have
> done the trick. Since we only recently starting to prepare for a switch
> to slurm, I can't confirm if this already existed in 1.8.4. Our slurm
> version is 14.11.7.
> 
> Best,
> Paul
> 
> 
> On 06/18/2015 10:45 AM, Ralph Castain wrote:
> >
> > Please let us know - FWIW, we aren’t seeing any such reports on the OMPI 
> > mailing lists, and we run our test harness against Slurm (and other RMs) 
> > every night.
> >
> > Also, please tell us what version of Slurm you are using. We do sometimes 
> > see regressions against newer versions as they appear, and that may be the 
> > case here.
> >
> >
> >> On Jun 18, 2015, at 7:32 AM, Paul van der Mark <[email protected]> wrote:
> >>
> >>
> >> Hello John,
> >>
> >> We tried a number of combination of flags and some work and some don't.
> >> 1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog
> >> 2. salloc -n 9 srun ./mympiprog
> >> (test cluster with 8 cores per node)
> >>
> >> Case 1: works flawless (for every combination)
> >> Case 2: works sometimes, warnings in some cases, segmentation faults in
> >> some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc.
> >>
> >> mpirun instead of srun works all the time.
> >>
> >> We are going to look into openmpi 1.8.6 now. We would like to have -n X
> >> work, since that is what most of our users use anyway.
> >>
> >> Best,
> >> Paul
> >>
> >>
> >>
> >>
> >> On 06/05/2015 08:19 AM, John Desantis wrote:
> >>>
> >>> Paul,
> >>>
> >>> How are you invoking srun with the application in question?
> >>>
> >>> It seems strange that the messages would be manifest when the job runs
> >>> on more than one node.  Have you tried passing the flags "-N" and
> >>> "--ntasks-per-node" for testing?  What about using "-w hostfile"?
> >>> Those would be the options that I'd immediately try to begin
> >>> trouble-shooting the issue.
> >>>
> >>> John DeSantis
> >>>
> >>> 2015-06-02 14:19 GMT-04:00 Paul van der Mark <[email protected]>:
> >>>>
> >>>> All,
> >>>>
> >>>> We are preparing for a switch from our current job scheduler to slurm
> >>>> and I am running into a strange issue. I compiled openmpi with slurm
> >>>> support and when I start a job with sbatch and use mpirun everything
> >>>> works fine. However, when I use srun instead of mpirun and the job does
> >>>> not fit on a single node, I either receive the following openmpi warning
> >>>> a number of times:
> >>>> --------------------------------------------------------------------------
> >>>> WARNING: Missing locality information required for sm initialization.
> >>>> Continuing without shared memory support.
> >>>> --------------------------------------------------------------------------
> >>>> or a segmentation fault in an openmpi library (address not mapped) or
> >>>> both.
> >>>>
> >>>> I only observe this with mpi-programs compiled with openmpi and ran by
> >>>> srun when the job does not fit on a single node. The same program
> >>>> started by openmpi's mpirun runs fine. The same source compiled with
> >>>> mvapich2 works fine with srun.
> >>>>
> >>>> Some version info:
> >>>> slurm 14.11.7
> >>>> openmpi 1.8.5
> >>>> hwloc 1.10.1 (used for both slurm and openmpi)
> >>>> os: RHEL 7.1
> >>>>
> >>>> Has anyone seen that warning before and what would be a good place to
> >>>> start troubleshooting?
> >>>>
> >>>>
> >>>> Thank you,
> >>>> Paul
> 

Reply via email to