Sounds odd - I suspect there is some issue with the IB network, then, as we
regularly test against IB and have seen no problems. I'd suggest switching
this thread to the OMPI user mailing list, and provide the usual requested
info for these problems: your configure cmd line, config.log, and output
from "ompi_info".

We'll get it figured out :-)
Ralph


On Mon, Jun 22, 2015 at 8:58 AM, Wiegand, Paul <[email protected]> wrote:

> I upgraded to OpenMPI 1.8.6 last week, and this did change how the problem
> presents but did not solve our problems.  Now MPI indicates that it cannot
> use the BTL OpenIB and so runs, but without using the IB.  I also tried
> building with the --without-scif switch as suggested earlier last week,
> with no help.  I've not had time to dig in since then.
>
> Still no luck on our end.
>
> Paul.
>
> > On Jun 22, 2015, at 11:45, Ralph Castain <[email protected]> wrote:
> >
> > Good to hear! Thanks
> > Ralph
> >
> >
> > On Mon, Jun 22, 2015 at 7:50 AM, Paul van der Mark <[email protected]>
> wrote:
> >
> > Hello Ralph,
> >
> > A quick update: our upgrade to OpenMPI 1.8.6 (from 1.8.5) seems to have
> > done the trick. Since we only recently starting to prepare for a switch
> > to slurm, I can't confirm if this already existed in 1.8.4. Our slurm
> > version is 14.11.7.
> >
> > Best,
> > Paul
> >
> >
> > On 06/18/2015 10:45 AM, Ralph Castain wrote:
> > >
> > > Please let us know - FWIW, we aren’t seeing any such reports on the
> OMPI mailing lists, and we run our test harness against Slurm (and other
> RMs) every night.
> > >
> > > Also, please tell us what version of Slurm you are using. We do
> sometimes see regressions against newer versions as they appear, and that
> may be the case here.
> > >
> > >
> > >> On Jun 18, 2015, at 7:32 AM, Paul van der Mark <[email protected]>
> wrote:
> > >>
> > >>
> > >> Hello John,
> > >>
> > >> We tried a number of combination of flags and some work and some
> don't.
> > >> 1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog
> > >> 2. salloc -n 9 srun ./mympiprog
> > >> (test cluster with 8 cores per node)
> > >>
> > >> Case 1: works flawless (for every combination)
> > >> Case 2: works sometimes, warnings in some cases, segmentation faults
> in
> > >> some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc.
> > >>
> > >> mpirun instead of srun works all the time.
> > >>
> > >> We are going to look into openmpi 1.8.6 now. We would like to have -n
> X
> > >> work, since that is what most of our users use anyway.
> > >>
> > >> Best,
> > >> Paul
> > >>
> > >>
> > >>
> > >>
> > >> On 06/05/2015 08:19 AM, John Desantis wrote:
> > >>>
> > >>> Paul,
> > >>>
> > >>> How are you invoking srun with the application in question?
> > >>>
> > >>> It seems strange that the messages would be manifest when the job
> runs
> > >>> on more than one node.  Have you tried passing the flags "-N" and
> > >>> "--ntasks-per-node" for testing?  What about using "-w hostfile"?
> > >>> Those would be the options that I'd immediately try to begin
> > >>> trouble-shooting the issue.
> > >>>
> > >>> John DeSantis
> > >>>
> > >>> 2015-06-02 14:19 GMT-04:00 Paul van der Mark <[email protected]>:
> > >>>>
> > >>>> All,
> > >>>>
> > >>>> We are preparing for a switch from our current job scheduler to
> slurm
> > >>>> and I am running into a strange issue. I compiled openmpi with slurm
> > >>>> support and when I start a job with sbatch and use mpirun everything
> > >>>> works fine. However, when I use srun instead of mpirun and the job
> does
> > >>>> not fit on a single node, I either receive the following openmpi
> warning
> > >>>> a number of times:
> > >>>>
> --------------------------------------------------------------------------
> > >>>> WARNING: Missing locality information required for sm
> initialization.
> > >>>> Continuing without shared memory support.
> > >>>>
> --------------------------------------------------------------------------
> > >>>> or a segmentation fault in an openmpi library (address not mapped)
> or
> > >>>> both.
> > >>>>
> > >>>> I only observe this with mpi-programs compiled with openmpi and ran
> by
> > >>>> srun when the job does not fit on a single node. The same program
> > >>>> started by openmpi's mpirun runs fine. The same source compiled with
> > >>>> mvapich2 works fine with srun.
> > >>>>
> > >>>> Some version info:
> > >>>> slurm 14.11.7
> > >>>> openmpi 1.8.5
> > >>>> hwloc 1.10.1 (used for both slurm and openmpi)
> > >>>> os: RHEL 7.1
> > >>>>
> > >>>> Has anyone seen that warning before and what would be a good place
> to
> > >>>> start troubleshooting?
> > >>>>
> > >>>>
> > >>>> Thank you,
> > >>>> Paul
> >
>
>

Reply via email to