Re: [OMPI users] Seg fault in opal_progress

2018-07-16 Thread Noam Bernstein
> On Jul 16, 2018, at 8:34 AM, Noam Bernstein > wrote: > >> On Jul 14, 2018, at 1:31 AM, Nathan Hjelm via users >> mailto:users@lists.open-mpi.org>> wrote: >> >> Please give master a try. This looks like another signature of running out >> of space for share

Re: [OMPI users] Seg fault in opal_progress

2018-07-16 Thread Noam Bernstein
> On Jul 14, 2018, at 1:31 AM, Nathan Hjelm via users > wrote: > > Please give master a try. This looks like another signature of running out of > space for shared memory buffers. Sorry, I wasn’t explicit on this point - I’m already using master, specifically openmpi-master-201807120327-34bc77

Re: [OMPI users] Seg fault in opal_progress

2018-07-13 Thread Nathan Hjelm via users
Please give master a try. This looks like another signature of running out of space for shared memory buffers. -Nathan > On Jul 13, 2018, at 6:41 PM, Noam Bernstein > wrote: > > Just to summarize for the list. With Jeff’s prodding I got it generating > core files with the debug (and mem-deb

Re: [OMPI users] Seg fault in opal_progress

2018-07-13 Thread Noam Bernstein
Just to summarize for the list. With Jeff’s prodding I got it generating core files with the debug (and mem-debug) version of openmpi, and below is the kind of stack trace I’m getting from gdb. It looks slightly different when I use a slightly different implementation that doesn’t use MPI_INPL

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Jeff Squyres (jsquyres) via users
Noam and I actually talked on the phone (whtt!?) and worked through this a bit more. Oddly, he can generate core files if he runs in /tmp, but not if he runs in an NFS-mounted directory (!). I haven't seen that before -- if someone knows why that would happen, I'd love to hear the explanat

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
> On Jul 12, 2018, at 11:58 AM, Jeff Squyres (jsquyres) > wrote: > > > > (You may have already done this; I just want to make sure we're on the same > sheet of music here…) I’m not talking about the job script or shell startup files. The actual “executable” passed to mpirun on the command

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Jeff Squyres (jsquyres) via users
On Jul 12, 2018, at 11:45 AM, Noam Bernstein wrote: > >> E.g., if you "ulimit -c" in your interactive shell and see "unlimited", but >> if you "ulimit -c" in a launched job and see "0", then the job scheduler is >> doing that to your environment somewhere. > > I am using a scheduler (torque),

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
> On Jul 12, 2018, at 11:02 AM, Jeff Squyres (jsquyres) > wrote: > > On Jul 12, 2018, at 10:59 AM, Noam Bernstein > wrote: >> >>> Do you get core files? >>> >>> Loading up the core file in a debugger might give us more information. >> >> No, I don’t, despite setting "ulimit -c unlimited”

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Jeff Squyres (jsquyres) via users
On Jul 12, 2018, at 10:59 AM, Noam Bernstein wrote: > >> Do you get core files? >> >> Loading up the core file in a debugger might give us more information. > > No, I don’t, despite setting "ulimit -c unlimited”. I’m not sure what’s > going on with that (or the lack of line info in the sta

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
> On Jul 12, 2018, at 10:51 AM, Jeff Squyres (jsquyres) via users > wrote: > > Do you get core files? > > Loading up the core file in a debugger might give us more information. No, I don’t, despite setting "ulimit -c unlimited”. I’m not sure what’s going on with that (or the lack of line i

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Jeff Squyres (jsquyres) via users
Do you get core files? Loading up the core file in a debugger might give us more information. > On Jul 12, 2018, at 9:35 AM, Noam Bernstein > wrote: > > >> On Jul 12, 2018, at 8:37 AM, Noam Bernstein >> wrote: >> >> I’m going to try the 3.1.x 20180710 nightly snapshot next. > > Same be

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
> On Jul 12, 2018, at 8:37 AM, Noam Bernstein > wrote: > > I’m going to try the 3.1.x 20180710 nightly snapshot next. Same behavior, exactly - segfault, no debugging info beyond the vasp routine that calls mpi_allreduce.

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Noam Bernstein
I’ve recompiled 3.1.1 with —enable-debug —enable-mem-debug, and I still get no detailed information from the mpi libraries, only VASP (as before): ldd (at runtime, so I’m fairly sure it’s referring to the right executable and LD_LIBRARY_PATH) info: vexec /usr/local/vasp/bin/5.4.4/0test/vasp.gamm

Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Åke Sandgren
Are you running with ulimit -s unlimited? If not that looks like a out-of-stack crash, which VASP frequently causes. If you are running with unlimited stack, I could perhaps run that input case on our VASP build. (Which have a bunch of fixes for bad stack usage among other things) On 07/11/2018 1

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Ben Menadue
Here’s what happens using a debug build: [raijin7:5] ompi_comm_peer_lookup: invalid peer index (2) [raijin7:5:0:5] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8) /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/ompi/mca/pml/

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Ben Menadue
Hi, Perhaps related — we’re seeing this one with 3.1.1. I’ll see if I can get the application run against our --enable-debug build. Cheers, Ben [raijin7:1943 :0:1943] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x45) /short/z00/bjm900/build/openmpi-mofed4.2/o

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Nathan Hjelm via users
Might be also worth testing a master snapshot and see if that fixes the issue. There are a couple of fixes being backported from master to v3.0.x and v3.1.x now. -Nathan On Jul 11, 2018, at 03:16 PM, Noam Bernstein wrote: On Jul 11, 2018, at 11:29 AM, Jeff Squyres (jsquyres) via users wro

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Jeff Squyres (jsquyres) via users
$ ompi_info | grep -i debug Configure command line: '--prefix=/home/jsquyres/bogus' '--with-usnic' '--with-libfabric=/home/jsquyres/libfabric-current/install' '--enable-mpirun-prefix-by-default' '--enable-debug' '--enable-mem-debug' '--enable-mem-profile' '--disable-mpi-fortran' '--enable-debu

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
> On Jul 11, 2018, at 11:29 AM, Jeff Squyres (jsquyres) via users > wrote: >>> >> >> After more extensive testing it’s clear that it still happens with 2.1.3, >> but much less frequently. I’m going to try to get more detailed info with >> version 3.1.1, where it’s easier to reproduce. objdu

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
> On Jul 11, 2018, at 11:29 AM, Jeff Squyres (jsquyres) via users > wrote: > > Ok, that would be great -- thanks. > > Recompiling Open MPI with --enable-debug will turn on several > debugging/sanity checks inside Open MPI, and it will also enable debugging > symbols. Hence, If you can get a

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Jeff Squyres (jsquyres) via users
Ok, that would be great -- thanks. Recompiling Open MPI with --enable-debug will turn on several debugging/sanity checks inside Open MPI, and it will also enable debugging symbols. Hence, If you can get a failure when a debug Open MPI build, it might give you a core file that can be used to ge

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
> On Jul 11, 2018, at 9:58 AM, Noam Bernstein > wrote: > >> On Jul 10, 2018, at 5:15 PM, Noam Bernstein > > wrote: >> >> >> >> What are useful steps I can do to debug? Recompile with —enable-debug? Are >> there any other versions that are worth trying?

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Noam Bernstein
> On Jul 10, 2018, at 5:15 PM, Noam Bernstein > wrote: > > > > What are useful steps I can do to debug? Recompile with —enable-debug? Are > there any other versions that are worth trying? I don’t recall this error > happening before we switched to 3.1.0. > >

[OMPI users] Seg fault in opal_progress

2018-07-10 Thread Noam Bernstein
Hi OpenMPI users - I’m trying to debug a non-deterministic crash, apparently in opal_progress, with OpenMPI 3.1.0. All of them seem to involve mpi_allreduce, although it’s different particular calls from this code (VASP), and they seem more frequent for larger core/mpi task counts (128 happens