Re: [OMPI users] openib segfaults with Torque

2014-06-13 Thread Fischer, Greg A.
ryone! _ From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa Sent: Wednesday, June 11, 2014 7:13 PM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque If that could help Greg, on the compute nodes I normally add this to /etc/security/limits

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Gus Correa
> -Original Message- > From: Fischer, Greg A. > Sent: Tuesday, June 10, 2014 2:59 PM > To: Nathan Hjelm > Cc: Open MPI Users; Fischer, Greg A. > Subject: RE: [OMPI users] openib segfaults with Torque > > [binf316:fischega] $ ulim

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Martin Siegert
quyres) > >> > wrote: > >> > > >> > Mellanox -- > >> > > >> > What would cause a CQ to fail to be created? > >> > > >> > On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." >

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Jeff Squyres (jsquyres)
ed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) >> > wrote: >> > >> > Mellanox -- >> > >> > What would cause a CQ to fail to be created? >> > >> > On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." >> > wr

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Ralph Castain
g A." > > wrote: > > > > > Is there any other work around that I might try? Something that > > avoids UDCM? > > > > > > -Original Message- > > > From: Fischer, Greg A. > > > Sent: Tuesday, June 1

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd
OMPI users] openib segfaults with Torque > > > > > > [binf316:fischega] $ ulimit -m > > > unlimited > > > > > > Greg > > > > > > -Original Message- > > > From: Nathan Hjelm [mailto:h

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Nathan Hjelm
PM, "Fischer, Greg A." > wrote: > > > Is there any other work around that I might try? Something that > avoids UDCM? > > > > -Original Message- > > From: Fischer, Greg A. > > Sent: Tue

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd
mething that avoids > UDCM? > > > > -Original Message- > > From: Fischer, Greg A. > > Sent: Tuesday, June 10, 2014 2:59 PM > > To: Nathan Hjelm > > Cc: Open MPI Users; Fischer, Greg A. > > Subject: RE: [OMPI users] openib segfaults with Torqu

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Jeff Squyres (jsquyres)
une 10, 2014 2:59 PM > To: Nathan Hjelm > Cc: Open MPI Users; Fischer, Greg A. > Subject: RE: [OMPI users] openib segfaults with Torque > > [binf316:fischega] $ ulimit -m > unlimited > > Greg > > -Original Message- > From: Nathan Hjelm [mailto:hje...@lanl

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Fischer, Greg A.
Is there any other work around that I might try? Something that avoids UDCM? -Original Message- From: Fischer, Greg A. Sent: Tuesday, June 10, 2014 2:59 PM To: Nathan Hjelm Cc: Open MPI Users; Fischer, Greg A. Subject: RE: [OMPI users] openib segfaults with Torque [binf316:fischega

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
[binf316:fischega] $ ulimit -m unlimited Greg -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:58 PM To: Fischer, Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Out of curiosity what is the mlock limit on

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
reg A. > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > > Well, thats interesting. The output shows that ibv_create_cq is failing. > Strange since an identical call had just succeeded (udcm creates two > completion queues). Some questions th

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Well, thats interesting. The output shows that ibv_create_cq is failing. Strange since an identical call had just succeeded (udcm creates two completion queues). Some questions that might indicate where the failur

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
m: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres > (jsquyres) > Sent: Tuesday, June 10, 2014 10:31 AM > To: Nathan Hjelm > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Greg: > > Can you run with "--mca btl

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
me know if I can provide anything else. Thanks for looking into this, Greg -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres (jsquyres) Sent: Tuesday, June 10, 2014 10:31 AM To: Nathan Hjelm Cc: Open MPI Users Subject: Re: [OMPI users] openib segf

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Jeff Squyres (jsquyres)
Greg: Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly? On Jun 10, 2014, at 10:25 AM, Nathan Hjelm wrote: > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: >> I s

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > I seem to recall that you have an IB-based cluster, right? > > From a *very quick* glance at the code, it looks like this might be a simple > incorrect-finalization issue. That is: > > - you run the job on a single serve

Re: [OMPI users] openib segfaults with Torque

2014-06-09 Thread Jeff Squyres (jsquyres)
I seem to recall that you have an IB-based cluster, right? >From a *very quick* glance at the code, it looks like this might be a simple >incorrect-finalization issue. That is: - you run the job on a single server - openib disqualifies itself because you're running on a single server - openib t

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Ralph Castain
Process 0 exiting > Process 1 exiting > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Friday, June 06, 2014 10:34 AM > To: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Huh - how strange. I can't imag

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
Ralph Castain Sent: Friday, June 06, 2014 10:34 AM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Huh - how strange. I can't imagine what it has to do with Torque vs rsh - this is failing when the openib BTL is trying to create the connection, which comes way afte

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Ralph Castain
nf316:21583] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f3b58301c36] > [binf316:21583] [17] ring_c[0x400889] > [binf316:21583] *** End of error message *** > -- > mpirun noticed that process rank 0 with PID 21583 on node 316 exited on > signal 6 (Abo

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Thursday, June 05, 2014 7:57 PM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Hmmm...I'm not sure how that is going to run with only one proc (I don't know if the program is pro

Re: [OMPI users] openib segfaults with Torque

2014-06-05 Thread Ralph Castain
Hmmm...I'm not sure how that is going to run with only one proc (I don't know if the program is protected against that scenario). If you run with -np 2 -mca btl openib,sm,self, is it happy? On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. wrote: > Here’s the command I’m invoking and the terminal

Re: [OMPI users] openib segfaults with Torque

2014-06-05 Thread Fischer, Greg A.
Here's the command I'm invoking and the terminal output. (Some of this information doesn't appear to be captured in the backtrace.) [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm

[OMPI users] openib segfaults with Torque

2014-06-05 Thread Fischer, Greg A.
OpenMPI Users, After encountering difficulty with the Intel compilers (see the "intermittent segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared to