Re: [OMPI users] [warn] Epoll ADD(1) on fd 0 failed

2014-06-10 Thread Ralph Castain
Artem is investigating with Timur On Jun 10, 2014, at 12:34 PM, Mike Dubman wrote: > btw, the output comes from ompi`s libevent and not from slurm itself (sorry > about confusion and thanks to Yossi for catching this) > > > opal/mca/event/libevent2021/libevent/epoll.c:

Re: [OMPI users] [warn] Epoll ADD(1) on fd 0 failed

2014-06-10 Thread Mike Dubman
btw, the output comes from ompi`s libevent and not from slurm itself (sorry about confusion and thanks to Yossi for catching this) opal/mca/event/libevent2021/libevent/epoll.c: event_warn("Epoll %s(%d) on fd %d failed. Old events were %d; read change was %d (%s); write change was %d (%s)", opal/

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
[binf316:fischega] $ ulimit -m unlimited Greg -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:58 PM To: Fischer, Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Out of curiosity what is the mlock limit on you

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
Out of curiosity what is the mlock limit on your system? If it is too low that can cause ibv_create_cq to fail. To check run ulimit -m. -Nathan Hjelm Application Readiness, HPC-5, LANL On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > Yes, this fails on all nodes on the system,

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
Yes, this fails on all nodes on the system, except for the head node. The uptime of the system isn't significant. Maybe 1 week, and it's received basically no use. -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:49 PM To: Fischer, Greg A. Cc:

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
Well, thats interesting. The output shows that ibv_create_cq is failing. Strange since an identical call had just succeeded (udcm creates two completion queues). Some questions that might indicate where the failure might be: Does this fail on any other node in your system? How long has the node

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
Jeff/Nathan, I ran the following with my debug build of OpenMPI 1.8.1 - after opening a terminal on a compute node with "qsub -l nodes 2 -I": mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> output.txt Output and backtrace are attached. Let me know if I can provide

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-10 Thread Fischer, Greg A.
Yes, it should be possible for me to get an upgraded Intel compiler on that system. However, as you suggest, I'm more focused on getting it working with GCC right now. -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres (jsquyres) Sent: Monday, J

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Jeff Squyres (jsquyres)
Greg: Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly? On Jun 10, 2014, at 10:25 AM, Nathan Hjelm wrote: > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: >> I s

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > I seem to recall that you have an IB-based cluster, right? > > From a *very quick* glance at the code, it looks like this might be a simple > incorrect-finalization issue. That is: > > - you run the job on a single serve