Re: [OMPI users] openib segfaults with Torque

2014-06-13 Thread Fischer, Greg A.
This sounds credible. When I login via Torque, I see the following:

[binf316:fischega] $ ulimit -l
64

but when I login via ssh, I see:

[binf316:fischega] $ ulimit -l
unlimited

I'll have my administrator make the changes and give that a shot.  Thanks, 
everyone!

_
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Wednesday, June 11, 2014 7:13 PM
To: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque


If that could help Greg,
on the compute nodes I normally add this to /etc/security/limits.conf:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  32768

and

ulimit -n 32768
ulimit -l unlimited
ulimit -s unlimited

to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which
should be sourced by the former).
Other values are possible, of course.

My recollection is that the boilerplate init scripts that
come with Torque don't change those limits.

I suppose this makes the pbs_mom child processes,
including the user job script and whatever processes it starts
(mpiexec, etc), to inherit those limits.
Or not?

Gus Correa


On 06/11/2014 06:20 PM, Jeff Squyres (jsquyres) wrote:
> +1
>
> On Jun 11, 2014, at 6:01 PM, Ralph Castain 
> <r...@open-mpi.org<mailto:r...@open-mpi.org>>
>   wrote:
>
>> Yeah, I think we've seen that somewhere before too...
>>
>>
>> On Jun 11, 2014, at 2:59 PM, Joshua Ladd 
>> <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:
>>
>>> Agreed. The problem is not with UDCM. I don't think something is wrong with 
>>> the system. I think his Torque is imposing major constraints on the maximum 
>>> size that can be locked into memory.
>>>
>>> Josh
>>>
>>>
>>> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm 
>>> <hje...@lanl.gov<mailto:hje...@lanl.gov>> wrote:
>>> Probably won't help to use RDMACM though as you will just see the
>>> resource failure somewhere else. UDCM is not the problem. Something is
>>> wrong with the system. Allocating a 512 entry CQ should not fail.
>>>
>>> -Nathan
>>>
>>> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
>>>> I'm guessing it's a resource limitation issue coming from Torque.
>>>>
>>>> H...I found something interesting on the interwebs that looks 
>>>> awfully
>>>> similar:
>>>> 
>>>> http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
>>>>
>>>> Greg, if the suggestion from the Torque users doesn't resolve your 
>>>> issue (
>>>> "...adding the following line 'ulimit -l unlimited' to pbs_mom and
>>>> restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead 
>>>> of
>>>> UDCM, which is a pretty recent addition to the openIB BTL.) by setting:
>>>>
>>>> -mca btl_openib_cpc_include rdmacm
>>>>
>>>> Josh
>>>>
>>>> On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres)
>>>> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
>>>>
>>>>   Mellanox --
>>>>
>>>>       What would cause a CQ to fail to be created?
>>>>
>>>>   On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
>>>>   <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:
>>>>
>>>>   > Is there any other work around that I might try?  Something that
>>>>   avoids UDCM?
>>>>   >
>>>>   > -Original Message-
>>>>   > From: Fischer, Greg A.
>>>>   > Sent: Tuesday, June 10, 2014 2:59 PM
>>>>   > To: Nathan Hjelm
>>>>   > Cc: Open MPI Users; Fischer, Greg A.
>>>>   > Subject: RE: [OMPI users] openib segfaults with Torque
>>>>   >
>>>>   > [binf316:fischega] $ ulimit -m
>>>>   > unlimited
>>>>   >
>>>>   > Greg
>>>>   >
>>>>   > -Original Message-
>>>>   > From: Nathan Hjelm [mailto:hje...@lanl.gov]
>>>>   > Sent: Tuesday, June 10, 2014 2:58 PM
>>>>   > To: Fischer, Greg A.
>>>>   > Cc: Open MPI Users
>>>>   > Subject: Re: [OMPI users] openib segfaults with Torque
>>>>   >
>>>>   > Out of curiosity what is the

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Gus Correa

If that could help Greg,
on the compute nodes I normally add this to /etc/security/limits.conf:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  32768

and

ulimit -n 32768
ulimit -l unlimited
ulimit -s unlimited

to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which
should be sourced by the former).
Other values are possible, of course.

My recollection is that the boilerplate init scripts that
come with Torque don't change those limits.

I suppose this makes the pbs_mom child processes,
including the user job script and whatever processes it starts
(mpiexec, etc), to inherit those limits.
Or not?

Gus Correa


On 06/11/2014 06:20 PM, Jeff Squyres (jsquyres) wrote:

+1

On Jun 11, 2014, at 6:01 PM, Ralph Castain <r...@open-mpi.org>
  wrote:


Yeah, I think we've seen that somewhere before too...


On Jun 11, 2014, at 2:59 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:


Agreed. The problem is not with UDCM. I don't think something is wrong with the 
system. I think his Torque is imposing major constraints on the maximum size 
that can be locked into memory.

Josh


On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote:
Probably won't help to use RDMACM though as you will just see the
resource failure somewhere else. UDCM is not the problem. Something is
wrong with the system. Allocating a 512 entry CQ should not fail.

-Nathan

On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:

I'm guessing it's a resource limitation issue coming from Torque.

H...I found something interesting on the interwebs that looks awfully
similar:
http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html

Greg, if the suggestion from the Torque users doesn't resolve your issue (
"...adding the following line 'ulimit -l unlimited' to pbs_mom and
restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead of
UDCM, which is a pretty recent addition to the openIB BTL.) by setting:

-mca btl_openib_cpc_include rdmacm

Josh

On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:

  Mellanox --

  What would cause a CQ to fail to be created?

  On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
  <fisch...@westinghouse.com> wrote:

  > Is there any other work around that I might try?  Something that
  avoids UDCM?
  >
  > -Original Message-
  > From: Fischer, Greg A.
  > Sent: Tuesday, June 10, 2014 2:59 PM
  > To: Nathan Hjelm
      > Cc: Open MPI Users; Fischer, Greg A.
  > Subject: RE: [OMPI users] openib segfaults with Torque
  >
  > [binf316:fischega] $ ulimit -m
  > unlimited
  >
  > Greg
  >
  > -Original Message-
  > From: Nathan Hjelm [mailto:hje...@lanl.gov]
  > Sent: Tuesday, June 10, 2014 2:58 PM
  > To: Fischer, Greg A.
  > Cc: Open MPI Users
  > Subject: Re: [OMPI users] openib segfaults with Torque
  >
  > Out of curiosity what is the mlock limit on your system? If it is too
  low that can cause ibv_create_cq to fail. To check run ulimit -m.
  >
  > -Nathan Hjelm
  > Application Readiness, HPC-5, LANL
  >
  > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
  >> Yes, this fails on all nodes on the system, except for the head node.
  >>
  >> The uptime of the system isn't significant. Maybe 1 week, and it's
  received basically no use.
  >>
  >> -Original Message-
  >> From: Nathan Hjelm [mailto:hje...@lanl.gov]
  >> Sent: Tuesday, June 10, 2014 2:49 PM
  >> To: Fischer, Greg A.
  >> Cc: Open MPI Users
  >> Subject: Re: [OMPI users] openib segfaults with Torque
  >>
  >>
  >> Well, thats interesting. The output shows that ibv_create_cq is
  failing. Strange since an identical call had just succeeded (udcm
  creates two completion queues). Some questions that might indicate where
  the failure might be:
  >>
  >> Does this fail on any other node in your system?
  >>
  >> How long has the node been up?
  >>
  >> -Nathan Hjelm
  >> Application Readiness, HPC-5, LANL
  >>
  >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
  >>> Jeff/Nathan,
  >>>
  >>> I ran the following with my debug build of OpenMPI 1.8.1 - after
  opening a terminal on a compute node with "qsub -l nodes 2 -I":
  >>>
  >>>  mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
  >>> ring_c &> output.txt
  >>>
  >>&g

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Martin Siegert
It isn't really Torque that is imposing those constraints:
- the torque_mom initscript inherits from the OS whatever ulimits are
  in effect at that time;
- each job inherits the ulimits from the pbs_mom.

Thus, you need to change the ulimits from whatever is set at
startup time, e.g., in /etc/sysconfig/torque_mom:

ulimit -d unlimited
ulimit -s unlimited
ulimit -n 32768
ulimit -l 2097152

or whatever you consider to be reasonable.

Cheers,
Martin

-- 
Martin Siegert
WestGrid/ComputeCanada
Simon Fraser University
Burnaby, British Columbia

On Wed, Jun 11, 2014 at 10:20:08PM +, Jeff Squyres (jsquyres) wrote:
> +1
> 
> On Jun 11, 2014, at 6:01 PM, Ralph Castain <r...@open-mpi.org>
>  wrote:
> 
> > Yeah, I think we've seen that somewhere before too...
> > 
> > 
> > On Jun 11, 2014, at 2:59 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> > 
> >> Agreed. The problem is not with UDCM. I don't think something is wrong 
> >> with the system. I think his Torque is imposing major constraints on the 
> >> maximum size that can be locked into memory.
> >> 
> >> Josh
> >> 
> >> 
> >> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote:
> >> Probably won't help to use RDMACM though as you will just see the
> >> resource failure somewhere else. UDCM is not the problem. Something is
> >> wrong with the system. Allocating a 512 entry CQ should not fail.
> >> 
> >> -Nathan
> >> 
> >> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
> >> >I'm guessing it's a resource limitation issue coming from Torque.
> >> >
> >> >H...I found something interesting on the interwebs that looks 
> >> > awfully
> >> >similar:
> >> >
> >> > http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
> >> >
> >> >Greg, if the suggestion from the Torque users doesn't resolve your 
> >> > issue (
> >> >"...adding the following line 'ulimit -l unlimited' to pbs_mom and
> >> >restarting pbs_mom." ) doesn't work, try using the RDMACM CPC 
> >> > (instead of
> >> >UDCM, which is a pretty recent addition to the openIB BTL.) by 
> >> > setting:
> >> >
> >> >-mca btl_openib_cpc_include rdmacm
> >> >
> >> >Josh
> >> >
> >> >On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres)
> >> ><jsquy...@cisco.com> wrote:
> >> >
> >> >  Mellanox --
> >> >
> >> >      What would cause a CQ to fail to be created?
> >> >
> >> >  On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
> >> >  <fisch...@westinghouse.com> wrote:
> >> >
> >> >  > Is there any other work around that I might try?  Something that
> >> >  avoids UDCM?
> >> >  >
> >> >  > -Original Message-----
> >> >  > From: Fischer, Greg A.
> >> >  > Sent: Tuesday, June 10, 2014 2:59 PM
> >> >  > To: Nathan Hjelm
> >> >  > Cc: Open MPI Users; Fischer, Greg A.
> >> >  > Subject: RE: [OMPI users] openib segfaults with Torque
> >> >  >
> >> >  > [binf316:fischega] $ ulimit -m
> >> >  > unlimited
> >> >  >
> >> >  > Greg
> >> >  >
> >> >  > -Original Message-
> >> >  > From: Nathan Hjelm [mailto:hje...@lanl.gov]
> >> >  > Sent: Tuesday, June 10, 2014 2:58 PM
> >> >  > To: Fischer, Greg A.
> >> >  > Cc: Open MPI Users
> >> >  > Subject: Re: [OMPI users] openib segfaults with Torque
> >> >  >
> >> >  > Out of curiosity what is the mlock limit on your system? If it is 
> >> > too
> >> >  low that can cause ibv_create_cq to fail. To check run ulimit -m.
> >> >  >
> >> >  > -Nathan Hjelm
> >> >  > Application Readiness, HPC-5, LANL
> >> >  >
> >> >  > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
> >> >  >> Yes, this fails on all nodes on the system, except for the head 
> >> > node.
> >> >  >>
> >> >  >> The uptime of the system isn't significant. Maybe 1 week, and 
> &

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Jeff Squyres (jsquyres)
+1

On Jun 11, 2014, at 6:01 PM, Ralph Castain <r...@open-mpi.org>
 wrote:

> Yeah, I think we've seen that somewhere before too...
> 
> 
> On Jun 11, 2014, at 2:59 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> 
>> Agreed. The problem is not with UDCM. I don't think something is wrong with 
>> the system. I think his Torque is imposing major constraints on the maximum 
>> size that can be locked into memory.
>> 
>> Josh
>> 
>> 
>> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote:
>> Probably won't help to use RDMACM though as you will just see the
>> resource failure somewhere else. UDCM is not the problem. Something is
>> wrong with the system. Allocating a 512 entry CQ should not fail.
>> 
>> -Nathan
>> 
>> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
>> >I'm guessing it's a resource limitation issue coming from Torque.
>> >
>> >H...I found something interesting on the interwebs that looks 
>> > awfully
>> >similar:
>> >
>> > http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
>> >
>> >Greg, if the suggestion from the Torque users doesn't resolve your 
>> > issue (
>> >"...adding the following line 'ulimit -l unlimited' to pbs_mom and
>> >restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead 
>> > of
>> >UDCM, which is a pretty recent addition to the openIB BTL.) by setting:
>> >
>> >-mca btl_openib_cpc_include rdmacm
>> >
>> >Josh
>> >
>> >On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres)
>> ><jsquy...@cisco.com> wrote:
>> >
>> >  Mellanox --
>> >
>> >  What would cause a CQ to fail to be created?
>> >
>> >  On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
>> >      <fisch...@westinghouse.com> wrote:
>> >
>> >  > Is there any other work around that I might try?  Something that
>> >  avoids UDCM?
>> >  >
>> >  > -Original Message-
>> >  > From: Fischer, Greg A.
>> >  > Sent: Tuesday, June 10, 2014 2:59 PM
>> >      > To: Nathan Hjelm
>> >  > Cc: Open MPI Users; Fischer, Greg A.
>> >  > Subject: RE: [OMPI users] openib segfaults with Torque
>> >  >
>> >  > [binf316:fischega] $ ulimit -m
>> >  > unlimited
>> >  >
>> >  > Greg
>> >  >
>> >  > -Original Message-
>> >  > From: Nathan Hjelm [mailto:hje...@lanl.gov]
>> >  > Sent: Tuesday, June 10, 2014 2:58 PM
>> >  > To: Fischer, Greg A.
>> >  > Cc: Open MPI Users
>> >  > Subject: Re: [OMPI users] openib segfaults with Torque
>> >  >
>> >  > Out of curiosity what is the mlock limit on your system? If it is 
>> > too
>> >  low that can cause ibv_create_cq to fail. To check run ulimit -m.
>> >  >
>> >  > -Nathan Hjelm
>> >  > Application Readiness, HPC-5, LANL
>> >  >
>> >  > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
>> >  >> Yes, this fails on all nodes on the system, except for the head 
>> > node.
>> >  >>
>> >  >> The uptime of the system isn't significant. Maybe 1 week, and it's
>> >  received basically no use.
>> >  >>
>> >  >> -Original Message-
>> >  >> From: Nathan Hjelm [mailto:hje...@lanl.gov]
>> >  >> Sent: Tuesday, June 10, 2014 2:49 PM
>> >  >> To: Fischer, Greg A.
>> >  >> Cc: Open MPI Users
>> >  >> Subject: Re: [OMPI users] openib segfaults with Torque
>> >  >>
>> >  >>
>> >  >> Well, thats interesting. The output shows that ibv_create_cq is
>> >  failing. Strange since an identical call had just succeeded (udcm
>> >  creates two completion queues). Some questions that might indicate 
>> > where
>> >  the failure might be:
>> >  >>
>> >  >> Does this fail on any other node in your system?
>> >  >>
>> >  >> How long has the node been up?
>> >  >>
>> >  >> -N

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Ralph Castain
Yeah, I think we've seen that somewhere before too...


On Jun 11, 2014, at 2:59 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:

> Agreed. The problem is not with UDCM. I don't think something is wrong with 
> the system. I think his Torque is imposing major constraints on the maximum 
> size that can be locked into memory.
> 
> Josh
> 
> 
> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote:
> Probably won't help to use RDMACM though as you will just see the
> resource failure somewhere else. UDCM is not the problem. Something is
> wrong with the system. Allocating a 512 entry CQ should not fail.
> 
> -Nathan
> 
> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
> >I'm guessing it's a resource limitation issue coming from Torque.
> >
> >H...I found something interesting on the interwebs that looks awfully
> >similar:
> >
> > http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
> >
> >Greg, if the suggestion from the Torque users doesn't resolve your issue 
> > (
> >"...adding the following line 'ulimit -l unlimited' to pbs_mom and
> >restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead of
> >UDCM, which is a pretty recent addition to the openIB BTL.) by setting:
> >
> >-mca btl_openib_cpc_include rdmacm
> >
> >Josh
> >
> >On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres)
> ><jsquy...@cisco.com> wrote:
> >
> >  Mellanox --
> >
> >  What would cause a CQ to fail to be created?
> >
> >  On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
> >  <fisch...@westinghouse.com> wrote:
> >
> >  > Is there any other work around that I might try?  Something that
> >      avoids UDCM?
> >  >
> >  > -Original Message-
> >  > From: Fischer, Greg A.
> >  > Sent: Tuesday, June 10, 2014 2:59 PM
> >  > To: Nathan Hjelm
> >  > Cc: Open MPI Users; Fischer, Greg A.
> >  > Subject: RE: [OMPI users] openib segfaults with Torque
> >  >
> >  > [binf316:fischega] $ ulimit -m
> >  > unlimited
> >  >
> >  > Greg
> >  >
> >  > -Original Message-
> >  > From: Nathan Hjelm [mailto:hje...@lanl.gov]
> >  > Sent: Tuesday, June 10, 2014 2:58 PM
> >  > To: Fischer, Greg A.
> >  > Cc: Open MPI Users
> >  > Subject: Re: [OMPI users] openib segfaults with Torque
> >  >
> >  > Out of curiosity what is the mlock limit on your system? If it is too
> >  low that can cause ibv_create_cq to fail. To check run ulimit -m.
> >  >
> >  > -Nathan Hjelm
> >  > Application Readiness, HPC-5, LANL
> >  >
> >  > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
> >  >> Yes, this fails on all nodes on the system, except for the head 
> > node.
> >  >>
> >  >> The uptime of the system isn't significant. Maybe 1 week, and it's
> >  received basically no use.
> >  >>
> >  >> -Original Message-
> >  >> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> >  >> Sent: Tuesday, June 10, 2014 2:49 PM
> >  >> To: Fischer, Greg A.
> >  >> Cc: Open MPI Users
> >  >> Subject: Re: [OMPI users] openib segfaults with Torque
> >  >>
> >  >>
> >  >> Well, thats interesting. The output shows that ibv_create_cq is
> >  failing. Strange since an identical call had just succeeded (udcm
> >  creates two completion queues). Some questions that might indicate 
> > where
> >  the failure might be:
> >  >>
> >  >> Does this fail on any other node in your system?
> >  >>
> >  >> How long has the node been up?
> >  >>
> >  >> -Nathan Hjelm
> >  >> Application Readiness, HPC-5, LANL
> >  >>
> >  >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> >      >>> Jeff/Nathan,
> >  >>>
> >  >>> I ran the following with my debug build of OpenMPI 1.8.1 - after
> >  opening a terminal on a compute node with "qsub -l nodes 2 -I":
> >  >>>
> >  >>>  mpirun -mca btl openib,self -mca btl_base_ver

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd
Agreed. The problem is not with UDCM. I don't think something is wrong with
the system. I think his Torque is imposing major constraints on the maximum
size that can be locked into memory.

Josh


On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote:

> Probably won't help to use RDMACM though as you will just see the
> resource failure somewhere else. UDCM is not the problem. Something is
> wrong with the system. Allocating a 512 entry CQ should not fail.
>
> -Nathan
>
> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
> >I'm guessing it's a resource limitation issue coming from Torque.
> >
> >H...I found something interesting on the interwebs that looks
> awfully
> >similar:
> >
> http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
> >
> >Greg, if the suggestion from the Torque users doesn't resolve your
> issue (
> >"...adding the following line 'ulimit -l unlimited' to pbs_mom and
> >restarting pbs_mom." ) doesn't work, try using the RDMACM CPC
> (instead of
> >UDCM, which is a pretty recent addition to the openIB BTL.) by
> setting:
> >
> >-mca btl_openib_cpc_include rdmacm
> >
> >Josh
> >
> >On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres)
> ><jsquy...@cisco.com> wrote:
> >
> >  Mellanox --
> >
> >  What would cause a CQ to fail to be created?
> >
> >  On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
> >  <fisch...@westinghouse.com> wrote:
> >
> >  > Is there any other work around that I might try?  Something that
> >  avoids UDCM?
> >      >
> >  > -Original Message-
> >  > From: Fischer, Greg A.
> >  > Sent: Tuesday, June 10, 2014 2:59 PM
> >  > To: Nathan Hjelm
> >  > Cc: Open MPI Users; Fischer, Greg A.
> >  > Subject: RE: [OMPI users] openib segfaults with Torque
> >  >
> >  > [binf316:fischega] $ ulimit -m
> >  > unlimited
> >  >
> >  > Greg
> >  >
> >  > -Original Message-
> >  > From: Nathan Hjelm [mailto:hje...@lanl.gov]
> >  > Sent: Tuesday, June 10, 2014 2:58 PM
> >  > To: Fischer, Greg A.
> >  > Cc: Open MPI Users
> >  > Subject: Re: [OMPI users] openib segfaults with Torque
> >  >
> >  > Out of curiosity what is the mlock limit on your system? If it is
> too
> >  low that can cause ibv_create_cq to fail. To check run ulimit -m.
> >  >
> >  > -Nathan Hjelm
> >  > Application Readiness, HPC-5, LANL
> >  >
> >  > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
> >  >> Yes, this fails on all nodes on the system, except for the head
> node.
> >  >>
> >  >> The uptime of the system isn't significant. Maybe 1 week, and
> it's
> >  received basically no use.
> >  >>
> >  >> -Original Message-
> >  >> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> >  >> Sent: Tuesday, June 10, 2014 2:49 PM
> >  >> To: Fischer, Greg A.
> >  >> Cc: Open MPI Users
> >  >> Subject: Re: [OMPI users] openib segfaults with Torque
> >  >>
> >  >>
> >  >> Well, thats interesting. The output shows that ibv_create_cq is
> >  failing. Strange since an identical call had just succeeded (udcm
> >  creates two completion queues). Some questions that might indicate
> where
> >  the failure might be:
> >  >>
> >  >> Does this fail on any other node in your system?
> >  >>
> >  >> How long has the node been up?
> >  >>
> >  >> -Nathan Hjelm
> >  >> Application Readiness, HPC-5, LANL
> >  >>
> >  >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> >  >>> Jeff/Nathan,
> >  >>>
> >  >>> I ran the following with my debug build of OpenMPI 1.8.1 - after
> >  opening a terminal on a compute node with "qsub -l nodes 2 -I":
> >  >>>
> >  >>>  mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
> >  >>> ring_c &> output.txt
> >  >>>
> >  >>> Output and backtrace are attached. Let me know if 

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Nathan Hjelm
Probably won't help to use RDMACM though as you will just see the
resource failure somewhere else. UDCM is not the problem. Something is
wrong with the system. Allocating a 512 entry CQ should not fail.

-Nathan

On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
>I'm guessing it's a resource limitation issue coming from Torque.
> 
>H...I found something interesting on the interwebs that looks awfully
>similar:
>http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
> 
>Greg, if the suggestion from the Torque users doesn't resolve your issue (
>"...adding the following line 'ulimit -l unlimited' to pbs_mom and
>restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead of
>UDCM, which is a pretty recent addition to the openIB BTL.) by setting:
> 
>-mca btl_openib_cpc_include rdmacm
> 
>Josh 
> 
>On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres)
><jsquy...@cisco.com> wrote:
> 
>  Mellanox --
> 
>  What would cause a CQ to fail to be created?
> 
>  On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
>  <fisch...@westinghouse.com> wrote:
> 
>  > Is there any other work around that I might try?  Something that
>  avoids UDCM?
>  >
>  > -Original Message-
>  > From: Fischer, Greg A.
>      > Sent: Tuesday, June 10, 2014 2:59 PM
>  > To: Nathan Hjelm
>  > Cc: Open MPI Users; Fischer, Greg A.
>  > Subject: RE: [OMPI users] openib segfaults with Torque
>  >
>  > [binf316:fischega] $ ulimit -m
>  > unlimited
>  >
>  > Greg
>  >
>  > -Original Message-
>  > From: Nathan Hjelm [mailto:hje...@lanl.gov]
>  > Sent: Tuesday, June 10, 2014 2:58 PM
>  > To: Fischer, Greg A.
>  > Cc: Open MPI Users
>  > Subject: Re: [OMPI users] openib segfaults with Torque
>  >
>  > Out of curiosity what is the mlock limit on your system? If it is too
>  low that can cause ibv_create_cq to fail. To check run ulimit -m.
>  >
>  > -Nathan Hjelm
>  > Application Readiness, HPC-5, LANL
>  >
>  > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
>  >> Yes, this fails on all nodes on the system, except for the head node.
>  >>
>  >> The uptime of the system isn't significant. Maybe 1 week, and it's
>  received basically no use.
>  >>
>  >> -Original Message-
>  >> From: Nathan Hjelm [mailto:hje...@lanl.gov]
>  >> Sent: Tuesday, June 10, 2014 2:49 PM
>  >> To: Fischer, Greg A.
>  >> Cc: Open MPI Users
>  >> Subject: Re: [OMPI users] openib segfaults with Torque
>  >>
>  >>
>  >> Well, thats interesting. The output shows that ibv_create_cq is
>  failing. Strange since an identical call had just succeeded (udcm
>  creates two completion queues). Some questions that might indicate where
>  the failure might be:
>  >>
>  >> Does this fail on any other node in your system?
>  >>
>  >> How long has the node been up?
>  >>
>  >> -Nathan Hjelm
>  >> Application Readiness, HPC-5, LANL
>  >>
>  >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
>  >>> Jeff/Nathan,
>  >>>
>  >>> I ran the following with my debug build of OpenMPI 1.8.1 - after
>  opening a terminal on a compute node with "qsub -l nodes 2 -I":
>  >>>
>  >>>  mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
>  >>> ring_c &> output.txt
>  >>>
>  >>> Output and backtrace are attached. Let me know if I can provide
>  anything else.
>  >>>
>  >>> Thanks for looking into this,
>  >>> Greg
>  >>>
>  >>> -Original Message-
>  >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>  >>> Squyres (jsquyres)
>  >>> Sent: Tuesday, June 10, 2014 10:31 AM
>  >>> To: Nathan Hjelm
>  >>> Cc: Open MPI Users
>  >>> Subject: Re: [OMPI users] openib segfaults with Torque
>  >>>
>  >>> Greg:
>  >>>
>  >>> Can you run with "--mca btl_base_verbose 100" on your debug build so
>  that we can ge

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd
I'm guessing it's a resource limitation issue coming from Torque.

H...I found something interesting on the interwebs that looks awfully
similar:

http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html


Greg, if the suggestion from the Torque users doesn't resolve your issue (
"...adding the following line 'ulimit -l unlimited' to pbs_mom and
restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead of
UDCM, which is a pretty recent addition to the openIB BTL.) by setting:

-mca btl_openib_cpc_include rdmacm



Josh


On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Mellanox --
>
> What would cause a CQ to fail to be created?
>
>
> On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." <fisch...@westinghouse.com>
> wrote:
>
> > Is there any other work around that I might try?  Something that avoids
> UDCM?
> >
> > -Original Message-
> > From: Fischer, Greg A.
> > Sent: Tuesday, June 10, 2014 2:59 PM
> > To: Nathan Hjelm
> > Cc: Open MPI Users; Fischer, Greg A.
> > Subject: RE: [OMPI users] openib segfaults with Torque
> >
> > [binf316:fischega] $ ulimit -m
> > unlimited
> >
> > Greg
> >
> > -Original Message-
> > From: Nathan Hjelm [mailto:hje...@lanl.gov]
> > Sent: Tuesday, June 10, 2014 2:58 PM
> > To: Fischer, Greg A.
> > Cc: Open MPI Users
> > Subject: Re: [OMPI users] openib segfaults with Torque
> >
> > Out of curiosity what is the mlock limit on your system? If it is too
> low that can cause ibv_create_cq to fail. To check run ulimit -m.
> >
> > -Nathan Hjelm
> > Application Readiness, HPC-5, LANL
> >
> > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
> >> Yes, this fails on all nodes on the system, except for the head node.
> >>
> >> The uptime of the system isn't significant. Maybe 1 week, and it's
> received basically no use.
> >>
> >> -Original Message-
> >> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> >> Sent: Tuesday, June 10, 2014 2:49 PM
> >> To: Fischer, Greg A.
> >> Cc: Open MPI Users
> >> Subject: Re: [OMPI users] openib segfaults with Torque
> >>
> >>
> >> Well, thats interesting. The output shows that ibv_create_cq is
> failing. Strange since an identical call had just succeeded (udcm creates
> two completion queues). Some questions that might indicate where the
> failure might be:
> >>
> >> Does this fail on any other node in your system?
> >>
> >> How long has the node been up?
> >>
> >> -Nathan Hjelm
> >> Application Readiness, HPC-5, LANL
> >>
> >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> >>> Jeff/Nathan,
> >>>
> >>> I ran the following with my debug build of OpenMPI 1.8.1 - after
> opening a terminal on a compute node with "qsub -l nodes 2 -I":
> >>>
> >>>  mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
> >>> ring_c &> output.txt
> >>>
> >>> Output and backtrace are attached. Let me know if I can provide
> anything else.
> >>>
> >>> Thanks for looking into this,
> >>> Greg
> >>>
> >>> -Original Message-
> >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> >>> Squyres (jsquyres)
> >>> Sent: Tuesday, June 10, 2014 10:31 AM
> >>> To: Nathan Hjelm
> >>> Cc: Open MPI Users
> >>> Subject: Re: [OMPI users] openib segfaults with Torque
> >>>
> >>> Greg:
> >>>
> >>> Can you run with "--mca btl_base_verbose 100" on your debug build so
> that we can get some additional output to see why UDCM is failing to setup
> properly?
> >>>
> >>>
> >>>
> >>> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
> >>>
> >>>> On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres)
> wrote:
> >>>>> I seem to recall that you have an IB-based cluster, right?
> >>>>>
> >>>>> From a *very quick* glance at the code, it looks like this might be
> a simple incorrect-finalization issue.  That is:
> >>>>>
> >>>>> - you run the job on a single server
> >>>>> - openib disqualifies itself because you're running on a single
> >>>>> server
> >>>>> - 

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Jeff Squyres (jsquyres)
Mellanox --

What would cause a CQ to fail to be created?


On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." <fisch...@westinghouse.com> 
wrote:

> Is there any other work around that I might try?  Something that avoids UDCM?
> 
> -Original Message-
> From: Fischer, Greg A.
> Sent: Tuesday, June 10, 2014 2:59 PM
> To: Nathan Hjelm
> Cc: Open MPI Users; Fischer, Greg A.
> Subject: RE: [OMPI users] openib segfaults with Torque
> 
> [binf316:fischega] $ ulimit -m
> unlimited
> 
> Greg
> 
> -Original Message-
> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> Sent: Tuesday, June 10, 2014 2:58 PM
> To: Fischer, Greg A.
> Cc: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
> 
> Out of curiosity what is the mlock limit on your system? If it is too low 
> that can cause ibv_create_cq to fail. To check run ulimit -m.
> 
> -Nathan Hjelm
> Application Readiness, HPC-5, LANL
> 
> On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
>> Yes, this fails on all nodes on the system, except for the head node.
>> 
>> The uptime of the system isn't significant. Maybe 1 week, and it's received 
>> basically no use.
>> 
>> -Original Message-
>> From: Nathan Hjelm [mailto:hje...@lanl.gov]
>> Sent: Tuesday, June 10, 2014 2:49 PM
>> To: Fischer, Greg A.
>> Cc: Open MPI Users
>> Subject: Re: [OMPI users] openib segfaults with Torque
>> 
>> 
>> Well, thats interesting. The output shows that ibv_create_cq is failing. 
>> Strange since an identical call had just succeeded (udcm creates two 
>> completion queues). Some questions that might indicate where the failure 
>> might be:
>> 
>> Does this fail on any other node in your system?
>> 
>> How long has the node been up?
>> 
>> -Nathan Hjelm
>> Application Readiness, HPC-5, LANL
>> 
>> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
>>> Jeff/Nathan,
>>> 
>>> I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
>>> terminal on a compute node with "qsub -l nodes 2 -I":
>>> 
>>>  mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
>>> ring_c &> output.txt
>>> 
>>> Output and backtrace are attached. Let me know if I can provide anything 
>>> else.
>>> 
>>> Thanks for looking into this,
>>> Greg
>>> 
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>>> Squyres (jsquyres)
>>> Sent: Tuesday, June 10, 2014 10:31 AM
>>> To: Nathan Hjelm
>>> Cc: Open MPI Users
>>> Subject: Re: [OMPI users] openib segfaults with Torque
>>> 
>>> Greg:
>>> 
>>> Can you run with "--mca btl_base_verbose 100" on your debug build so that 
>>> we can get some additional output to see why UDCM is failing to setup 
>>> properly?
>>> 
>>> 
>>> 
>>> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
>>> 
>>>> On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
>>>>> I seem to recall that you have an IB-based cluster, right?
>>>>> 
>>>>> From a *very quick* glance at the code, it looks like this might be a 
>>>>> simple incorrect-finalization issue.  That is:
>>>>> 
>>>>> - you run the job on a single server
>>>>> - openib disqualifies itself because you're running on a single
>>>>> server
>>>>> - openib then goes to finalize/close itself
>>>>> - but openib didn't fully initialize itself (because it
>>>>> disqualified itself early in the initialization process), and
>>>>> something in the finalization process didn't take that into
>>>>> account
>>>>> 
>>>>> Nathan -- is that anywhere close to correct?
>>>> 
>>>> Nope. udcm_module_finalize is being called because there was an
>>>> error setting up the udcm state. See btl_openib_connect_udcm.c:476.
>>>> The opal_list_t destructor is getting an assert failure. Probably
>>>> because the constructor wasn't called. I can rearrange the
>>>> constructors to be called first but there appears to be a deeper
>>>> issue with the user's
>>>> system: udcm_module_init should not be failing! It creates a
>>>> couple of CQs, allocates a small number of registered bufferes and
>>>

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Fischer, Greg A.
Is there any other work around that I might try?  Something that avoids UDCM?

-Original Message-
From: Fischer, Greg A.
Sent: Tuesday, June 10, 2014 2:59 PM
To: Nathan Hjelm
Cc: Open MPI Users; Fischer, Greg A.
Subject: RE: [OMPI users] openib segfaults with Torque

[binf316:fischega] $ ulimit -m
unlimited

Greg

-Original Message-
From: Nathan Hjelm [mailto:hje...@lanl.gov]
Sent: Tuesday, June 10, 2014 2:58 PM
To: Fischer, Greg A.
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Out of curiosity what is the mlock limit on your system? If it is too low that 
can cause ibv_create_cq to fail. To check run ulimit -m.

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
> Yes, this fails on all nodes on the system, except for the head node.
>
> The uptime of the system isn't significant. Maybe 1 week, and it's received 
> basically no use.
>
> -Original Message-
> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> Sent: Tuesday, June 10, 2014 2:49 PM
> To: Fischer, Greg A.
> Cc: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
>
>
> Well, thats interesting. The output shows that ibv_create_cq is failing. 
> Strange since an identical call had just succeeded (udcm creates two 
> completion queues). Some questions that might indicate where the failure 
> might be:
>
> Does this fail on any other node in your system?
>
> How long has the node been up?
>
> -Nathan Hjelm
> Application Readiness, HPC-5, LANL
>
> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> > Jeff/Nathan,
> >
> > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
> > terminal on a compute node with "qsub -l nodes 2 -I":
> >
> >   mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
> > ring_c &> output.txt
> >
> > Output and backtrace are attached. Let me know if I can provide anything 
> > else.
> >
> > Thanks for looking into this,
> > Greg
> >
> > -Original Message-
> > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> > Squyres (jsquyres)
> > Sent: Tuesday, June 10, 2014 10:31 AM
> > To: Nathan Hjelm
> > Cc: Open MPI Users
> > Subject: Re: [OMPI users] openib segfaults with Torque
> >
> > Greg:
> >
> > Can you run with "--mca btl_base_verbose 100" on your debug build so that 
> > we can get some additional output to see why UDCM is failing to setup 
> > properly?
> >
> >
> >
> > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
> >
> > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
> > >> I seem to recall that you have an IB-based cluster, right?
> > >>
> > >> From a *very quick* glance at the code, it looks like this might be a 
> > >> simple incorrect-finalization issue.  That is:
> > >>
> > >> - you run the job on a single server
> > >> - openib disqualifies itself because you're running on a single
> > >> server
> > >> - openib then goes to finalize/close itself
> > >> - but openib didn't fully initialize itself (because it
> > >> disqualified itself early in the initialization process), and
> > >> something in the finalization process didn't take that into
> > >> account
> > >>
> > >> Nathan -- is that anywhere close to correct?
> > >
> > > Nope. udcm_module_finalize is being called because there was an
> > > error setting up the udcm state. See btl_openib_connect_udcm.c:476.
> > > The opal_list_t destructor is getting an assert failure. Probably
> > > because the constructor wasn't called. I can rearrange the
> > > constructors to be called first but there appears to be a deeper
> > > issue with the user's
> > > system: udcm_module_init should not be failing! It creates a
> > > couple of CQs, allocates a small number of registered bufferes and
> > > starts monitoring the fd for the completion channel. All these
> > > things are also done in the setup of the openib btl itself. Keep
> > > in mind that the openib btl will not disqualify itself when running 
> > > single server.
> > > Openib may be used to communicate on node and is needed for the dynamics 
> > > case.
> > >
> > > The user might try adding -mca btl_base_verbose 100 to shed some
> > > light on what the real issue is.
> > >
> > > BTW, I no longer monitor th

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
[binf316:fischega] $ ulimit -m
unlimited

Greg

-Original Message-
From: Nathan Hjelm [mailto:hje...@lanl.gov]
Sent: Tuesday, June 10, 2014 2:58 PM
To: Fischer, Greg A.
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Out of curiosity what is the mlock limit on your system? If it is too low that 
can cause ibv_create_cq to fail. To check run ulimit -m.

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
> Yes, this fails on all nodes on the system, except for the head node.
>
> The uptime of the system isn't significant. Maybe 1 week, and it's received 
> basically no use.
>
> -Original Message-
> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> Sent: Tuesday, June 10, 2014 2:49 PM
> To: Fischer, Greg A.
> Cc: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
>
>
> Well, thats interesting. The output shows that ibv_create_cq is failing. 
> Strange since an identical call had just succeeded (udcm creates two 
> completion queues). Some questions that might indicate where the failure 
> might be:
>
> Does this fail on any other node in your system?
>
> How long has the node been up?
>
> -Nathan Hjelm
> Application Readiness, HPC-5, LANL
>
> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> > Jeff/Nathan,
> >
> > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
> > terminal on a compute node with "qsub -l nodes 2 -I":
> >
> >   mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
> > ring_c &> output.txt
> >
> > Output and backtrace are attached. Let me know if I can provide anything 
> > else.
> >
> > Thanks for looking into this,
> > Greg
> >
> > -Original Message-
> > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> > Squyres (jsquyres)
> > Sent: Tuesday, June 10, 2014 10:31 AM
> > To: Nathan Hjelm
> > Cc: Open MPI Users
> > Subject: Re: [OMPI users] openib segfaults with Torque
> >
> > Greg:
> >
> > Can you run with "--mca btl_base_verbose 100" on your debug build so that 
> > we can get some additional output to see why UDCM is failing to setup 
> > properly?
> >
> >
> >
> > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
> >
> > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
> > >> I seem to recall that you have an IB-based cluster, right?
> > >>
> > >> From a *very quick* glance at the code, it looks like this might be a 
> > >> simple incorrect-finalization issue.  That is:
> > >>
> > >> - you run the job on a single server
> > >> - openib disqualifies itself because you're running on a single
> > >> server
> > >> - openib then goes to finalize/close itself
> > >> - but openib didn't fully initialize itself (because it
> > >> disqualified itself early in the initialization process), and
> > >> something in the finalization process didn't take that into
> > >> account
> > >>
> > >> Nathan -- is that anywhere close to correct?
> > >
> > > Nope. udcm_module_finalize is being called because there was an
> > > error setting up the udcm state. See btl_openib_connect_udcm.c:476.
> > > The opal_list_t destructor is getting an assert failure. Probably
> > > because the constructor wasn't called. I can rearrange the
> > > constructors to be called first but there appears to be a deeper
> > > issue with the user's
> > > system: udcm_module_init should not be failing! It creates a
> > > couple of CQs, allocates a small number of registered bufferes and
> > > starts monitoring the fd for the completion channel. All these
> > > things are also done in the setup of the openib btl itself. Keep
> > > in mind that the openib btl will not disqualify itself when running 
> > > single server.
> > > Openib may be used to communicate on node and is needed for the dynamics 
> > > case.
> > >
> > > The user might try adding -mca btl_base_verbose 100 to shed some
> > > light on what the real issue is.
> > >
> > > BTW, I no longer monitor the user mailing list. If something needs
> > > my attention forward it to me directly.
> > >
> > > -Nathan
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > h

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
Out of curiosity what is the mlock limit on your system? If it is too
low that can cause ibv_create_cq to fail. To check run ulimit -m.

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
> Yes, this fails on all nodes on the system, except for the head node.
> 
> The uptime of the system isn't significant. Maybe 1 week, and it's received 
> basically no use.
> 
> -Original Message-
> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> Sent: Tuesday, June 10, 2014 2:49 PM
> To: Fischer, Greg A.
> Cc: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
> 
> 
> Well, thats interesting. The output shows that ibv_create_cq is failing. 
> Strange since an identical call had just succeeded (udcm creates two 
> completion queues). Some questions that might indicate where the failure 
> might be:
> 
> Does this fail on any other node in your system?
> 
> How long has the node been up?
> 
> -Nathan Hjelm
> Application Readiness, HPC-5, LANL
> 
> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> > Jeff/Nathan,
> >
> > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
> > terminal on a compute node with "qsub -l nodes 2 -I":
> >
> >   mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &>
> > output.txt
> >
> > Output and backtrace are attached. Let me know if I can provide anything 
> > else.
> >
> > Thanks for looking into this,
> > Greg
> >
> > -Original Message-
> > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> > Squyres (jsquyres)
> > Sent: Tuesday, June 10, 2014 10:31 AM
> > To: Nathan Hjelm
> > Cc: Open MPI Users
> > Subject: Re: [OMPI users] openib segfaults with Torque
> >
> > Greg:
> >
> > Can you run with "--mca btl_base_verbose 100" on your debug build so that 
> > we can get some additional output to see why UDCM is failing to setup 
> > properly?
> >
> >
> >
> > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
> >
> > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
> > >> I seem to recall that you have an IB-based cluster, right?
> > >>
> > >> From a *very quick* glance at the code, it looks like this might be a 
> > >> simple incorrect-finalization issue.  That is:
> > >>
> > >> - you run the job on a single server
> > >> - openib disqualifies itself because you're running on a single
> > >> server
> > >> - openib then goes to finalize/close itself
> > >> - but openib didn't fully initialize itself (because it
> > >> disqualified itself early in the initialization process), and
> > >> something in the finalization process didn't take that into account
> > >>
> > >> Nathan -- is that anywhere close to correct?
> > >
> > > Nope. udcm_module_finalize is being called because there was an
> > > error setting up the udcm state. See btl_openib_connect_udcm.c:476.
> > > The opal_list_t destructor is getting an assert failure. Probably
> > > because the constructor wasn't called. I can rearrange the
> > > constructors to be called first but there appears to be a deeper
> > > issue with the user's
> > > system: udcm_module_init should not be failing! It creates a couple
> > > of CQs, allocates a small number of registered bufferes and starts
> > > monitoring the fd for the completion channel. All these things are
> > > also done in the setup of the openib btl itself. Keep in mind that
> > > the openib btl will not disqualify itself when running single server.
> > > Openib may be used to communicate on node and is needed for the dynamics 
> > > case.
> > >
> > > The user might try adding -mca btl_base_verbose 100 to shed some
> > > light on what the real issue is.
> > >
> > > BTW, I no longer monitor the user mailing list. If something needs
> > > my attention forward it to me directly.
> > >
> > > -Nathan
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinf

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
Yes, this fails on all nodes on the system, except for the head node.

The uptime of the system isn't significant. Maybe 1 week, and it's received 
basically no use.

-Original Message-
From: Nathan Hjelm [mailto:hje...@lanl.gov]
Sent: Tuesday, June 10, 2014 2:49 PM
To: Fischer, Greg A.
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque


Well, thats interesting. The output shows that ibv_create_cq is failing. 
Strange since an identical call had just succeeded (udcm creates two completion 
queues). Some questions that might indicate where the failure might be:

Does this fail on any other node in your system?

How long has the node been up?

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> Jeff/Nathan,
>
> I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
> terminal on a compute node with "qsub -l nodes 2 -I":
>
>   mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &>
> output.txt
>
> Output and backtrace are attached. Let me know if I can provide anything else.
>
> Thanks for looking into this,
> Greg
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> Squyres (jsquyres)
> Sent: Tuesday, June 10, 2014 10:31 AM
> To: Nathan Hjelm
> Cc: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
>
> Greg:
>
> Can you run with "--mca btl_base_verbose 100" on your debug build so that we 
> can get some additional output to see why UDCM is failing to setup properly?
>
>
>
> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
>
> > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
> >> I seem to recall that you have an IB-based cluster, right?
> >>
> >> From a *very quick* glance at the code, it looks like this might be a 
> >> simple incorrect-finalization issue.  That is:
> >>
> >> - you run the job on a single server
> >> - openib disqualifies itself because you're running on a single
> >> server
> >> - openib then goes to finalize/close itself
> >> - but openib didn't fully initialize itself (because it
> >> disqualified itself early in the initialization process), and
> >> something in the finalization process didn't take that into account
> >>
> >> Nathan -- is that anywhere close to correct?
> >
> > Nope. udcm_module_finalize is being called because there was an
> > error setting up the udcm state. See btl_openib_connect_udcm.c:476.
> > The opal_list_t destructor is getting an assert failure. Probably
> > because the constructor wasn't called. I can rearrange the
> > constructors to be called first but there appears to be a deeper
> > issue with the user's
> > system: udcm_module_init should not be failing! It creates a couple
> > of CQs, allocates a small number of registered bufferes and starts
> > monitoring the fd for the completion channel. All these things are
> > also done in the setup of the openib btl itself. Keep in mind that
> > the openib btl will not disqualify itself when running single server.
> > Openib may be used to communicate on node and is needed for the dynamics 
> > case.
> >
> > The user might try adding -mca btl_base_verbose 100 to shed some
> > light on what the real issue is.
> >
> > BTW, I no longer monitor the user mailing list. If something needs
> > my attention forward it to me directly.
> >
> > -Nathan
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

> Core was generated by `ring_c'.
> Program terminated with signal 6, Aborted.
> #0  0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6
> #0  0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6
> #1  0x7f8b6ae1e0c5 in abort () from /lib64/libc.so.6
> #2  0x7f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6
> #3  0x7f8b664b684b in udcm_module_finalize (btl=0x717060,
> cpc=0x7190c0) at
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co
> nnect_udcm.c:734
> #4  0x7f8b664b5474 in udcm_component_query (btl=0x717060,
> cpc=0x718a48) at
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co
> nnect_udcm.c:476
> #5  0x7f8b664ae316 in
> ompi_btl_openib_connect_base_select_for_local_port (btl=0x717060) at
> ..

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm

Well, thats interesting. The output shows that ibv_create_cq is
failing. Strange since an identical call had just succeeded (udcm
creates two completion queues). Some questions that might indicate where
the failure might be:

Does this fail on any other node in your system?

How long has the node been up?

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> Jeff/Nathan,
> 
> I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
> terminal on a compute node with "qsub -l nodes 2 -I":
> 
>   mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> 
> output.txt
> 
> Output and backtrace are attached. Let me know if I can provide anything else.
> 
> Thanks for looking into this,
> Greg
> 
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
> (jsquyres)
> Sent: Tuesday, June 10, 2014 10:31 AM
> To: Nathan Hjelm
> Cc: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
> 
> Greg: 
> 
> Can you run with "--mca btl_base_verbose 100" on your debug build so that we 
> can get some additional output to see why UDCM is failing to setup properly?
> 
> 
> 
> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
> 
> > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
> >> I seem to recall that you have an IB-based cluster, right?
> >> 
> >> From a *very quick* glance at the code, it looks like this might be a 
> >> simple incorrect-finalization issue.  That is:
> >> 
> >> - you run the job on a single server
> >> - openib disqualifies itself because you're running on a single 
> >> server
> >> - openib then goes to finalize/close itself
> >> - but openib didn't fully initialize itself (because it disqualified 
> >> itself early in the initialization process), and something in the 
> >> finalization process didn't take that into account
> >> 
> >> Nathan -- is that anywhere close to correct?
> > 
> > Nope. udcm_module_finalize is being called because there was an error 
> > setting up the udcm state. See btl_openib_connect_udcm.c:476. The 
> > opal_list_t destructor is getting an assert failure. Probably because 
> > the constructor wasn't called. I can rearrange the constructors to be 
> > called first but there appears to be a deeper issue with the user's
> > system: udcm_module_init should not be failing! It creates a couple of 
> > CQs, allocates a small number of registered bufferes and starts 
> > monitoring the fd for the completion channel. All these things are 
> > also done in the setup of the openib btl itself. Keep in mind that the 
> > openib btl will not disqualify itself when running single server. 
> > Openib may be used to communicate on node and is needed for the dynamics 
> > case.
> > 
> > The user might try adding -mca btl_base_verbose 100 to shed some light 
> > on what the real issue is.
> > 
> > BTW, I no longer monitor the user mailing list. If something needs my 
> > attention forward it to me directly.
> > 
> > -Nathan
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 

> Core was generated by `ring_c'.
> Program terminated with signal 6, Aborted.
> #0  0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6
> #0  0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6
> #1  0x7f8b6ae1e0c5 in abort () from /lib64/libc.so.6
> #2  0x7f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6
> #3  0x7f8b664b684b in udcm_module_finalize (btl=0x717060, cpc=0x7190c0) 
> at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
> #4  0x7f8b664b5474 in udcm_component_query (btl=0x717060, cpc=0x718a48) 
> at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
> #5  0x7f8b664ae316 in ompi_btl_openib_connect_base_select_for_local_port 
> (btl=0x717060) at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
> #6  0x7f8b66497817 in btl_openib_component_init 
> (num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false, 
> enable_mpi_threads=false)
> at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_compone

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
Jeff/Nathan,

I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
terminal on a compute node with "qsub -l nodes 2 -I":

mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> 
output.txt

Output and backtrace are attached. Let me know if I can provide anything else.

Thanks for looking into this,
Greg

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
(jsquyres)
Sent: Tuesday, June 10, 2014 10:31 AM
To: Nathan Hjelm
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Greg: 

Can you run with "--mca btl_base_verbose 100" on your debug build so that we 
can get some additional output to see why UDCM is failing to setup properly?



On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:

> On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
>> I seem to recall that you have an IB-based cluster, right?
>> 
>> From a *very quick* glance at the code, it looks like this might be a simple 
>> incorrect-finalization issue.  That is:
>> 
>> - you run the job on a single server
>> - openib disqualifies itself because you're running on a single 
>> server
>> - openib then goes to finalize/close itself
>> - but openib didn't fully initialize itself (because it disqualified 
>> itself early in the initialization process), and something in the 
>> finalization process didn't take that into account
>> 
>> Nathan -- is that anywhere close to correct?
> 
> Nope. udcm_module_finalize is being called because there was an error 
> setting up the udcm state. See btl_openib_connect_udcm.c:476. The 
> opal_list_t destructor is getting an assert failure. Probably because 
> the constructor wasn't called. I can rearrange the constructors to be 
> called first but there appears to be a deeper issue with the user's
> system: udcm_module_init should not be failing! It creates a couple of 
> CQs, allocates a small number of registered bufferes and starts 
> monitoring the fd for the completion channel. All these things are 
> also done in the setup of the openib btl itself. Keep in mind that the 
> openib btl will not disqualify itself when running single server. 
> Openib may be used to communicate on node and is needed for the dynamics case.
> 
> The user might try adding -mca btl_base_verbose 100 to shed some light 
> on what the real issue is.
> 
> BTW, I no longer monitor the user mailing list. If something needs my 
> attention forward it to me directly.
> 
> -Nathan


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Core was generated by `ring_c'.
Program terminated with signal 6, Aborted.
#0  0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6
#0  0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6
#1  0x7f8b6ae1e0c5 in abort () from /lib64/libc.so.6
#2  0x7f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6
#3  0x7f8b664b684b in udcm_module_finalize (btl=0x717060, cpc=0x7190c0) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
#4  0x7f8b664b5474 in udcm_component_query (btl=0x717060, cpc=0x718a48) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
#5  0x7f8b664ae316 in ompi_btl_openib_connect_base_select_for_local_port 
(btl=0x717060) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6  0x7f8b66497817 in btl_openib_component_init 
(num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false, 
enable_mpi_threads=false)
at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
#7  0x7f8b6b43fa5e in mca_btl_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
#8  0x7f8b666d9d42 in mca_bml_r2_component_init (priority=0x7fffe34cecb4, 
enable_progress_threads=false, enable_mpi_threads=false)
at ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
#9  0x7f8b6b43ed1b in mca_bml_base_init (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
#10 0x7f8b655ff739 in mca_pml_ob1_component_init (priority=0x7fffe34cedf0, 
enable_progress_threads=false, enable_mpi_threads=false)
at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
#11 0x7f8b6b4659b2 in mca_pml_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_se

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Jeff Squyres (jsquyres)
Greg: 

Can you run with "--mca btl_base_verbose 100" on your debug build so that we 
can get some additional output to see why UDCM is failing to setup properly?



On Jun 10, 2014, at 10:25 AM, Nathan Hjelm  wrote:

> On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
>> I seem to recall that you have an IB-based cluster, right?
>> 
>> From a *very quick* glance at the code, it looks like this might be a simple 
>> incorrect-finalization issue.  That is:
>> 
>> - you run the job on a single server
>> - openib disqualifies itself because you're running on a single server
>> - openib then goes to finalize/close itself
>> - but openib didn't fully initialize itself (because it disqualified itself 
>> early in the initialization process), and something in the finalization 
>> process didn't take that into account
>> 
>> Nathan -- is that anywhere close to correct?
> 
> Nope. udcm_module_finalize is being called because there was an error
> setting up the udcm state. See btl_openib_connect_udcm.c:476. The
> opal_list_t destructor is getting an assert failure. Probably because
> the constructor wasn't called. I can rearrange the constructors to be
> called first but there appears to be a deeper issue with the user's
> system: udcm_module_init should not be failing! It creates a couple of
> CQs, allocates a small number of registered bufferes and starts
> monitoring the fd for the completion channel. All these things are also
> done in the setup of the openib btl itself. Keep in mind that the openib
> btl will not disqualify itself when running single server. Openib may be
> used to communicate on node and is needed for the dynamics case.
> 
> The user might try adding -mca btl_base_verbose 100 to shed some
> light on what the real issue is.
> 
> BTW, I no longer monitor the user mailing list. If something needs my
> attention forward it to me directly.
> 
> -Nathan


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
> I seem to recall that you have an IB-based cluster, right?
> 
> From a *very quick* glance at the code, it looks like this might be a simple 
> incorrect-finalization issue.  That is:
> 
> - you run the job on a single server
> - openib disqualifies itself because you're running on a single server
> - openib then goes to finalize/close itself
> - but openib didn't fully initialize itself (because it disqualified itself 
> early in the initialization process), and something in the finalization 
> process didn't take that into account
> 
> Nathan -- is that anywhere close to correct?

Nope. udcm_module_finalize is being called because there was an error
setting up the udcm state. See btl_openib_connect_udcm.c:476. The
opal_list_t destructor is getting an assert failure. Probably because
the constructor wasn't called. I can rearrange the constructors to be
called first but there appears to be a deeper issue with the user's
system: udcm_module_init should not be failing! It creates a couple of
CQs, allocates a small number of registered bufferes and starts
monitoring the fd for the completion channel. All these things are also
done in the setup of the openib btl itself. Keep in mind that the openib
btl will not disqualify itself when running single server. Openib may be
used to communicate on node and is needed for the dynamics case.

The user might try adding -mca btl_base_verbose 100 to shed some
light on what the real issue is.

BTW, I no longer monitor the user mailing list. If something needs my
attention forward it to me directly.

-Nathan


pgpx5f_ZZt8HD.pgp
Description: PGP signature


Re: [OMPI users] openib segfaults with Torque

2014-06-09 Thread Jeff Squyres (jsquyres)
I seem to recall that you have an IB-based cluster, right?

>From a *very quick* glance at the code, it looks like this might be a simple 
>incorrect-finalization issue.  That is:

- you run the job on a single server
- openib disqualifies itself because you're running on a single server
- openib then goes to finalize/close itself
- but openib didn't fully initialize itself (because it disqualified itself 
early in the initialization process), and something in the finalization process 
didn't take that into account

Nathan -- is that anywhere close to correct?



On Jun 5, 2014, at 5:10 PM, "Fischer, Greg A."  
wrote:

> OpenMPI Users,
>  
> After encountering difficulty with the Intel compilers (see the “intermittent 
> segfaults with openib on ring_c.c” thread), I installed GCC-4.8.3 and 
> recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib 
> BTL in a typical BASH environment. Everything appeared to work fine, so I 
> went on my merry way compiling the rest of my dependencies.
>  
> After getting my dependencies and applications compiled, I began observing 
> segfaults when submitting the applications through Torque. I recompiled 
> OpenMPI with debug options, ran “ring_c” over the openib BTL in an 
> interactive Torque session (“qsub –I”), and got the backtrace below. All 
> other system settings described in the previous thread are the same. Any 
> thoughts on how to resolve this issue?
>  
> Core was generated by `ring_c'.
> Program terminated with signal 6, Aborted.
> #0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
> #1  0x7f7f5920c0c5 in abort () from /lib64/libc.so.6
> #2  0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6
> #3  0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) 
> at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
> #4  0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) 
> at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
> #5  0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port 
> (btl=0x716680) at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
> #6  0x7f7f54885817 in btl_openib_component_init 
> (num_btl_modules=0x7fff906aa420, enable_progress_threads=false, 
> enable_mpi_threads=false)
> at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
> #7  0x7f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, 
> enable_mpi_threads=false) at 
> ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
> #8  0x7f7f54ac7d42 in mca_bml_r2_component_init (priority=0x7fff906aa4f4, 
> enable_progress_threads=false, enable_mpi_threads=false) at 
> ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
> #9  0x7f7f5982cd1b in mca_bml_base_init (enable_progress_threads=false, 
> enable_mpi_threads=false) at 
> ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
> #10 0x7f7f539ed739 in mca_pml_ob1_component_init 
> (priority=0x7fff906aa630, enable_progress_threads=false, 
> enable_mpi_threads=false)
> at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
> #11 0x7f7f598539b2 in mca_pml_base_select (enable_progress_threads=false, 
> enable_mpi_threads=false) at 
> ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
> #12 0x7f7f597c033c in ompi_mpi_init (argc=1, argv=0x7fff906aa928, 
> requested=0, provided=0x7fff906aa7d8) at 
> ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
> #13 0x7f7f597f5386 in PMPI_Init (argc=0x7fff906aa82c, 
> argv=0x7fff906aa820) at pinit.c:84
> #14 0x0040096f in main (argc=1, argv=0x7fff906aa928) at ring_c.c:19
>  
> Greg
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Ralph Castain
Fascinating - I can only assume that Torque is setting something in the 
environment that is creating the confusion. Sadly, Nathan is at the MPI Forum 
this week, so we may have to wait until Mon to get his input on the problem as 
he wrote the udcm code.


On Jun 6, 2014, at 8:51 AM, Fischer, Greg A. <fisch...@westinghouse.com> wrote:

> Yep, TCP works fine when launched via Torque/qsub:
>  
> [binf315:fischega] $ mpirun -np 2 -mca btl tcp,sm,self ring_c
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> Process 1 exiting
>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Friday, June 06, 2014 10:34 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
>  
> Huh - how strange. I can't imagine what it has to do with Torque vs rsh - 
> this is failing when the openib BTL is trying to create the connection, which 
> comes way after the launch is complete.
>  
> Are you able to run this with btl tcp,sm,self? If so, that would confirm that 
> everything else is correct, and the problem truly is limited to the udcm 
> itself...which shouldn't have anything to do with how the proc was launched.
>  
>  
> On Jun 6, 2014, at 6:47 AM, Fischer, Greg A. <fisch...@westinghouse.com> 
> wrote:
> 
> 
> Here are the results when logging in to the compute node via ssh and running 
> as you suggest:
>  
> [binf102:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> Process 1 exiting
>  
> Here are the results when executing over Torque (launch the shell with “qsub 
> -l nodes=2 –I”):
>  
> [binf316:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c
> ring_c: 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
>  udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
> ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
> [binf316:21584] *** Process received signal ***
> [binf316:21584] Signal: Aborted (6)
> [binf316:21584] Signal code:  (-6)
> ring_c: 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
>  udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
> ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
> [binf316:21583] *** Process received signal ***
> [binf316:21583] Signal: Aborted (6)
> [binf316:21583] Signal code:  (-6)
> [binf316:21584] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fe33a2637c0]
> [binf316:21584] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fe339f0fb55]
> [binf316:21584] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fe339f11131]
> [binf316:21584] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7fe339f08a10]
> [binf316:21584] [ 4] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7fe3355a984b]
> [binf316:21584] [ 5] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7fe3355a8474]
> [binf316:21584] [ 6] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7fe3355a1316]
> [binf316:21584] [ 7] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7fe33558a817]
> [binf316:21584] [ 8] [binf316:21583] [ 0] 
> /lib64/libpthread.so.0(+0xf7c0)[0x7f3b586697c0]
> [binf316:21583] [ 1] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7fe33a532a5e]
> [binf316:21584] [ 9] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7fe3357ccd42]
> [binf316:21584] [10] /lib64/libc.so.6(gsignal+0x35)[0x7f3b58315b55]
> [binf316:21583] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f3b58317131]
> [binf316:21583] [ 3] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/li

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
Yep, TCP works fine when launched via Torque/qsub:

[binf315:fischega] $ mpirun -np 2 -mca btl tcp,sm,self ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, June 06, 2014 10:34 AM
To: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Huh - how strange. I can't imagine what it has to do with Torque vs rsh - this 
is failing when the openib BTL is trying to create the connection, which comes 
way after the launch is complete.

Are you able to run this with btl tcp,sm,self? If so, that would confirm that 
everything else is correct, and the problem truly is limited to the udcm 
itself...which shouldn't have anything to do with how the proc was launched.


On Jun 6, 2014, at 6:47 AM, Fischer, Greg A. 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:


Here are the results when logging in to the compute node via ssh and running as 
you suggest:

[binf102:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting

Here are the results when executing over Torque (launch the shell with "qsub -l 
nodes=2 -I"):

[binf316:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:21584] *** Process received signal ***
[binf316:21584] Signal: Aborted (6)
[binf316:21584] Signal code:  (-6)
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:21583] *** Process received signal ***
[binf316:21583] Signal: Aborted (6)
[binf316:21583] Signal code:  (-6)
[binf316:21584] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fe33a2637c0]
[binf316:21584] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fe339f0fb55]
[binf316:21584] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fe339f11131]
[binf316:21584] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7fe339f08a10]
[binf316:21584] [ 4] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7fe3355a984b]
[binf316:21584] [ 5] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7fe3355a8474]
[binf316:21584] [ 6] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7fe3355a1316]
[binf316:21584] [ 7] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7fe33558a817]
[binf316:21584] [ 8] [binf316:21583] [ 0] 
/lib64/libpthread.so.0(+0xf7c0)[0x7f3b586697c0]
[binf316:21583] [ 1] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7fe33a532a5e]
[binf316:21584] [ 9] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7fe3357ccd42]
[binf316:21584] [10] /lib64/libc.so.6(gsignal+0x35)[0x7f3b58315b55]
[binf316:21583] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f3b58317131]
[binf316:21583] [ 3] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7fe33a531d1b]
[binf316:21584] [11] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7fe3344e7739]
[binf316:21584] [12] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f3b5830ea10]
[binf316:21583] [ 4] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f3b539af84b]
[binf316:21583] [ 5] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f3b539ae474]
[binf316:21583] [ 6] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f3b539a7316]
[binf316:21583] [ 7] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_open

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Ralph Castain
o(mca_bml_r2_component_init+0x20)[0x7f3b53bd2d42]
> [binf316:21583] [10] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7fe33a4c533c]
> [binf316:21584] [14] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f3b58937d1b]
> [binf316:21583] [11] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f3b528ed739]
> [binf316:21583] [12] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f3b5895e9b2]
> [binf316:21583] [13] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7fe33a4fa386]
> [binf316:21584] [15] ring_c[0x40096f]
> [binf316:21584] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7fe339efbc36]
> [binf316:21584] [17] ring_c[0x400889]
> [binf316:21584] *** End of error message ***
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f3b588cb33c]
> [binf316:21583] [14] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f3b58900386]
> [binf316:21583] [15] ring_c[0x40096f]
> [binf316:21583] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f3b58301c36]
> [binf316:21583] [17] ring_c[0x400889]
> [binf316:21583] *** End of error message ***
> --
> mpirun noticed that process rank 0 with PID 21583 on node 316 exited on 
> signal 6 (Aborted).
> --
>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Thursday, June 05, 2014 7:57 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
>  
> Hmmm...I'm not sure how that is going to run with only one proc (I don't know 
> if the program is protected against that scenario). If you run with -np 2 
> -mca btl openib,sm,self, is it happy?
>  
>  
> On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. <fisch...@westinghouse.com> 
> wrote:
> 
> 
> Here’s the command I’m invoking and the terminal output.  (Some of this 
> information doesn’t appear to be captured in the backtrace.)
>  
> [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c
> ring_c: 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
>  udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
> ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
> [binf316:04549] *** Process received signal ***
> [binf316:04549] Signal: Aborted (6)
> [binf316:04549] Signal code:  (-6)
> [binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0]
> [binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55]
> [binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131]
> [binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10]
> [binf316:04549] [ 4] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b]
> [binf316:04549] [ 5] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474]
> [binf316:04549] [ 6] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316]
> [binf316:04549] [ 7] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817]
> [binf316:04549] [ 8] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e]
> [binf316:04549] [ 9] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42]
> [binf316:04549] [10] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b]
> [binf316:04549] [11] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739]
> [binf316:04549] [12] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2]
> [binf316:04549] [13] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c]
> [binf316:04549] [14] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386]
> [binf316:04549] [15] ring_c[0x40096f]
> [binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36]
> [binf316:04549] [17] ring_c[0x400889]
> [binf316:04549] *** End of error message ***
> -

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
**
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f3b588cb33c]
[binf316:21583] [14] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f3b58900386]
[binf316:21583] [15] ring_c[0x40096f]
[binf316:21583] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f3b58301c36]
[binf316:21583] [17] ring_c[0x400889]
[binf316:21583] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 21583 on node 316 exited on 
signal 6 (Aborted).
--

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, June 05, 2014 7:57 PM
To: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Hmmm...I'm not sure how that is going to run with only one proc (I don't know 
if the program is protected against that scenario). If you run with -np 2 -mca 
btl openib,sm,self, is it happy?


On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:


Here's the command I'm invoking and the terminal output.  (Some of this 
information doesn't appear to be captured in the backtrace.)

[binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:04549] *** Process received signal ***
[binf316:04549] Signal: Aborted (6)
[binf316:04549] Signal code:  (-6)
[binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0]
[binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55]
[binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131]
[binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10]
[binf316:04549] [ 4] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b]
[binf316:04549] [ 5] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474]
[binf316:04549] [ 6] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316]
[binf316:04549] [ 7] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817]
[binf316:04549] [ 8] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e]
[binf316:04549] [ 9] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42]
[binf316:04549] [10] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b]
[binf316:04549] [11] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739]
[binf316:04549] [12] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2]
[binf316:04549] [13] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c]
[binf316:04549] [14] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386]
[binf316:04549] [15] ring_c[0x40096f]
[binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36]
[binf316:04549] [17] ring_c[0x400889]
[binf316:04549] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 4549 on node 316 exited on 
signal 6 (Aborted).
--

From: Fischer, Greg A.
Sent: Thursday, June 05, 2014 5:10 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Cc: Fischer, Greg A.
Subject: openib segfaults with Torque

OpenMPI Users,

After encountering difficulty with the Intel compilers (see the "intermittent 
segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and 
recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL 
in a typical BASH environment. Everything appeared to work fine, so I went on 
my merry way compiling the rest of my dependencies.

After getting my dependencies and applications compiled, I began observing 
segfaults when submitting the applications through Torque. I recompiled OpenMPI 
with debug options, ran "ring_c" over the openib BTL in an interactive Torque 
session ("qsub -I"), and got the backtrace below. All other system settings 
described in the previous thread are the same. Any thoughts on how to resolve 
this issue?

Core was generated by `ring_c'.
Program termin

Re: [OMPI users] openib segfaults with Torque

2014-06-05 Thread Ralph Castain
Hmmm...I'm not sure how that is going to run with only one proc (I don't know 
if the program is protected against that scenario). If you run with -np 2 -mca 
btl openib,sm,self, is it happy?


On Jun 5, 2014, at 2:16 PM, Fischer, Greg A.  wrote:

> Here’s the command I’m invoking and the terminal output.  (Some of this 
> information doesn’t appear to be captured in the backtrace.)
>  
> [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c
> ring_c: 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
>  udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
> ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
> [binf316:04549] *** Process received signal ***
> [binf316:04549] Signal: Aborted (6)
> [binf316:04549] Signal code:  (-6)
> [binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0]
> [binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55]
> [binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131]
> [binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10]
> [binf316:04549] [ 4] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b]
> [binf316:04549] [ 5] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474]
> [binf316:04549] [ 6] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316]
> [binf316:04549] [ 7] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817]
> [binf316:04549] [ 8] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e]
> [binf316:04549] [ 9] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42]
> [binf316:04549] [10] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b]
> [binf316:04549] [11] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739]
> [binf316:04549] [12] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2]
> [binf316:04549] [13] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c]
> [binf316:04549] [14] 
> //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386]
> [binf316:04549] [15] ring_c[0x40096f]
> [binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36]
> [binf316:04549] [17] ring_c[0x400889]
> [binf316:04549] *** End of error message ***
> --
> mpirun noticed that process rank 0 with PID 4549 on node 316 exited on 
> signal 6 (Aborted).
> --
>  
> From: Fischer, Greg A. 
> Sent: Thursday, June 05, 2014 5:10 PM
> To: us...@open-mpi.org
> Cc: Fischer, Greg A.
> Subject: openib segfaults with Torque
>  
> OpenMPI Users,
>  
> After encountering difficulty with the Intel compilers (see the “intermittent 
> segfaults with openib on ring_c.c” thread), I installed GCC-4.8.3 and 
> recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib 
> BTL in a typical BASH environment. Everything appeared to work fine, so I 
> went on my merry way compiling the rest of my dependencies.
>  
> After getting my dependencies and applications compiled, I began observing 
> segfaults when submitting the applications through Torque. I recompiled 
> OpenMPI with debug options, ran “ring_c” over the openib BTL in an 
> interactive Torque session (“qsub –I”), and got the backtrace below. All 
> other system settings described in the previous thread are the same. Any 
> thoughts on how to resolve this issue?
>  
> Core was generated by `ring_c'.
> Program terminated with signal 6, Aborted.
> #0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
> #1  0x7f7f5920c0c5 in abort () from /lib64/libc.so.6
> #2  0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6
> #3  0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) 
> at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
> #4  0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) 
> at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
> #5  0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port 
> (btl=0x716680) at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
> #6  

Re: [OMPI users] openib segfaults with Torque

2014-06-05 Thread Fischer, Greg A.
Here's the command I'm invoking and the terminal output.  (Some of this 
information doesn't appear to be captured in the backtrace.)

[binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:04549] *** Process received signal ***
[binf316:04549] Signal: Aborted (6)
[binf316:04549] Signal code:  (-6)
[binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0]
[binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55]
[binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131]
[binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10]
[binf316:04549] [ 4] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b]
[binf316:04549] [ 5] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474]
[binf316:04549] [ 6] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316]
[binf316:04549] [ 7] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817]
[binf316:04549] [ 8] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e]
[binf316:04549] [ 9] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42]
[binf316:04549] [10] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b]
[binf316:04549] [11] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739]
[binf316:04549] [12] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2]
[binf316:04549] [13] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c]
[binf316:04549] [14] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386]
[binf316:04549] [15] ring_c[0x40096f]
[binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36]
[binf316:04549] [17] ring_c[0x400889]
[binf316:04549] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 4549 on node 316 exited on 
signal 6 (Aborted).
--

From: Fischer, Greg A.
Sent: Thursday, June 05, 2014 5:10 PM
To: us...@open-mpi.org
Cc: Fischer, Greg A.
Subject: openib segfaults with Torque

OpenMPI Users,

After encountering difficulty with the Intel compilers (see the "intermittent 
segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and 
recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL 
in a typical BASH environment. Everything appeared to work fine, so I went on 
my merry way compiling the rest of my dependencies.

After getting my dependencies and applications compiled, I began observing 
segfaults when submitting the applications through Torque. I recompiled OpenMPI 
with debug options, ran "ring_c" over the openib BTL in an interactive Torque 
session ("qsub -I"), and got the backtrace below. All other system settings 
described in the previous thread are the same. Any thoughts on how to resolve 
this issue?

Core was generated by `ring_c'.
Program terminated with signal 6, Aborted.
#0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
#1  0x7f7f5920c0c5 in abort () from /lib64/libc.so.6
#2  0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6
#3  0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
#4  0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
#5  0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port 
(btl=0x716680) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6  0x7f7f54885817 in btl_openib_component_init 
(num_btl_modules=0x7fff906aa420, enable_progress_threads=false, 
enable_mpi_threads=false)
at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
#7  0x7f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
#8  0x7f7f54ac7d42 in mca_bml_r2_component_init 

[OMPI users] openib segfaults with Torque

2014-06-05 Thread Fischer, Greg A.
OpenMPI Users,

After encountering difficulty with the Intel compilers (see the "intermittent 
segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and 
recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL 
in a typical BASH environment. Everything appeared to work fine, so I went on 
my merry way compiling the rest of my dependencies.

After getting my dependencies and applications compiled, I began observing 
segfaults when submitting the applications through Torque. I recompiled OpenMPI 
with debug options, ran "ring_c" over the openib BTL in an interactive Torque 
session ("qsub -I"), and got the backtrace below. All other system settings 
described in the previous thread are the same. Any thoughts on how to resolve 
this issue?

Core was generated by `ring_c'.
Program terminated with signal 6, Aborted.
#0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
#1  0x7f7f5920c0c5 in abort () from /lib64/libc.so.6
#2  0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6
#3  0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
#4  0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
#5  0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port 
(btl=0x716680) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6  0x7f7f54885817 in btl_openib_component_init 
(num_btl_modules=0x7fff906aa420, enable_progress_threads=false, 
enable_mpi_threads=false)
at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
#7  0x7f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
#8  0x7f7f54ac7d42 in mca_bml_r2_component_init (priority=0x7fff906aa4f4, 
enable_progress_threads=false, enable_mpi_threads=false) at 
../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
#9  0x7f7f5982cd1b in mca_bml_base_init (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
#10 0x7f7f539ed739 in mca_pml_ob1_component_init (priority=0x7fff906aa630, 
enable_progress_threads=false, enable_mpi_threads=false)
at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
#11 0x7f7f598539b2 in mca_pml_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
#12 0x7f7f597c033c in ompi_mpi_init (argc=1, argv=0x7fff906aa928, 
requested=0, provided=0x7fff906aa7d8) at 
../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
#13 0x7f7f597f5386 in PMPI_Init (argc=0x7fff906aa82c, argv=0x7fff906aa820) 
at pinit.c:84
#14 0x0040096f in main (argc=1, argv=0x7fff906aa928) at ring_c.c:19

Greg