Re: [OMPI users] openib segfaults with Torque
This sounds credible. When I login via Torque, I see the following: [binf316:fischega] $ ulimit -l 64 but when I login via ssh, I see: [binf316:fischega] $ ulimit -l unlimited I'll have my administrator make the changes and give that a shot. Thanks, everyone! _ From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa Sent: Wednesday, June 11, 2014 7:13 PM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque If that could help Greg, on the compute nodes I normally add this to /etc/security/limits.conf: * - memlock -1 * - stack -1 * - nofile 32768 and ulimit -n 32768 ulimit -l unlimited ulimit -s unlimited to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which should be sourced by the former). Other values are possible, of course. My recollection is that the boilerplate init scripts that come with Torque don't change those limits. I suppose this makes the pbs_mom child processes, including the user job script and whatever processes it starts (mpiexec, etc), to inherit those limits. Or not? Gus Correa On 06/11/2014 06:20 PM, Jeff Squyres (jsquyres) wrote: > +1 > > On Jun 11, 2014, at 6:01 PM, Ralph Castain > <r...@open-mpi.org<mailto:r...@open-mpi.org>> > wrote: > >> Yeah, I think we've seen that somewhere before too... >> >> >> On Jun 11, 2014, at 2:59 PM, Joshua Ladd >> <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote: >> >>> Agreed. The problem is not with UDCM. I don't think something is wrong with >>> the system. I think his Torque is imposing major constraints on the maximum >>> size that can be locked into memory. >>> >>> Josh >>> >>> >>> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm >>> <hje...@lanl.gov<mailto:hje...@lanl.gov>> wrote: >>> Probably won't help to use RDMACM though as you will just see the >>> resource failure somewhere else. UDCM is not the problem. Something is >>> wrong with the system. Allocating a 512 entry CQ should not fail. >>> >>> -Nathan >>> >>> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: >>>> I'm guessing it's a resource limitation issue coming from Torque. >>>> >>>> H...I found something interesting on the interwebs that looks >>>> awfully >>>> similar: >>>> >>>> http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html >>>> >>>> Greg, if the suggestion from the Torque users doesn't resolve your >>>> issue ( >>>> "...adding the following line 'ulimit -l unlimited' to pbs_mom and >>>> restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead >>>> of >>>> UDCM, which is a pretty recent addition to the openIB BTL.) by setting: >>>> >>>> -mca btl_openib_cpc_include rdmacm >>>> >>>> Josh >>>> >>>> On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) >>>> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote: >>>> >>>> Mellanox -- >>>> >>>> What would cause a CQ to fail to be created? >>>> >>>> On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." >>>> <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: >>>> >>>> > Is there any other work around that I might try? Something that >>>> avoids UDCM? >>>> > >>>> > -Original Message- >>>> > From: Fischer, Greg A. >>>> > Sent: Tuesday, June 10, 2014 2:59 PM >>>> > To: Nathan Hjelm >>>> > Cc: Open MPI Users; Fischer, Greg A. >>>> > Subject: RE: [OMPI users] openib segfaults with Torque >>>> > >>>> > [binf316:fischega] $ ulimit -m >>>> > unlimited >>>> > >>>> > Greg >>>> > >>>> > -Original Message- >>>> > From: Nathan Hjelm [mailto:hje...@lanl.gov] >>>> > Sent: Tuesday, June 10, 2014 2:58 PM >>>> > To: Fischer, Greg A. >>>> > Cc: Open MPI Users >>>> > Subject: Re: [OMPI users] openib segfaults with Torque >>>> > >>>> > Out of curiosity what is the
Re: [OMPI users] openib segfaults with Torque
If that could help Greg, on the compute nodes I normally add this to /etc/security/limits.conf: * - memlock -1 * - stack -1 * - nofile 32768 and ulimit -n 32768 ulimit -l unlimited ulimit -s unlimited to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which should be sourced by the former). Other values are possible, of course. My recollection is that the boilerplate init scripts that come with Torque don't change those limits. I suppose this makes the pbs_mom child processes, including the user job script and whatever processes it starts (mpiexec, etc), to inherit those limits. Or not? Gus Correa On 06/11/2014 06:20 PM, Jeff Squyres (jsquyres) wrote: +1 On Jun 11, 2014, at 6:01 PM, Ralph Castain <r...@open-mpi.org> wrote: Yeah, I think we've seen that somewhere before too... On Jun 11, 2014, at 2:59 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: Agreed. The problem is not with UDCM. I don't think something is wrong with the system. I think his Torque is imposing major constraints on the maximum size that can be locked into memory. Josh On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote: Probably won't help to use RDMACM though as you will just see the resource failure somewhere else. UDCM is not the problem. Something is wrong with the system. Allocating a 512 entry CQ should not fail. -Nathan On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: I'm guessing it's a resource limitation issue coming from Torque. H...I found something interesting on the interwebs that looks awfully similar: http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html Greg, if the suggestion from the Torque users doesn't resolve your issue ( "...adding the following line 'ulimit -l unlimited' to pbs_mom and restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead of UDCM, which is a pretty recent addition to the openIB BTL.) by setting: -mca btl_openib_cpc_include rdmacm Josh On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: Mellanox -- What would cause a CQ to fail to be created? On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." <fisch...@westinghouse.com> wrote: > Is there any other work around that I might try? Something that avoids UDCM? > > -Original Message- > From: Fischer, Greg A. > Sent: Tuesday, June 10, 2014 2:59 PM > To: Nathan Hjelm > Cc: Open MPI Users; Fischer, Greg A. > Subject: RE: [OMPI users] openib segfaults with Torque > > [binf316:fischega] $ ulimit -m > unlimited > > Greg > > -Original Message- > From: Nathan Hjelm [mailto:hje...@lanl.gov] > Sent: Tuesday, June 10, 2014 2:58 PM > To: Fischer, Greg A. > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Out of curiosity what is the mlock limit on your system? If it is too low that can cause ibv_create_cq to fail. To check run ulimit -m. > > -Nathan Hjelm > Application Readiness, HPC-5, LANL > > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: >> Yes, this fails on all nodes on the system, except for the head node. >> >> The uptime of the system isn't significant. Maybe 1 week, and it's received basically no use. >> >> -Original Message- >> From: Nathan Hjelm [mailto:hje...@lanl.gov] >> Sent: Tuesday, June 10, 2014 2:49 PM >> To: Fischer, Greg A. >> Cc: Open MPI Users >> Subject: Re: [OMPI users] openib segfaults with Torque >> >> >> Well, thats interesting. The output shows that ibv_create_cq is failing. Strange since an identical call had just succeeded (udcm creates two completion queues). Some questions that might indicate where the failure might be: >> >> Does this fail on any other node in your system? >> >> How long has the node been up? >> >> -Nathan Hjelm >> Application Readiness, HPC-5, LANL >> >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: >>> Jeff/Nathan, >>> >>> I ran the following with my debug build of OpenMPI 1.8.1 - after opening a terminal on a compute node with "qsub -l nodes 2 -I": >>> >>> mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 >>> ring_c &> output.txt >>> >>&g
Re: [OMPI users] openib segfaults with Torque
It isn't really Torque that is imposing those constraints: - the torque_mom initscript inherits from the OS whatever ulimits are in effect at that time; - each job inherits the ulimits from the pbs_mom. Thus, you need to change the ulimits from whatever is set at startup time, e.g., in /etc/sysconfig/torque_mom: ulimit -d unlimited ulimit -s unlimited ulimit -n 32768 ulimit -l 2097152 or whatever you consider to be reasonable. Cheers, Martin -- Martin Siegert WestGrid/ComputeCanada Simon Fraser University Burnaby, British Columbia On Wed, Jun 11, 2014 at 10:20:08PM +, Jeff Squyres (jsquyres) wrote: > +1 > > On Jun 11, 2014, at 6:01 PM, Ralph Castain <r...@open-mpi.org> > wrote: > > > Yeah, I think we've seen that somewhere before too... > > > > > > On Jun 11, 2014, at 2:59 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > > > >> Agreed. The problem is not with UDCM. I don't think something is wrong > >> with the system. I think his Torque is imposing major constraints on the > >> maximum size that can be locked into memory. > >> > >> Josh > >> > >> > >> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > >> Probably won't help to use RDMACM though as you will just see the > >> resource failure somewhere else. UDCM is not the problem. Something is > >> wrong with the system. Allocating a 512 entry CQ should not fail. > >> > >> -Nathan > >> > >> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: > >> >I'm guessing it's a resource limitation issue coming from Torque. > >> > > >> >H...I found something interesting on the interwebs that looks > >> > awfully > >> >similar: > >> > > >> > http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html > >> > > >> >Greg, if the suggestion from the Torque users doesn't resolve your > >> > issue ( > >> >"...adding the following line 'ulimit -l unlimited' to pbs_mom and > >> >restarting pbs_mom." ) doesn't work, try using the RDMACM CPC > >> > (instead of > >> >UDCM, which is a pretty recent addition to the openIB BTL.) by > >> > setting: > >> > > >> >-mca btl_openib_cpc_include rdmacm > >> > > >> >Josh > >> > > >> >On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) > >> ><jsquy...@cisco.com> wrote: > >> > > >> > Mellanox -- > >> > > >> > What would cause a CQ to fail to be created? > >> > > >> > On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." > >> > <fisch...@westinghouse.com> wrote: > >> > > >> > > Is there any other work around that I might try? Something that > >> > avoids UDCM? > >> > > > >> > > -Original Message----- > >> > > From: Fischer, Greg A. > >> > > Sent: Tuesday, June 10, 2014 2:59 PM > >> > > To: Nathan Hjelm > >> > > Cc: Open MPI Users; Fischer, Greg A. > >> > > Subject: RE: [OMPI users] openib segfaults with Torque > >> > > > >> > > [binf316:fischega] $ ulimit -m > >> > > unlimited > >> > > > >> > > Greg > >> > > > >> > > -Original Message- > >> > > From: Nathan Hjelm [mailto:hje...@lanl.gov] > >> > > Sent: Tuesday, June 10, 2014 2:58 PM > >> > > To: Fischer, Greg A. > >> > > Cc: Open MPI Users > >> > > Subject: Re: [OMPI users] openib segfaults with Torque > >> > > > >> > > Out of curiosity what is the mlock limit on your system? If it is > >> > too > >> > low that can cause ibv_create_cq to fail. To check run ulimit -m. > >> > > > >> > > -Nathan Hjelm > >> > > Application Readiness, HPC-5, LANL > >> > > > >> > > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > >> > >> Yes, this fails on all nodes on the system, except for the head > >> > node. > >> > >> > >> > >> The uptime of the system isn't significant. Maybe 1 week, and > &
Re: [OMPI users] openib segfaults with Torque
+1 On Jun 11, 2014, at 6:01 PM, Ralph Castain <r...@open-mpi.org> wrote: > Yeah, I think we've seen that somewhere before too... > > > On Jun 11, 2014, at 2:59 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > >> Agreed. The problem is not with UDCM. I don't think something is wrong with >> the system. I think his Torque is imposing major constraints on the maximum >> size that can be locked into memory. >> >> Josh >> >> >> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote: >> Probably won't help to use RDMACM though as you will just see the >> resource failure somewhere else. UDCM is not the problem. Something is >> wrong with the system. Allocating a 512 entry CQ should not fail. >> >> -Nathan >> >> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: >> >I'm guessing it's a resource limitation issue coming from Torque. >> > >> >H...I found something interesting on the interwebs that looks >> > awfully >> >similar: >> > >> > http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html >> > >> >Greg, if the suggestion from the Torque users doesn't resolve your >> > issue ( >> >"...adding the following line 'ulimit -l unlimited' to pbs_mom and >> >restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead >> > of >> >UDCM, which is a pretty recent addition to the openIB BTL.) by setting: >> > >> >-mca btl_openib_cpc_include rdmacm >> > >> >Josh >> > >> >On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) >> ><jsquy...@cisco.com> wrote: >> > >> > Mellanox -- >> > >> > What would cause a CQ to fail to be created? >> > >> > On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." >> > <fisch...@westinghouse.com> wrote: >> > >> > > Is there any other work around that I might try? Something that >> > avoids UDCM? >> > > >> > > -Original Message- >> > > From: Fischer, Greg A. >> > > Sent: Tuesday, June 10, 2014 2:59 PM >> > > To: Nathan Hjelm >> > > Cc: Open MPI Users; Fischer, Greg A. >> > > Subject: RE: [OMPI users] openib segfaults with Torque >> > > >> > > [binf316:fischega] $ ulimit -m >> > > unlimited >> > > >> > > Greg >> > > >> > > -Original Message- >> > > From: Nathan Hjelm [mailto:hje...@lanl.gov] >> > > Sent: Tuesday, June 10, 2014 2:58 PM >> > > To: Fischer, Greg A. >> > > Cc: Open MPI Users >> > > Subject: Re: [OMPI users] openib segfaults with Torque >> > > >> > > Out of curiosity what is the mlock limit on your system? If it is >> > too >> > low that can cause ibv_create_cq to fail. To check run ulimit -m. >> > > >> > > -Nathan Hjelm >> > > Application Readiness, HPC-5, LANL >> > > >> > > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: >> > >> Yes, this fails on all nodes on the system, except for the head >> > node. >> > >> >> > >> The uptime of the system isn't significant. Maybe 1 week, and it's >> > received basically no use. >> > >> >> > >> -Original Message- >> > >> From: Nathan Hjelm [mailto:hje...@lanl.gov] >> > >> Sent: Tuesday, June 10, 2014 2:49 PM >> > >> To: Fischer, Greg A. >> > >> Cc: Open MPI Users >> > >> Subject: Re: [OMPI users] openib segfaults with Torque >> > >> >> > >> >> > >> Well, thats interesting. The output shows that ibv_create_cq is >> > failing. Strange since an identical call had just succeeded (udcm >> > creates two completion queues). Some questions that might indicate >> > where >> > the failure might be: >> > >> >> > >> Does this fail on any other node in your system? >> > >> >> > >> How long has the node been up? >> > >> >> > >> -N
Re: [OMPI users] openib segfaults with Torque
Yeah, I think we've seen that somewhere before too... On Jun 11, 2014, at 2:59 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > Agreed. The problem is not with UDCM. I don't think something is wrong with > the system. I think his Torque is imposing major constraints on the maximum > size that can be locked into memory. > > Josh > > > On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > Probably won't help to use RDMACM though as you will just see the > resource failure somewhere else. UDCM is not the problem. Something is > wrong with the system. Allocating a 512 entry CQ should not fail. > > -Nathan > > On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: > >I'm guessing it's a resource limitation issue coming from Torque. > > > >H...I found something interesting on the interwebs that looks awfully > >similar: > > > > http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html > > > >Greg, if the suggestion from the Torque users doesn't resolve your issue > > ( > >"...adding the following line 'ulimit -l unlimited' to pbs_mom and > >restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead of > >UDCM, which is a pretty recent addition to the openIB BTL.) by setting: > > > >-mca btl_openib_cpc_include rdmacm > > > >Josh > > > >On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) > ><jsquy...@cisco.com> wrote: > > > > Mellanox -- > > > > What would cause a CQ to fail to be created? > > > > On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." > > <fisch...@westinghouse.com> wrote: > > > > > Is there any other work around that I might try? Something that > > avoids UDCM? > > > > > > -Original Message- > > > From: Fischer, Greg A. > > > Sent: Tuesday, June 10, 2014 2:59 PM > > > To: Nathan Hjelm > > > Cc: Open MPI Users; Fischer, Greg A. > > > Subject: RE: [OMPI users] openib segfaults with Torque > > > > > > [binf316:fischega] $ ulimit -m > > > unlimited > > > > > > Greg > > > > > > -Original Message- > > > From: Nathan Hjelm [mailto:hje...@lanl.gov] > > > Sent: Tuesday, June 10, 2014 2:58 PM > > > To: Fischer, Greg A. > > > Cc: Open MPI Users > > > Subject: Re: [OMPI users] openib segfaults with Torque > > > > > > Out of curiosity what is the mlock limit on your system? If it is too > > low that can cause ibv_create_cq to fail. To check run ulimit -m. > > > > > > -Nathan Hjelm > > > Application Readiness, HPC-5, LANL > > > > > > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > > >> Yes, this fails on all nodes on the system, except for the head > > node. > > >> > > >> The uptime of the system isn't significant. Maybe 1 week, and it's > > received basically no use. > > >> > > >> -Original Message- > > >> From: Nathan Hjelm [mailto:hje...@lanl.gov] > > >> Sent: Tuesday, June 10, 2014 2:49 PM > > >> To: Fischer, Greg A. > > >> Cc: Open MPI Users > > >> Subject: Re: [OMPI users] openib segfaults with Torque > > >> > > >> > > >> Well, thats interesting. The output shows that ibv_create_cq is > > failing. Strange since an identical call had just succeeded (udcm > > creates two completion queues). Some questions that might indicate > > where > > the failure might be: > > >> > > >> Does this fail on any other node in your system? > > >> > > >> How long has the node been up? > > >> > > >> -Nathan Hjelm > > >> Application Readiness, HPC-5, LANL > > >> > > >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > > >>> Jeff/Nathan, > > >>> > > >>> I ran the following with my debug build of OpenMPI 1.8.1 - after > > opening a terminal on a compute node with "qsub -l nodes 2 -I": > > >>> > > >>> mpirun -mca btl openib,self -mca btl_base_ver
Re: [OMPI users] openib segfaults with Torque
Agreed. The problem is not with UDCM. I don't think something is wrong with the system. I think his Torque is imposing major constraints on the maximum size that can be locked into memory. Josh On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > Probably won't help to use RDMACM though as you will just see the > resource failure somewhere else. UDCM is not the problem. Something is > wrong with the system. Allocating a 512 entry CQ should not fail. > > -Nathan > > On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: > >I'm guessing it's a resource limitation issue coming from Torque. > > > >H...I found something interesting on the interwebs that looks > awfully > >similar: > > > http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html > > > >Greg, if the suggestion from the Torque users doesn't resolve your > issue ( > >"...adding the following line 'ulimit -l unlimited' to pbs_mom and > >restarting pbs_mom." ) doesn't work, try using the RDMACM CPC > (instead of > >UDCM, which is a pretty recent addition to the openIB BTL.) by > setting: > > > >-mca btl_openib_cpc_include rdmacm > > > >Josh > > > >On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) > ><jsquy...@cisco.com> wrote: > > > > Mellanox -- > > > > What would cause a CQ to fail to be created? > > > > On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." > > <fisch...@westinghouse.com> wrote: > > > > > Is there any other work around that I might try? Something that > > avoids UDCM? > > > > > > -Original Message- > > > From: Fischer, Greg A. > > > Sent: Tuesday, June 10, 2014 2:59 PM > > > To: Nathan Hjelm > > > Cc: Open MPI Users; Fischer, Greg A. > > > Subject: RE: [OMPI users] openib segfaults with Torque > > > > > > [binf316:fischega] $ ulimit -m > > > unlimited > > > > > > Greg > > > > > > -Original Message- > > > From: Nathan Hjelm [mailto:hje...@lanl.gov] > > > Sent: Tuesday, June 10, 2014 2:58 PM > > > To: Fischer, Greg A. > > > Cc: Open MPI Users > > > Subject: Re: [OMPI users] openib segfaults with Torque > > > > > > Out of curiosity what is the mlock limit on your system? If it is > too > > low that can cause ibv_create_cq to fail. To check run ulimit -m. > > > > > > -Nathan Hjelm > > > Application Readiness, HPC-5, LANL > > > > > > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > > >> Yes, this fails on all nodes on the system, except for the head > node. > > >> > > >> The uptime of the system isn't significant. Maybe 1 week, and > it's > > received basically no use. > > >> > > >> -Original Message- > > >> From: Nathan Hjelm [mailto:hje...@lanl.gov] > > >> Sent: Tuesday, June 10, 2014 2:49 PM > > >> To: Fischer, Greg A. > > >> Cc: Open MPI Users > > >> Subject: Re: [OMPI users] openib segfaults with Torque > > >> > > >> > > >> Well, thats interesting. The output shows that ibv_create_cq is > > failing. Strange since an identical call had just succeeded (udcm > > creates two completion queues). Some questions that might indicate > where > > the failure might be: > > >> > > >> Does this fail on any other node in your system? > > >> > > >> How long has the node been up? > > >> > > >> -Nathan Hjelm > > >> Application Readiness, HPC-5, LANL > > >> > > >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > > >>> Jeff/Nathan, > > >>> > > >>> I ran the following with my debug build of OpenMPI 1.8.1 - after > > opening a terminal on a compute node with "qsub -l nodes 2 -I": > > >>> > > >>> mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 > > >>> ring_c &> output.txt > > >>> > > >>> Output and backtrace are attached. Let me know if
Re: [OMPI users] openib segfaults with Torque
Probably won't help to use RDMACM though as you will just see the resource failure somewhere else. UDCM is not the problem. Something is wrong with the system. Allocating a 512 entry CQ should not fail. -Nathan On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: >I'm guessing it's a resource limitation issue coming from Torque. > >H...I found something interesting on the interwebs that looks awfully >similar: >http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html > >Greg, if the suggestion from the Torque users doesn't resolve your issue ( >"...adding the following line 'ulimit -l unlimited' to pbs_mom and >restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead of >UDCM, which is a pretty recent addition to the openIB BTL.) by setting: > >-mca btl_openib_cpc_include rdmacm > >Josh > >On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) ><jsquy...@cisco.com> wrote: > > Mellanox -- > > What would cause a CQ to fail to be created? > > On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." > <fisch...@westinghouse.com> wrote: > > > Is there any other work around that I might try? Something that > avoids UDCM? > > > > -Original Message- > > From: Fischer, Greg A. > > Sent: Tuesday, June 10, 2014 2:59 PM > > To: Nathan Hjelm > > Cc: Open MPI Users; Fischer, Greg A. > > Subject: RE: [OMPI users] openib segfaults with Torque > > > > [binf316:fischega] $ ulimit -m > > unlimited > > > > Greg > > > > -Original Message- > > From: Nathan Hjelm [mailto:hje...@lanl.gov] > > Sent: Tuesday, June 10, 2014 2:58 PM > > To: Fischer, Greg A. > > Cc: Open MPI Users > > Subject: Re: [OMPI users] openib segfaults with Torque > > > > Out of curiosity what is the mlock limit on your system? If it is too > low that can cause ibv_create_cq to fail. To check run ulimit -m. > > > > -Nathan Hjelm > > Application Readiness, HPC-5, LANL > > > > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > >> Yes, this fails on all nodes on the system, except for the head node. > >> > >> The uptime of the system isn't significant. Maybe 1 week, and it's > received basically no use. > >> > >> -Original Message- > >> From: Nathan Hjelm [mailto:hje...@lanl.gov] > >> Sent: Tuesday, June 10, 2014 2:49 PM > >> To: Fischer, Greg A. > >> Cc: Open MPI Users > >> Subject: Re: [OMPI users] openib segfaults with Torque > >> > >> > >> Well, thats interesting. The output shows that ibv_create_cq is > failing. Strange since an identical call had just succeeded (udcm > creates two completion queues). Some questions that might indicate where > the failure might be: > >> > >> Does this fail on any other node in your system? > >> > >> How long has the node been up? > >> > >> -Nathan Hjelm > >> Application Readiness, HPC-5, LANL > >> > >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > >>> Jeff/Nathan, > >>> > >>> I ran the following with my debug build of OpenMPI 1.8.1 - after > opening a terminal on a compute node with "qsub -l nodes 2 -I": > >>> > >>> mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 > >>> ring_c &> output.txt > >>> > >>> Output and backtrace are attached. Let me know if I can provide > anything else. > >>> > >>> Thanks for looking into this, > >>> Greg > >>> > >>> -Original Message- > >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > >>> Squyres (jsquyres) > >>> Sent: Tuesday, June 10, 2014 10:31 AM > >>> To: Nathan Hjelm > >>> Cc: Open MPI Users > >>> Subject: Re: [OMPI users] openib segfaults with Torque > >>> > >>> Greg: > >>> > >>> Can you run with "--mca btl_base_verbose 100" on your debug build so > that we can ge
Re: [OMPI users] openib segfaults with Torque
I'm guessing it's a resource limitation issue coming from Torque. H...I found something interesting on the interwebs that looks awfully similar: http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html Greg, if the suggestion from the Torque users doesn't resolve your issue ( "...adding the following line 'ulimit -l unlimited' to pbs_mom and restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead of UDCM, which is a pretty recent addition to the openIB BTL.) by setting: -mca btl_openib_cpc_include rdmacm Josh On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > Mellanox -- > > What would cause a CQ to fail to be created? > > > On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." <fisch...@westinghouse.com> > wrote: > > > Is there any other work around that I might try? Something that avoids > UDCM? > > > > -Original Message- > > From: Fischer, Greg A. > > Sent: Tuesday, June 10, 2014 2:59 PM > > To: Nathan Hjelm > > Cc: Open MPI Users; Fischer, Greg A. > > Subject: RE: [OMPI users] openib segfaults with Torque > > > > [binf316:fischega] $ ulimit -m > > unlimited > > > > Greg > > > > -Original Message- > > From: Nathan Hjelm [mailto:hje...@lanl.gov] > > Sent: Tuesday, June 10, 2014 2:58 PM > > To: Fischer, Greg A. > > Cc: Open MPI Users > > Subject: Re: [OMPI users] openib segfaults with Torque > > > > Out of curiosity what is the mlock limit on your system? If it is too > low that can cause ibv_create_cq to fail. To check run ulimit -m. > > > > -Nathan Hjelm > > Application Readiness, HPC-5, LANL > > > > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > >> Yes, this fails on all nodes on the system, except for the head node. > >> > >> The uptime of the system isn't significant. Maybe 1 week, and it's > received basically no use. > >> > >> -Original Message- > >> From: Nathan Hjelm [mailto:hje...@lanl.gov] > >> Sent: Tuesday, June 10, 2014 2:49 PM > >> To: Fischer, Greg A. > >> Cc: Open MPI Users > >> Subject: Re: [OMPI users] openib segfaults with Torque > >> > >> > >> Well, thats interesting. The output shows that ibv_create_cq is > failing. Strange since an identical call had just succeeded (udcm creates > two completion queues). Some questions that might indicate where the > failure might be: > >> > >> Does this fail on any other node in your system? > >> > >> How long has the node been up? > >> > >> -Nathan Hjelm > >> Application Readiness, HPC-5, LANL > >> > >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > >>> Jeff/Nathan, > >>> > >>> I ran the following with my debug build of OpenMPI 1.8.1 - after > opening a terminal on a compute node with "qsub -l nodes 2 -I": > >>> > >>> mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 > >>> ring_c &> output.txt > >>> > >>> Output and backtrace are attached. Let me know if I can provide > anything else. > >>> > >>> Thanks for looking into this, > >>> Greg > >>> > >>> -Original Message- > >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > >>> Squyres (jsquyres) > >>> Sent: Tuesday, June 10, 2014 10:31 AM > >>> To: Nathan Hjelm > >>> Cc: Open MPI Users > >>> Subject: Re: [OMPI users] openib segfaults with Torque > >>> > >>> Greg: > >>> > >>> Can you run with "--mca btl_base_verbose 100" on your debug build so > that we can get some additional output to see why UDCM is failing to setup > properly? > >>> > >>> > >>> > >>> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > >>> > >>>> On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) > wrote: > >>>>> I seem to recall that you have an IB-based cluster, right? > >>>>> > >>>>> From a *very quick* glance at the code, it looks like this might be > a simple incorrect-finalization issue. That is: > >>>>> > >>>>> - you run the job on a single server > >>>>> - openib disqualifies itself because you're running on a single > >>>>> server > >>>>> -
Re: [OMPI users] openib segfaults with Torque
Mellanox -- What would cause a CQ to fail to be created? On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." <fisch...@westinghouse.com> wrote: > Is there any other work around that I might try? Something that avoids UDCM? > > -Original Message- > From: Fischer, Greg A. > Sent: Tuesday, June 10, 2014 2:59 PM > To: Nathan Hjelm > Cc: Open MPI Users; Fischer, Greg A. > Subject: RE: [OMPI users] openib segfaults with Torque > > [binf316:fischega] $ ulimit -m > unlimited > > Greg > > -Original Message- > From: Nathan Hjelm [mailto:hje...@lanl.gov] > Sent: Tuesday, June 10, 2014 2:58 PM > To: Fischer, Greg A. > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Out of curiosity what is the mlock limit on your system? If it is too low > that can cause ibv_create_cq to fail. To check run ulimit -m. > > -Nathan Hjelm > Application Readiness, HPC-5, LANL > > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: >> Yes, this fails on all nodes on the system, except for the head node. >> >> The uptime of the system isn't significant. Maybe 1 week, and it's received >> basically no use. >> >> -Original Message- >> From: Nathan Hjelm [mailto:hje...@lanl.gov] >> Sent: Tuesday, June 10, 2014 2:49 PM >> To: Fischer, Greg A. >> Cc: Open MPI Users >> Subject: Re: [OMPI users] openib segfaults with Torque >> >> >> Well, thats interesting. The output shows that ibv_create_cq is failing. >> Strange since an identical call had just succeeded (udcm creates two >> completion queues). Some questions that might indicate where the failure >> might be: >> >> Does this fail on any other node in your system? >> >> How long has the node been up? >> >> -Nathan Hjelm >> Application Readiness, HPC-5, LANL >> >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: >>> Jeff/Nathan, >>> >>> I ran the following with my debug build of OpenMPI 1.8.1 - after opening a >>> terminal on a compute node with "qsub -l nodes 2 -I": >>> >>> mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 >>> ring_c &> output.txt >>> >>> Output and backtrace are attached. Let me know if I can provide anything >>> else. >>> >>> Thanks for looking into this, >>> Greg >>> >>> -Original Message- >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >>> Squyres (jsquyres) >>> Sent: Tuesday, June 10, 2014 10:31 AM >>> To: Nathan Hjelm >>> Cc: Open MPI Users >>> Subject: Re: [OMPI users] openib segfaults with Torque >>> >>> Greg: >>> >>> Can you run with "--mca btl_base_verbose 100" on your debug build so that >>> we can get some additional output to see why UDCM is failing to setup >>> properly? >>> >>> >>> >>> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: >>> >>>> On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: >>>>> I seem to recall that you have an IB-based cluster, right? >>>>> >>>>> From a *very quick* glance at the code, it looks like this might be a >>>>> simple incorrect-finalization issue. That is: >>>>> >>>>> - you run the job on a single server >>>>> - openib disqualifies itself because you're running on a single >>>>> server >>>>> - openib then goes to finalize/close itself >>>>> - but openib didn't fully initialize itself (because it >>>>> disqualified itself early in the initialization process), and >>>>> something in the finalization process didn't take that into >>>>> account >>>>> >>>>> Nathan -- is that anywhere close to correct? >>>> >>>> Nope. udcm_module_finalize is being called because there was an >>>> error setting up the udcm state. See btl_openib_connect_udcm.c:476. >>>> The opal_list_t destructor is getting an assert failure. Probably >>>> because the constructor wasn't called. I can rearrange the >>>> constructors to be called first but there appears to be a deeper >>>> issue with the user's >>>> system: udcm_module_init should not be failing! It creates a >>>> couple of CQs, allocates a small number of registered bufferes and >>>
Re: [OMPI users] openib segfaults with Torque
Is there any other work around that I might try? Something that avoids UDCM? -Original Message- From: Fischer, Greg A. Sent: Tuesday, June 10, 2014 2:59 PM To: Nathan Hjelm Cc: Open MPI Users; Fischer, Greg A. Subject: RE: [OMPI users] openib segfaults with Torque [binf316:fischega] $ ulimit -m unlimited Greg -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:58 PM To: Fischer, Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Out of curiosity what is the mlock limit on your system? If it is too low that can cause ibv_create_cq to fail. To check run ulimit -m. -Nathan Hjelm Application Readiness, HPC-5, LANL On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > Yes, this fails on all nodes on the system, except for the head node. > > The uptime of the system isn't significant. Maybe 1 week, and it's received > basically no use. > > -Original Message- > From: Nathan Hjelm [mailto:hje...@lanl.gov] > Sent: Tuesday, June 10, 2014 2:49 PM > To: Fischer, Greg A. > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > > Well, thats interesting. The output shows that ibv_create_cq is failing. > Strange since an identical call had just succeeded (udcm creates two > completion queues). Some questions that might indicate where the failure > might be: > > Does this fail on any other node in your system? > > How long has the node been up? > > -Nathan Hjelm > Application Readiness, HPC-5, LANL > > On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > > Jeff/Nathan, > > > > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a > > terminal on a compute node with "qsub -l nodes 2 -I": > > > > mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 > > ring_c &> output.txt > > > > Output and backtrace are attached. Let me know if I can provide anything > > else. > > > > Thanks for looking into this, > > Greg > > > > -Original Message- > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > > Squyres (jsquyres) > > Sent: Tuesday, June 10, 2014 10:31 AM > > To: Nathan Hjelm > > Cc: Open MPI Users > > Subject: Re: [OMPI users] openib segfaults with Torque > > > > Greg: > > > > Can you run with "--mca btl_base_verbose 100" on your debug build so that > > we can get some additional output to see why UDCM is failing to setup > > properly? > > > > > > > > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > > > > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > > >> I seem to recall that you have an IB-based cluster, right? > > >> > > >> From a *very quick* glance at the code, it looks like this might be a > > >> simple incorrect-finalization issue. That is: > > >> > > >> - you run the job on a single server > > >> - openib disqualifies itself because you're running on a single > > >> server > > >> - openib then goes to finalize/close itself > > >> - but openib didn't fully initialize itself (because it > > >> disqualified itself early in the initialization process), and > > >> something in the finalization process didn't take that into > > >> account > > >> > > >> Nathan -- is that anywhere close to correct? > > > > > > Nope. udcm_module_finalize is being called because there was an > > > error setting up the udcm state. See btl_openib_connect_udcm.c:476. > > > The opal_list_t destructor is getting an assert failure. Probably > > > because the constructor wasn't called. I can rearrange the > > > constructors to be called first but there appears to be a deeper > > > issue with the user's > > > system: udcm_module_init should not be failing! It creates a > > > couple of CQs, allocates a small number of registered bufferes and > > > starts monitoring the fd for the completion channel. All these > > > things are also done in the setup of the openib btl itself. Keep > > > in mind that the openib btl will not disqualify itself when running > > > single server. > > > Openib may be used to communicate on node and is needed for the dynamics > > > case. > > > > > > The user might try adding -mca btl_base_verbose 100 to shed some > > > light on what the real issue is. > > > > > > BTW, I no longer monitor th
Re: [OMPI users] openib segfaults with Torque
[binf316:fischega] $ ulimit -m unlimited Greg -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:58 PM To: Fischer, Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Out of curiosity what is the mlock limit on your system? If it is too low that can cause ibv_create_cq to fail. To check run ulimit -m. -Nathan Hjelm Application Readiness, HPC-5, LANL On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > Yes, this fails on all nodes on the system, except for the head node. > > The uptime of the system isn't significant. Maybe 1 week, and it's received > basically no use. > > -Original Message- > From: Nathan Hjelm [mailto:hje...@lanl.gov] > Sent: Tuesday, June 10, 2014 2:49 PM > To: Fischer, Greg A. > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > > Well, thats interesting. The output shows that ibv_create_cq is failing. > Strange since an identical call had just succeeded (udcm creates two > completion queues). Some questions that might indicate where the failure > might be: > > Does this fail on any other node in your system? > > How long has the node been up? > > -Nathan Hjelm > Application Readiness, HPC-5, LANL > > On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > > Jeff/Nathan, > > > > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a > > terminal on a compute node with "qsub -l nodes 2 -I": > > > > mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 > > ring_c &> output.txt > > > > Output and backtrace are attached. Let me know if I can provide anything > > else. > > > > Thanks for looking into this, > > Greg > > > > -Original Message- > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > > Squyres (jsquyres) > > Sent: Tuesday, June 10, 2014 10:31 AM > > To: Nathan Hjelm > > Cc: Open MPI Users > > Subject: Re: [OMPI users] openib segfaults with Torque > > > > Greg: > > > > Can you run with "--mca btl_base_verbose 100" on your debug build so that > > we can get some additional output to see why UDCM is failing to setup > > properly? > > > > > > > > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > > > > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > > >> I seem to recall that you have an IB-based cluster, right? > > >> > > >> From a *very quick* glance at the code, it looks like this might be a > > >> simple incorrect-finalization issue. That is: > > >> > > >> - you run the job on a single server > > >> - openib disqualifies itself because you're running on a single > > >> server > > >> - openib then goes to finalize/close itself > > >> - but openib didn't fully initialize itself (because it > > >> disqualified itself early in the initialization process), and > > >> something in the finalization process didn't take that into > > >> account > > >> > > >> Nathan -- is that anywhere close to correct? > > > > > > Nope. udcm_module_finalize is being called because there was an > > > error setting up the udcm state. See btl_openib_connect_udcm.c:476. > > > The opal_list_t destructor is getting an assert failure. Probably > > > because the constructor wasn't called. I can rearrange the > > > constructors to be called first but there appears to be a deeper > > > issue with the user's > > > system: udcm_module_init should not be failing! It creates a > > > couple of CQs, allocates a small number of registered bufferes and > > > starts monitoring the fd for the completion channel. All these > > > things are also done in the setup of the openib btl itself. Keep > > > in mind that the openib btl will not disqualify itself when running > > > single server. > > > Openib may be used to communicate on node and is needed for the dynamics > > > case. > > > > > > The user might try adding -mca btl_base_verbose 100 to shed some > > > light on what the real issue is. > > > > > > BTW, I no longer monitor the user mailing list. If something needs > > > my attention forward it to me directly. > > > > > > -Nathan > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > h
Re: [OMPI users] openib segfaults with Torque
Out of curiosity what is the mlock limit on your system? If it is too low that can cause ibv_create_cq to fail. To check run ulimit -m. -Nathan Hjelm Application Readiness, HPC-5, LANL On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > Yes, this fails on all nodes on the system, except for the head node. > > The uptime of the system isn't significant. Maybe 1 week, and it's received > basically no use. > > -Original Message- > From: Nathan Hjelm [mailto:hje...@lanl.gov] > Sent: Tuesday, June 10, 2014 2:49 PM > To: Fischer, Greg A. > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > > Well, thats interesting. The output shows that ibv_create_cq is failing. > Strange since an identical call had just succeeded (udcm creates two > completion queues). Some questions that might indicate where the failure > might be: > > Does this fail on any other node in your system? > > How long has the node been up? > > -Nathan Hjelm > Application Readiness, HPC-5, LANL > > On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > > Jeff/Nathan, > > > > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a > > terminal on a compute node with "qsub -l nodes 2 -I": > > > > mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> > > output.txt > > > > Output and backtrace are attached. Let me know if I can provide anything > > else. > > > > Thanks for looking into this, > > Greg > > > > -Original Message- > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > > Squyres (jsquyres) > > Sent: Tuesday, June 10, 2014 10:31 AM > > To: Nathan Hjelm > > Cc: Open MPI Users > > Subject: Re: [OMPI users] openib segfaults with Torque > > > > Greg: > > > > Can you run with "--mca btl_base_verbose 100" on your debug build so that > > we can get some additional output to see why UDCM is failing to setup > > properly? > > > > > > > > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > > > > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > > >> I seem to recall that you have an IB-based cluster, right? > > >> > > >> From a *very quick* glance at the code, it looks like this might be a > > >> simple incorrect-finalization issue. That is: > > >> > > >> - you run the job on a single server > > >> - openib disqualifies itself because you're running on a single > > >> server > > >> - openib then goes to finalize/close itself > > >> - but openib didn't fully initialize itself (because it > > >> disqualified itself early in the initialization process), and > > >> something in the finalization process didn't take that into account > > >> > > >> Nathan -- is that anywhere close to correct? > > > > > > Nope. udcm_module_finalize is being called because there was an > > > error setting up the udcm state. See btl_openib_connect_udcm.c:476. > > > The opal_list_t destructor is getting an assert failure. Probably > > > because the constructor wasn't called. I can rearrange the > > > constructors to be called first but there appears to be a deeper > > > issue with the user's > > > system: udcm_module_init should not be failing! It creates a couple > > > of CQs, allocates a small number of registered bufferes and starts > > > monitoring the fd for the completion channel. All these things are > > > also done in the setup of the openib btl itself. Keep in mind that > > > the openib btl will not disqualify itself when running single server. > > > Openib may be used to communicate on node and is needed for the dynamics > > > case. > > > > > > The user might try adding -mca btl_base_verbose 100 to shed some > > > light on what the real issue is. > > > > > > BTW, I no longer monitor the user mailing list. If something needs > > > my attention forward it to me directly. > > > > > > -Nathan > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinf
Re: [OMPI users] openib segfaults with Torque
Yes, this fails on all nodes on the system, except for the head node. The uptime of the system isn't significant. Maybe 1 week, and it's received basically no use. -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:49 PM To: Fischer, Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Well, thats interesting. The output shows that ibv_create_cq is failing. Strange since an identical call had just succeeded (udcm creates two completion queues). Some questions that might indicate where the failure might be: Does this fail on any other node in your system? How long has the node been up? -Nathan Hjelm Application Readiness, HPC-5, LANL On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > Jeff/Nathan, > > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a > terminal on a compute node with "qsub -l nodes 2 -I": > > mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> > output.txt > > Output and backtrace are attached. Let me know if I can provide anything else. > > Thanks for looking into this, > Greg > > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > Squyres (jsquyres) > Sent: Tuesday, June 10, 2014 10:31 AM > To: Nathan Hjelm > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Greg: > > Can you run with "--mca btl_base_verbose 100" on your debug build so that we > can get some additional output to see why UDCM is failing to setup properly? > > > > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > >> I seem to recall that you have an IB-based cluster, right? > >> > >> From a *very quick* glance at the code, it looks like this might be a > >> simple incorrect-finalization issue. That is: > >> > >> - you run the job on a single server > >> - openib disqualifies itself because you're running on a single > >> server > >> - openib then goes to finalize/close itself > >> - but openib didn't fully initialize itself (because it > >> disqualified itself early in the initialization process), and > >> something in the finalization process didn't take that into account > >> > >> Nathan -- is that anywhere close to correct? > > > > Nope. udcm_module_finalize is being called because there was an > > error setting up the udcm state. See btl_openib_connect_udcm.c:476. > > The opal_list_t destructor is getting an assert failure. Probably > > because the constructor wasn't called. I can rearrange the > > constructors to be called first but there appears to be a deeper > > issue with the user's > > system: udcm_module_init should not be failing! It creates a couple > > of CQs, allocates a small number of registered bufferes and starts > > monitoring the fd for the completion channel. All these things are > > also done in the setup of the openib btl itself. Keep in mind that > > the openib btl will not disqualify itself when running single server. > > Openib may be used to communicate on node and is needed for the dynamics > > case. > > > > The user might try adding -mca btl_base_verbose 100 to shed some > > light on what the real issue is. > > > > BTW, I no longer monitor the user mailing list. If something needs > > my attention forward it to me directly. > > > > -Nathan > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Core was generated by `ring_c'. > Program terminated with signal 6, Aborted. > #0 0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6 > #0 0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6 > #1 0x7f8b6ae1e0c5 in abort () from /lib64/libc.so.6 > #2 0x7f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6 > #3 0x7f8b664b684b in udcm_module_finalize (btl=0x717060, > cpc=0x7190c0) at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co > nnect_udcm.c:734 > #4 0x7f8b664b5474 in udcm_component_query (btl=0x717060, > cpc=0x718a48) at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co > nnect_udcm.c:476 > #5 0x7f8b664ae316 in > ompi_btl_openib_connect_base_select_for_local_port (btl=0x717060) at > ..
Re: [OMPI users] openib segfaults with Torque
Well, thats interesting. The output shows that ibv_create_cq is failing. Strange since an identical call had just succeeded (udcm creates two completion queues). Some questions that might indicate where the failure might be: Does this fail on any other node in your system? How long has the node been up? -Nathan Hjelm Application Readiness, HPC-5, LANL On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > Jeff/Nathan, > > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a > terminal on a compute node with "qsub -l nodes 2 -I": > > mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> > output.txt > > Output and backtrace are attached. Let me know if I can provide anything else. > > Thanks for looking into this, > Greg > > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres > (jsquyres) > Sent: Tuesday, June 10, 2014 10:31 AM > To: Nathan Hjelm > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Greg: > > Can you run with "--mca btl_base_verbose 100" on your debug build so that we > can get some additional output to see why UDCM is failing to setup properly? > > > > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > >> I seem to recall that you have an IB-based cluster, right? > >> > >> From a *very quick* glance at the code, it looks like this might be a > >> simple incorrect-finalization issue. That is: > >> > >> - you run the job on a single server > >> - openib disqualifies itself because you're running on a single > >> server > >> - openib then goes to finalize/close itself > >> - but openib didn't fully initialize itself (because it disqualified > >> itself early in the initialization process), and something in the > >> finalization process didn't take that into account > >> > >> Nathan -- is that anywhere close to correct? > > > > Nope. udcm_module_finalize is being called because there was an error > > setting up the udcm state. See btl_openib_connect_udcm.c:476. The > > opal_list_t destructor is getting an assert failure. Probably because > > the constructor wasn't called. I can rearrange the constructors to be > > called first but there appears to be a deeper issue with the user's > > system: udcm_module_init should not be failing! It creates a couple of > > CQs, allocates a small number of registered bufferes and starts > > monitoring the fd for the completion channel. All these things are > > also done in the setup of the openib btl itself. Keep in mind that the > > openib btl will not disqualify itself when running single server. > > Openib may be used to communicate on node and is needed for the dynamics > > case. > > > > The user might try adding -mca btl_base_verbose 100 to shed some light > > on what the real issue is. > > > > BTW, I no longer monitor the user mailing list. If something needs my > > attention forward it to me directly. > > > > -Nathan > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Core was generated by `ring_c'. > Program terminated with signal 6, Aborted. > #0 0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6 > #0 0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6 > #1 0x7f8b6ae1e0c5 in abort () from /lib64/libc.so.6 > #2 0x7f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6 > #3 0x7f8b664b684b in udcm_module_finalize (btl=0x717060, cpc=0x7190c0) > at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734 > #4 0x7f8b664b5474 in udcm_component_query (btl=0x717060, cpc=0x718a48) > at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476 > #5 0x7f8b664ae316 in ompi_btl_openib_connect_base_select_for_local_port > (btl=0x717060) at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 > #6 0x7f8b66497817 in btl_openib_component_init > (num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false, > enable_mpi_threads=false) > at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_compone
Re: [OMPI users] openib segfaults with Torque
Jeff/Nathan, I ran the following with my debug build of OpenMPI 1.8.1 - after opening a terminal on a compute node with "qsub -l nodes 2 -I": mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> output.txt Output and backtrace are attached. Let me know if I can provide anything else. Thanks for looking into this, Greg -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres (jsquyres) Sent: Tuesday, June 10, 2014 10:31 AM To: Nathan Hjelm Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Greg: Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly? On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: >> I seem to recall that you have an IB-based cluster, right? >> >> From a *very quick* glance at the code, it looks like this might be a simple >> incorrect-finalization issue. That is: >> >> - you run the job on a single server >> - openib disqualifies itself because you're running on a single >> server >> - openib then goes to finalize/close itself >> - but openib didn't fully initialize itself (because it disqualified >> itself early in the initialization process), and something in the >> finalization process didn't take that into account >> >> Nathan -- is that anywhere close to correct? > > Nope. udcm_module_finalize is being called because there was an error > setting up the udcm state. See btl_openib_connect_udcm.c:476. The > opal_list_t destructor is getting an assert failure. Probably because > the constructor wasn't called. I can rearrange the constructors to be > called first but there appears to be a deeper issue with the user's > system: udcm_module_init should not be failing! It creates a couple of > CQs, allocates a small number of registered bufferes and starts > monitoring the fd for the completion channel. All these things are > also done in the setup of the openib btl itself. Keep in mind that the > openib btl will not disqualify itself when running single server. > Openib may be used to communicate on node and is needed for the dynamics case. > > The user might try adding -mca btl_base_verbose 100 to shed some light > on what the real issue is. > > BTW, I no longer monitor the user mailing list. If something needs my > attention forward it to me directly. > > -Nathan -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Core was generated by `ring_c'. Program terminated with signal 6, Aborted. #0 0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6 #0 0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6 #1 0x7f8b6ae1e0c5 in abort () from /lib64/libc.so.6 #2 0x7f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6 #3 0x7f8b664b684b in udcm_module_finalize (btl=0x717060, cpc=0x7190c0) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734 #4 0x7f8b664b5474 in udcm_component_query (btl=0x717060, cpc=0x718a48) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476 #5 0x7f8b664ae316 in ompi_btl_openib_connect_base_select_for_local_port (btl=0x717060) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 #6 0x7f8b66497817 in btl_openib_component_init (num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703 #7 0x7f8b6b43fa5e in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108 #8 0x7f8b666d9d42 in mca_bml_r2_component_init (priority=0x7fffe34cecb4, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88 #9 0x7f8b6b43ed1b in mca_bml_base_init (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69 #10 0x7f8b655ff739 in mca_pml_ob1_component_init (priority=0x7fffe34cedf0, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271 #11 0x7f8b6b4659b2 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_se
Re: [OMPI users] openib segfaults with Torque
Greg: Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly? On Jun 10, 2014, at 10:25 AM, Nathan Hjelmwrote: > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: >> I seem to recall that you have an IB-based cluster, right? >> >> From a *very quick* glance at the code, it looks like this might be a simple >> incorrect-finalization issue. That is: >> >> - you run the job on a single server >> - openib disqualifies itself because you're running on a single server >> - openib then goes to finalize/close itself >> - but openib didn't fully initialize itself (because it disqualified itself >> early in the initialization process), and something in the finalization >> process didn't take that into account >> >> Nathan -- is that anywhere close to correct? > > Nope. udcm_module_finalize is being called because there was an error > setting up the udcm state. See btl_openib_connect_udcm.c:476. The > opal_list_t destructor is getting an assert failure. Probably because > the constructor wasn't called. I can rearrange the constructors to be > called first but there appears to be a deeper issue with the user's > system: udcm_module_init should not be failing! It creates a couple of > CQs, allocates a small number of registered bufferes and starts > monitoring the fd for the completion channel. All these things are also > done in the setup of the openib btl itself. Keep in mind that the openib > btl will not disqualify itself when running single server. Openib may be > used to communicate on node and is needed for the dynamics case. > > The user might try adding -mca btl_base_verbose 100 to shed some > light on what the real issue is. > > BTW, I no longer monitor the user mailing list. If something needs my > attention forward it to me directly. > > -Nathan -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] openib segfaults with Torque
On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > I seem to recall that you have an IB-based cluster, right? > > From a *very quick* glance at the code, it looks like this might be a simple > incorrect-finalization issue. That is: > > - you run the job on a single server > - openib disqualifies itself because you're running on a single server > - openib then goes to finalize/close itself > - but openib didn't fully initialize itself (because it disqualified itself > early in the initialization process), and something in the finalization > process didn't take that into account > > Nathan -- is that anywhere close to correct? Nope. udcm_module_finalize is being called because there was an error setting up the udcm state. See btl_openib_connect_udcm.c:476. The opal_list_t destructor is getting an assert failure. Probably because the constructor wasn't called. I can rearrange the constructors to be called first but there appears to be a deeper issue with the user's system: udcm_module_init should not be failing! It creates a couple of CQs, allocates a small number of registered bufferes and starts monitoring the fd for the completion channel. All these things are also done in the setup of the openib btl itself. Keep in mind that the openib btl will not disqualify itself when running single server. Openib may be used to communicate on node and is needed for the dynamics case. The user might try adding -mca btl_base_verbose 100 to shed some light on what the real issue is. BTW, I no longer monitor the user mailing list. If something needs my attention forward it to me directly. -Nathan pgpx5f_ZZt8HD.pgp Description: PGP signature
Re: [OMPI users] openib segfaults with Torque
I seem to recall that you have an IB-based cluster, right? >From a *very quick* glance at the code, it looks like this might be a simple >incorrect-finalization issue. That is: - you run the job on a single server - openib disqualifies itself because you're running on a single server - openib then goes to finalize/close itself - but openib didn't fully initialize itself (because it disqualified itself early in the initialization process), and something in the finalization process didn't take that into account Nathan -- is that anywhere close to correct? On Jun 5, 2014, at 5:10 PM, "Fischer, Greg A."wrote: > OpenMPI Users, > > After encountering difficulty with the Intel compilers (see the “intermittent > segfaults with openib on ring_c.c” thread), I installed GCC-4.8.3 and > recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib > BTL in a typical BASH environment. Everything appeared to work fine, so I > went on my merry way compiling the rest of my dependencies. > > After getting my dependencies and applications compiled, I began observing > segfaults when submitting the applications through Torque. I recompiled > OpenMPI with debug options, ran “ring_c” over the openib BTL in an > interactive Torque session (“qsub –I”), and got the backtrace below. All > other system settings described in the previous thread are the same. Any > thoughts on how to resolve this issue? > > Core was generated by `ring_c'. > Program terminated with signal 6, Aborted. > #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 > (gdb) bt > #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 > #1 0x7f7f5920c0c5 in abort () from /lib64/libc.so.6 > #2 0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6 > #3 0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) > at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734 > #4 0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) > at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476 > #5 0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port > (btl=0x716680) at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 > #6 0x7f7f54885817 in btl_openib_component_init > (num_btl_modules=0x7fff906aa420, enable_progress_threads=false, > enable_mpi_threads=false) > at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703 > #7 0x7f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, > enable_mpi_threads=false) at > ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108 > #8 0x7f7f54ac7d42 in mca_bml_r2_component_init (priority=0x7fff906aa4f4, > enable_progress_threads=false, enable_mpi_threads=false) at > ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88 > #9 0x7f7f5982cd1b in mca_bml_base_init (enable_progress_threads=false, > enable_mpi_threads=false) at > ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69 > #10 0x7f7f539ed739 in mca_pml_ob1_component_init > (priority=0x7fff906aa630, enable_progress_threads=false, > enable_mpi_threads=false) > at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271 > #11 0x7f7f598539b2 in mca_pml_base_select (enable_progress_threads=false, > enable_mpi_threads=false) at > ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128 > #12 0x7f7f597c033c in ompi_mpi_init (argc=1, argv=0x7fff906aa928, > requested=0, provided=0x7fff906aa7d8) at > ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604 > #13 0x7f7f597f5386 in PMPI_Init (argc=0x7fff906aa82c, > argv=0x7fff906aa820) at pinit.c:84 > #14 0x0040096f in main (argc=1, argv=0x7fff906aa928) at ring_c.c:19 > > Greg > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] openib segfaults with Torque
Fascinating - I can only assume that Torque is setting something in the environment that is creating the confusion. Sadly, Nathan is at the MPI Forum this week, so we may have to wait until Mon to get his input on the problem as he wrote the udcm code. On Jun 6, 2014, at 8:51 AM, Fischer, Greg A. <fisch...@westinghouse.com> wrote: > Yep, TCP works fine when launched via Torque/qsub: > > [binf315:fischega] $ mpirun -np 2 -mca btl tcp,sm,self ring_c > Process 0 sending 10 to 1, tag 201 (2 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > Process 0 decremented value: 8 > Process 0 decremented value: 7 > Process 0 decremented value: 6 > Process 0 decremented value: 5 > Process 0 decremented value: 4 > Process 0 decremented value: 3 > Process 0 decremented value: 2 > Process 0 decremented value: 1 > Process 0 decremented value: 0 > Process 0 exiting > Process 1 exiting > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Friday, June 06, 2014 10:34 AM > To: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Huh - how strange. I can't imagine what it has to do with Torque vs rsh - > this is failing when the openib BTL is trying to create the connection, which > comes way after the launch is complete. > > Are you able to run this with btl tcp,sm,self? If so, that would confirm that > everything else is correct, and the problem truly is limited to the udcm > itself...which shouldn't have anything to do with how the proc was launched. > > > On Jun 6, 2014, at 6:47 AM, Fischer, Greg A. <fisch...@westinghouse.com> > wrote: > > > Here are the results when logging in to the compute node via ssh and running > as you suggest: > > [binf102:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c > Process 0 sending 10 to 1, tag 201 (2 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > Process 0 decremented value: 8 > Process 0 decremented value: 7 > Process 0 decremented value: 6 > Process 0 decremented value: 5 > Process 0 decremented value: 4 > Process 0 decremented value: 3 > Process 0 decremented value: 2 > Process 0 decremented value: 1 > Process 0 decremented value: 0 > Process 0 exiting > Process 1 exiting > > Here are the results when executing over Torque (launch the shell with “qsub > -l nodes=2 –I”): > > [binf316:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c > ring_c: > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: > udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == > ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. > [binf316:21584] *** Process received signal *** > [binf316:21584] Signal: Aborted (6) > [binf316:21584] Signal code: (-6) > ring_c: > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: > udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == > ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. > [binf316:21583] *** Process received signal *** > [binf316:21583] Signal: Aborted (6) > [binf316:21583] Signal code: (-6) > [binf316:21584] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fe33a2637c0] > [binf316:21584] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fe339f0fb55] > [binf316:21584] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fe339f11131] > [binf316:21584] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7fe339f08a10] > [binf316:21584] [ 4] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7fe3355a984b] > [binf316:21584] [ 5] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7fe3355a8474] > [binf316:21584] [ 6] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7fe3355a1316] > [binf316:21584] [ 7] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7fe33558a817] > [binf316:21584] [ 8] [binf316:21583] [ 0] > /lib64/libpthread.so.0(+0xf7c0)[0x7f3b586697c0] > [binf316:21583] [ 1] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7fe33a532a5e] > [binf316:21584] [ 9] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7fe3357ccd42] > [binf316:21584] [10] /lib64/libc.so.6(gsignal+0x35)[0x7f3b58315b55] > [binf316:21583] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f3b58317131] > [binf316:21583] [ 3] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/li
Re: [OMPI users] openib segfaults with Torque
Yep, TCP works fine when launched via Torque/qsub: [binf315:fischega] $ mpirun -np 2 -mca btl tcp,sm,self ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 1 exiting From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Friday, June 06, 2014 10:34 AM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Huh - how strange. I can't imagine what it has to do with Torque vs rsh - this is failing when the openib BTL is trying to create the connection, which comes way after the launch is complete. Are you able to run this with btl tcp,sm,self? If so, that would confirm that everything else is correct, and the problem truly is limited to the udcm itself...which shouldn't have anything to do with how the proc was launched. On Jun 6, 2014, at 6:47 AM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Here are the results when logging in to the compute node via ssh and running as you suggest: [binf102:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 1 exiting Here are the results when executing over Torque (launch the shell with "qsub -l nodes=2 -I"): [binf316:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. [binf316:21584] *** Process received signal *** [binf316:21584] Signal: Aborted (6) [binf316:21584] Signal code: (-6) ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. [binf316:21583] *** Process received signal *** [binf316:21583] Signal: Aborted (6) [binf316:21583] Signal code: (-6) [binf316:21584] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fe33a2637c0] [binf316:21584] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fe339f0fb55] [binf316:21584] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fe339f11131] [binf316:21584] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7fe339f08a10] [binf316:21584] [ 4] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7fe3355a984b] [binf316:21584] [ 5] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7fe3355a8474] [binf316:21584] [ 6] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7fe3355a1316] [binf316:21584] [ 7] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7fe33558a817] [binf316:21584] [ 8] [binf316:21583] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f3b586697c0] [binf316:21583] [ 1] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7fe33a532a5e] [binf316:21584] [ 9] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7fe3357ccd42] [binf316:21584] [10] /lib64/libc.so.6(gsignal+0x35)[0x7f3b58315b55] [binf316:21583] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f3b58317131] [binf316:21583] [ 3] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7fe33a531d1b] [binf316:21584] [11] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7fe3344e7739] [binf316:21584] [12] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f3b5830ea10] [binf316:21583] [ 4] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f3b539af84b] [binf316:21583] [ 5] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f3b539ae474] [binf316:21583] [ 6] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f3b539a7316] [binf316:21583] [ 7] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_open
Re: [OMPI users] openib segfaults with Torque
o(mca_bml_r2_component_init+0x20)[0x7f3b53bd2d42] > [binf316:21583] [10] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7fe33a4c533c] > [binf316:21584] [14] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f3b58937d1b] > [binf316:21583] [11] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f3b528ed739] > [binf316:21583] [12] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f3b5895e9b2] > [binf316:21583] [13] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7fe33a4fa386] > [binf316:21584] [15] ring_c[0x40096f] > [binf316:21584] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7fe339efbc36] > [binf316:21584] [17] ring_c[0x400889] > [binf316:21584] *** End of error message *** > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f3b588cb33c] > [binf316:21583] [14] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f3b58900386] > [binf316:21583] [15] ring_c[0x40096f] > [binf316:21583] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f3b58301c36] > [binf316:21583] [17] ring_c[0x400889] > [binf316:21583] *** End of error message *** > -- > mpirun noticed that process rank 0 with PID 21583 on node 316 exited on > signal 6 (Aborted). > -- > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Thursday, June 05, 2014 7:57 PM > To: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Hmmm...I'm not sure how that is going to run with only one proc (I don't know > if the program is protected against that scenario). If you run with -np 2 > -mca btl openib,sm,self, is it happy? > > > On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. <fisch...@westinghouse.com> > wrote: > > > Here’s the command I’m invoking and the terminal output. (Some of this > information doesn’t appear to be captured in the backtrace.) > > [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c > ring_c: > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: > udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == > ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. > [binf316:04549] *** Process received signal *** > [binf316:04549] Signal: Aborted (6) > [binf316:04549] Signal code: (-6) > [binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0] > [binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55] > [binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131] > [binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10] > [binf316:04549] [ 4] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b] > [binf316:04549] [ 5] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474] > [binf316:04549] [ 6] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316] > [binf316:04549] [ 7] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817] > [binf316:04549] [ 8] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e] > [binf316:04549] [ 9] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42] > [binf316:04549] [10] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b] > [binf316:04549] [11] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739] > [binf316:04549] [12] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2] > [binf316:04549] [13] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c] > [binf316:04549] [14] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386] > [binf316:04549] [15] ring_c[0x40096f] > [binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36] > [binf316:04549] [17] ring_c[0x400889] > [binf316:04549] *** End of error message *** > -
Re: [OMPI users] openib segfaults with Torque
** //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f3b588cb33c] [binf316:21583] [14] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f3b58900386] [binf316:21583] [15] ring_c[0x40096f] [binf316:21583] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f3b58301c36] [binf316:21583] [17] ring_c[0x400889] [binf316:21583] *** End of error message *** -- mpirun noticed that process rank 0 with PID 21583 on node 316 exited on signal 6 (Aborted). -- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Thursday, June 05, 2014 7:57 PM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Hmmm...I'm not sure how that is going to run with only one proc (I don't know if the program is protected against that scenario). If you run with -np 2 -mca btl openib,sm,self, is it happy? On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Here's the command I'm invoking and the terminal output. (Some of this information doesn't appear to be captured in the backtrace.) [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. [binf316:04549] *** Process received signal *** [binf316:04549] Signal: Aborted (6) [binf316:04549] Signal code: (-6) [binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0] [binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55] [binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131] [binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10] [binf316:04549] [ 4] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b] [binf316:04549] [ 5] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474] [binf316:04549] [ 6] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316] [binf316:04549] [ 7] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817] [binf316:04549] [ 8] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e] [binf316:04549] [ 9] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42] [binf316:04549] [10] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b] [binf316:04549] [11] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739] [binf316:04549] [12] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2] [binf316:04549] [13] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c] [binf316:04549] [14] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386] [binf316:04549] [15] ring_c[0x40096f] [binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36] [binf316:04549] [17] ring_c[0x400889] [binf316:04549] *** End of error message *** -- mpirun noticed that process rank 0 with PID 4549 on node 316 exited on signal 6 (Aborted). -- From: Fischer, Greg A. Sent: Thursday, June 05, 2014 5:10 PM To: us...@open-mpi.org<mailto:us...@open-mpi.org> Cc: Fischer, Greg A. Subject: openib segfaults with Torque OpenMPI Users, After encountering difficulty with the Intel compilers (see the "intermittent segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared to work fine, so I went on my merry way compiling the rest of my dependencies. After getting my dependencies and applications compiled, I began observing segfaults when submitting the applications through Torque. I recompiled OpenMPI with debug options, ran "ring_c" over the openib BTL in an interactive Torque session ("qsub -I"), and got the backtrace below. All other system settings described in the previous thread are the same. Any thoughts on how to resolve this issue? Core was generated by `ring_c'. Program termin
Re: [OMPI users] openib segfaults with Torque
Hmmm...I'm not sure how that is going to run with only one proc (I don't know if the program is protected against that scenario). If you run with -np 2 -mca btl openib,sm,self, is it happy? On Jun 5, 2014, at 2:16 PM, Fischer, Greg A.wrote: > Here’s the command I’m invoking and the terminal output. (Some of this > information doesn’t appear to be captured in the backtrace.) > > [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c > ring_c: > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: > udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == > ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. > [binf316:04549] *** Process received signal *** > [binf316:04549] Signal: Aborted (6) > [binf316:04549] Signal code: (-6) > [binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0] > [binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55] > [binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131] > [binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10] > [binf316:04549] [ 4] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b] > [binf316:04549] [ 5] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474] > [binf316:04549] [ 6] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316] > [binf316:04549] [ 7] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817] > [binf316:04549] [ 8] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e] > [binf316:04549] [ 9] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42] > [binf316:04549] [10] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b] > [binf316:04549] [11] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739] > [binf316:04549] [12] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2] > [binf316:04549] [13] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c] > [binf316:04549] [14] > //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386] > [binf316:04549] [15] ring_c[0x40096f] > [binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36] > [binf316:04549] [17] ring_c[0x400889] > [binf316:04549] *** End of error message *** > -- > mpirun noticed that process rank 0 with PID 4549 on node 316 exited on > signal 6 (Aborted). > -- > > From: Fischer, Greg A. > Sent: Thursday, June 05, 2014 5:10 PM > To: us...@open-mpi.org > Cc: Fischer, Greg A. > Subject: openib segfaults with Torque > > OpenMPI Users, > > After encountering difficulty with the Intel compilers (see the “intermittent > segfaults with openib on ring_c.c” thread), I installed GCC-4.8.3 and > recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib > BTL in a typical BASH environment. Everything appeared to work fine, so I > went on my merry way compiling the rest of my dependencies. > > After getting my dependencies and applications compiled, I began observing > segfaults when submitting the applications through Torque. I recompiled > OpenMPI with debug options, ran “ring_c” over the openib BTL in an > interactive Torque session (“qsub –I”), and got the backtrace below. All > other system settings described in the previous thread are the same. Any > thoughts on how to resolve this issue? > > Core was generated by `ring_c'. > Program terminated with signal 6, Aborted. > #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 > (gdb) bt > #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 > #1 0x7f7f5920c0c5 in abort () from /lib64/libc.so.6 > #2 0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6 > #3 0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) > at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734 > #4 0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) > at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476 > #5 0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port > (btl=0x716680) at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 > #6
Re: [OMPI users] openib segfaults with Torque
Here's the command I'm invoking and the terminal output. (Some of this information doesn't appear to be captured in the backtrace.) [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. [binf316:04549] *** Process received signal *** [binf316:04549] Signal: Aborted (6) [binf316:04549] Signal code: (-6) [binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0] [binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55] [binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131] [binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10] [binf316:04549] [ 4] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b] [binf316:04549] [ 5] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474] [binf316:04549] [ 6] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316] [binf316:04549] [ 7] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817] [binf316:04549] [ 8] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e] [binf316:04549] [ 9] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42] [binf316:04549] [10] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b] [binf316:04549] [11] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739] [binf316:04549] [12] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2] [binf316:04549] [13] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c] [binf316:04549] [14] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386] [binf316:04549] [15] ring_c[0x40096f] [binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36] [binf316:04549] [17] ring_c[0x400889] [binf316:04549] *** End of error message *** -- mpirun noticed that process rank 0 with PID 4549 on node 316 exited on signal 6 (Aborted). -- From: Fischer, Greg A. Sent: Thursday, June 05, 2014 5:10 PM To: us...@open-mpi.org Cc: Fischer, Greg A. Subject: openib segfaults with Torque OpenMPI Users, After encountering difficulty with the Intel compilers (see the "intermittent segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared to work fine, so I went on my merry way compiling the rest of my dependencies. After getting my dependencies and applications compiled, I began observing segfaults when submitting the applications through Torque. I recompiled OpenMPI with debug options, ran "ring_c" over the openib BTL in an interactive Torque session ("qsub -I"), and got the backtrace below. All other system settings described in the previous thread are the same. Any thoughts on how to resolve this issue? Core was generated by `ring_c'. Program terminated with signal 6, Aborted. #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 #1 0x7f7f5920c0c5 in abort () from /lib64/libc.so.6 #2 0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6 #3 0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734 #4 0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476 #5 0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port (btl=0x716680) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 #6 0x7f7f54885817 in btl_openib_component_init (num_btl_modules=0x7fff906aa420, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703 #7 0x7f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108 #8 0x7f7f54ac7d42 in mca_bml_r2_component_init
[OMPI users] openib segfaults with Torque
OpenMPI Users, After encountering difficulty with the Intel compilers (see the "intermittent segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared to work fine, so I went on my merry way compiling the rest of my dependencies. After getting my dependencies and applications compiled, I began observing segfaults when submitting the applications through Torque. I recompiled OpenMPI with debug options, ran "ring_c" over the openib BTL in an interactive Torque session ("qsub -I"), and got the backtrace below. All other system settings described in the previous thread are the same. Any thoughts on how to resolve this issue? Core was generated by `ring_c'. Program terminated with signal 6, Aborted. #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 #1 0x7f7f5920c0c5 in abort () from /lib64/libc.so.6 #2 0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6 #3 0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734 #4 0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476 #5 0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port (btl=0x716680) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 #6 0x7f7f54885817 in btl_openib_component_init (num_btl_modules=0x7fff906aa420, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703 #7 0x7f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108 #8 0x7f7f54ac7d42 in mca_bml_r2_component_init (priority=0x7fff906aa4f4, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88 #9 0x7f7f5982cd1b in mca_bml_base_init (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69 #10 0x7f7f539ed739 in mca_pml_ob1_component_init (priority=0x7fff906aa630, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271 #11 0x7f7f598539b2 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128 #12 0x7f7f597c033c in ompi_mpi_init (argc=1, argv=0x7fff906aa928, requested=0, provided=0x7fff906aa7d8) at ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604 #13 0x7f7f597f5386 in PMPI_Init (argc=0x7fff906aa82c, argv=0x7fff906aa820) at pinit.c:84 #14 0x0040096f in main (argc=1, argv=0x7fff906aa928) at ring_c.c:19 Greg