Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled

2013-08-12 Thread Ralph Castain
Seems strange that it would have something to do with IB - it seems that alloc itself is failing, and at only 512 bytes, that doesn't seem like something IB would cause. If you write a little program that calls alloc (no MPI), does it also fail? On Aug 12, 2013, at 3:35 PM, Gus Correa

Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled

2013-08-12 Thread Gus Correa
Hi Ralph Sorry if this is more of an IB than an OMPI problem, but my view angle shows it through the OMPI jobs failing. Yes, indeed I was setting memlock to unlimited in limits.conf and in the pbs_mom, restarting everything, relaunching the job. The error message changes, but it still fails on

Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled

2013-08-12 Thread Ralph Castain
No, this has nothing to do with the registration limit. For some reason, the system is refusing to create a thread - i.e., it is pthread_create that is failing. I have no idea what would be causing that to happen. Try setting it to unlimited and see if it allows the thread to start, I guess.

Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled

2013-08-12 Thread Gus Correa
Hi Ralph, all I include more information below, after turning on btl_openib_verbose 30. As you can see, OMPI tries, and fails, to load openib. Last week I reduced the memlock limit from unlimited to ~12GB, as part of a general attempt to reign on memory use/abuse by jobs sharing a node. No

Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled

2013-08-12 Thread Gus Correa
Thank you for the prompt help, Ralph! Yes, it is OMPI 1.4.3 built with openib support: $ ompi_info | grep openib MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3) There are three libraries in prefix/lib/openmpi, no mca_btl_openib library. $ ls $PREFIX/lib/openmpi/

Re: [OMPI users] Fault Tolerant Features in OpenMPI

2013-08-12 Thread Edson Tavares de Camargo
Hi, George! I had studied the ULFM document before begin the tests with failure detection in open mpi and seems me a good choice. But I'm having trouble with the ULFM-enabled version of Open MPI (openmpi-1.7ft_b3.tar.gz). I follow the UFML setup (in http://fault-tolerance.org/ulfm/ulfm-setup/).

Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled

2013-08-12 Thread Ralph Castain
Check ompi_info - was it built with openib support? Then check that the mca_btl_openib library is present in the prefix/lib/openmpi directory Sounds like it isn't finding the openib plugin On Aug 12, 2013, at 11:57 AM, Gus Correa wrote: > Dear Open MPI pros > > On

[OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled

2013-08-12 Thread Gus Correa
Dear Open MPI pros On one of the clusters here, that has Infinband, I am getting this type of errors from OpenMPI 1.4.3 (OK, I know it is old ...): * Tcl_InitNotifier: unable to start notifier thread Abort: Command not found.

Re: [OMPI users] Fault Tolerant Features in OpenMPI

2013-08-12 Thread George Bosilca
Edson, Based on your questions I would suggest you take a look at the ULFM-enabled version of Open MPI. You can find it at http://fault-tolerance.org/. George. On Aug 11, 2013, at 15:33 , Edson Tavares de Camargo wrote: > Thanks a lot for your reply, Ralph! > >