Hi Ralph
Thank you.
I switched back to memlock unlimited, rebooted the nodes,
and after that OpenMPI is working right with Infinband.
As for why the problem happened first place,
I can only think that somehow the Infiniband kernel modules and
driver didn't like my reducing the memlock limit,
Seems strange that it would have something to do with IB - it seems that alloc
itself is failing, and at only 512 bytes, that doesn't seem like something IB
would cause.
If you write a little program that calls alloc (no MPI), does it also fail?
On Aug 12, 2013, at 3:35 PM, Gus Correa
Hi Ralph
Sorry if this is more of an IB than an OMPI problem,
but my view angle shows it through the OMPI jobs failing.
Yes, indeed I was setting memlock to unlimited in limits.conf
and in the pbs_mom, restarting everything, relaunching the job.
The error message changes, but it still fails on
No, this has nothing to do with the registration limit. For some reason, the
system is refusing to create a thread - i.e., it is pthread_create that is
failing. I have no idea what would be causing that to happen.
Try setting it to unlimited and see if it allows the thread to start, I guess.
Hi Ralph, all
I include more information below,
after turning on btl_openib_verbose 30.
As you can see, OMPI tries, and fails, to load openib.
Last week I reduced the memlock limit from unlimited
to ~12GB, as part of a general attempt to reign on memory
use/abuse by jobs sharing a node.
No
Thank you for the prompt help, Ralph!
Yes, it is OMPI 1.4.3 built with openib support:
$ ompi_info | grep openib
MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3)
There are three libraries in prefix/lib/openmpi,
no mca_btl_openib library.
$ ls $PREFIX/lib/openmpi/
Check ompi_info - was it built with openib support?
Then check that the mca_btl_openib library is present in the prefix/lib/openmpi
directory
Sounds like it isn't finding the openib plugin
On Aug 12, 2013, at 11:57 AM, Gus Correa wrote:
> Dear Open MPI pros
>
> On
Dear Open MPI pros
On one of the clusters here, that has Infinband,
I am getting this type of errors from
OpenMPI 1.4.3 (OK, I know it is old ...):
*
Tcl_InitNotifier: unable to start notifier thread
Abort: Command not found.