On Oct 11, 2008, at 10:20 AM, Aleksej Saushev wrote:
$ ompi_info | grep oob
MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7)
Good!
Is there a chance that there's some dependent library of oob_rml
that is available on your head/build node, but not available on
your back-
end nodes? (that would be pretty odd, though)
Very unlikely. Unless you don't install it at "make install" time,
it is there. Host and target are the same (identical).
Any particular library (set of libraries) to check?
Actually, the output below seems to indicate that the modules are
being *loaded* ok, but they're declining to run for some reason. So I
think we can rule out the dependent libraries issue.
$ mpirun --mca rml_base_debug 100 -np 2 skosfile
[asau.local:09060] mca: base: components_open: Looking for rml
components
[asau.local:09060] mca: base: components_open: distilling rml
components
[asau.local:09060] mca: base: components_open: accepting all rml
components
[asau.local:09060] mca: base: components_open: opening rml components
[asau.local:09060] mca: base: components_open: found loaded
component oob
[asau.local:09060] mca: base: components_open: component oob open
function successful
[asau.local:09060] orte_rml_base_select: initializing rml component
oob
[asau.local:09060] orte_rml_base_select: init returned failure
Ah ha -- this is progress. For some reason, your "oob" RML plugin is
declining to run. I see that its query/initialization function is
actually quite short:
if(mca_oob_base_init() != ORTE_SUCCESS)
return NULL;
*priority = 1;
return &orte_rml_oob_module;
So it must be failing the mca_oob_base_init() function -- this is what
initializes the underling "OOB" (out of band) communications subsystem.
Of course, this doesn't fail often, so we don't have any run-time
switches to enable the debugging output. :-( Edit orte/mca/oob/base/
oob_base_open.c line 43 and change the value of mca_oob_base_output
from -1 to 0. Let's see that output -- I'm particularly interested in
the output from querying the tcp oob component. I suspect that it's
declining to run as well.
I wonder if this is going to end up being an opal_if() issue -- where
we are traversing all the IP network interfaces from the kernel...
I'll bet even money that it is.
Specifically: I predict that the tcp oob component is declining to run
(which then causes the greater OOB init to fail, because no OOB
components will be able to be found, which then causes the RML OOB
init to fail, and therefore RML init fails because no RML components
can be found). My guess is that orte/mca/oob/tcp/
oob_tcp.c:oob_tcp_component_init() is failing to find any valid/UP IP
interfaces. It starts traversing the list of interfaces at line 864
with the call to opal_ifbegin() ("OPAL" is our underlying portability
layer). If this was the first time opal_ifbegin() was invoked, it'll
scan the kernel for all the interfaces; otherwise it'll just traverse
the list that it already has. Either way, you might want to run this
section through a debugger and see if it's not finding anything.
Just an offhand question: do you have non-localhost IPv4 interfaces
enabled on your machines?
That's also odd. I don't see any problems in the source code in
this particular area. What is the output of this area of the
code when compiled with -E? It should show some obvious
problem.
I'll check this a bit later, if you don't object.
No problem.
--
Jeff Squyres
Cisco Systems