Jeff Squyres <jsquy...@cisco.com> writes:

> On Oct 11, 2008, at 6:48 AM, Aleksej Saushev wrote:
>
>> The actual message states:
>>
>> [asau.local:25752] [NO-NAME] ORTE_ERROR_LOG: Not found in file
>> runtime/orte_init_stage1.c at line 182
>> --------------------------------------------------------------------------
>
> Hmm.  Even with all your output, I still don't see what could be
> causing this -- the oob rml plugin was compiled and installed
> just  fine.  Do you see an oob rml line in the output of
> ompi_info?

$ ompi_info | grep oob
[asau.local:00985] mca: base: components_open: Looking for ras components
[asau.local:00985] mca: base: components_open: distilling ras components
[asau.local:00985] mca: base: components_open: accepting all ras components
[asau.local:00985] mca: base: components_open: opening ras components
[asau.local:00985] mca: base: components_open: found loaded component dash_host
[asau.local:00985] mca: base: components_open: component dash_host open 
function successful
[asau.local:00985] mca: base: components_open: found loaded component gridengine
[asau.local:00985] mca: base: components_open: component gridengine open 
function successful
[asau.local:00985] mca: base: components_open: found loaded component localhost
[asau.local:00985] mca: base: components_open: component localhost open 
function successful
                 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
                 MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7)

> Is there a chance that there's some dependent library of oob_rml
> that  is available on your head/build node, but not available on
> your back- 
> end nodes?  (that would be pretty odd, though)

Very unlikely. Unless you don't install it at "make install" time,
it is there. Host and target are the same (identical).
Any particular library (set of libraries) to check?

> Bummer -- it looks like we have a bug in the debugging output
> for when  rml plugins are selected -- so I can't just give you
> an mpirun command  line that will output some additional
> diagnostic information.  Do you  mind getting your hands dirty
> in a little code?  If so, edit this  file:
> orte/mca/rml/base/rml_base_select.c and change all instances of
>
>    opal_output_verbose(xxx, orte_rml_base.rml_output, ...)
> to
>    opaL_output(orte_rml_base.rml_output, ...)
>
> And then compile/install that with (this is a shortcut; of
> course, you  can do a top-level "make install" to install it,
> but it's a bit  overkill for what we need for this bit):
>
>    cd orte/rml
>    make
>    cd ../..
>    make install-am
>
> Then run with:
>
>    mpirun --mca rml_base_debug 100 ...
>
> And see what the output tells you.  When I do this with a
> successful  run, my output looks like this:
>
> ----
> [5:38] svbu-mpi:~/mpi % mpirun -np 1 --mca rml_base_debug 100 hello
> [svbu-mpi.cisco.com:02087] orte_rml_base_select: initializing
> rml  component oob
> [svbu-mpi030:10587] orte_rml_base_select: initializing rml component oob
> stdout: Hello, world!  I am 0 of 1 (svbu-mpi030)
> stderr: Hello, world!  I am 0 of 1 (svbu-mpi030)
> [5:39] svbu-mpi:~/mpi %
> -----
>
> (my "hello" program simply prints out the hello world message on
> both  stdout/stderr)

$ mpirun --mca rml_base_debug 100 -np 2 skosfile
[asau.local:09060] mca: base: components_open: Looking for rml components
[asau.local:09060] mca: base: components_open: distilling rml components
[asau.local:09060] mca: base: components_open: accepting all rml components
[asau.local:09060] mca: base: components_open: opening rml components
[asau.local:09060] mca: base: components_open: found loaded component oob
[asau.local:09060] mca: base: components_open: component oob open function 
successful
[asau.local:09060] orte_rml_base_select: initializing rml component oob
[asau.local:09060] orte_rml_base_select: init returned failure
[asau.local:09060] orte_rml_base_select: module oob unloaded
[asau.local:09060] [NO-NAME] ORTE_ERROR_LOG: Not found in file 
runtime/orte_init_stage1.c at line 182
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_rml_base_select failed
  --> Returned value -13 instead of ORTE_SUCCESS

--------------------------------------------------------------------------
[asau.local:09060] [NO-NAME] ORTE_ERROR_LOG: Not found in file 
runtime/orte_system_init.c at line 42
[asau.local:09060] [NO-NAME] ORTE_ERROR_LOG: Not found in file 
runtime/orte_init.c at line 52
--------------------------------------------------------------------------
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -13 instead of ORTE_SUCCESS.
--------------------------------------------------------------------------

>> Additional information.
>>
>> pkgsrc framework does work correctly here, it even catches or
>> overrides some incompatibilities, when building OpenMPI from the
>> same tarball without pkgsrc framework, I get this:
>>
>> libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../../opal/
>> include -I../../../../orte/include -I../../../../ompi/include
>> - 
>> I../../../.. -O3 -DNDEBUG -finline-functions
>> -fno-strict-aliasing - 
>> pthread -MT backtrace_none_component.lo -MD -MP -MF .deps/
>> backtrace_none_component.Tpo -c backtrace_none_component.c
>> -fPIC - 
>> DPIC -o .libs/backtrace_none_component.o
>> backtrace_none_component.c:41: error: expected expression
>> before ','  token
>> backtrace_none_component.c:51: warning: braces around scalar
>> initializer
>> backtrace_none_component.c:51: warning: (near initialization
>> for  'mca_backtrace_none_component
>> .backtracec_version.mca_component_release_version')
>
> That's also odd.  I don't see any problems in the source code in
> this  particular area.  What is the output of this area of the
> code when  compiled with -E?  It should show some obvious
> problem.

I'll check this a bit later, if you don't object.


-- 
HE CE3OH...

Reply via email to