Some further clarification, I read a post over on the SGE mailing list
that said the --with-sge is part of ompi 1.3, not 1.2.x.

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Aleksej Saushev
Sent: Thursday, October 16, 2008 12:39 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI portability problems: debug info
isn'thelpful

Jeff Squyres <jsquy...@cisco.com> writes:

> On Oct 11, 2008, at 10:20 AM, Aleksej Saushev wrote:
>
>> $ ompi_info | grep oob
>>                 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>>                 MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7)
>
> Good!
>
>>> $ mpirun --mca rml_base_debug 100 -np 2 skosfile
>> [asau.local:09060] mca: base: components_open: Looking for rml
>> components
>> [asau.local:09060] mca: base: components_open: distilling rml
>> components
>> [asau.local:09060] mca: base: components_open: accepting all
>> rml  components
>> [asau.local:09060] mca: base: components_open: opening rml components
>> [asau.local:09060] mca: base: components_open: found loaded
>> component oob
>> [asau.local:09060] mca: base: components_open: component oob
>> open  function successful
>> [asau.local:09060] orte_rml_base_select: initializing rml
>> component  oob
>> [asau.local:09060] orte_rml_base_select: init returned failure
>
> Ah ha -- this is progress.  For some reason, your "oob" RML
> plugin is  declining to run.  I see that its
> query/initialization function is  actually quite short:
>
>     if(mca_oob_base_init() != ORTE_SUCCESS)
>         return NULL;
>     *priority = 1;
>     return &orte_rml_oob_module;
>
> So it must be failing the mca_oob_base_init() function -- this
> is what  initializes the underling "OOB" (out of band)
> communications subsystem.
>
> Of course, this doesn't fail often, so we don't have any
> run-time  switches to enable the debugging output.  :-(  Edit
> orte/mca/oob/base/ oob_base_open.c line 43 and change the value
> of mca_oob_base_output  from -1 to 0.  Let's see that output --
> I'm particularly interested in  the output from querying the tcp
> oob component.  I suspect that it's  declining to run as well.
>
> I wonder if this is going to end up being an opal_if() issue --
> where  we are traversing all the IP network interfaces from the
> kernel...   I'll bet even money that it is.

[asau.local:04648] opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=6
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init_stage1.c at line 182
------------------------------------------------------------------------
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_rml_base_select failed
  --> Returned value -13 instead of ORTE_SUCCESS

------------------------------------------------------------------------
--
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 52
------------------------------------------------------------------------
--
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -13 instead of ORTE_SUCCESS.
------------------------------------------------------------------------
--

Why don't you use strerror(3) to print errno value explanation?

>From <sys/errno.h>:
#define ENXIO           6               /* Device not configured */

It seems that I have to debug network interface probing,
how should I use *_output subroutines so that they do print?
I tried these changes but in vain:

--- opal/util/if.c.orig 2008-08-25 23:16:50.000000000 +0400
+++ opal/util/if.c      2008-10-15 23:55:07.000000000 +0400
@@ -242,6 +242,8 @@
         if(ifr->ifr_addr.sa_family != AF_INET)
             continue;

+       opal_output(0, "opal_ifinit: checking netif %s", ifr->ifr_name);
+       /* HERE IT FAILS!! */
         if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) {
             opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed
with errno=%d", errno);
             continue;
--- opal/util/if.c.orig 2008-08-25 23:16:50.000000000 +0400
+++ opal/util/if.c      2008-10-15 23:55:07.000000000 +0400
@@ -242,6 +242,8 @@
         if(ifr->ifr_addr.sa_family != AF_INET)
             continue;

+       fprintf(stderr, "opal_ifinit: checking netif %s\n",
ifr->ifr_name);
+       /* HERE IT FAILS!! */
         if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) {
             opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed
with errno=%d", errno);
             continue;
--- opal/util/output.c.orig     2008-08-25 23:16:50.000000000 +0400
+++ opal/util/output.c  2008-10-16 19:58:49.000000000 +0400
@@ -41,7 +41,7 @@
 /*
  * Private data
  */
-static int verbose_stream = -1;
+static int verbose_stream = 0;
 static opal_output_stream_t verbose;
 static char *output_dir = NULL;
 static char *output_prefix = NULL;

It seems a bit tricky, and it is scarcely documented.
Have I overlooked it?

What makes it strange, that fprintf(stderr,..) doen't work.

> Specifically: I predict that the tcp oob component is declining
> to run  (which then causes the greater OOB init to fail, because
> no OOB  components will be able to be found, which then causes
> the RML OOB  init to fail, and therefore RML init fails because
> no RML components  can be found).  My guess is that
> orte/mca/oob/tcp/ oob_tcp.c:oob_tcp_component_init() is failing
> to find any valid/UP IP  interfaces.  It starts traversing the
> list of interfaces at line 864  with the call to opal_ifbegin()
> ("OPAL" is our underlying portability  layer).  If this was the
> first time opal_ifbegin() was invoked, it'll  scan the kernel
> for all the interfaces; otherwise it'll just traverse  the list
> that it already has.  Either way, you might want to run this
> section through a debugger and see if it's not finding anything.
>
> Just an offhand question: do you have non-localhost IPv4
> interfaces  enabled on your machines?

Yes.

ifconfig -l ==> bce0 fwip0 rum0 lo0 pppoe0

>>>> That's also odd.  I don't see any problems in the source code in
>>> this  particular area.  What is the output of this area of the
>>> code when  compiled with -E?  It should show some obvious
>>> problem.
>>
>> I'll check this a bit later, if you don't object.
>
> No problem.

I've met some difficulties on this way today. I take time for further
investigations. Though I think this isn't needed now.

I'll be unavailable starting from Saturday (probably,
since Monday for sure).


-- 
HE CE3OH...
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to