Some further clarification, I read a post over on the SGE mailing list that said the --with-sge is part of ompi 1.3, not 1.2.x.
-----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Aleksej Saushev Sent: Thursday, October 16, 2008 12:39 PM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI portability problems: debug info isn'thelpful Jeff Squyres <jsquy...@cisco.com> writes: > On Oct 11, 2008, at 10:20 AM, Aleksej Saushev wrote: > >> $ ompi_info | grep oob >> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) >> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7) > > Good! > >>> $ mpirun --mca rml_base_debug 100 -np 2 skosfile >> [asau.local:09060] mca: base: components_open: Looking for rml >> components >> [asau.local:09060] mca: base: components_open: distilling rml >> components >> [asau.local:09060] mca: base: components_open: accepting all >> rml components >> [asau.local:09060] mca: base: components_open: opening rml components >> [asau.local:09060] mca: base: components_open: found loaded >> component oob >> [asau.local:09060] mca: base: components_open: component oob >> open function successful >> [asau.local:09060] orte_rml_base_select: initializing rml >> component oob >> [asau.local:09060] orte_rml_base_select: init returned failure > > Ah ha -- this is progress. For some reason, your "oob" RML > plugin is declining to run. I see that its > query/initialization function is actually quite short: > > if(mca_oob_base_init() != ORTE_SUCCESS) > return NULL; > *priority = 1; > return &orte_rml_oob_module; > > So it must be failing the mca_oob_base_init() function -- this > is what initializes the underling "OOB" (out of band) > communications subsystem. > > Of course, this doesn't fail often, so we don't have any > run-time switches to enable the debugging output. :-( Edit > orte/mca/oob/base/ oob_base_open.c line 43 and change the value > of mca_oob_base_output from -1 to 0. Let's see that output -- > I'm particularly interested in the output from querying the tcp > oob component. I suspect that it's declining to run as well. > > I wonder if this is going to end up being an opal_if() issue -- > where we are traversing all the IP network interfaces from the > kernel... I'll bet even money that it is. [asau.local:04648] opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=6 [asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 182 ------------------------------------------------------------------------ -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_rml_base_select failed --> Returned value -13 instead of ORTE_SUCCESS ------------------------------------------------------------------------ -- [asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42 [asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52 ------------------------------------------------------------------------ -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS. ------------------------------------------------------------------------ -- Why don't you use strerror(3) to print errno value explanation? >From <sys/errno.h>: #define ENXIO 6 /* Device not configured */ It seems that I have to debug network interface probing, how should I use *_output subroutines so that they do print? I tried these changes but in vain: --- opal/util/if.c.orig 2008-08-25 23:16:50.000000000 +0400 +++ opal/util/if.c 2008-10-15 23:55:07.000000000 +0400 @@ -242,6 +242,8 @@ if(ifr->ifr_addr.sa_family != AF_INET) continue; + opal_output(0, "opal_ifinit: checking netif %s", ifr->ifr_name); + /* HERE IT FAILS!! */ if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) { opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=%d", errno); continue; --- opal/util/if.c.orig 2008-08-25 23:16:50.000000000 +0400 +++ opal/util/if.c 2008-10-15 23:55:07.000000000 +0400 @@ -242,6 +242,8 @@ if(ifr->ifr_addr.sa_family != AF_INET) continue; + fprintf(stderr, "opal_ifinit: checking netif %s\n", ifr->ifr_name); + /* HERE IT FAILS!! */ if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) { opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=%d", errno); continue; --- opal/util/output.c.orig 2008-08-25 23:16:50.000000000 +0400 +++ opal/util/output.c 2008-10-16 19:58:49.000000000 +0400 @@ -41,7 +41,7 @@ /* * Private data */ -static int verbose_stream = -1; +static int verbose_stream = 0; static opal_output_stream_t verbose; static char *output_dir = NULL; static char *output_prefix = NULL; It seems a bit tricky, and it is scarcely documented. Have I overlooked it? What makes it strange, that fprintf(stderr,..) doen't work. > Specifically: I predict that the tcp oob component is declining > to run (which then causes the greater OOB init to fail, because > no OOB components will be able to be found, which then causes > the RML OOB init to fail, and therefore RML init fails because > no RML components can be found). My guess is that > orte/mca/oob/tcp/ oob_tcp.c:oob_tcp_component_init() is failing > to find any valid/UP IP interfaces. It starts traversing the > list of interfaces at line 864 with the call to opal_ifbegin() > ("OPAL" is our underlying portability layer). If this was the > first time opal_ifbegin() was invoked, it'll scan the kernel > for all the interfaces; otherwise it'll just traverse the list > that it already has. Either way, you might want to run this > section through a debugger and see if it's not finding anything. > > Just an offhand question: do you have non-localhost IPv4 > interfaces enabled on your machines? Yes. ifconfig -l ==> bce0 fwip0 rum0 lo0 pppoe0 >>>> That's also odd. I don't see any problems in the source code in >>> this particular area. What is the output of this area of the >>> code when compiled with -E? It should show some obvious >>> problem. >> >> I'll check this a bit later, if you don't object. > > No problem. I've met some difficulties on this way today. I take time for further investigations. Though I think this isn't needed now. I'll be unavailable starting from Saturday (probably, since Monday for sure). -- HE CE3OH... _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users