On Dec 18, 2007, at 11:12 AM, Marco Sbrighi wrote:

Assumedly this(these) statement(s) are in a config file that is being
read by Open MPI, such as $HOME/.openmpi/mca-params.conf?

I've tried many combinations: only in $HOME/.openmpi/mca-params.conf,
only in command line and both; but none seems to work correctly.
Nevertheless, what I'm expecting is that if something is specified in
$HOME/.openmpi/mca-params.conf, then if differently specified in command
line, the last should be assumed, I think.

The only difference in putting values in these locations should be the order of precedence in which they are read. As you stated, values on the command line override everything else. See http://www.open-mpi.org/faq/?category=tuning#setting-mca-params .
Yes, it does. Specifying the MCA same param twice on the command line
results in undefined behavior -- it will only take one of them, and I
assume it'll take the first (but I'd have to check the code to be sure).

OK, I can obtain the same behaviour using only one statement:
--mca oob_tcp_include eth1,lo,eth0,ib0,ib1

FWIW, I traced the history of this code -- it looks like it dates all the way back to LAM/MPI, where if you specify "--mca foo bar --mca foo yow", then foo will get the value "bar,yow". So it *is* intended (albeit undocumented!) behavior. Who knew! :-)

note that using --mca mpi_show_mca_params what I'm seeing in the report
is the same for both statements (twice and single):

.....
[node255:30188] oob_tcp_debug=0
[node255:30188] oob_tcp_include=eth1,lo,eth0,ib0,ib1
[node255:30188] oob_tcp_exclude=
.......

So far, this is all consistent and expected.

Could you try with 1.2.3 or 1.2.4 (1.2.4 is the most recent; 1.2.5 is
due out "soon" -- it *may* get out before the holiday break, but no
promises...)?

we have 1.2.3 in another cluster and it performs the same behaviour as
1.2.2 .... (BTW the other cluster has the same eth ifaces)

Crud.

If you can't upgrade, let me know and I can provide a debugging patch
that will give us a little more insight into what is happening on your
machines.  Thanks.

It is quite difficult for us to upgrade the open-mpi now. We have the
official CISCO packages installed, and I know the 1.2.2-1 is the only
official CISCO's open-mpi distribution today ....


Here's a patch to the OMPI 1.2.2 source that adds some printf's in the OOB TCP interface selection logic that should show exactly what each process decides. You should be able to run this with as few as 2 processes to see what the decision-making process is for each of them.

11:24] svbu-mpi:/home/jsquyres/openmpi-1.2.2 % diff -u orte/mca/oob/ tcp/oob_tcp.c.orig orte/mca/oob/tcp/oob_tcp.c
--- orte/mca/oob/tcp/oob_tcp.c.orig     2007-12-18 11:21:08.000000000 -0800
+++ orte/mca/oob/tcp/oob_tcp.c  2007-12-18 11:22:29.000000000 -0800
@@ -1344,11 +1344,15 @@
         char name[32];
         opal_ifindextoname(i, name, sizeof(name));
         if (mca_oob_tcp_component.tcp_include != NULL &&
-            strstr(mca_oob_tcp_component.tcp_include,name) == NULL)
+            strstr(mca_oob_tcp_component.tcp_include,name) == NULL) {
+ opal_output(0, "TCP OOB skipping %s because it's not in include (%s)\n", name, mca_oob_tcp_component.tcp_include);
             continue;
+        }
         if (mca_oob_tcp_component.tcp_exclude != NULL &&
-            strstr(mca_oob_tcp_component.tcp_exclude,name) != NULL)
+            strstr(mca_oob_tcp_component.tcp_exclude,name) != NULL) {
+ opal_output(0, "TCP OOB skipping %s because it's in exclude (%s)\n", name, mca_oob_tcp_component.tcp_exclude);
             continue;
+        }
         opal_ifindextoaddr(i, (struct sockaddr*)&addr, sizeof(addr));
         if(opal_ifcount() > 1 &&
            opal_ifislocalhost((struct sockaddr*) &addr))
@@ -1356,6 +1360,7 @@
         if(ptr != contact_info) {
             ptr += sprintf(ptr, ";");
         }
+        opal_output(0, "TCP OOB adding interface: %s\n", name);
         ptr += sprintf(ptr, "tcp://%s:%d", inet_ntoa(addr.sin_addr),
                     ntohs(mca_oob_tcp_component.tcp_listen_port));
     }

I attached the patch as well in case my mail client / the mailing list munges it.

--
Jeff Squyres
Cisco Systems

Attachment: ompi-1.2.2-oob-tcp-verbose.patch
Description: Binary data


Reply via email to