On Dec 18, 2007, at 11:12 AM, Marco Sbrighi wrote:
Assumedly this(these) statement(s) are in a config file that is being read by Open MPI, such as $HOME/.openmpi/mca-params.conf?I've tried many combinations: only in $HOME/.openmpi/mca-params.conf, only in command line and both; but none seems to work correctly. Nevertheless, what I'm expecting is that if something is specified in$HOME/.openmpi/mca-params.conf, then if differently specified in commandline, the last should be assumed, I think.
The only difference in putting values in these locations should be the order of precedence in which they are read. As you stated, values on the command line override everything else. See http://www.open-mpi.org/faq/?category=tuning#setting-mca-params .
Yes, it does. Specifying the MCA same param twice on the command lineresults in undefined behavior -- it will only take one of them, and Iassume it'll take the first (but I'd have to check the code to be sure).OK, I can obtain the same behaviour using only one statement: --mca oob_tcp_include eth1,lo,eth0,ib0,ib1
FWIW, I traced the history of this code -- it looks like it dates all the way back to LAM/MPI, where if you specify "--mca foo bar --mca foo yow", then foo will get the value "bar,yow". So it *is* intended (albeit undocumented!) behavior. Who knew! :-)
note that using --mca mpi_show_mca_params what I'm seeing in the reportis the same for both statements (twice and single): ..... [node255:30188] oob_tcp_debug=0 [node255:30188] oob_tcp_include=eth1,lo,eth0,ib0,ib1 [node255:30188] oob_tcp_exclude= .......
So far, this is all consistent and expected.
Could you try with 1.2.3 or 1.2.4 (1.2.4 is the most recent; 1.2.5 isdue out "soon" -- it *may* get out before the holiday break, but no promises...)?we have 1.2.3 in another cluster and it performs the same behaviour as 1.2.2 .... (BTW the other cluster has the same eth ifaces)
Crud.
If you can't upgrade, let me know and I can provide a debugging patchthat will give us a little more insight into what is happening on yourmachines. Thanks.It is quite difficult for us to upgrade the open-mpi now. We have the official CISCO packages installed, and I know the 1.2.2-1 is the only official CISCO's open-mpi distribution today ....
Here's a patch to the OMPI 1.2.2 source that adds some printf's in the OOB TCP interface selection logic that should show exactly what each process decides. You should be able to run this with as few as 2 processes to see what the decision-making process is for each of them.
11:24] svbu-mpi:/home/jsquyres/openmpi-1.2.2 % diff -u orte/mca/oob/ tcp/oob_tcp.c.orig orte/mca/oob/tcp/oob_tcp.c
--- orte/mca/oob/tcp/oob_tcp.c.orig 2007-12-18 11:21:08.000000000 -0800 +++ orte/mca/oob/tcp/oob_tcp.c 2007-12-18 11:22:29.000000000 -0800 @@ -1344,11 +1344,15 @@ char name[32]; opal_ifindextoname(i, name, sizeof(name)); if (mca_oob_tcp_component.tcp_include != NULL && - strstr(mca_oob_tcp_component.tcp_include,name) == NULL) + strstr(mca_oob_tcp_component.tcp_include,name) == NULL) {+ opal_output(0, "TCP OOB skipping %s because it's not in include (%s)\n", name, mca_oob_tcp_component.tcp_include);
continue; + } if (mca_oob_tcp_component.tcp_exclude != NULL && - strstr(mca_oob_tcp_component.tcp_exclude,name) != NULL) + strstr(mca_oob_tcp_component.tcp_exclude,name) != NULL) {+ opal_output(0, "TCP OOB skipping %s because it's in exclude (%s)\n", name, mca_oob_tcp_component.tcp_exclude);
continue; + } opal_ifindextoaddr(i, (struct sockaddr*)&addr, sizeof(addr)); if(opal_ifcount() > 1 && opal_ifislocalhost((struct sockaddr*) &addr)) @@ -1356,6 +1360,7 @@ if(ptr != contact_info) { ptr += sprintf(ptr, ";"); } + opal_output(0, "TCP OOB adding interface: %s\n", name); ptr += sprintf(ptr, "tcp://%s:%d", inet_ntoa(addr.sin_addr), ntohs(mca_oob_tcp_component.tcp_listen_port)); }I attached the patch as well in case my mail client / the mailing list munges it.
-- Jeff Squyres Cisco Systems
ompi-1.2.2-oob-tcp-verbose.patch
Description: Binary data