On 18/12/2020 23:04, Josh Hursey wrote:
Vincent,
Thanks for the details on the bug. Indeed this is a case that seems to
have been a problem for a little while now when you use static ports
with ORTE (-mca oob_tcp_static_ipv4_ports option). It must have crept
in when we refactored the internal regular expression mechanism for
the v4 branches (and now that I look maybe as far back as v3.1). I
just hit this same issue in the past day or so working with a
different user.
Though I do not have a suggestion for a workaround at this time
(sorry) I did file a GitHub Issue and am looking at this issue. With
the holiday I don't know when I will have a fix, but you can watch the
ticket for updates.
https://github.com/open-mpi/ompi/issues/8304
In the meantime, you could try the v3.0 series release (which predates
this change) or the current Open MPI master branch (which approaches
this a little differently). The same command line should work in both.
Both can be downloaded from the links below:
https://www.open-mpi.org/software/ompi/v3.0/
https://www.open-mpi.org/nightly/master/
Hello Josh
Thank you for considering the problem. I will certainly keep watching
the ticket. However, there is nothing really urgent (to me anyway).
Regarding your command line, it looks pretty good:
orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca btl tcp
--mca btl_tcp_port_min_v4 6706 --mca btl_tcp_port_range_v4 10 --mca
oob_tcp_static_ipv4_ports 6705 -host node2:1 -np 1
/path/to/some/program arg1 .. argn
I would suggest, while you are debugging this, that you use a program
like /bin/hostname instead of a real MPI program. If /bin/hostname
launches properly then move on to an MPI program. That will assure you
that the runtime wired up correctly (oob/tcp), and then we can focus
on the MPI side of the communication (btl/tcp). You will want to
change "-mca btl tcp" to at least "-mca btl tcp,self" (or better "-mca
btl tcp,vader,self" if you want shared memory). 'self' is the loopback
interface in Open MPI.
Yes. This is actually what I did. I just wanted to be generic and report
the problem without too much flourish.
But it is important you reminded this for new users, helping them to
understand the real purpose of each layer in an MPI implementation.
Is there a reason that you are specifying the --launch-agent to the
orted? Is it installed in a different path on the remote nodes? If
Open MPI is installed in the same location on all nodes then you
shouldn't need that.
I recompiled the sources, activating --enable-orterun-prefix-by-default
when running ./configure. Of course, it helps :)
Again, thank you.
Kind regards
Vincent.
Thanks,
Josh