On 07/01/2021 19:51, Josh Hursey via users wrote:
I posted a fix for the static ports issue (currently on the v4.1.x branch):
https://github.com/open-mpi/ompi/pull/8339

If you have time do you want to give it a try and confirm that it fixes your issue?


Hello Josh

Definitely yes ! It does not crash anymore and I see through ss/netstat the orted process is connecting to the port I specified. Good work. Thank you.

I wish you a happy 2021.

Regards

Vincent.



Thanks,
Josh


On Tue, Dec 22, 2020 at 2:44 AM Vincent <boubl...@yahoo.co.uk <mailto:boubl...@yahoo.co.uk>> wrote:

    On 18/12/2020 23:04, Josh Hursey wrote:
    Vincent,

    Thanks for the details on the bug. Indeed this is a case that
    seems to have been a problem for a little while now when you
    use static ports with ORTE (-mca oob_tcp_static_ipv4_ports
    option). It must have crept in when we refactored the internal
    regular expression mechanism for the v4 branches (and now that I
    look maybe as far back as v3.1). I just hit this same issue in
    the past day or so working with a different user.

    Though I do not have a suggestion for a workaround at this time
    (sorry) I did file a GitHub Issue and am looking at this issue.
    With the holiday I don't know when I will have a fix, but you can
    watch the ticket for updates.
    https://github.com/open-mpi/ompi/issues/8304

    In the meantime, you could try the v3.0 series release (which
    predates this change) or the current Open MPI master branch
    (which approaches this a little differently). The same command
    line should work in both. Both can be downloaded from the links
    below:
    https://www.open-mpi.org/software/ompi/v3.0/
    https://www.open-mpi.org/nightly/master/
    Hello Josh

    Thank you for considering the problem. I will certainly keep
    watching the ticket. However, there is nothing really urgent (to
    me anyway).


    Regarding your command line, it looks pretty good:
      orterun --launch-agent /home/boubliki/openmpi/bin/orted -mca
    btl tcp --mca btl_tcp_port_min_v4 6706 --mca
    btl_tcp_port_range_v4 10 --mca oob_tcp_static_ipv4_ports 6705
    -host node2:1 -np 1 /path/to/some/program arg1 .. argn

    I would suggest, while you are debugging this, that you use a
    program like /bin/hostname instead of a real MPI program. If
    /bin/hostname launches properly then move on to an MPI program.
    That will assure you that the runtime wired up correctly
    (oob/tcp), and then we can focus on the MPI side of the
    communication (btl/tcp). You will want to change "-mca btl tcp"
    to at least "-mca btl tcp,self" (or better "-mca btl
    tcp,vader,self" if you want shared memory). 'self' is the
    loopback interface in Open MPI.
    Yes. This is actually what I did. I just wanted to be generic and
    report the problem without too much flourish.
    But it is important you reminded this for new users, helping them
    to understand the real purpose of each layer in an MPI implementation.


    Is there a reason that you are specifying the --launch-agent to
    the orted? Is it installed in a different path on the remote
    nodes? If Open MPI is installed in the same location on all nodes
    then you shouldn't need that.
    I recompiled the sources, activating
    --enable-orterun-prefix-by-default when running ./configure. Of
    course, it helps :)

    Again, thank you.

    Kind regards

    Vincent.



    Thanks,
    Josh



--
Josh Hursey
IBM Spectrum MPI Developer

Reply via email to