Hmmm...then you have something else going on. By default, OMPI will ask the 
local OS for an available port and use it. You only need to specify ports when 
working thru a firewall.

Do you have firewalls on this cluster?


On Mar 18, 2021, at 8:55 AM, Sendu Bala <s...@sanger.ac.uk 
<mailto:s...@sanger.ac.uk> > wrote:

Yes, that’s the trick. I’m going to have to check port usage on all hosts and 
pick suitable ranges just-in-time - and hope I don’t hit a race condition with 
other users of the cluster.

Does mpiexec not have this kind of functionality built in? When I use it with 
no port options set (pure default), it just doesn’t function (I’m guessing 
because it chose “bad” or in-use ports).



On 18 Mar 2021, at 14:11, Ralph Castain via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> > wrote:

Hard to say - unless there is some reason, why not make it large enough to not 
be an issue? You may have to experiment a bit as there is nothing to guarantee 
that other processes aren't occupying those regions.



On Mar 18, 2021, at 2:13 AM, Sendu Bala <s...@sanger.ac.uk 
<mailto:s...@sanger.ac.uk> > wrote:

Thanks, it made it work when I was running “true” as a test, but then my real 
MPI app failed with:

[node-5-8-2][[48139,1],0][btl_tcp_component.c:966:mca_btl_tcp_component_create_listen]
 bind() failed: no port available in the range [46107..46139]
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[48139,1],1]) is on host: node-12-6-2
  Process 2 ([[48139,1],0]) is on host: node-5-8-2
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.


This was when running with 16 cores, so I thought  a 32 port range would be 
fine. Is this telling me I have to make it a 33 port range, have different 
ranges for oob and btl, or that some other unrelated software is using some 
ports in my range?


(I changed my range from my previous post, because using that range resulted in 
the issue I posted about here before, where mpirun just does nothing for 5mins 
and then terminates itself, without any error messages.)


Cheers,
Sendu.


On 17 Mar 2021, at 13:25, Ralph Castain via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> > wrote:

What you are missing is that there are _two_ messaging layers in the system. 
You told the btl/tcp layer to use the specified ports, but left the oob/tcp one 
unspecified. You need to add

oob_tcp_dynamic_ipv4_ports = 46207-46239

or whatever range you want to specify

Note that if you want the btl/tcp layer to use those other settings (e.g., 
keepalive_time), then you'll need to set those as well. The names of the 
variables may not match between the layers - you'll need to use ompi_info to 
find the names and params available for each layer.


On Mar 16, 2021, at 2:43 AM, Vincent via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> > wrote:

On 09/03/2021 11:23, Sendu Bala via users wrote:
When using mpirun, how do you pick which ports are used?

I???ve tried:

mpirun --mca btl_tcp_port_min_v4 46207  --mca btl_tcp_port_range_v4 32 --mca 
oob_tcp_keepalive_time 45 --mca oob_tcp_max_recon_attempts 20 --mca 
oob_tcp_retry_delay  1 --mca oob_tcp_keepalive_probes 20 --mca 
oob_tcp_keepalive_intvl 10 true

And also setting similar things in openmpi/etc/openmpi-mca-params.conf :

btl_tcp_port_min_v4 = 46207
btl_tcp_port_range_v4 = 32
oob_tcp_keepalive_time = 45
oob_tcp_max_recon_attempts = 20
oob_tcp_retry_delay = 1
oob_tcp_keepalive_probes = 20
oob_tcp_keepalive_intvl = 10

But when the process is running:

ss -l -p -n | grep "pid=57642,"
tcp  LISTEN 0      128                                                
127.0.0.1:58439                 0.0.0.0:* users:(("mpirun",pid=57642,fd=14))
tcp  LISTEN 0      128                                                  
0.0.0.0:36253                 0.0.0.0:*   users:(("mpirun",pid=57642,fd=17))

What am I doing wrong, and how do I get it to use my desired ports (and other 
settings above)?


Hello

Could this be related to some recently resolved bug ?
What version are you running ?
Having a look on 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_open-2Dmpi_ompi_issues_8304&d=DwIFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=R4ZUzQZ7_TZ1SVV_pAmysrrJ1zatMHFpzMNAdJSpPIo&m=Dv6xQizR35EO5Xf86whFlO2mZWbJO9kT0iMDaeL0iXs&s=RhsRamUPqN_mfRS_JffG2ZAfqgCaYGL1Fkqbv1d3WB8&e=
  could be possibly useful?


Regards

Vincent.

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a 
charity registered in England with number 1021457 and a company registered in 
England with number 2742969, whose registered office is 215 Euston Road, 
London, NW1 2BE.


-- The Wellcome Sanger Institute is operated by Genome Research Limited, a 
charity registered in England with number 1021457 and a company registered in 
England with number 2742969, whose registered office is 215 Euston Road, 
London, NW1 2BE. 

Reply via email to