Hello all,
With our new branch date of May 14th, in order to have any chance of merging
these changes, I will cut a PR on Monday morning, May 11th. This means I will
have to set a cutoff for data collection ideally by May 8th, Friday.
Please submit this data by then if you want to be
Assuming that I'm doing this correctly, setting the port to >1024 doesn't help:
mic:/amd/home/jdelsign/PMIx>env OMPI_MCA_btl_tcp_port_min_v4=5432 mpirun -n 3 --map-by node -x MESSAGE=name --personality ompi --hostfile myhostfile2 ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE:
Hi Ralph,
It is starting to look like microway2 is the problem after all. Compared to the other two nodes, it has some complicated set of firewall rules that I don't understand at all (see below). The other two node have just the first three "-P" lines. I was under
the impression that all
My best guess is that port 1024 is being blocked in some fashion. Depending on
how you start it, OMPI may well pick a different port (it all depends on what
it gets assigned by the OS) that lets it make the connection. You could verify
this by setting "OMPI_MCA_btl_tcp_port_min_v4="
On May
That seems to work (much to my surprise):
mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>cat myhostfile3+1
microway3 slots=16
microway1 slots=16
mic:/amd/home/jdelsign/PMIx>cat myhostfile1+3
microway1 slots=16
What happens if you run your "3 procs on two nodes" case using just microway1
and 3 (i.e., omit microway2)?
On May 4, 2020, at 9:05 AM, John DelSignore via devel mailto:devel@lists.open-mpi.org> > wrote:
Hi George,
10.71.2.58 is microway2 (which has been used in all of the configurations I've
Hi George,
10.71.2.58 is microway2 (which has been used in all of the configurations I've tried, so maybe that's why it appears to be the common denominator):
lid:/amd/home/jdelsign>host -l totalviewtech.com|grep microway
microway1.totalviewtech.com has address 10.71.2.52
John,
The common denominator across all these errors is an error from connect
while trying to connect to 10.71.2.58 on port 1024. Who is 10.71.2.58 ? If
the firewall open ? Is the port 1024 allowed to connect to ?
George.
On Mon, May 4, 2020 at 11:36 AM John DelSignore via devel <
Good to confirm - thanks! This does indeed look like an issue in the btl/tcp
component's reachability code.
On May 4, 2020, at 8:34 AM, John DelSignore mailto:jdelsign...@perforce.com> > wrote:
Inline below...
On 2020-05-04 11:09, Ralph Castain via devel wrote:
Staring at this some more, I
Inline below...
On 2020-05-04 11:09, Ralph Castain via devel wrote:
Staring at this some more, I do have the following questions:
* in your first case, it looks like "prte" was started from microway3 - correct?
Yes, "prte" was started from microway3.
* in the second case, that
Staring at this some more, I do have the following questions:
* in your first case, it looks like "prte" was started from microway3 - correct?
* in the second case, that worked, it looks like "mpirun" was executed from
microway1 - correct?
* in the third case, you state that "mpirun" was again
Hi folks,
I cloned a fresh copy of OMPI master this morning at ~8:30am EDT and rebuilt. I'm running a very simple test code on three Centos 7.[56] nodes named microway[123] over TCP. I'm seeing a fatal error similar to the following:
[microway3.totalviewtech.com:227713]
12 matches
Mail list logo