Re: [OMPI devel] Running data collection for collectives tuning (slurm option included)

2020-05-04 Thread Zhang, William via devel
Hello all, With our new branch date of May 14th, in order to have any chance of merging these changes, I will cut a PR on Monday morning, May 11th. This means I will have to set a cutoff for data collection ideally by May 8th, Friday. Please submit this data by then if you want to be

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread John DelSignore via devel
Assuming that I'm doing this correctly, setting the port to >1024 doesn't help: mic:/amd/home/jdelsign/PMIx>env OMPI_MCA_btl_tcp_port_min_v4=5432 mpirun -n 3 --map-by node -x MESSAGE=name --personality ompi --hostfile myhostfile2 ./tx_basic_mpi tx_basic_mpi Hello from proc (0) MESSAGE:

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread John DelSignore via devel
Hi Ralph, It is starting to look like microway2 is the problem after all. Compared to the other two nodes, it has some complicated set of firewall rules that I don't understand at all (see below). The other two node have just the first three "-P" lines. I was under the impression that all

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread Ralph Castain via devel
My best guess is that port 1024 is being blocked in some fashion. Depending on how you start it, OMPI may well pick a different port (it all depends on what it gets assigned by the OS) that lets it make the connection. You could verify this by setting "OMPI_MCA_btl_tcp_port_min_v4=" On May

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread John DelSignore via devel
That seems to work (much to my surprise): mic:/amd/home/jdelsign/PMIx>pterm pterm failed to initialize, likely due to no DVM being available mic:/amd/home/jdelsign/PMIx>cat myhostfile3+1 microway3 slots=16 microway1 slots=16 mic:/amd/home/jdelsign/PMIx>cat myhostfile1+3 microway1 slots=16

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread Ralph Castain via devel
What happens if you run your "3 procs on two nodes" case using just microway1 and 3 (i.e., omit microway2)? On May 4, 2020, at 9:05 AM, John DelSignore via devel mailto:devel@lists.open-mpi.org> > wrote: Hi George, 10.71.2.58 is microway2 (which has been used in all of the configurations I've

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread John DelSignore via devel
Hi George, 10.71.2.58 is microway2 (which has been used in all of the configurations I've tried, so maybe that's why it appears to be the common denominator): lid:/amd/home/jdelsign>host -l totalviewtech.com|grep microway microway1.totalviewtech.com has address 10.71.2.52

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread George Bosilca via devel
John, The common denominator across all these errors is an error from connect while trying to connect to 10.71.2.58 on port 1024. Who is 10.71.2.58 ? If the firewall open ? Is the port 1024 allowed to connect to ? George. On Mon, May 4, 2020 at 11:36 AM John DelSignore via devel <

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread Ralph Castain via devel
Good to confirm - thanks! This does indeed look like an issue in the btl/tcp component's reachability code. On May 4, 2020, at 8:34 AM, John DelSignore mailto:jdelsign...@perforce.com> > wrote: Inline below... On 2020-05-04 11:09, Ralph Castain via devel wrote: Staring at this some more, I

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread John DelSignore via devel
Inline below... On 2020-05-04 11:09, Ralph Castain via devel wrote: Staring at this some more, I do have the following questions: * in your first case, it looks like "prte" was started from microway3 - correct? Yes, "prte" was started from microway3. * in the second case, that

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread Ralph Castain via devel
Staring at this some more, I do have the following questions: * in your first case, it looks like "prte" was started from microway3 - correct? * in the second case, that worked, it looks like "mpirun" was executed from microway1 - correct? * in the third case, you state that "mpirun" was again

[OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread John DelSignore via devel
Hi folks, I cloned a fresh copy of OMPI master this morning at ~8:30am EDT and rebuilt. I'm running a very simple test code on three Centos 7.[56] nodes named microway[123] over TCP. I'm seeing a fatal error similar to the following: [microway3.totalviewtech.com:227713]