You might want to try a pure TCP benchmark across this problematic NIC (e.g., 
NetpipeTCP or iperf).

That will take MPI out of the equation and see if you are able to pass TCP 
traffic correctly.  Make sure to test sizes both smaller and larger than your 
MTU.


> On May 14, 2016, at 1:25 AM, dpchoudh . <dpcho...@gmail.com> wrote:
> 
> No, I used IP addresses in all my tests. What I found that if I used the IP 
> address of the Broadcom NIC in hostfile and used that network exclusively 
> (btl_tcp_if_include), the mpirun command hung silently. If I used the IP 
> address of another NIC in the host file (and Broadcom NIC exclusively), 
> mpirun crashed saying the remote process is unreachable. If I used the other 
> two networks exclusively (and any of their IP addresses in the host file) it 
> works fine.
> 
> Since TCP itself does not care what the underlying NIC is, it is most likely 
> some kind of firewall issue, as you guessed (I did disable it, but there 
> could be other related issues). In any case, I believe it has nothing to do 
> with OMPI. One thing that is different between the Broadcom NIC and the rest 
> is that the Broadcom NIC is connected to the WAN side and thus gets its IP 
> via DHCP, where as the rest have static IPs. I don't see why that would make 
> a difference, but it is possible that CentOS is enforcing some kind of 
> security policy that I am not aware of.
> 
> Thank you for for feedback.
> 
> Durga
> 
> The surgeon general advises you to eat right, exercise regularly and quit 
> ageing.
> 
> On Sat, May 14, 2016 at 1:13 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> iirc, ompi internally uses networks and not interface names.
> what did you use in your tests ?
> can you try with networks ?
> 
> Cheers,
> 
> Gilles
> 
> On Saturday, May 14, 2016, dpchoudh . <dpcho...@gmail.com> wrote:
> Hello Gilles
> 
> Thanks for your prompt follow up. It looks this this issue is somehow 
> specific to the Broadcom NIC. If I take it out, the rest of them work in any 
> combination. On further investigation, I found that the name that 'ifconfig' 
> shows for this intterface is different from what it is named in internal 
> scripts. Could be a bug in CentOS, but at least does not look like an OpenMPI 
> issue.
> 
> Sorry for raising the false alarm.
> 
> Durga
> 
> The surgeon general advises you to eat right, exercise regularly and quit 
> ageing.
> 
> On Sat, May 14, 2016 at 12:02 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> at first I recommend you test 7 cases
> - one network only (3 cases)
> - two networks ony (3 cases)
> - three networks (1 case)
> 
> and see when things hang
> 
> you might also want to 
> mpirun --mca oob_tcp_if_include 10.1.10.0/24 ...
> to ensure no hang will happen in oob
> 
> as usual, double check no firewall is running, and your hosts can ping each 
> other 
> 
> Cheers,
> 
> Gilles
> 
> On Saturday, May 14, 2016, dpchoudh . <dpcho...@gmail.com> wrote:
> Dear developers
> 
> I have been observing this issue all along on the master branch, but have 
> been brushing off as something to do with my installation.
> 
> Right now, I just downloaded a fresh checkout (via git pull), built and 
> installed it (after deleting /usr/local/lib/openmpi/) and I can reproduce the 
> hang 100% of the time.
> 
> Description of the setup:
> 
> 1. Two x86_64 boxes (dual xeons, 6 core each)
> 2. Four network interfaces, 3 running IP:
>     Broadcom GbE (IP 10.01.10.X/24) BW 1 Gbps
>     Chelsio iWARP (IP 10.10.10.X/24) BW 10 Gbps
>     Qlogic Infiniband (IP 10.01.11.X/24) BW 20Gbps
>     LSI logic Fibre channel (not running IP, I don't think this matters)
> 
> All of the NICs have their link UP. All the NICs are in separate IP subnets, 
> connected back to back.
> 
> With this, the following command hangs:
> (The hostfile is this:
> 10.10.10.10 slots=1
> 10.10.10.11 slots=1
> 
> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca 
> pml ob1 ./mpitest
> 
> with the following output:
> 
> Hello world from processor smallMPI, rank 0 out of 2 processors
> Hello world from processor bigMPI, rank 1 out of 2 processors
> smallMPI sent haha!, rank 0
> bigMPI received haha!, rank 1
> 
> The stack trace at rank 0 is:
> 
> (gdb) bt
> #0  0x00007f9cb844769d in poll () from /lib64/libc.so.6
> #1  0x00007f9cb79354d6 in poll_dispatch (base=0xddb540, tv=0x7ffc065d01b0) at 
> poll.c:165
> #2  0x00007f9cb792d180 in opal_libevent2022_event_base_loop (base=0xddb540, 
> flags=2) at event.c:1630
> #3  0x00007f9cb7851e74 in opal_progress () at runtime/opal_progress.c:171
> #4  0x00007f9cb89bc47d in opal_condition_wait (c=0x7f9cb8f37c40 
> <ompi_request_cond>, m=0x7f9cb8f37bc0 <ompi_request_lock>) at 
> ../opal/threads/condition.h:76
> #5  0x00007f9cb89bcadf in ompi_request_default_wait_all (count=2, 
> requests=0x7ffc065d0360, statuses=0x7ffc065d0330) at request/req_wait.c:287
> #6  0x00007f9cb8a95469 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16, 
> source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>     at base/coll_base_barrier.c:63
> #7  0x00007f9cb8a95b86 in ompi_coll_base_barrier_intra_two_procs 
> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at 
> base/coll_base_barrier.c:313
> #8  0x00007f9cb8ac6d1c in ompi_coll_tuned_barrier_intra_dec_fixed 
> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at 
> coll_tuned_decision_fixed.c:196
> #9  0x00007f9cb89dc689 in PMPI_Barrier (comm=0x601280 <ompi_mpi_comm_world>) 
> at pbarrier.c:63
> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffc065d0648) at mpitest.c:27
> 
> and at rank 1 is:
> 
> (gdb) bt
> #0  0x00007f1101e7d69d in poll () from /lib64/libc.so.6
> #1  0x00007f110136b4d6 in poll_dispatch (base=0x1d54540, tv=0x7ffd73013710) 
> at poll.c:165
> #2  0x00007f1101363180 in opal_libevent2022_event_base_loop (base=0x1d54540, 
> flags=2) at event.c:1630
> #3  0x00007f1101287e74 in opal_progress () at runtime/opal_progress.c:171
> #4  0x00007f11023f247d in opal_condition_wait (c=0x7f110296dc40 
> <ompi_request_cond>, m=0x7f110296dbc0 <ompi_request_lock>) at 
> ../opal/threads/condition.h:76
> #5  0x00007f11023f2adf in ompi_request_default_wait_all (count=2, 
> requests=0x7ffd730138c0, statuses=0x7ffd73013890) at request/req_wait.c:287
> #6  0x00007f11024cb469 in ompi_coll_base_sendrecv_zero (dest=0, stag=-16, 
> source=0, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>     at base/coll_base_barrier.c:63
> #7  0x00007f11024cbb86 in ompi_coll_base_barrier_intra_two_procs 
> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at 
> base/coll_base_barrier.c:313
> #8  0x00007f11024cde3c in ompi_coll_tuned_barrier_intra_dec_fixed 
> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at 
> coll_tuned_decision_fixed.c:196
> #9  0x00007f1102412689 in PMPI_Barrier (comm=0x601280 <ompi_mpi_comm_world>) 
> at pbarrier.c:63
> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffd73013ba8) at mpitest.c:27
> 
> The code for the test program is:
> 
> #include <mpi.h>
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> 
> int main(int argc, char *argv[])
> {
>     int world_size, world_rank, name_len;
>     char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
> 
>     MPI_Init(&argc, &argv);
>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>     MPI_Get_processor_name(hostname, &name_len);
>     printf("Hello world from processor %s, rank %d out of %d processors\n", 
> hostname, world_rank, world_size);
>     if (world_rank == 1)
>     {
>     MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>     printf("%s received %s, rank %d\n", hostname, buf, world_rank);
>     }
>     else
>     {
>     strcpy(buf, "haha!"); 
>     MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
>     printf("%s sent %s, rank %d\n", hostname, buf, world_rank);
>     }
>     MPI_Barrier(MPI_COMM_WORLD);
>     MPI_Finalize();
>     return 0;
> }
> 
> I have a strong feeling that there is an issue in this kind of situation. 
> I'll be more than happy to run further tests if someone asks me to.
> 
> Thank you
> Durga
> 
> The surgeon general advises you to eat right, exercise regularly and quit 
> ageing.
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29196.php
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29198.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29199.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to