Hello Gilles Thanks again for your inputs. Since that code snippet works for you, I am now fairly certain that my 'instrumentation' has broken something; sorry for troubling the whole community while I climb the learning curve. The netcat script that you mention does work correctly; that and that fact that the issue happens even when I use the openib BTL makes me convinced it is not a firewall issue.
Best regards Durga We learn from history that we never learn from history. On Sun, Apr 3, 2016 at 9:05 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > your program works fine on my environment. > > this is typical of a firewall running on your host(s), can you double > check that ? > > a simple way to do that is to > 10.10.10.11# nc -l 1024 > > and on the other node > echo ahah | nc 10.10.10.11 1024 > > the first command should print "ahah" unless the host is unreachable > and/or the tcp connection is denied by the firewall. > > Cheers, > > Gilles > > > > On 4/4/2016 9:44 AM, dpchoudh . wrote: > > Hello Gilles > > Thanks for your help. > > My question was more of a sanity check on myself. That little program I > sent looked correct to me; do you see anything wrong with it? > > What I am running on my setup is an instrumented OMPI stack, taken from > git HEAD, in an attempt to understand how some of the internals work. If > you think the code is correct, it is quite possible that one of those > 'instrumentations' is causing this. > > And BTW, adding -mca pml ob1 makes the code hang at MPI_Send (as opposed > to MPI_Recv()) > > [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node > 10.10.10.11 > [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node > 10.10.10.11 > [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node > 10.10.10.11 > [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node > 10.10.10.11 > [smallMPI:51673] btl: tcp: attempting to connect() to [[51894,1],1] > address 10.10.10.11 on port 1024 <--- Hangs here > > But 10.10.10.11 is pingable: > [durga@smallMPI ~]$ ping bigMPI > PING bigMPI (10.10.10.11) 56(84) bytes of data. > 64 bytes from bigMPI (10.10.10.11): icmp_seq=1 ttl=64 time=0.247 ms > > > We learn from history that we never learn from history. > > On Sun, Apr 3, 2016 at 8:04 PM, Gilles Gouaillardet <gil...@rist.or.jp> > wrote: > >> Hi, >> >> per a previous message, can you give a try to >> mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp --mca pml ob1 >> ./mpitest >> >> if it still hangs, the issue could be OpenMPI think some subnets are >> reachable but they are not. >> >> for diagnostic : >> mpirun --mca btl_base_verbose 100 ... >> >> you can explicitly include/exclude subnets with >> --mca btl_tcp_if_include xxx >> or >> --mca btl_tcp_if_exclude yyy >> >> for example, >> mpirun --mca btl_btp_if_include 192.168.0.0/24 -np 2 -hostfile >> ~/hostfile --mca btl self,tcp --mca pml ob1 ./mpitest >> should do the trick >> >> Cheers, >> >> Gilles >> >> >> >> >> On 4/4/2016 8:32 AM, dpchoudh . wrote: >> >> Hello all >> >> I don't mean to be competing for the 'silliest question of the year >> award', but I can't figure this out on my own: >> >> My 'cluster' has 2 machines, bigMPI and smallMPI. They are connected via >> several (types of) networks and the connectivity is OK. >> >> In this setup, the following program hangs after printing >> >> Hello world from processor smallMPI, rank 0 out of 2 processors >> Hello world from processor bigMPI, rank 1 out of 2 processors >> smallMPI sent haha! >> >> >> Obviously it is hanging at MPI_Recv(). But why? My command line is as >> follows, but this happens if I try openib BTL (instead of TCP) as well. >> >> mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp ./mpitest >> >> It must be something *really* trivial, but I am drawing a blank right now. >> >> Please help! >> >> #include <mpi.h> >> #include <stdio.h> >> #include <string.h> >> >> int main(int argc, char** argv) >> { >> int world_size, world_rank, name_len; >> char hostname[MPI_MAX_PROCESSOR_NAME], buf[8]; >> >> MPI_Init(&argc, &argv); >> MPI_Comm_size(MPI_COMM_WORLD, &world_size); >> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); >> MPI_Get_processor_name(hostname, &name_len); >> printf("Hello world from processor %s, rank %d out of %d >> processors\n", hostname, world_rank, world_size); >> if (world_rank == 1) >> { >> MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE); >> printf("%s received %s\n", hostname, buf); >> } >> else >> { >> strcpy(buf, "haha!"); >> MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD); >> printf("%s sent %s\n", hostname, buf); >> } >> MPI_Barrier(MPI_COMM_WORLD); >> MPI_Finalize(); >> return 0; >> } >> >> >> >> We learn from history that we never learn from history. >> >> >> _______________________________________________ >> users mailing listus...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/04/28876.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/04/28877.php >> > > > > _______________________________________________ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/04/28878.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/04/28879.php >