Hello Gilles

Thanks again for your inputs. Since that code snippet works for you, I am
now fairly certain that my 'instrumentation' has broken something; sorry
for troubling the whole community while I climb the learning curve. The
netcat script that you mention does work correctly; that and that fact that
the issue happens even when I use the openib BTL makes me convinced it is
not a firewall issue.

Best regards
Durga

We learn from history that we never learn from history.

On Sun, Apr 3, 2016 at 9:05 PM, Gilles Gouaillardet <gil...@rist.or.jp>
wrote:

> your program works fine on my environment.
>
> this is typical of a firewall running on your host(s), can you double
> check that ?
>
> a simple way to do that is to
> 10.10.10.11# nc -l 1024
>
> and on the other node
> echo ahah | nc 10.10.10.11 1024
>
> the first command should print "ahah" unless the host is unreachable
> and/or the tcp connection is denied by the firewall.
>
> Cheers,
>
> Gilles
>
>
>
> On 4/4/2016 9:44 AM, dpchoudh . wrote:
>
> Hello Gilles
>
> Thanks for your help.
>
> My question was more of a sanity check on myself. That little program I
> sent looked correct to me; do you see anything wrong with it?
>
> What I am running on my setup is an instrumented OMPI stack, taken from
> git HEAD, in an attempt to understand how some of the internals work. If
> you think the code is correct, it is quite possible that one of those
> 'instrumentations' is causing this.
>
> And BTW, adding -mca pml ob1 makes the code hang at MPI_Send (as opposed
> to MPI_Recv())
>
> [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node
> 10.10.10.11
> [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node
> 10.10.10.11
> [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node
> 10.10.10.11
> [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node
> 10.10.10.11
> [smallMPI:51673] btl: tcp: attempting to connect() to [[51894,1],1]
> address 10.10.10.11 on port 1024 <--- Hangs here
>
> But 10.10.10.11 is pingable:
> [durga@smallMPI ~]$ ping bigMPI
> PING bigMPI (10.10.10.11) 56(84) bytes of data.
> 64 bytes from bigMPI (10.10.10.11): icmp_seq=1 ttl=64 time=0.247 ms
>
>
> We learn from history that we never learn from history.
>
> On Sun, Apr 3, 2016 at 8:04 PM, Gilles Gouaillardet <gil...@rist.or.jp>
> wrote:
>
>> Hi,
>>
>> per a previous message, can you give a try to
>> mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp --mca pml ob1
>> ./mpitest
>>
>> if it still hangs, the issue could be OpenMPI think some subnets are
>> reachable but they are not.
>>
>> for diagnostic :
>> mpirun --mca btl_base_verbose 100 ...
>>
>> you can explicitly include/exclude subnets with
>> --mca btl_tcp_if_include xxx
>> or
>> --mca btl_tcp_if_exclude yyy
>>
>> for example,
>> mpirun --mca btl_btp_if_include 192.168.0.0/24 -np 2 -hostfile
>> ~/hostfile --mca btl self,tcp --mca pml ob1 ./mpitest
>> should do the trick
>>
>> Cheers,
>>
>> Gilles
>>
>>
>>
>>
>> On 4/4/2016 8:32 AM, dpchoudh . wrote:
>>
>> Hello all
>>
>> I don't mean to be competing for the 'silliest question of the year
>> award', but I can't figure this out on my own:
>>
>> My 'cluster' has 2 machines, bigMPI and smallMPI. They are connected via
>> several (types of) networks and the connectivity is OK.
>>
>> In this setup, the following program hangs after printing
>>
>> Hello world from processor smallMPI, rank 0 out of 2 processors
>> Hello world from processor bigMPI, rank 1 out of 2 processors
>> smallMPI sent haha!
>>
>>
>> Obviously it is hanging at MPI_Recv(). But why? My command line is as
>> follows, but this happens if I try openib BTL (instead of TCP) as well.
>>
>> mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp ./mpitest
>>
>> It must be something *really* trivial, but I am drawing a blank right now.
>>
>> Please help!
>>
>> #include <mpi.h>
>> #include <stdio.h>
>> #include <string.h>
>>
>> int main(int argc, char** argv)
>> {
>>     int world_size, world_rank, name_len;
>>     char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
>>
>>     MPI_Init(&argc, &argv);
>>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>>     MPI_Get_processor_name(hostname, &name_len);
>>     printf("Hello world from processor %s, rank %d out of %d
>> processors\n", hostname, world_rank, world_size);
>>     if (world_rank == 1)
>>     {
>>     MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>     printf("%s received %s\n", hostname, buf);
>>     }
>>     else
>>     {
>>     strcpy(buf, "haha!");
>>     MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
>>     printf("%s sent %s\n", hostname, buf);
>>     }
>>     MPI_Barrier(MPI_COMM_WORLD);
>>     MPI_Finalize();
>>     return 0;
>> }
>>
>>
>>
>> We learn from history that we never learn from history.
>>
>>
>> _______________________________________________
>> users mailing listus...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/04/28876.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/04/28877.php
>>
>
>
>
> _______________________________________________
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/04/28878.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/04/28879.php
>

Reply via email to