We just discovered this ticket, which might describe the same problem that
we have:

https://svn.open-mpi.org/trac/ompi/ticket/1505

It seems unresolved... do you have a workaround for it? I've seen the "-mca
opal_net_private_ipv4 " parameter, but I don't exactly know how to use it...
At least my experiments failed to do anything.

I'll be very grateful for your help,
Krzysztof


2010/11/17 Grzegorz Maj <ma...@wp.pl>

> 2010/11/11 Jeff Squyres <jsquy...@cisco.com>:
> > On Nov 11, 2010, at 3:23 PM, Krzysztof Zarzycki wrote:
> >
> >> No, unfortunately specification of interfaces is a little more
> complicated...  eth0/1/2 is not common for both machines.
> >
> > Can you define "common"?  Do you mean that eth0 on one machine is on a
> different network then eth0 on the other machine?
> >
> > Is there any way that you can make them the same?  It would certainly
> make things easier.
>
> Yes, they are on different networks and unfortunately we are not
> allowed to play with this.
>
> >
> >> I've tried to play with (oob/btl)_tcp_ if_include, but actually... I
> don't know exactly how.
> >
> > See my other mail:
> >
> >    http://www.open-mpi.org/community/lists/users/2010/11/14737.php
> >
> >> Anyway, do you have any ideas how to further debug the communication
> problem?
> >
> > The connect() is not getting through somehow.  Sadly, we don't have
> enough debug messages to show exactly what is going wrong when these kinds
> of things happen; I have a half-finished branch that has much better
> debug/error messages, but I've never had the time to finish it (indeed, I
> think there's a bug in that development branch right now, otherwise I'd
> recommend giving it a whirl).  :-\
>
> Analyzing the strace of both processes shows, that on both sides the
> call to 'poll' after connect/accept succeeds. As I understand they
> even exchange some information, which is always 8 bytes, like
> D\227\0\1\0\0\0\0. One of them sends this information and the other
> receives it. But after receiving, it does:
>
> ----
> recv(8, "\5g\0\1\0\0\0\0", 8, 0)        = 8
> fcntl64(8, F_GETFL)                     = 0x2 (flags O_RDWR)
> fcntl64(8, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
> getpeername(8, {sa_family=AF_INET, sin_port=htons(57885),
> sin_addr=inet_addr("10.0.0.2")}, [16]) = 0
> close(8)
> ----
>
> In a working scenario (on another machines), after receiving, these
> bytes are resent and then proceeds the proper communication (my
> 'hello' message is sent).
>
> The above address 10.0.0.2 is eth2 on the host machine, which indeed
> should be used in this communication.
>
> While playing with network interfaces it came out, that when we bring
> down one of the aliases (eth2:0), it starts working. How should we
> enforce mpirun not to use this alias, when it's up? We were trying to
> use (oob/btl)_tcp_ if_exclude and specifying eth2:0, but it doesn't
> seem to help.
>
> Regards,
> Grzegorz
>
>
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to