Re: [OMPI users] openMPI 1.1.4 - connect() failed with errno=111

2007-02-12 Thread Matteo Guglielmi
Jeff Squyres wrote:
> On Feb 12, 2007, at 2:34 PM, Matteo Guglielmi wrote:
>
>   
>> Those nic "eth1" are not connected at all... all the machines use  
>> only the eth0
>> interface which have different IP for each PC.
>> 
>
> Gotcha.  But, FWIW, OMPI doesn't know that because they have valid IP  
> addresses.  So it thinks they're on the same subnet (on the same  
> host, actually), and therefore thinks that they should be routable.
>
>   
>> Anyway you solved my problem suggesting me those FAQ entries!!!
>> --mca btl_tcp_if_exclude lo,eth1 that's the magic option which  
>> works for me!!!
>> 
>
> Excellent -- glad to help.
>
> Another solution might be to simply disable those NICs since they're  
> not hooked up to anything; then OMPI should work without any options.
>   
Yep that's even better!
> Good luck!
>
>   
Thanks again,

I was playing around with the firewall so far and couldn't get any solution
out of it... and now I know why... because the problem wasn't there!!!

Oh my gosh... you helped me a lot!

Cheers,
MG.


Re: [OMPI users] openMPI 1.1.4 - connect() failed with errno=111

2007-02-12 Thread Matteo Guglielmi
Jeff Squyres wrote:
> On Feb 12, 2007, at 12:54 PM, Matteo Guglielmi wrote:
>
>   
>> This is the ifconfig output from the machine I'm used to submit the
>> parallel job:
>> 
>
> It looks like both of your nodes share an IP address:
>
>   
>> [root@lcbcpc02 ~]# ifconfig
>> eth1  Link encap:Ethernet  HWaddr 00:15:17:10:53:C9
>>   inet addr:192.168.0.1  Bcast:192.168.0.255  Mask: 
>> 255.255.255.0
>> [root@lcbcpc04 ~]# ifconfig
>> eth1  Link encap:Ethernet  HWaddr 00:15:17:10:53:75
>>   inet addr:192.168.0.1  Bcast:192.168.0.255  Mask: 
>> 255.255.255.0
>> 
>
> This will be problematic to more than just OMPI if these two  
> interfaces are on the same network.  The solution is to ensure that  
> all your nodes have unique IP addresses.
>
> If these NICs are on different networks, than it's a valid network  
> configuration, but Open MPI (by default) will assume that these are  
> routable to each other.  You can tell Open MPI to not use eth1 in  
> this case -- see this FAQ entries for details:
>
>http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>http://www.open-mpi.org/faq/?category=tcp#tcp-selection
>http://www.open-mpi.org/faq/?category=tcp#tcp-routability
>
>   
Those nic "eth1" are not connected at all... all the machines use only
the eth0
interface which have different IP for each PC.

Anyway you solved my problem suggesting me those FAQ entries!!!

*--mca btl_tcp_if_exclude lo,eth1

that's the magic option which works for me!!!


*



Thanks Jeff!!!
Thanks

MG.


Re: [OMPI users] openMPI 1.1.4 - connect() failed with errno=111

2007-02-12 Thread Jeff Squyres

On Feb 12, 2007, at 12:54 PM, Matteo Guglielmi wrote:


This is the ifconfig output from the machine I'm used to submit the
parallel job:


It looks like both of your nodes share an IP address:


[root@lcbcpc02 ~]# ifconfig
eth1  Link encap:Ethernet  HWaddr 00:15:17:10:53:C9
  inet addr:192.168.0.1  Bcast:192.168.0.255  Mask: 
255.255.255.0

[root@lcbcpc04 ~]# ifconfig
eth1  Link encap:Ethernet  HWaddr 00:15:17:10:53:75
  inet addr:192.168.0.1  Bcast:192.168.0.255  Mask: 
255.255.255.0


This will be problematic to more than just OMPI if these two  
interfaces are on the same network.  The solution is to ensure that  
all your nodes have unique IP addresses.


If these NICs are on different networks, than it's a valid network  
configuration, but Open MPI (by default) will assume that these are  
routable to each other.  You can tell Open MPI to not use eth1 in  
this case -- see this FAQ entries for details:


  http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
  http://www.open-mpi.org/faq/?category=tcp#tcp-selection
  http://www.open-mpi.org/faq/?category=tcp#tcp-routability

--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] openMPI 1.1.4 - connect() failed with errno=111

2007-02-12 Thread Jeff Squyres
I'm assuming that these are Linux hosts.  If so, errno 111 is  
"connection refused" possibly meaning that there is still some  
firewall active or the wrong interface is being used to establish  
connections between these machines.


Can you send the output of "ifconfig" (might be /sbin/ifconfig on  
your machine?) from both machines?



On Feb 11, 2007, at 3:45 PM, matteo.guglie...@epfl.ch wrote:

Since I've installed openmpi I cannot submit any job that uses cpus  
from

different machines.

### hostfile ###
lcbcpc02.epfl.ch slots=4 max-slots=4
lcbcpc04.epfl.ch slots=4 max-slots=4


### error message ###
[matteo@lcbcpc02 TEST]$ mpirun --hostfile ~matteo/hostfile -np 8
/home/matteo/Software/NWChem/5.0/bin/nwchem ./nwchem.nw
[0,1,5][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: 
572:mca_btl_tcp_endpoint_complete_connect]
[0,1,6][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: 
572:mca_btl_tcp_endpoint_complete_connect]

connect() failed with errno=111
6: lcbcpc04.epfl.ch len=16
[0,1,4][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: 
572:mca_btl_tcp_endpoint_complete_connect]

connect() failed with errno=111
4: lcbcpc04.epfl.ch len=16
[0,1,7][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: 
572:mca_btl_tcp_endpoint_complete_connect]

connect() failed with errno=111
7: lcbcpc04.epfl.ch len=16
connect() failed with errno=111
5: lcbcpc04.epfl.ch len=16
#

I did disable the firewall on both machines but I still get that  
error message.


Thanks,
MG.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems