Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

Eric Thibodeau Thu, 11 Sep 2008 12:04:41 -0400

Jeff Squyres wrote:

I'm not sure what USE=-threads means, but I would discourage the useof threads in the v1.2 series; our thread support is pretty muchbroken in the 1.2 series.

That's exactly what it means, hence the following BFW I had originallyinserted in the package to this effect:


       ewarn
       ewarn "WARNING: use of threads is still disabled by default in"
       ewarn "upstream builds."
       ewarn "You may stop now and set USE=-threads"
       ewarn
       epause 5

...ok, so it's maybe not that B and F but it's still there to be noticedand logged ;)



On Sep 10, 2008, at 7:52 PM, Eric Thibodeau wrote:

Prasanna, also make sure you try with USE=-threads ...as the ebuildstates, it's _experimental_ ;)

Keep your eye on:https://svn.open-mpi.org/trac/ompi/wiki/ThreadSafetySupport


Eric

Prasanna Ranganathan wrote:


Hi,

I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed
1.2.6-r1 to be the latest stable version of openMPI).

I do still get the following error message when running my testhelloWorld

program:

[10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c

onnect] connect() failed with

errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_

complete_connect] connect() failed with errno=113

[10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c

onnect] connect() failed with errno=113

[10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c

onnect] connect() failed with errno=113

Again, this error does not happen with every run of the test programand

occurs only certain times.

How do I take care of this?

Regards,

Prasanna.

On 9/9/08 9:00 AM, "users-requ...@open-mpi.org"<users-requ...@open-mpi.org>

wrote:

Message: 1
Date: Mon, 8 Sep 2008 16:43:33 -0400
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] Need help resolving No route to host error
with OpenMPI 1.1.2
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <af302d68-0d30-469e-afd3-566ff9628...@cisco.com>
Content-Type: text/plain; charset=WINDOWS-1252; format=flowed;
delsp=yes

Are you able to upgrade to Open MPI v1.2.7?

There were *many* bug fixes and changes in the 1.2 series compared to
the 1.1 series, some, in particular, were dealing with TCP socket
timeouts (which are important when dealing with large numbers of MPI
processes).

On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:

Hi,

I am trying to run a test mpiHelloWorld program that simply
initializes the MPI environment on all the nodes, prints the
hostname and rank of each node in the MPI process group and exits.

I am using MPI 1.1.2 and am running 997 processes on 499 nodes
(Nodes have 2 dual core CPUs).

I get the following error messages when I run my program as follows:
mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
.....
.....
.....
[0,1,380][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
[0,1,140][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
connect() failed with errno=113connect() failed with
errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[0,1,388][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113[0,1,386][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
[0,1,139][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
connect() failed with errno=113
.....
.....

The main thing is that I get these error messages around 3-4 times
out of 10 attempts with the rest all completing successfully. I have
looked into the FAQs in detail and also checked the tcp btl settings
but am not able to figure it out.

All the 499 nodes have only eth0 active and I get the error even
when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
--mca btl_tcp_if_include eth0 /main/mpiHelloWorld

I have attached the output of ompi_info ?all.

The following is the output of /sbin/ifconfig on the node where I
start the mpi process (it is one of the 499 nodes)

eth0      Link encap:Ethernet  HWaddr 00:03:25:44:8F:D6
          inet addr:10.12.1.11  Bcast:10.12.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:
17
          TX packets:1767028063 errors:0 dropped:0 overruns:0
carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:580938897359 (554026.5 Mb)  TX bytes:689318600552
(657385.4 Mb)
          Interrupt:22 Base address:0xc000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0

RX bytes:339687635 (323.9 Mb) TX bytes:339687635 (323.9Mb)



Kindly help.

Regards,

Prasanna.

<ompi_info.rtf>_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

Reply via email to