I realize it is no longer in the history of replies for this message, but the reason I am trying to use tcp instead of Infiniband is because:

We are using an in-house program called ScalIT that performs operations on very large sparse distributed matrices. ScalIT works on other clusters with comparable hardware and software, but not ours.
Other programs run just fine on our cluster using OpenMPI.
ScalIT runs to completion using OpenMPI *on a single 12-core node*.

It was suggested to me by another list member that I try forcing usage of tcp instead of Infiniband, so that's what I've been trying, just to see if it will work. I guess the tcp code is expected to be more reliable? The mca parameters used to produce the current error are: "--mca btl self,sm,tcp --mca btl_tcp_if_exclude lo,ib0"

        The previous Infiniband error message is:
---
local QP operation err (QPN 7c1d43, WQE @ 00015005, CQN 7a009a, index 307512)
  [ 0] 007c1d43
  [ 4] 00000000
  [ 8] 00000000
  [ c] 00000000
  [10] 026b2ed0
  [14] 00000000
  [18] 00015005
  [1c] ff100000
[[31552,1],84][btl_openib_component.c:3492:handle_wc] from compute-4-5.local to: compute-4-13 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 246f300 opcode 128 vendor error 107 qp_idx 0
---

It was also suggested that I disable eager RDMA. Doing this ("--mca btl_openib_use_eager_rdma 0") results in:
---
[[30430,1],234][btl_openib_component.c:3492:handle_wc] from compute-1-18.local to: compute-6-10 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 2c41e80 opcode 128 vendor error 244 qp_idx 0
---

All the Infiniband errors come in the same place with respect to the program output and reference the same OpenMPI code line. (It is notoriously difficult to trace through this program to be sure of the location in the code where the error occurs as ScalIT is written in appalling FORTRAN.)

I had another problem with a completely different code, also in FORTRAN and also in the same research group, that scaLAPACK initialization when compiled with Intel Composer 13.1.3 and 11.1-080 segfaulted. Switching to to MVAPICH2 solved the problem, but I wonder if maybe a convention of some sort is being violated in ScalIT such that the semantics do not work the same way as expected. I'm kind of grasping at straws here, and any leads are appreciated.

T. Vince Grimes, Ph.D.
CCC System Administrator

Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061

(806) 834-0813 (voice);     (806) 742-1289 (fax)

On 04/29/2014 11:00 AM, users-requ...@open-mpi.org wrote:
------------------------------

Message: 2
Date: Mon, 28 Apr 2014 22:07:08 +0000
From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] Connection timed out on TCP
Message-ID: <f8b864a2-8243-4eb3-b6a0-eb5e0940f...@cisco.com>
Content-Type: text/plain; charset="us-ascii"

In principle, there's nothing wrong with using ib0 interfaces for TCP MPI 
communication, but it does raise the question of why you're using TCP when you 
have InfiniBand available...?

Aside from that, can you send all the info listed here:

    http://www.open-mpi.org/community/help/



On Apr 28, 2014, at 11:08 AM, Vince Grimes <tom.gri...@ttu.edu> wrote:

After barring the ib0 interfaces, I still get "Connection timed out" errors 
even over the Ethernet interfaces.

At the end of the output I not get the following messages in addition to the 
one above:

--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    client handshake fail
from the file:
    help-mpi-btl-tcp.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------

The Ethernet switches are managed. Is it likely there is something set wrong?

T. Vince Grimes, Ph.D.
CCC System Administrator

Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061

(806) 834-0813 (voice);     (806) 742-1289 (fax)

On 04/25/2014 04:22 PM, users-requ...@open-mpi.org wrote:

Message: 3
Date: Fri, 25 Apr 2014 14:56:47 -0500
From: Vince Grimes <tom.gri...@ttu.edu>
To: <us...@open-mpi.org>
Subject: [OMPI users] Connection timed out on TCP
Message-ID: <535abdff.1020...@ttu.edu>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed

There is no firewall on this subnet as it is the internal Ethernet for
the cluster.

However, I double-checked the offending IPs and discovered they are
Infiniband IPoIB addresses! I'm now trying to exclude the ib0 interface
as in https://www.open-mpi.org/faq/?category=tcp#tcp-selection

T. Vince Grimes, Ph.D.
CCC System Administrator

Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061

(806) 834-0813 (voice);     (806) 742-1289 (fax)

On 04/25/2014 11:00 AM, users-requ...@open-mpi.org wrote:
Date: Thu, 24 Apr 2014 19:49:26 -0700 From: Ralph Castain
<r...@open-mpi.org> To: Open MPI Users <us...@open-mpi.org> Subject: Re:
[OMPI users] Connection timed out on TCP and notify question Message-ID:
<11462b85-83ca-4b3d-86e5-eddd9bc87...@open-mpi.org> Content-Type:
text/plain; charset=us-ascii Sounds like either a routing problem or a
firewall. Are there multiple NICs on these nodes? Looking at the quoted
NIC in your error message, is that the correct subnet we should be
using? Have you checked to ensure no firewalls exist on that subnet
between the nodes? On Apr 24, 2014, at 8:41 AM, Vince Grimes
<tom.gri...@ttu.edu> wrote:
Dear all:

        In the ongoing investigation into why a particular in-house program is not 
working in parallel over multiple nodes using OpenMPI, running with "--mca btl 
self,sm,tcp" I have been running into the following error:

[compute-6-15.local][[8185,1],0 
[btl_tcp_endpoint.c:653:mca_btl_tcp_endpoint_complete_connect] connect() to 
10.7.36.247 failed: Connection timed out (110)

I thought at first it was due to running out of file handles (sockets are 
considered files), but I have amended limits.d to allow 102400 files (up from 
the default of 1024), which should be more than enough.

        What is going on? Trying to connect to 4/20 nodes gave the error above.

        My second question involves the notify system for btl openib. What does 
the syslog notifier require in order to work? I want to see if the errors 
running the same program with openib are due to dropped IB connections.

--
T. Vince Grimes, Ph.D.
CCC System Administrator

Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061

(806) 834-0813 (voice);     (806) 742-1289 (fax)
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to