Dear all:
The subject heading is a little misleading because this is in response
to part of that original contact. I tried the first two suggestions
below (disabling eager DMA and using tcp btl), but to no avail. In all
cases I am running over 20 12-core nodes through SGE. In the first case,
I get the errors:
***
[[30430,1],234][btl_openib_component.c:3492:handle_wc] from
compute-1-18.local to: compute-6-10 error polling HP CQ with status WORK
REQUEST FLUSHED ERROR status number 5 for wr_id 2c41e80 opcode 128
vendor error 244 qp_idx 0
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: compute-4-13.local
PID: 22356
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[compute-6-1.local:22658] 2 more processes have sent help message
help-odls-default.txt / odls-default:could-not-kill
[compute-6-1.local:22658] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
--------------------------------------------------------------------------
***
The first error is at the same place as before
([btl_openib_component.c:3492:handle_wc]) and the message is only
slightly different (LP -> HP).
For the second suggestion, using tcp btl, I got a whole load of these:
***
[compute-3-1.local][[20917,1],74][btl_tcp_endpoint.c:653:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.7.36.244 failed: Connection timed out (110)
***
there are 1826 "Connection timed out" errors at an earlier spot in the
code than in the case above. I checked iptables and there is no reason
the connection would have been refused. Is it possible I'm out of file
descriptors (because sockets count as files)? `ulimit -n` yields 1024.
T. Vince Grimes, Ph.D.
CCC System Administrator
Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061
(806) 834-0813 (voice); (806) 742-1289 (fax)
On 03/22/2014 11:00 AM, users-requ...@open-mpi.org wrote:
----------------------------------------------------------------------
Message: 1
Date: Fri, 21 Mar 2014 20:16:31 +0000
From: Joshua Ladd <josh...@mellanox.com>
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] Call stack upon MPI routine error
Message-ID:
<8edebdde2c39d447a738659597bbb63a3ed12...@mtidag01.mtl.com>
Content-Type: text/plain; charset="us-ascii"
Hi, Vince
Couple of ideas off the top of my head:
1. Try disabling eager RDMA. Eager RDMA can consume significant resources: "-mca
btl_openib_use_eager_rdma 0"
2. Try using the TCP BTL - is the error still present?
3. Try the poor man's debugger - print the pid and hostname of the process
when and then put a while(1) at btl_openib_component.c:3492 so that the process
will hang when it hits this error. Hop over to the node and attach to the hung
process. You can move up the call stack from here.
Best,
Josh
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Vince Grimes
Sent: Friday, March 21, 2014 3:52 PM
To: us...@open-mpi.org
Subject: [OMPI users] Call stack upon MPI routine error
OpenMPI folks:
I have mentioned before a problem with an in-house code (ScalIT) that
generates the error message
[[31552,1],84][btl_openib_component.c:3492:handle_wc] from compute-4-5.local
to: compute-4-13 error polling LP CQ with status LOCAL QP OPERATION ERROR
status number 2 for wr_id 246f300 opcode 128 vendor error 107 qp_idx 0
at a specific, reproducible point. It was suggested that the error could be due
to memory problems, such as the amount of registered memory. I have already
corrected the amount of registered memory per the URLs that were given to me.
My question today is two-fold:
First, is it possible that ScalIT uses so much memory that there is no memory
to register for IB communications? ScalIT is very memory-intensive and has to
run distributed just to get a large matrix in memory (split between nodes).
Second, is there a way to trap that error so I can see the call stack, showing
the MPI function called and exactly where in the code the error was generated?
--
T. Vince Grimes, Ph.D.
CCC System Administrator
Texas Tech University
Dept. of Chemistry and Biochemistry (10A) Box 41061 Lubbock, TX 79409-1061
(806) 834-0813 (voice); (806) 742-1289 (fax)
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users