Hi, Vince

Couple of ideas off the top of my head:

1. Try disabling eager RDMA. Eager RDMA can consume significant resources: 
"-mca btl_openib_use_eager_rdma 0"

2. Try using the TCP BTL - is the error still present?

3. Try the poor man's debugger -  print the pid and hostname of the process 
when and then put a while(1) at btl_openib_component.c:3492 so that the process 
will hang when it hits this error. Hop over to the node and attach to the hung 
process. You can move up the call stack from here. 

Best,

Josh

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Vince Grimes
Sent: Friday, March 21, 2014 3:52 PM
To: us...@open-mpi.org
Subject: [OMPI users] Call stack upon MPI routine error

OpenMPI folks:

        I have mentioned before a problem with an in-house code (ScalIT) that 
generates the error message

[[31552,1],84][btl_openib_component.c:3492:handle_wc] from compute-4-5.local 
to: compute-4-13 error polling LP CQ with status LOCAL QP OPERATION ERROR 
status number 2 for wr_id 246f300 opcode 128  vendor error 107 qp_idx 0

at a specific, reproducible point. It was suggested that the error could be due 
to memory problems, such as the amount of registered memory. I have already 
corrected the amount of registered memory per the URLs that were given to me. 
My question today is two-fold:

First, is it possible that ScalIT uses so much memory that there is no memory 
to register for IB communications? ScalIT is very memory-intensive and has to 
run distributed just to get a large matrix in memory (split between nodes).

Second, is there a way to trap that error so I can see the call stack, showing 
the MPI function called and exactly where in the code the error was generated?

--
T. Vince Grimes, Ph.D.
CCC System Administrator

Texas Tech University
Dept. of Chemistry and Biochemistry (10A) Box 41061 Lubbock, TX 79409-1061

(806) 834-0813 (voice);     (806) 742-1289 (fax)
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to