It looks that we are touching some QP that was released. Before close the QP we
make sure to complete all outstanding messages on the endpoint. Once all qps
(and other resources) are closed , we signal to async thread to remove this hca
from monitoring list. For me it looks that somehow we clos
I can't tell if this is a problem, though I suspect it's a small one
even if it's a problem at all.
In mca_bml_r2_del_proc_btl(), a BTL is removed from the send list and
from the RDMA list.
If the BTL is removed from the send list, the end-point's max send size
is recomputed to be the minimu
I'd guess thesame thing as George - a race condition in the shutdown of the
async thread...? I haven't looked at that code in a long log time to remember
how it tried to defend against the race condition.
Sent from my PDA. No type good.
On Jan 3, 2011, at 2:31 PM, "Eugene Loh" wrote:
> Geo
George Bosilca wrote:
Eugene,
This error indicate that somehow we're accessing the QP while the QP is in
"down" state. As the asynchronous thread is the one that see this error, I
wonder if it doesn't look for some information about a QP that has been destroyed by the
main thread (as this on
In addition, it would be really, really nice if someone would consolidate the
watching of these devices into other mechanisms.
The idea here is that the error can be noticed asynchronously, so it can't be
part of the main libevent fd-watching (which is only checked once in a while).
The async
WHAT: convert orte to start by launching a virtual machine across all allocated
nodes
WHY: support topologically-aware mapping methods
WHEN: sometime over the next couple of months
***
Several of us (including Jeff, Terry, Josh, and Ralph) are