Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2011-01-03 Thread Shamis, Pavel
It looks that we are touching some QP that was released. Before close the QP we make sure to complete all outstanding messages on the endpoint. Once all qps (and other resources) are closed , we signal to async thread to remove this hca from monitoring list. For me it looks that somehow we clos

[OMPI devel] mca_bml_r2_del_proc_btl()

2011-01-03 Thread Eugene Loh
I can't tell if this is a problem, though I suspect it's a small one even if it's a problem at all. In mca_bml_r2_del_proc_btl(), a BTL is removed from the send list and from the RDMA list. If the BTL is removed from the send list, the end-point's max send size is recomputed to be the minimu

Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2011-01-03 Thread Jeff Squyres (jsquyres)
I'd guess thesame thing as George - a race condition in the shutdown of the async thread...? I haven't looked at that code in a long log time to remember how it tried to defend against the race condition. Sent from my PDA. No type good. On Jan 3, 2011, at 2:31 PM, "Eugene Loh" wrote: > Geo

Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2011-01-03 Thread Eugene Loh
George Bosilca wrote: Eugene, This error indicate that somehow we're accessing the QP while the QP is in "down" state. As the asynchronous thread is the one that see this error, I wonder if it doesn't look for some information about a QP that has been destroyed by the main thread (as this on

Re: [OMPI devel] async thread in openib BTL

2011-01-03 Thread Jeff Squyres
In addition, it would be really, really nice if someone would consolidate the watching of these devices into other mechanisms. The idea here is that the error can be noticed asynchronously, so it can't be part of the main libevent fd-watching (which is only checked once in a while). The async

[OMPI devel] RFC: VM launch

2011-01-03 Thread Ralph Castain
WHAT: convert orte to start by launching a virtual machine across all allocated nodes WHY: support topologically-aware mapping methods WHEN: sometime over the next couple of months *** Several of us (including Jeff, Terry, Josh, and Ralph) are