Re: [OMPI devel] BTL add procs errors

2010-06-05 Thread Ralph Castain
Good point. I'll only add one more comment/question before crawling back under my rock. If you have a failover capability, why would you turn it "off" during a particular phase of the program? Imagine you are a user and your job has sat in the queue for an entire day. Just as it finally

Re: [OMPI devel] BTL add procs errors

2010-06-04 Thread Jeff Squyres
A clarification -- this specific issue is during add_procs(), which, for jobs that do not use the MPI-2 dynamics, is during MPI_INIT. The #3 error detection/abort is not during the dynamic/lazy MPI peer connection wireup. The potential for long timeout delays mentioned in #1 would be during

Re: [OMPI devel] BTL add procs errors

2010-06-04 Thread Ralph Castain
I think Rolf's reply makes a possibly bad assumption - i.e., that this problem is occurring just as the job is starting to run. Let me give you a real-life example where this wasn't true, and where aborting the job would make a very unhappy user: We start a long-running job (i.e., days) on a

Re: [OMPI devel] BTL add procs errors

2010-06-04 Thread Rolf vandeVaart
On 06/04/10 11:47, Jeff Squyres wrote: On Jun 2, 2010, at 1:36 PM, Jeff Squyres (jsquyres) wrote: We did assume that at least the errors are symmetric, i.e. if A fails to connect to B then B will fail when trying to connect to A. However, if there are other BTL the connection is supposed

Re: [OMPI devel] BTL add procs errors

2010-06-04 Thread Jeff Squyres
On Jun 2, 2010, at 1:36 PM, Jeff Squyres (jsquyres) wrote: > > We did assume that at least the errors are symmetric, i.e. if A fails to > > connect to B then B will fail when trying to connect to A. However, if > > there are other BTL the connection is supposed to smoothly move over some > >

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread George Bosilca
On Jun 2, 2010, at 12:18 , Jeff Squyres wrote: > On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote: > >>> Ah, this is the key. If I have one process (out of many) fail the >>> create_cq() function, I get a segv during finalize. I'll dig. >> >> Is there an assumption that if process A claims

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Jeff Squyres
On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote: > > Ah, this is the key. If I have one process (out of many) fail the > > create_cq() function, I get a segv during finalize. I'll dig. > > Is there an assumption that if process A claims to be able to communicate > with process B that

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Ashley Pittman
On 2 Jun 2010, at 16:49, Jeff Squyres wrote: > On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote: > >> But it made me progress on why I'm crashing : in my case, only a subset of >> processes have their create_cq fail. > > Ah, this is the key. If I have one process (out of many) fail the >

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Jeff Squyres
On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote: > But it made me progress on why I'm crashing : in my case, only a subset of > processes have their create_cq fail. Ah, this is the key. If I have one process (out of many) fail the create_cq() function, I get a segv during finalize. I'll

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey
On Wed, 2 Jun 2010, Jeff Squyres wrote: Don't you mean return NULL? This function is supposed to return a (struct ibv_cq *). Oops. My bad. Yes, it should return NULL. And it seems that if I make ibv_create_cq always return NULL, the scenario described by George works smoothly : returned

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Jeff Squyres
On Jun 2, 2010, at 5:08 AM, Sylvain Jeaugey wrote: > It must be because create_cq actually creates cqs. Try to apply this > patch which makes create_cq_compat() *not* creates the cqs and return an > error instead : > > diff

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread George Bosilca
I don't have any IB nodes, but I'm interested to see how this happens. What I would like to understand here is how do we get back in the OpenIB code if the add_procs failed for the BTL ... george. On Jun 2, 2010, at 05:08 , Sylvain Jeaugey wrote: > On Tue, 1 Jun 2010, Jeff Squyres wrote: >

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey
On Tue, 1 Jun 2010, Jeff Squyres wrote: On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote: In my case, the error happens in : mca_btl_openib_add_procs() mca_btl_openib_size_queues() adjust_cq() ibv_create_cq_compat() ibv_create_cq() Can you nail this down

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey
Couldn't explain it better. Thanks Jeff for the summary ! On Tue, 1 Jun 2010, Jeff Squyres wrote: On May 31, 2010, at 10:27 AM, Ralph Castain wrote: Just curious - your proposed fix sounds exactly like what was done in the OPAL SOS work. Are you therefore proposing to use SOS to provide a

Re: [OMPI devel] BTL add procs errors

2010-06-01 Thread Jeff Squyres
On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote: > In my case, the error happens in : >mca_btl_openib_add_procs() > mca_btl_openib_size_queues() >adjust_cq() > ibv_create_cq_compat() >ibv_create_cq() Can you nail this down any further? If I modify

Re: [OMPI devel] BTL add procs errors

2010-06-01 Thread Jeff Squyres
On May 31, 2010, at 10:27 AM, Ralph Castain wrote: > Just curious - your proposed fix sounds exactly like what was done in the > OPAL SOS work. Are you therefore proposing to use SOS to provide a more > informational status return? No, I think Sylvain's talking about slightly modifying the

Re: [OMPI devel] BTL add procs errors

2010-05-31 Thread Ralph Castain
Just curious - your proposed fix sounds exactly like what was done in the OPAL SOS work. Are you therefore proposing to use SOS to provide a more informational status return? If so, then it would seem the only real dispute here is: is there -any- condition whereby a given BTL should have the

Re: [OMPI devel] BTL add procs errors

2010-05-31 Thread Sylvain Jeaugey
In my case, the error happens in : mca_btl_openib_add_procs() mca_btl_openib_size_queues() adjust_cq() ibv_create_cq_compat() ibv_create_cq() ibv_create_cq() returns an error which goes up until mca_btl_openib_add_procs(). As george mentionned, the openib btl

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Jeff Squyres
To that point, where exactly in the openib BTL init / query sequence is it returning an error for you, Sylvain? Is it just a matter of tidying something up properly before returning the error? On May 28, 2010, at 2:21 PM, George Bosilca wrote: > On May 28, 2010, at 10:03 , Sylvain Jeaugey

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread George Bosilca
On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote: > On Fri, 28 May 2010, Jeff Squyres wrote: > >> On May 28, 2010, at 9:32 AM, Jeff Squyres wrote: >> >>> Understood, and I agreed that the bug should be fixed. Patches would be >>> welcome. :-) > I sent a patch on the bml layer in my first

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey
On Fri, 28 May 2010, Jeff Squyres wrote: On May 28, 2010, at 9:32 AM, Jeff Squyres wrote: Understood, and I agreed that the bug should be fixed. Patches would be welcome. :-) I sent a patch on the bml layer in my first e-mail. We will apply it on our tree, but as always we're trying to

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Jeff Squyres
On May 28, 2010, at 7:19 AM, Sylvain Jeaugey wrote: > So please, fix the bug first, then if you want that "automatic failover to > TCP" feature, develop it. Put a parameter for an eventual notification, or > abort, or whatever you want. But it doesn't exist today. It just doesn't > work, with any

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey
On Fri, 28 May 2010, Jeff Squyres wrote: Herein lies the quandary: we don't/can't know the user or sysadmin intent. They may not care if the IB is borked -- they might just want the job to fall over to TCP and continue. But they may care a lot if IB is borked -- they might want the job to

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Jeff Squyres
On May 28, 2010, at 6:04 AM, Sylvain Jeaugey wrote: > Having errors on add_procs stop the application seems a good thing in all > cases, so why not do it ? That would solve my original problem which lead > to this discussion. > > Yes, the openib BTL may be suboptimal (stopping the application

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey
On Thu, 27 May 2010, Jeff Squyres wrote: On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote: That's pretty much my first proposition : abort when an error arises, because if we don't, we'll crash soon afterwards. That's my original concern and this should really be fixed. Now, if you want

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Jeff Squyres
On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote: > That's pretty much my first proposition : abort when an error arises, > because if we don't, we'll crash soon afterwards. That's my original > concern and this should really be fixed. > > Now, if you want to fix the openib BTL so that an

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Sylvain Jeaugey
27, 2010 1:47 AM To: Open MPI Developers Subject: Re: [OMPI devel] BTL add procs errors I don't think what the openib BTL is doing is that bad. It is returning an error because something really went bad in IB. So yes, it could blank the bitmask and return success, but would you really want

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Barrett, Brian W
ware Group Sandia National Laboratories From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf Of Sylvain Jeaugey [sylvain.jeau...@bull.net] Sent: Thursday, May 27, 2010 1:47 AM To: Open MPI Developers Subject: Re: [OMPI devel] BTL add procs

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Ralph Castain
On May 27, 2010, at 1:47 AM, Sylvain Jeaugey wrote: > I don't think what the openib BTL is doing is that bad. It is returning an > error because something really went bad in IB. So yes, it could blank the > bitmask and return success, but would you really want IB to fail and fallback > on TCP

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Sylvain Jeaugey
I don't think what the openib BTL is doing is that bad. It is returning an error because something really went bad in IB. So yes, it could blank the bitmask and return success, but would you really want IB to fail and fallback on TCP once in a while without any notice ? I wouldn't. So, as it

Re: [OMPI devel] BTL add procs errors

2010-05-26 Thread Barrett, Brian W
George - I'm not sure I agree - the return code should indicate a failure beyond "something prohibited me from talking to the remote side" - something occurred that resulted in it being highly unlikely the app can successfully run to completion (such as malloc failing). On the other hand, I

Re: [OMPI devel] BTL add procs errors

2010-05-25 Thread George Bosilca
The BTLs are allowed to fail adding procs without major consequences in the short term. As you noticed each BTL returns a bit mask array containing all procs reachable through this particular instance of the BTL. Later (in the same file line 395) we check for the complete coverage for all

[OMPI devel] BTL add procs errors

2010-05-25 Thread Sylvain Jeaugey
Hi, I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error during the "add procs" phase. The current bml/r2 code silently ignores btl->add_procs() error codes with the following comment : ompi/mca/bml/r2/bml_r2.c:208 /* This BTL has troubles