Good point. I'll only add one more comment/question before crawling back under
my rock.
If you have a failover capability, why would you turn it "off" during a
particular phase of the program?
Imagine you are a user and your job has sat in the queue for an entire day.
Just as it finally
A clarification -- this specific issue is during add_procs(), which, for jobs
that do not use the MPI-2 dynamics, is during MPI_INIT. The #3 error
detection/abort is not during the dynamic/lazy MPI peer connection wireup.
The potential for long timeout delays mentioned in #1 would be during
I think Rolf's reply makes a possibly bad assumption - i.e., that this problem
is occurring just as the job is starting to run. Let me give you a real-life
example where this wasn't true, and where aborting the job would make a very
unhappy user:
We start a long-running job (i.e., days) on a
On 06/04/10 11:47, Jeff Squyres wrote:
On Jun 2, 2010, at 1:36 PM, Jeff Squyres (jsquyres) wrote:
We did assume that at least the errors are symmetric, i.e. if A fails to
connect to B then B will fail when trying to connect to A. However, if there
are other BTL the connection is supposed
On Jun 2, 2010, at 1:36 PM, Jeff Squyres (jsquyres) wrote:
> > We did assume that at least the errors are symmetric, i.e. if A fails to
> > connect to B then B will fail when trying to connect to A. However, if
> > there are other BTL the connection is supposed to smoothly move over some
> >
On Jun 2, 2010, at 12:18 , Jeff Squyres wrote:
> On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:
>
>>> Ah, this is the key. If I have one process (out of many) fail the
>>> create_cq() function, I get a segv during finalize. I'll dig.
>>
>> Is there an assumption that if process A claims
On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:
> > Ah, this is the key. If I have one process (out of many) fail the
> > create_cq() function, I get a segv during finalize. I'll dig.
>
> Is there an assumption that if process A claims to be able to communicate
> with process B that
On 2 Jun 2010, at 16:49, Jeff Squyres wrote:
> On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote:
>
>> But it made me progress on why I'm crashing : in my case, only a subset of
>> processes have their create_cq fail.
>
> Ah, this is the key. If I have one process (out of many) fail the
>
On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote:
> But it made me progress on why I'm crashing : in my case, only a subset of
> processes have their create_cq fail.
Ah, this is the key. If I have one process (out of many) fail the create_cq()
function, I get a segv during finalize. I'll
On Wed, 2 Jun 2010, Jeff Squyres wrote:
Don't you mean return NULL? This function is supposed to return a (struct
ibv_cq *).
Oops. My bad. Yes, it should return NULL. And it seems that if I make
ibv_create_cq always return NULL, the scenario described by George works
smoothly : returned
On Jun 2, 2010, at 5:08 AM, Sylvain Jeaugey wrote:
> It must be because create_cq actually creates cqs. Try to apply this
> patch which makes create_cq_compat() *not* creates the cqs and return an
> error instead :
>
> diff
I don't have any IB nodes, but I'm interested to see how this happens. What I
would like to understand here is how do we get back in the OpenIB code if the
add_procs failed for the BTL ...
george.
On Jun 2, 2010, at 05:08 , Sylvain Jeaugey wrote:
> On Tue, 1 Jun 2010, Jeff Squyres wrote:
>
On Tue, 1 Jun 2010, Jeff Squyres wrote:
On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote:
In my case, the error happens in :
mca_btl_openib_add_procs()
mca_btl_openib_size_queues()
adjust_cq()
ibv_create_cq_compat()
ibv_create_cq()
Can you nail this down
Couldn't explain it better. Thanks Jeff for the summary !
On Tue, 1 Jun 2010, Jeff Squyres wrote:
On May 31, 2010, at 10:27 AM, Ralph Castain wrote:
Just curious - your proposed fix sounds exactly like what was done in
the OPAL SOS work. Are you therefore proposing to use SOS to provide a
On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote:
> In my case, the error happens in :
>mca_btl_openib_add_procs()
> mca_btl_openib_size_queues()
>adjust_cq()
> ibv_create_cq_compat()
>ibv_create_cq()
Can you nail this down any further? If I modify
On May 31, 2010, at 10:27 AM, Ralph Castain wrote:
> Just curious - your proposed fix sounds exactly like what was done in the
> OPAL SOS work. Are you therefore proposing to use SOS to provide a more
> informational status return?
No, I think Sylvain's talking about slightly modifying the
Just curious - your proposed fix sounds exactly like what was done in the OPAL
SOS work. Are you therefore proposing to use SOS to provide a more
informational status return?
If so, then it would seem the only real dispute here is: is there -any-
condition whereby a given BTL should have the
In my case, the error happens in :
mca_btl_openib_add_procs()
mca_btl_openib_size_queues()
adjust_cq()
ibv_create_cq_compat()
ibv_create_cq()
ibv_create_cq() returns an error which goes up until
mca_btl_openib_add_procs(). As george mentionned, the openib btl
To that point, where exactly in the openib BTL init / query sequence is it
returning an error for you, Sylvain? Is it just a matter of tidying something
up properly before returning the error?
On May 28, 2010, at 2:21 PM, George Bosilca wrote:
> On May 28, 2010, at 10:03 , Sylvain Jeaugey
On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote:
> On Fri, 28 May 2010, Jeff Squyres wrote:
>
>> On May 28, 2010, at 9:32 AM, Jeff Squyres wrote:
>>
>>> Understood, and I agreed that the bug should be fixed. Patches would be
>>> welcome. :-)
> I sent a patch on the bml layer in my first
On Fri, 28 May 2010, Jeff Squyres wrote:
On May 28, 2010, at 9:32 AM, Jeff Squyres wrote:
Understood, and I agreed that the bug should be fixed. Patches would
be welcome. :-)
I sent a patch on the bml layer in my first e-mail. We will apply it on
our tree, but as always we're trying to
On May 28, 2010, at 7:19 AM, Sylvain Jeaugey wrote:
> So please, fix the bug first, then if you want that "automatic failover to
> TCP" feature, develop it. Put a parameter for an eventual notification, or
> abort, or whatever you want. But it doesn't exist today. It just doesn't
> work, with any
On Fri, 28 May 2010, Jeff Squyres wrote:
Herein lies the quandary: we don't/can't know the user or sysadmin
intent. They may not care if the IB is borked -- they might just want
the job to fall over to TCP and continue. But they may care a lot if IB
is borked -- they might want the job to
On May 28, 2010, at 6:04 AM, Sylvain Jeaugey wrote:
> Having errors on add_procs stop the application seems a good thing in all
> cases, so why not do it ? That would solve my original problem which lead
> to this discussion.
>
> Yes, the openib BTL may be suboptimal (stopping the application
On Thu, 27 May 2010, Jeff Squyres wrote:
On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:
That's pretty much my first proposition : abort when an error arises,
because if we don't, we'll crash soon afterwards. That's my original
concern and this should really be fixed.
Now, if you want
On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:
> That's pretty much my first proposition : abort when an error arises,
> because if we don't, we'll crash soon afterwards. That's my original
> concern and this should really be fixed.
>
> Now, if you want to fix the openib BTL so that an
27, 2010 1:47 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] BTL add procs errors
I don't think what the openib BTL is doing is that bad. It is returning an
error because something really went bad in IB. So yes, it could blank the
bitmask and return success, but would you really want
ware Group
Sandia National Laboratories
From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf Of
Sylvain Jeaugey [sylvain.jeau...@bull.net]
Sent: Thursday, May 27, 2010 1:47 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] BTL add procs
On May 27, 2010, at 1:47 AM, Sylvain Jeaugey wrote:
> I don't think what the openib BTL is doing is that bad. It is returning an
> error because something really went bad in IB. So yes, it could blank the
> bitmask and return success, but would you really want IB to fail and fallback
> on TCP
I don't think what the openib BTL is doing is that bad. It is returning an
error because something really went bad in IB. So yes, it could blank the
bitmask and return success, but would you really want IB to fail and
fallback on TCP once in a while without any notice ? I wouldn't.
So, as it
George -
I'm not sure I agree - the return code should indicate a failure beyond
"something prohibited me from talking to the remote side" - something occurred
that resulted in it being highly unlikely the app can successfully run to
completion (such as malloc failing). On the other hand, I
The BTLs are allowed to fail adding procs without major consequences in the
short term. As you noticed each BTL returns a bit mask array containing all
procs reachable through this particular instance of the BTL. Later (in the same
file line 395) we check for the complete coverage for all
Hi,
I'm currently trying to have Open MPI exit more gracefully when a BTL
returns an error during the "add procs" phase.
The current bml/r2 code silently ignores btl->add_procs() error codes with
the following comment :
ompi/mca/bml/r2/bml_r2.c:208
/* This BTL has troubles
33 matches
Mail list logo