Re: [OMPI devel] Device failover on ob1

2009-08-06 Thread Jeff Squyres

Is it time to "svn rm ompi/mca/pml/dr"?


On Aug 4, 2009, at 6:50 AM, Ralph Castain wrote:


Rolf/Mouhamed

Could you get together off-list to discuss the different approaches
and see if/where there is common ground. It would be nice to see an
integrated solution - personally, I would rather not see two
orthogonal approaches unless they can be cleanly separated. Much
better if they could support each other in an intelligent fashion.

On Aug 3, 2009, at 9:49 AM, Pavel Shamis (Pasha) wrote:

>
>
>> I have not, but there should be no difference.  The failover code
>> only gets triggered when an error happens.  Otherwise, there are no
>> differences in the code paths while everything is functioning
>> normally.
> Sounds good. I still did not have time to review the code. I will
> try to do it during this week.
>
> Pasha
>>
>> Rolf
>>
>> On 08/03/09 11:14, Pavel Shamis (Pasha) wrote:
>>> Rolf,
>>> Did you compare latency/bw for failover-enabled code VS trunk ?
>>>
>>> Pasha.
>>>
>>> Rolf Vandevaart wrote:
 Hi folks:

 As some of you know, I have also been looking into implementing
 failover as well.  I took a different approach as I am solving
 the problem within the openib BTL itself.  This of course means
 that this only works for failing from one openib BTL to another
 but that was our area of interest.  This also means that we do
 not need to keep track of fragments as we get them back from the
 completion queue upon failure. We then extract the relevant
 information and repost on the other working endpoint.

 My work has been progressing at http://bitbucket.org/rolfv/ompi-failover
 .

 This only currently works for send semantics so you have to run
 with -mca btl_openib_flags 1.

 Rolf

 On 07/31/09 05:49, Mouhamed Gueye wrote:
> Hi list,
>
> Here is an update on our work concerning device failover.
>
> As many of you suggested, we reoriented our work on ob1 rather
> than dr and we now have a working prototype on top of ob1. The
> approach is to store btl descriptors sent to peers and delete
> them when we receive proof of delivery. So far, we rely on
> completion callback functions, assuming that the message is
> delivered when the completion function is called, that is the
> case of openib. When a btl module fails, it is removed from the
> endpoint's btl list and the next one is used to retransmit
> stored descriptors. No extra-message is transmitted, it only
> consists in additions to the header. It has been mainly tested
> with two IB modules, in both multi-rail (two separate networks)
> and multi-path (a big unique network).
>
> You can grab and test the patch here (applies on top of the
> trunk) :
> http://bitbucket.org/gueyem/ob1-failover/
>
> To compile with failover support, just define --enable-device-
> failover at configure. You can then run a benchmark, disconnect
> a port and see the failover operate.
>
> A little latency increase (~ 2%) is induced by the failover
> layer when no failover occurs. To accelerate the failover
> process on openib, you can try to lower the
> btl_openib_ib_timeout openib parameter to 15 for example instead
> of 20 (default value).
>
> Mouhamed
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


>>>
>>
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI devel] Device failover on ob1

2009-08-04 Thread Graham, Richard L.
>From my perspective, the assumption that the low-level is reliable is 
>completely
 consistent with the assumptions that went into the ob1 design, so I don't see 
changes
 you may propose as a problem in principal.

Thanks a lot for the clarification,
Rich


On 8/3/09 9:39 AM, "Mouhamed Gueye"  wrote:

Hi list,

I'll try to answer to the main concerns so far.

We chose to work on ob1 for mainly 2 reasons:
- we focused first on fixing  dr  but  were quite disappointed by its
performance in comparison with ob1. Then, we oriented our work on ob1 to
provide failover while keeping good performance.
- Secondly, we wanted to avoid as much as possible to fork ob1 to stay
up-to-date with the code base. Plus, the failover layer is so thin (in
comparison with the code base) that it would not make sense to fork the
base into a new pml.

But we were aware that ob1 won't allow any non-zero impact change and
that is why the added code is configured out by default. Actually, we
wanted to address long jobs that can afford a very little performance
loss but won't allow aborting after several hours or days of computation
because of one port failure. The goal of this prototype is to provide a
proof of concept for discussion, as we know there are other people
working on this subject.

As stated in the previous mail, the idea is to store any sent btl
descriptor until it is marked as delivered. For that, we rely on
completion callbacks and the assumption, clearly, is that a completion
function called means message delivery to the remote card. The
underlying btl is the one that ensures message delivery. This is
currently the case of the openib btl, but any other btl may be able to
do so. So, with that assumption, we do not need any pml level
acknowledgment protocol  (no extra messages).
No timer is needed for retransmission as it is triggered by btl failure.
Today, only error callback scenario is implemented. We should also treat
btl send method return codes. To deal with message duplication, the
protocol maintains a message id allowing to track received messages
(hence the larger header). So any duplicated message will not be processed.

Concerning the openib btl, on a multi-port system, the connection scheme
is supposed to be (host 1-port 0) <==> (host 2-port 0) and (host 1-port
1) <==> (host 2-port 1) for example. This is done at btl endpoint
initialization but when establishing connexion at first send attempt,
the port association information is not processed. This results in a
crossed connection scheme ( (host 1-port 0) <==> (host 2-port 1) and
(host 1-port 1) <==> (host 2-port 0)). So, instead of having two
separate rings or paths, we have 1 big ring that does not allow
failover. We had to fix this to enable failover in both multi-path (same
network) and multi-rail (2 separate networks) with openib.

Brian, so far, we are able to switch from one failing btl to a safe one
only. When there is no more btl left, we abort the job. Next step is to
be able to re-establish the connection when the network is back.

Mouhamed
Graham, Richard L. a écrit :
> What is the impact on sm, which is by far the most sensitive to latency. This 
> really belongs in a place other than ob1.  Ob1 is supposed to provide the 
> lowest latency possible, and other pml's are supposed to be used for heavier 
> weight protocols.
>
> On the technical side, how do you distinguish between a lot acknowledgement 
> and an undelivered message ?  You really don't want to try and deliver data 
> into user space twice, as once a receive is complete, who knows what the user 
> has done with that buffer ?  A general treatment needs to be able to false 
> negatives, and attempts to deliver the data more than once.
>
> How are you detecting missing acknowledgements ?  Are you using some sort of 
> timer ?
>
> Rich
>
> On 7/31/09 5:49 AM, "Mouhamed Gueye"  wrote:
>
> Hi list,
>
> Here is an update on our work concerning device failover.
>
> As many of you suggested, we reoriented our work on ob1 rather than dr
> and we now have a working prototype on top of ob1. The approach is to
> store btl descriptors sent to peers and delete them when we receive
> proof of delivery. So far, we rely on completion callback functions,
> assuming that the message is delivered when the completion function is
> called, that is the case of openib. When a btl module fails, it is
> removed from the endpoint's btl list and the next one is used to
> retransmit stored descriptors. No extra-message is transmitted, it only
> consists in additions to the header. It has been mainly tested with two
> IB modules, in both multi-rail (two separate networks) and multi-path (a
> big unique network).
>
> You can grab and test the patch here (applies on top of the trunk) :
> http://bitbucket.org/gueyem/ob1-failover/
>
> To compile with failover support, just define --enable-device-failover
> at configure. You can then run a benchmark, disconnect a port and see
> the failover operate.
>
> A little latency 

Re: [OMPI devel] Device failover on ob1

2009-08-04 Thread Ralph Castain

Rolf/Mouhamed

Could you get together off-list to discuss the different approaches  
and see if/where there is common ground. It would be nice to see an  
integrated solution - personally, I would rather not see two  
orthogonal approaches unless they can be cleanly separated. Much  
better if they could support each other in an intelligent fashion.


On Aug 3, 2009, at 9:49 AM, Pavel Shamis (Pasha) wrote:




I have not, but there should be no difference.  The failover code  
only gets triggered when an error happens.  Otherwise, there are no  
differences in the code paths while everything is functioning  
normally.
Sounds good. I still did not have time to review the code. I will  
try to do it during this week.


Pasha


Rolf

On 08/03/09 11:14, Pavel Shamis (Pasha) wrote:

Rolf,
Did you compare latency/bw for failover-enabled code VS trunk ?

Pasha.

Rolf Vandevaart wrote:

Hi folks:

As some of you know, I have also been looking into implementing  
failover as well.  I took a different approach as I am solving  
the problem within the openib BTL itself.  This of course means  
that this only works for failing from one openib BTL to another  
but that was our area of interest.  This also means that we do  
not need to keep track of fragments as we get them back from the  
completion queue upon failure. We then extract the relevant  
information and repost on the other working endpoint.


My work has been progressing at http://bitbucket.org/rolfv/ompi-failover 
.


This only currently works for send semantics so you have to run  
with -mca btl_openib_flags 1.


Rolf

On 07/31/09 05:49, Mouhamed Gueye wrote:

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather  
than dr and we now have a working prototype on top of ob1. The  
approach is to store btl descriptors sent to peers and delete  
them when we receive proof of delivery. So far, we rely on  
completion callback functions, assuming that the message is  
delivered when the completion function is called, that is the  
case of openib. When a btl module fails, it is removed from the  
endpoint's btl list and the next one is used to retransmit  
stored descriptors. No extra-message is transmitted, it only  
consists in additions to the header. It has been mainly tested  
with two IB modules, in both multi-rail (two separate networks)  
and multi-path (a big unique network).


You can grab and test the patch here (applies on top of the  
trunk) :

http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define --enable-device- 
failover at configure. You can then run a benchmark, disconnect  
a port and see the failover operate.


A little latency increase (~ 2%) is induced by the failover  
layer when no failover occurs. To accelerate the failover  
process on openib, you can try to lower the  
btl_openib_ib_timeout openib parameter to 15 for example instead  
of 20 (default value).


Mouhamed
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel










___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Pavel Shamis (Pasha)



I have not, but there should be no difference.  The failover code only 
gets triggered when an error happens.  Otherwise, there are no 
differences in the code paths while everything is functioning normally.
Sounds good. I still did not have time to review the code. I will try to 
do it during this week.


Pasha


Rolf

On 08/03/09 11:14, Pavel Shamis (Pasha) wrote:

Rolf,
Did you compare latency/bw for failover-enabled code VS trunk ?

Pasha.

Rolf Vandevaart wrote:

Hi folks:

As some of you know, I have also been looking into implementing 
failover as well.  I took a different approach as I am solving the 
problem within the openib BTL itself.  This of course means that 
this only works for failing from one openib BTL to another but that 
was our area of interest.  This also means that we do not need to 
keep track of fragments as we get them back from the completion 
queue upon failure. We then extract the relevant information and 
repost on the other working endpoint.


My work has been progressing at 
http://bitbucket.org/rolfv/ompi-failover.


This only currently works for send semantics so you have to run with 
-mca btl_openib_flags 1.


Rolf

On 07/31/09 05:49, Mouhamed Gueye wrote:

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather than 
dr and we now have a working prototype on top of ob1. The approach 
is to store btl descriptors sent to peers and delete them when we 
receive proof of delivery. So far, we rely on completion callback 
functions, assuming that the message is delivered when the 
completion function is called, that is the case of openib. When a 
btl module fails, it is removed from the endpoint's btl list and 
the next one is used to retransmit stored descriptors. No 
extra-message is transmitted, it only consists in additions to the 
header. It has been mainly tested with two IB modules, in both 
multi-rail (two separate networks) and multi-path (a big unique 
network).


You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define 
--enable-device-failover at configure. You can then run a 
benchmark, disconnect a port and see the failover operate.


A little latency increase (~ 2%) is induced by the failover layer 
when no failover occurs. To accelerate the failover process on 
openib, you can try to lower the btl_openib_ib_timeout openib 
parameter to 15 for example instead of 20 (default value).


Mouhamed
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel












Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Rolf Vandevaart
I have not, but there should be no difference.  The failover code only 
gets triggered when an error happens.  Otherwise, there are no 
differences in the code paths while everything is functioning normally.


Rolf

On 08/03/09 11:14, Pavel Shamis (Pasha) wrote:

Rolf,
Did you compare latency/bw for failover-enabled code VS trunk ?

Pasha.

Rolf Vandevaart wrote:

Hi folks:

As some of you know, I have also been looking into implementing 
failover as well.  I took a different approach as I am solving the 
problem within the openib BTL itself.  This of course means that this 
only works for failing from one openib BTL to another but that was our 
area of interest.  This also means that we do not need to keep track 
of fragments as we get them back from the completion queue upon 
failure. We then extract the relevant information and repost on the 
other working endpoint.


My work has been progressing at http://bitbucket.org/rolfv/ompi-failover.

This only currently works for send semantics so you have to run with 
-mca btl_openib_flags 1.


Rolf

On 07/31/09 05:49, Mouhamed Gueye wrote:

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather than 
dr and we now have a working prototype on top of ob1. The approach is 
to store btl descriptors sent to peers and delete them when we 
receive proof of delivery. So far, we rely on completion callback 
functions, assuming that the message is delivered when the completion 
function is called, that is the case of openib. When a btl module 
fails, it is removed from the endpoint's btl list and the next one is 
used to retransmit stored descriptors. No extra-message is 
transmitted, it only consists in additions to the header. It has been 
mainly tested with two IB modules, in both multi-rail (two separate 
networks) and multi-path (a big unique network).


You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define 
--enable-device-failover at configure. You can then run a benchmark, 
disconnect a port and see the failover operate.


A little latency increase (~ 2%) is induced by the failover layer 
when no failover occurs. To accelerate the failover process on 
openib, you can try to lower the btl_openib_ib_timeout openib 
parameter to 15 for example instead of 20 (default value).


Mouhamed
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel








--

=
rolf.vandeva...@sun.com
781-442-3043
=


Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Brian W. Barrett

On Sun, 2 Aug 2009, Ralph Castain wrote:

Perhaps a bigger question needs to be addressed - namely, does the ob1 code 
need to be refactored?


Having been involved a little in the early discussion with bull when we 
debated over where to put this, I know the primary concern was that the code 
not suffer the same fate as the dr module. We have since run into a similar 
issue with the checksum module, so I know where they are coming from.


The problem is that the code base is adjusted to support changes in ob1, 
which is still being debugged. On the order of 95% of the code in ob1 is 
required to be common across all the pml modules, so the rest of us have to 
(a) watch carefully all the commits to see if someone touches ob1, and then 
(b) manually mirror the change in our modules.


This is not a supportable model over the long-term, which is why dr has died, 
and checksum is considering integrating into ob1 using configure #if's to 
avoid impacting non-checksum users. Likewise, device failover has been 
treated similarly here - i.e., configure out the added code unless someone 
wants it.


This -does- lead to messier source code with these #if's in it. If we can 
refactor the ob1 code so the common functionality resides in the base, then 
perhaps we can avoid this problem.


Is it possible?


I think Ralph raises a good point - we need to think about how to allow 
better use of OB1's code base between consumers like checksum and 
failover.  The current situation is problematic to me, for the reasons 
Ralph cited.  However, since the ob1 structures and code have little use 
for PMLs such as CM, I'd rather not push the code into the base - in the 
end, it's very specific to a particular PML implementation and the code 
pushed into the base already made things much more interesting in 
implementing CM than I would have liked.  DR is different in this 
conversation, as it was almost entirely a seperate implementation from ob1 
by the end, due to the removal of many features and the addition of many 
others.


However, I think there's middle ground here which could greatly improve 
the current situation.  With the proper refactoring, there's no technical 
reason why we couldn't move the checksum functionality into ob1 and add 
the failover to ob1, with no impact on performance when the functionality 
isn't used and little impact on code readability.


So, in summary, refactor OB1 to support checksum / failover good, pushing 
ob1 code into base bad.


Brian


Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Pavel Shamis (Pasha)

Rolf,
Did you compare latency/bw for failover-enabled code VS trunk ?

Pasha.

Rolf Vandevaart wrote:

Hi folks:

As some of you know, I have also been looking into implementing 
failover as well.  I took a different approach as I am solving the 
problem within the openib BTL itself.  This of course means that this 
only works for failing from one openib BTL to another but that was our 
area of interest.  This also means that we do not need to keep track 
of fragments as we get them back from the completion queue upon 
failure. We then extract the relevant information and repost on the 
other working endpoint.


My work has been progressing at http://bitbucket.org/rolfv/ompi-failover.

This only currently works for send semantics so you have to run with 
-mca btl_openib_flags 1.


Rolf

On 07/31/09 05:49, Mouhamed Gueye wrote:

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather than 
dr and we now have a working prototype on top of ob1. The approach is 
to store btl descriptors sent to peers and delete them when we 
receive proof of delivery. So far, we rely on completion callback 
functions, assuming that the message is delivered when the completion 
function is called, that is the case of openib. When a btl module 
fails, it is removed from the endpoint's btl list and the next one is 
used to retransmit stored descriptors. No extra-message is 
transmitted, it only consists in additions to the header. It has been 
mainly tested with two IB modules, in both multi-rail (two separate 
networks) and multi-path (a big unique network).


You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define 
--enable-device-failover at configure. You can then run a benchmark, 
disconnect a port and see the failover operate.


A little latency increase (~ 2%) is induced by the failover layer 
when no failover occurs. To accelerate the failover process on 
openib, you can try to lower the btl_openib_ib_timeout openib 
parameter to 15 for example instead of 20 (default value).


Mouhamed
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel







Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Rolf Vandevaart

Hi folks:

As some of you know, I have also been looking into implementing failover 
as well.  I took a different approach as I am solving the problem within 
the openib BTL itself.  This of course means that this only works for 
failing from one openib BTL to another but that was our area of 
interest.  This also means that we do not need to keep track of 
fragments as we get them back from the completion queue upon failure. 
We then extract the relevant information and repost on the other working 
endpoint.


My work has been progressing at http://bitbucket.org/rolfv/ompi-failover.

This only currently works for send semantics so you have to run with 
-mca btl_openib_flags 1.


Rolf

On 07/31/09 05:49, Mouhamed Gueye wrote:

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather than dr 
and we now have a working prototype on top of ob1. The approach is to 
store btl descriptors sent to peers and delete them when we receive 
proof of delivery. So far, we rely on completion callback functions, 
assuming that the message is delivered when the completion function is 
called, that is the case of openib. When a btl module fails, it is 
removed from the endpoint's btl list and the next one is used to 
retransmit stored descriptors. No extra-message is transmitted, it only 
consists in additions to the header. It has been mainly tested with two 
IB modules, in both multi-rail (two separate networks) and multi-path (a 
big unique network).


You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define --enable-device-failover 
at configure. You can then run a benchmark, disconnect a port and see 
the failover operate.


A little latency increase (~ 2%) is induced by the failover layer when 
no failover occurs. To accelerate the failover process on openib, you 
can try to lower the btl_openib_ib_timeout openib parameter to 15 for 
example instead of 20 (default value).


Mouhamed
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--

=
rolf.vandeva...@sun.com
781-442-3043
=


Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Mouhamed Gueye

Hi list,

I'll try to answer to the main concerns so far.

We chose to work on ob1 for mainly 2 reasons:
- we focused first on fixing  dr  but  were quite disappointed by its 
performance in comparison with ob1. Then, we oriented our work on ob1 to 
provide failover while keeping good performance.
- Secondly, we wanted to avoid as much as possible to fork ob1 to stay 
up-to-date with the code base. Plus, the failover layer is so thin (in 
comparison with the code base) that it would not make sense to fork the 
base into a new pml.


But we were aware that ob1 won't allow any non-zero impact change and 
that is why the added code is configured out by default. Actually, we 
wanted to address long jobs that can afford a very little performance 
loss but won't allow aborting after several hours or days of computation 
because of one port failure. The goal of this prototype is to provide a 
proof of concept for discussion, as we know there are other people 
working on this subject.


As stated in the previous mail, the idea is to store any sent btl 
descriptor until it is marked as delivered. For that, we rely on 
completion callbacks and the assumption, clearly, is that a completion 
function called means message delivery to the remote card. The 
underlying btl is the one that ensures message delivery. This is 
currently the case of the openib btl, but any other btl may be able to 
do so. So, with that assumption, we do not need any pml level 
acknowledgment protocol  (no extra messages).
No timer is needed for retransmission as it is triggered by btl failure. 
Today, only error callback scenario is implemented. We should also treat 
btl send method return codes. To deal with message duplication, the 
protocol maintains a message id allowing to track received messages 
(hence the larger header). So any duplicated message will not be processed.


Concerning the openib btl, on a multi-port system, the connection scheme 
is supposed to be (host 1-port 0) <==> (host 2-port 0) and (host 1-port 
1) <==> (host 2-port 1) for example. This is done at btl endpoint 
initialization but when establishing connexion at first send attempt, 
the port association information is not processed. This results in a 
crossed connection scheme ( (host 1-port 0) <==> (host 2-port 1) and 
(host 1-port 1) <==> (host 2-port 0)). So, instead of having two 
separate rings or paths, we have 1 big ring that does not allow 
failover. We had to fix this to enable failover in both multi-path (same 
network) and multi-rail (2 separate networks) with openib.


Brian, so far, we are able to switch from one failing btl to a safe one 
only. When there is no more btl left, we abort the job. Next step is to 
be able to re-establish the connection when the network is back.


Mouhamed
Graham, Richard L. a écrit :

What is the impact on sm, which is by far the most sensitive to latency. This 
really belongs in a place other than ob1.  Ob1 is supposed to provide the 
lowest latency possible, and other pml's are supposed to be used for heavier 
weight protocols.

On the technical side, how do you distinguish between a lot acknowledgement and 
an undelivered message ?  You really don't want to try and deliver data into 
user space twice, as once a receive is complete, who knows what the user has 
done with that buffer ?  A general treatment needs to be able to false 
negatives, and attempts to deliver the data more than once.

How are you detecting missing acknowledgements ?  Are you using some sort of 
timer ?

Rich

On 7/31/09 5:49 AM, "Mouhamed Gueye"  wrote:

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather than dr
and we now have a working prototype on top of ob1. The approach is to
store btl descriptors sent to peers and delete them when we receive
proof of delivery. So far, we rely on completion callback functions,
assuming that the message is delivered when the completion function is
called, that is the case of openib. When a btl module fails, it is
removed from the endpoint's btl list and the next one is used to
retransmit stored descriptors. No extra-message is transmitted, it only
consists in additions to the header. It has been mainly tested with two
IB modules, in both multi-rail (two separate networks) and multi-path (a
big unique network).

You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define --enable-device-failover
at configure. You can then run a benchmark, disconnect a port and see
the failover operate.

A little latency increase (~ 2%) is induced by the failover layer when
no failover occurs. To accelerate the failover process on openib, you
can try to lower the btl_openib_ib_timeout openib parameter to 15 for
example instead of 20 (default value).

Mouhamed
___
devel mailing list
de...@open-mpi.org
http://www

Re: [OMPI devel] Device failover on ob1

2009-08-02 Thread Ralph Castain
Okay - here's a thought. Why not do what the original message asked?  
Checkout their changes and look at what they did.


Then we can have the discussion about how intrusive it is. Otherwise,  
all we're doing is debating what they -might- have done, or what  
someone thinks they -should- have done, etc.


Look at it first, and see how big or small a change is involved.  
That's all they asked us to do - certainly seemed a reasonable request.


Just my $0.002


On Aug 2, 2009, at 4:49 PM, Graham, Richard L. wrote:

The point here is very different, and is not being made because of  
objections for
fail-over support.  Previous work took precisely this sort of  
approach, and in that
particular case the desire to support reliability, but be able to  
compile out this

support still had a negative performance impact.

This is why I am asking about precisely what assumptions are being  
made.  If the
assumption is that ompi can handle the failover with local  
information only, the
impact on ompi is minimal, and the likelihood of needing to make  
undesirable
changes to ob1 small.  If ompi needs to deal with remote delivery -  
e.g. a
send completed locally, but an ack did not arrive, is this because  
the remote side
sent it and the connection failure kept it from arriving, or is it  
because the remote
side did not send it at all, or maybe did not even get the data in  
the first plad

- the logic becomes more complex, and one may end up
wanting to change the way ob1 handles data to accommodate this   
Said another

way, there may not be as much commonality as was assumed.

Rich


On 8/2/09 6:19 PM, "Ralph Castain"  wrote:

The objections being cited are somewhat unfair - perhaps people do not
understand the proposal being made? The developers have gone out of
their way to ensure that all changes are configured out unless you
specifically select to use that functionality. This has been our
policy from day one - as long as the changes have zero impact unless
the user specifically requests that it be used, then no harm is done.

So I personally don't see any objection to bringing it into the code
base. Latency is not impacted one bit -unless- someone deliberately
configures the code to use this feature. In that case, they are
deliberately accepting any impact in order to gain the benefits.

Perhaps a bigger question needs to be addressed - namely, does the ob1
code need to be refactored?

Having been involved a little in the early discussion with bull when
we debated over where to put this, I know the primary concern was that
the code not suffer the same fate as the dr module. We have since run
into a similar issue with the checksum module, so I know where they
are coming from.

The problem is that the code base is adjusted to support changes in
ob1, which is still being debugged. On the order of 95% of the code in
ob1 is required to be common across all the pml modules, so the rest
of us have to (a) watch carefully all the commits to see if someone
touches ob1, and then (b) manually mirror the change in our modules.

This is not a supportable model over the long-term, which is why dr
has died, and checksum is considering integrating into ob1 using
configure #if's to avoid impacting non-checksum users. Likewise,
device failover has been treated similarly here - i.e., configure out
the added code unless someone wants it.

This -does- lead to messier source code with these #if's in it. If we
can refactor the ob1 code so the common functionality resides in the
base, then perhaps we can avoid this problem.

Is it possible?
Ralph

On Aug 2, 2009, at 3:25 PM, Graham, Richard L. wrote:





On 8/2/09 12:55 AM, "Brian Barrett"  wrote:

While I agree that performance impact (latency in this case) is
important, I disagree that this necessarily belongs somewhere other
than ob1.  For example, a zero-performance impact solution would be  
to
provide two versions of all the interface functions, one with  
failover

turned on and one with it turned off, and select the appropriate
functions at initialization time.  There are others, including  
careful

placement of decision logic, which are likely to result in near-zero
impact.  I'm not attempting to prescribe a solution, but refuting the
claim that this can't be in ob1 - I think more data is needed before
such a claim is made.


Just another way to do handle set the function pointers.


Mouhamed - can the openib btl try to re-establish a connection  
between

two peers today (with your ob1 patches, obviously)?  Would this allow
us to adapt to changing routes due to switch failures (assuming that
there are other physical routes around the failed switch, of course)?


The big question is what are the assumptions that are being made
for this mode of failure recovery.  If the assumption is that
local completion
implies remote delivery, the problem is simple to solve.  If not,
heavier
weight protocols need to be used to cover the range of ways failure
may manifest itself.


Re: [OMPI devel] Device failover on ob1

2009-08-02 Thread Graham, Richard L.
The point here is very different, and is not being made because of objections 
for
 fail-over support.  Previous work took precisely this sort of approach, and in 
that
 particular case the desire to support reliability, but be able to compile out 
this
 support still had a negative performance impact.

This is why I am asking about precisely what assumptions are being made.  If the
 assumption is that ompi can handle the failover with local information only, 
the
 impact on ompi is minimal, and the likelihood of needing to make undesirable
 changes to ob1 small.  If ompi needs to deal with remote delivery - e.g. a
 send completed locally, but an ack did not arrive, is this because the remote 
side
 sent it and the connection failure kept it from arriving, or is it because the 
remote
 side did not send it at all, or maybe did not even get the data in the first 
plad
 - the logic becomes more complex, and one may end up
 wanting to change the way ob1 handles data to accommodate this  Said 
another
 way, there may not be as much commonality as was assumed.

Rich


On 8/2/09 6:19 PM, "Ralph Castain"  wrote:

The objections being cited are somewhat unfair - perhaps people do not
understand the proposal being made? The developers have gone out of
their way to ensure that all changes are configured out unless you
specifically select to use that functionality. This has been our
policy from day one - as long as the changes have zero impact unless
the user specifically requests that it be used, then no harm is done.

So I personally don't see any objection to bringing it into the code
base. Latency is not impacted one bit -unless- someone deliberately
configures the code to use this feature. In that case, they are
deliberately accepting any impact in order to gain the benefits.

Perhaps a bigger question needs to be addressed - namely, does the ob1
code need to be refactored?

Having been involved a little in the early discussion with bull when
we debated over where to put this, I know the primary concern was that
the code not suffer the same fate as the dr module. We have since run
into a similar issue with the checksum module, so I know where they
are coming from.

The problem is that the code base is adjusted to support changes in
ob1, which is still being debugged. On the order of 95% of the code in
ob1 is required to be common across all the pml modules, so the rest
of us have to (a) watch carefully all the commits to see if someone
touches ob1, and then (b) manually mirror the change in our modules.

This is not a supportable model over the long-term, which is why dr
has died, and checksum is considering integrating into ob1 using
configure #if's to avoid impacting non-checksum users. Likewise,
device failover has been treated similarly here - i.e., configure out
the added code unless someone wants it.

This -does- lead to messier source code with these #if's in it. If we
can refactor the ob1 code so the common functionality resides in the
base, then perhaps we can avoid this problem.

Is it possible?
Ralph

On Aug 2, 2009, at 3:25 PM, Graham, Richard L. wrote:

>
>
>
> On 8/2/09 12:55 AM, "Brian Barrett"  wrote:
>
> While I agree that performance impact (latency in this case) is
> important, I disagree that this necessarily belongs somewhere other
> than ob1.  For example, a zero-performance impact solution would be to
> provide two versions of all the interface functions, one with failover
> turned on and one with it turned off, and select the appropriate
> functions at initialization time.  There are others, including careful
> placement of decision logic, which are likely to result in near-zero
> impact.  I'm not attempting to prescribe a solution, but refuting the
> claim that this can't be in ob1 - I think more data is needed before
> such a claim is made.
>
>>> Just another way to do handle set the function pointers.
>
> Mouhamed - can the openib btl try to re-establish a connection between
> two peers today (with your ob1 patches, obviously)?  Would this allow
> us to adapt to changing routes due to switch failures (assuming that
> there are other physical routes around the failed switch, of course)?
>
>>> The big question is what are the assumptions that are being made
>>> for this mode of failure recovery.  If the assumption is that
>>> local completion
>>> implies remote delivery, the problem is simple to solve.  If not,
>>> heavier
>>> weight protocols need to be used to cover the range of ways failure
>>> may manifest itself.
>
> Rich
>
> Thanks,
>
> Brian
>
> On Aug 1, 2009, at 6:21 PM, Graham, Richard L. wrote:
>
>> What is the impact on sm, which is by far the most sensitive to
>> latency. This really belongs in a place other than ob1.  Ob1 is
>> supposed to provide the lowest latency possible, and other pml's are
>> supposed to be used for heavier weight protocols.
>>
>> On the technical side, how do you distinguish between a lot
>> acknowledgement and an undelivered message ?  You reall

Re: [OMPI devel] Device failover on ob1

2009-08-02 Thread Ralph Castain
The objections being cited are somewhat unfair - perhaps people do not  
understand the proposal being made? The developers have gone out of  
their way to ensure that all changes are configured out unless you  
specifically select to use that functionality. This has been our  
policy from day one - as long as the changes have zero impact unless  
the user specifically requests that it be used, then no harm is done.


So I personally don't see any objection to bringing it into the code  
base. Latency is not impacted one bit -unless- someone deliberately  
configures the code to use this feature. In that case, they are  
deliberately accepting any impact in order to gain the benefits.


Perhaps a bigger question needs to be addressed - namely, does the ob1  
code need to be refactored?


Having been involved a little in the early discussion with bull when  
we debated over where to put this, I know the primary concern was that  
the code not suffer the same fate as the dr module. We have since run  
into a similar issue with the checksum module, so I know where they  
are coming from.


The problem is that the code base is adjusted to support changes in  
ob1, which is still being debugged. On the order of 95% of the code in  
ob1 is required to be common across all the pml modules, so the rest  
of us have to (a) watch carefully all the commits to see if someone  
touches ob1, and then (b) manually mirror the change in our modules.


This is not a supportable model over the long-term, which is why dr  
has died, and checksum is considering integrating into ob1 using  
configure #if's to avoid impacting non-checksum users. Likewise,  
device failover has been treated similarly here - i.e., configure out  
the added code unless someone wants it.


This -does- lead to messier source code with these #if's in it. If we  
can refactor the ob1 code so the common functionality resides in the  
base, then perhaps we can avoid this problem.


Is it possible?
Ralph

On Aug 2, 2009, at 3:25 PM, Graham, Richard L. wrote:





On 8/2/09 12:55 AM, "Brian Barrett"  wrote:

While I agree that performance impact (latency in this case) is
important, I disagree that this necessarily belongs somewhere other
than ob1.  For example, a zero-performance impact solution would be to
provide two versions of all the interface functions, one with failover
turned on and one with it turned off, and select the appropriate
functions at initialization time.  There are others, including careful
placement of decision logic, which are likely to result in near-zero
impact.  I'm not attempting to prescribe a solution, but refuting the
claim that this can't be in ob1 - I think more data is needed before
such a claim is made.


Just another way to do handle set the function pointers.


Mouhamed - can the openib btl try to re-establish a connection between
two peers today (with your ob1 patches, obviously)?  Would this allow
us to adapt to changing routes due to switch failures (assuming that
there are other physical routes around the failed switch, of course)?


The big question is what are the assumptions that are being made
for this mode of failure recovery.  If the assumption is that  
local completion
implies remote delivery, the problem is simple to solve.  If not,  
heavier

weight protocols need to be used to cover the range of ways failure
may manifest itself.


Rich

Thanks,

Brian

On Aug 1, 2009, at 6:21 PM, Graham, Richard L. wrote:


What is the impact on sm, which is by far the most sensitive to
latency. This really belongs in a place other than ob1.  Ob1 is
supposed to provide the lowest latency possible, and other pml's are
supposed to be used for heavier weight protocols.

On the technical side, how do you distinguish between a lot
acknowledgement and an undelivered message ?  You really don't want
to try and deliver data into user space twice, as once a receive is
complete, who knows what the user has done with that buffer ?  A
general treatment needs to be able to false negatives, and attempts
to deliver the data more than once.

How are you detecting missing acknowledgements ?  Are you using some
sort of timer ?

Rich

On 7/31/09 5:49 AM, "Mouhamed Gueye"  wrote:

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather than  
dr

and we now have a working prototype on top of ob1. The approach is to
store btl descriptors sent to peers and delete them when we receive
proof of delivery. So far, we rely on completion callback functions,
assuming that the message is delivered when the completion function  
is

called, that is the case of openib. When a btl module fails, it is
removed from the endpoint's btl list and the next one is used to
retransmit stored descriptors. No extra-message is transmitted, it
only
consists in additions to the header. It has been mainly tested with
two
IB modules, in both multi-rail (two separate networks) and multi-
path (a
big

Re: [OMPI devel] Device failover on ob1

2009-08-02 Thread Graham, Richard L.



On 8/2/09 12:55 AM, "Brian Barrett"  wrote:

While I agree that performance impact (latency in this case) is
important, I disagree that this necessarily belongs somewhere other
than ob1.  For example, a zero-performance impact solution would be to
provide two versions of all the interface functions, one with failover
turned on and one with it turned off, and select the appropriate
functions at initialization time.  There are others, including careful
placement of decision logic, which are likely to result in near-zero
impact.  I'm not attempting to prescribe a solution, but refuting the
claim that this can't be in ob1 - I think more data is needed before
such a claim is made.

>> Just another way to do handle set the function pointers.

Mouhamed - can the openib btl try to re-establish a connection between
two peers today (with your ob1 patches, obviously)?  Would this allow
us to adapt to changing routes due to switch failures (assuming that
there are other physical routes around the failed switch, of course)?

>> The big question is what are the assumptions that are being made
>> for this mode of failure recovery.  If the assumption is that local 
>> completion
>> implies remote delivery, the problem is simple to solve.  If not, heavier
>> weight protocols need to be used to cover the range of ways failure
>> may manifest itself.

Rich

Thanks,

Brian

On Aug 1, 2009, at 6:21 PM, Graham, Richard L. wrote:

> What is the impact on sm, which is by far the most sensitive to
> latency. This really belongs in a place other than ob1.  Ob1 is
> supposed to provide the lowest latency possible, and other pml's are
> supposed to be used for heavier weight protocols.
>
> On the technical side, how do you distinguish between a lot
> acknowledgement and an undelivered message ?  You really don't want
> to try and deliver data into user space twice, as once a receive is
> complete, who knows what the user has done with that buffer ?  A
> general treatment needs to be able to false negatives, and attempts
> to deliver the data more than once.
>
> How are you detecting missing acknowledgements ?  Are you using some
> sort of timer ?
>
> Rich
>
> On 7/31/09 5:49 AM, "Mouhamed Gueye"  wrote:
>
> Hi list,
>
> Here is an update on our work concerning device failover.
>
> As many of you suggested, we reoriented our work on ob1 rather than dr
> and we now have a working prototype on top of ob1. The approach is to
> store btl descriptors sent to peers and delete them when we receive
> proof of delivery. So far, we rely on completion callback functions,
> assuming that the message is delivered when the completion function is
> called, that is the case of openib. When a btl module fails, it is
> removed from the endpoint's btl list and the next one is used to
> retransmit stored descriptors. No extra-message is transmitted, it
> only
> consists in additions to the header. It has been mainly tested with
> two
> IB modules, in both multi-rail (two separate networks) and multi-
> path (a
> big unique network).
>
> You can grab and test the patch here (applies on top of the trunk) :
> http://bitbucket.org/gueyem/ob1-failover/
>
> To compile with failover support, just define --enable-device-failover
> at configure. You can then run a benchmark, disconnect a port and see
> the failover operate.
>
> A little latency increase (~ 2%) is induced by the failover layer when
> no failover occurs. To accelerate the failover process on openib, you
> can try to lower the btl_openib_ib_timeout openib parameter to 15 for
> example instead of 20 (default value).
>
> Mouhamed
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

--
   Brian Barrett
   Open MPI developer
   http://www.open-mpi.org/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Device failover on ob1

2009-08-02 Thread Brian Barrett
While I agree that performance impact (latency in this case) is  
important, I disagree that this necessarily belongs somewhere other  
than ob1.  For example, a zero-performance impact solution would be to  
provide two versions of all the interface functions, one with failover  
turned on and one with it turned off, and select the appropriate  
functions at initialization time.  There are others, including careful  
placement of decision logic, which are likely to result in near-zero  
impact.  I'm not attempting to prescribe a solution, but refuting the  
claim that this can't be in ob1 - I think more data is needed before  
such a claim is made.


Mouhamed - can the openib btl try to re-establish a connection between  
two peers today (with your ob1 patches, obviously)?  Would this allow  
us to adapt to changing routes due to switch failures (assuming that  
there are other physical routes around the failed switch, of course)?


Thanks,

Brian

On Aug 1, 2009, at 6:21 PM, Graham, Richard L. wrote:

What is the impact on sm, which is by far the most sensitive to  
latency. This really belongs in a place other than ob1.  Ob1 is  
supposed to provide the lowest latency possible, and other pml's are  
supposed to be used for heavier weight protocols.


On the technical side, how do you distinguish between a lot  
acknowledgement and an undelivered message ?  You really don't want  
to try and deliver data into user space twice, as once a receive is  
complete, who knows what the user has done with that buffer ?  A  
general treatment needs to be able to false negatives, and attempts  
to deliver the data more than once.


How are you detecting missing acknowledgements ?  Are you using some  
sort of timer ?


Rich

On 7/31/09 5:49 AM, "Mouhamed Gueye"  wrote:

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather than dr
and we now have a working prototype on top of ob1. The approach is to
store btl descriptors sent to peers and delete them when we receive
proof of delivery. So far, we rely on completion callback functions,
assuming that the message is delivered when the completion function is
called, that is the case of openib. When a btl module fails, it is
removed from the endpoint's btl list and the next one is used to
retransmit stored descriptors. No extra-message is transmitted, it  
only
consists in additions to the header. It has been mainly tested with  
two
IB modules, in both multi-rail (two separate networks) and multi- 
path (a

big unique network).

You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define --enable-device-failover
at configure. You can then run a benchmark, disconnect a port and see
the failover operate.

A little latency increase (~ 2%) is induced by the failover layer when
no failover occurs. To accelerate the failover process on openib, you
can try to lower the btl_openib_ib_timeout openib parameter to 15 for
example instead of 20 (default value).

Mouhamed
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI devel] Device failover on ob1

2009-08-01 Thread Graham, Richard L.
What is the impact on sm, which is by far the most sensitive to latency. This 
really belongs in a place other than ob1.  Ob1 is supposed to provide the 
lowest latency possible, and other pml's are supposed to be used for heavier 
weight protocols.

On the technical side, how do you distinguish between a lot acknowledgement and 
an undelivered message ?  You really don't want to try and deliver data into 
user space twice, as once a receive is complete, who knows what the user has 
done with that buffer ?  A general treatment needs to be able to false 
negatives, and attempts to deliver the data more than once.

How are you detecting missing acknowledgements ?  Are you using some sort of 
timer ?

Rich

On 7/31/09 5:49 AM, "Mouhamed Gueye"  wrote:

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather than dr
and we now have a working prototype on top of ob1. The approach is to
store btl descriptors sent to peers and delete them when we receive
proof of delivery. So far, we rely on completion callback functions,
assuming that the message is delivered when the completion function is
called, that is the case of openib. When a btl module fails, it is
removed from the endpoint's btl list and the next one is used to
retransmit stored descriptors. No extra-message is transmitted, it only
consists in additions to the header. It has been mainly tested with two
IB modules, in both multi-rail (two separate networks) and multi-path (a
big unique network).

You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define --enable-device-failover
at configure. You can then run a benchmark, disconnect a port and see
the failover operate.

A little latency increase (~ 2%) is induced by the failover layer when
no failover occurs. To accelerate the failover process on openib, you
can try to lower the btl_openib_ib_timeout openib parameter to 15 for
example instead of 20 (default value).

Mouhamed
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Device failover on ob1

2009-07-31 Thread Mouhamed Gueye

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather than dr 
and we now have a working prototype on top of ob1. The approach is to 
store btl descriptors sent to peers and delete them when we receive 
proof of delivery. So far, we rely on completion callback functions, 
assuming that the message is delivered when the completion function is 
called, that is the case of openib. When a btl module fails, it is 
removed from the endpoint's btl list and the next one is used to 
retransmit stored descriptors. No extra-message is transmitted, it only 
consists in additions to the header. It has been mainly tested with two 
IB modules, in both multi-rail (two separate networks) and multi-path (a 
big unique network).


You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define --enable-device-failover 
at configure. You can then run a benchmark, disconnect a port and see 
the failover operate.


A little latency increase (~ 2%) is induced by the failover layer when 
no failover occurs. To accelerate the failover process on openib, you 
can try to lower the btl_openib_ib_timeout openib parameter to 15 for 
example instead of 20 (default value).


Mouhamed