Re: [OMPI devel] Device failover on ob1
Is it time to "svn rm ompi/mca/pml/dr"? On Aug 4, 2009, at 6:50 AM, Ralph Castain wrote: Rolf/Mouhamed Could you get together off-list to discuss the different approaches and see if/where there is common ground. It would be nice to see an integrated solution - personally, I would rather not see two orthogonal approaches unless they can be cleanly separated. Much better if they could support each other in an intelligent fashion. On Aug 3, 2009, at 9:49 AM, Pavel Shamis (Pasha) wrote: > > >> I have not, but there should be no difference. The failover code >> only gets triggered when an error happens. Otherwise, there are no >> differences in the code paths while everything is functioning >> normally. > Sounds good. I still did not have time to review the code. I will > try to do it during this week. > > Pasha >> >> Rolf >> >> On 08/03/09 11:14, Pavel Shamis (Pasha) wrote: >>> Rolf, >>> Did you compare latency/bw for failover-enabled code VS trunk ? >>> >>> Pasha. >>> >>> Rolf Vandevaart wrote: Hi folks: As some of you know, I have also been looking into implementing failover as well. I took a different approach as I am solving the problem within the openib BTL itself. This of course means that this only works for failing from one openib BTL to another but that was our area of interest. This also means that we do not need to keep track of fragments as we get them back from the completion queue upon failure. We then extract the relevant information and repost on the other working endpoint. My work has been progressing at http://bitbucket.org/rolfv/ompi-failover . This only currently works for send semantics so you have to run with -mca btl_openib_flags 1. Rolf On 07/31/09 05:49, Mouhamed Gueye wrote: > Hi list, > > Here is an update on our work concerning device failover. > > As many of you suggested, we reoriented our work on ob1 rather > than dr and we now have a working prototype on top of ob1. The > approach is to store btl descriptors sent to peers and delete > them when we receive proof of delivery. So far, we rely on > completion callback functions, assuming that the message is > delivered when the completion function is called, that is the > case of openib. When a btl module fails, it is removed from the > endpoint's btl list and the next one is used to retransmit > stored descriptors. No extra-message is transmitted, it only > consists in additions to the header. It has been mainly tested > with two IB modules, in both multi-rail (two separate networks) > and multi-path (a big unique network). > > You can grab and test the patch here (applies on top of the > trunk) : > http://bitbucket.org/gueyem/ob1-failover/ > > To compile with failover support, just define --enable-device- > failover at configure. You can then run a benchmark, disconnect > a port and see the failover operate. > > A little latency increase (~ 2%) is induced by the failover > layer when no failover occurs. To accelerate the failover > process on openib, you can try to lower the > btl_openib_ib_timeout openib parameter to 15 for example instead > of 20 (default value). > > Mouhamed > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Device failover on ob1
>From my perspective, the assumption that the low-level is reliable is >completely consistent with the assumptions that went into the ob1 design, so I don't see changes you may propose as a problem in principal. Thanks a lot for the clarification, Rich On 8/3/09 9:39 AM, "Mouhamed Gueye" wrote: Hi list, I'll try to answer to the main concerns so far. We chose to work on ob1 for mainly 2 reasons: - we focused first on fixing dr but were quite disappointed by its performance in comparison with ob1. Then, we oriented our work on ob1 to provide failover while keeping good performance. - Secondly, we wanted to avoid as much as possible to fork ob1 to stay up-to-date with the code base. Plus, the failover layer is so thin (in comparison with the code base) that it would not make sense to fork the base into a new pml. But we were aware that ob1 won't allow any non-zero impact change and that is why the added code is configured out by default. Actually, we wanted to address long jobs that can afford a very little performance loss but won't allow aborting after several hours or days of computation because of one port failure. The goal of this prototype is to provide a proof of concept for discussion, as we know there are other people working on this subject. As stated in the previous mail, the idea is to store any sent btl descriptor until it is marked as delivered. For that, we rely on completion callbacks and the assumption, clearly, is that a completion function called means message delivery to the remote card. The underlying btl is the one that ensures message delivery. This is currently the case of the openib btl, but any other btl may be able to do so. So, with that assumption, we do not need any pml level acknowledgment protocol (no extra messages). No timer is needed for retransmission as it is triggered by btl failure. Today, only error callback scenario is implemented. We should also treat btl send method return codes. To deal with message duplication, the protocol maintains a message id allowing to track received messages (hence the larger header). So any duplicated message will not be processed. Concerning the openib btl, on a multi-port system, the connection scheme is supposed to be (host 1-port 0) <==> (host 2-port 0) and (host 1-port 1) <==> (host 2-port 1) for example. This is done at btl endpoint initialization but when establishing connexion at first send attempt, the port association information is not processed. This results in a crossed connection scheme ( (host 1-port 0) <==> (host 2-port 1) and (host 1-port 1) <==> (host 2-port 0)). So, instead of having two separate rings or paths, we have 1 big ring that does not allow failover. We had to fix this to enable failover in both multi-path (same network) and multi-rail (2 separate networks) with openib. Brian, so far, we are able to switch from one failing btl to a safe one only. When there is no more btl left, we abort the job. Next step is to be able to re-establish the connection when the network is back. Mouhamed Graham, Richard L. a écrit : > What is the impact on sm, which is by far the most sensitive to latency. This > really belongs in a place other than ob1. Ob1 is supposed to provide the > lowest latency possible, and other pml's are supposed to be used for heavier > weight protocols. > > On the technical side, how do you distinguish between a lot acknowledgement > and an undelivered message ? You really don't want to try and deliver data > into user space twice, as once a receive is complete, who knows what the user > has done with that buffer ? A general treatment needs to be able to false > negatives, and attempts to deliver the data more than once. > > How are you detecting missing acknowledgements ? Are you using some sort of > timer ? > > Rich > > On 7/31/09 5:49 AM, "Mouhamed Gueye" wrote: > > Hi list, > > Here is an update on our work concerning device failover. > > As many of you suggested, we reoriented our work on ob1 rather than dr > and we now have a working prototype on top of ob1. The approach is to > store btl descriptors sent to peers and delete them when we receive > proof of delivery. So far, we rely on completion callback functions, > assuming that the message is delivered when the completion function is > called, that is the case of openib. When a btl module fails, it is > removed from the endpoint's btl list and the next one is used to > retransmit stored descriptors. No extra-message is transmitted, it only > consists in additions to the header. It has been mainly tested with two > IB modules, in both multi-rail (two separate networks) and multi-path (a > big unique network). > > You can grab and test the patch here (applies on top of the trunk) : > http://bitbucket.org/gueyem/ob1-failover/ > > To compile with failover support, just define --enable-device-failover > at configure. You can then run a benchmark, disconnect a port and see > the failover operate. > > A little latency
Re: [OMPI devel] Device failover on ob1
Rolf/Mouhamed Could you get together off-list to discuss the different approaches and see if/where there is common ground. It would be nice to see an integrated solution - personally, I would rather not see two orthogonal approaches unless they can be cleanly separated. Much better if they could support each other in an intelligent fashion. On Aug 3, 2009, at 9:49 AM, Pavel Shamis (Pasha) wrote: I have not, but there should be no difference. The failover code only gets triggered when an error happens. Otherwise, there are no differences in the code paths while everything is functioning normally. Sounds good. I still did not have time to review the code. I will try to do it during this week. Pasha Rolf On 08/03/09 11:14, Pavel Shamis (Pasha) wrote: Rolf, Did you compare latency/bw for failover-enabled code VS trunk ? Pasha. Rolf Vandevaart wrote: Hi folks: As some of you know, I have also been looking into implementing failover as well. I took a different approach as I am solving the problem within the openib BTL itself. This of course means that this only works for failing from one openib BTL to another but that was our area of interest. This also means that we do not need to keep track of fragments as we get them back from the completion queue upon failure. We then extract the relevant information and repost on the other working endpoint. My work has been progressing at http://bitbucket.org/rolfv/ompi-failover . This only currently works for send semantics so you have to run with -mca btl_openib_flags 1. Rolf On 07/31/09 05:49, Mouhamed Gueye wrote: Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi-path (a big unique network). You can grab and test the patch here (applies on top of the trunk) : http://bitbucket.org/gueyem/ob1-failover/ To compile with failover support, just define --enable-device- failover at configure. You can then run a benchmark, disconnect a port and see the failover operate. A little latency increase (~ 2%) is induced by the failover layer when no failover occurs. To accelerate the failover process on openib, you can try to lower the btl_openib_ib_timeout openib parameter to 15 for example instead of 20 (default value). Mouhamed ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Device failover on ob1
I have not, but there should be no difference. The failover code only gets triggered when an error happens. Otherwise, there are no differences in the code paths while everything is functioning normally. Sounds good. I still did not have time to review the code. I will try to do it during this week. Pasha Rolf On 08/03/09 11:14, Pavel Shamis (Pasha) wrote: Rolf, Did you compare latency/bw for failover-enabled code VS trunk ? Pasha. Rolf Vandevaart wrote: Hi folks: As some of you know, I have also been looking into implementing failover as well. I took a different approach as I am solving the problem within the openib BTL itself. This of course means that this only works for failing from one openib BTL to another but that was our area of interest. This also means that we do not need to keep track of fragments as we get them back from the completion queue upon failure. We then extract the relevant information and repost on the other working endpoint. My work has been progressing at http://bitbucket.org/rolfv/ompi-failover. This only currently works for send semantics so you have to run with -mca btl_openib_flags 1. Rolf On 07/31/09 05:49, Mouhamed Gueye wrote: Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi-path (a big unique network). You can grab and test the patch here (applies on top of the trunk) : http://bitbucket.org/gueyem/ob1-failover/ To compile with failover support, just define --enable-device-failover at configure. You can then run a benchmark, disconnect a port and see the failover operate. A little latency increase (~ 2%) is induced by the failover layer when no failover occurs. To accelerate the failover process on openib, you can try to lower the btl_openib_ib_timeout openib parameter to 15 for example instead of 20 (default value). Mouhamed ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Device failover on ob1
I have not, but there should be no difference. The failover code only gets triggered when an error happens. Otherwise, there are no differences in the code paths while everything is functioning normally. Rolf On 08/03/09 11:14, Pavel Shamis (Pasha) wrote: Rolf, Did you compare latency/bw for failover-enabled code VS trunk ? Pasha. Rolf Vandevaart wrote: Hi folks: As some of you know, I have also been looking into implementing failover as well. I took a different approach as I am solving the problem within the openib BTL itself. This of course means that this only works for failing from one openib BTL to another but that was our area of interest. This also means that we do not need to keep track of fragments as we get them back from the completion queue upon failure. We then extract the relevant information and repost on the other working endpoint. My work has been progressing at http://bitbucket.org/rolfv/ompi-failover. This only currently works for send semantics so you have to run with -mca btl_openib_flags 1. Rolf On 07/31/09 05:49, Mouhamed Gueye wrote: Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi-path (a big unique network). You can grab and test the patch here (applies on top of the trunk) : http://bitbucket.org/gueyem/ob1-failover/ To compile with failover support, just define --enable-device-failover at configure. You can then run a benchmark, disconnect a port and see the failover operate. A little latency increase (~ 2%) is induced by the failover layer when no failover occurs. To accelerate the failover process on openib, you can try to lower the btl_openib_ib_timeout openib parameter to 15 for example instead of 20 (default value). Mouhamed ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- = rolf.vandeva...@sun.com 781-442-3043 =
Re: [OMPI devel] Device failover on ob1
On Sun, 2 Aug 2009, Ralph Castain wrote: Perhaps a bigger question needs to be addressed - namely, does the ob1 code need to be refactored? Having been involved a little in the early discussion with bull when we debated over where to put this, I know the primary concern was that the code not suffer the same fate as the dr module. We have since run into a similar issue with the checksum module, so I know where they are coming from. The problem is that the code base is adjusted to support changes in ob1, which is still being debugged. On the order of 95% of the code in ob1 is required to be common across all the pml modules, so the rest of us have to (a) watch carefully all the commits to see if someone touches ob1, and then (b) manually mirror the change in our modules. This is not a supportable model over the long-term, which is why dr has died, and checksum is considering integrating into ob1 using configure #if's to avoid impacting non-checksum users. Likewise, device failover has been treated similarly here - i.e., configure out the added code unless someone wants it. This -does- lead to messier source code with these #if's in it. If we can refactor the ob1 code so the common functionality resides in the base, then perhaps we can avoid this problem. Is it possible? I think Ralph raises a good point - we need to think about how to allow better use of OB1's code base between consumers like checksum and failover. The current situation is problematic to me, for the reasons Ralph cited. However, since the ob1 structures and code have little use for PMLs such as CM, I'd rather not push the code into the base - in the end, it's very specific to a particular PML implementation and the code pushed into the base already made things much more interesting in implementing CM than I would have liked. DR is different in this conversation, as it was almost entirely a seperate implementation from ob1 by the end, due to the removal of many features and the addition of many others. However, I think there's middle ground here which could greatly improve the current situation. With the proper refactoring, there's no technical reason why we couldn't move the checksum functionality into ob1 and add the failover to ob1, with no impact on performance when the functionality isn't used and little impact on code readability. So, in summary, refactor OB1 to support checksum / failover good, pushing ob1 code into base bad. Brian
Re: [OMPI devel] Device failover on ob1
Rolf, Did you compare latency/bw for failover-enabled code VS trunk ? Pasha. Rolf Vandevaart wrote: Hi folks: As some of you know, I have also been looking into implementing failover as well. I took a different approach as I am solving the problem within the openib BTL itself. This of course means that this only works for failing from one openib BTL to another but that was our area of interest. This also means that we do not need to keep track of fragments as we get them back from the completion queue upon failure. We then extract the relevant information and repost on the other working endpoint. My work has been progressing at http://bitbucket.org/rolfv/ompi-failover. This only currently works for send semantics so you have to run with -mca btl_openib_flags 1. Rolf On 07/31/09 05:49, Mouhamed Gueye wrote: Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi-path (a big unique network). You can grab and test the patch here (applies on top of the trunk) : http://bitbucket.org/gueyem/ob1-failover/ To compile with failover support, just define --enable-device-failover at configure. You can then run a benchmark, disconnect a port and see the failover operate. A little latency increase (~ 2%) is induced by the failover layer when no failover occurs. To accelerate the failover process on openib, you can try to lower the btl_openib_ib_timeout openib parameter to 15 for example instead of 20 (default value). Mouhamed ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Device failover on ob1
Hi folks: As some of you know, I have also been looking into implementing failover as well. I took a different approach as I am solving the problem within the openib BTL itself. This of course means that this only works for failing from one openib BTL to another but that was our area of interest. This also means that we do not need to keep track of fragments as we get them back from the completion queue upon failure. We then extract the relevant information and repost on the other working endpoint. My work has been progressing at http://bitbucket.org/rolfv/ompi-failover. This only currently works for send semantics so you have to run with -mca btl_openib_flags 1. Rolf On 07/31/09 05:49, Mouhamed Gueye wrote: Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi-path (a big unique network). You can grab and test the patch here (applies on top of the trunk) : http://bitbucket.org/gueyem/ob1-failover/ To compile with failover support, just define --enable-device-failover at configure. You can then run a benchmark, disconnect a port and see the failover operate. A little latency increase (~ 2%) is induced by the failover layer when no failover occurs. To accelerate the failover process on openib, you can try to lower the btl_openib_ib_timeout openib parameter to 15 for example instead of 20 (default value). Mouhamed ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- = rolf.vandeva...@sun.com 781-442-3043 =
Re: [OMPI devel] Device failover on ob1
Hi list, I'll try to answer to the main concerns so far. We chose to work on ob1 for mainly 2 reasons: - we focused first on fixing dr but were quite disappointed by its performance in comparison with ob1. Then, we oriented our work on ob1 to provide failover while keeping good performance. - Secondly, we wanted to avoid as much as possible to fork ob1 to stay up-to-date with the code base. Plus, the failover layer is so thin (in comparison with the code base) that it would not make sense to fork the base into a new pml. But we were aware that ob1 won't allow any non-zero impact change and that is why the added code is configured out by default. Actually, we wanted to address long jobs that can afford a very little performance loss but won't allow aborting after several hours or days of computation because of one port failure. The goal of this prototype is to provide a proof of concept for discussion, as we know there are other people working on this subject. As stated in the previous mail, the idea is to store any sent btl descriptor until it is marked as delivered. For that, we rely on completion callbacks and the assumption, clearly, is that a completion function called means message delivery to the remote card. The underlying btl is the one that ensures message delivery. This is currently the case of the openib btl, but any other btl may be able to do so. So, with that assumption, we do not need any pml level acknowledgment protocol (no extra messages). No timer is needed for retransmission as it is triggered by btl failure. Today, only error callback scenario is implemented. We should also treat btl send method return codes. To deal with message duplication, the protocol maintains a message id allowing to track received messages (hence the larger header). So any duplicated message will not be processed. Concerning the openib btl, on a multi-port system, the connection scheme is supposed to be (host 1-port 0) <==> (host 2-port 0) and (host 1-port 1) <==> (host 2-port 1) for example. This is done at btl endpoint initialization but when establishing connexion at first send attempt, the port association information is not processed. This results in a crossed connection scheme ( (host 1-port 0) <==> (host 2-port 1) and (host 1-port 1) <==> (host 2-port 0)). So, instead of having two separate rings or paths, we have 1 big ring that does not allow failover. We had to fix this to enable failover in both multi-path (same network) and multi-rail (2 separate networks) with openib. Brian, so far, we are able to switch from one failing btl to a safe one only. When there is no more btl left, we abort the job. Next step is to be able to re-establish the connection when the network is back. Mouhamed Graham, Richard L. a écrit : What is the impact on sm, which is by far the most sensitive to latency. This really belongs in a place other than ob1. Ob1 is supposed to provide the lowest latency possible, and other pml's are supposed to be used for heavier weight protocols. On the technical side, how do you distinguish between a lot acknowledgement and an undelivered message ? You really don't want to try and deliver data into user space twice, as once a receive is complete, who knows what the user has done with that buffer ? A general treatment needs to be able to false negatives, and attempts to deliver the data more than once. How are you detecting missing acknowledgements ? Are you using some sort of timer ? Rich On 7/31/09 5:49 AM, "Mouhamed Gueye" wrote: Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi-path (a big unique network). You can grab and test the patch here (applies on top of the trunk) : http://bitbucket.org/gueyem/ob1-failover/ To compile with failover support, just define --enable-device-failover at configure. You can then run a benchmark, disconnect a port and see the failover operate. A little latency increase (~ 2%) is induced by the failover layer when no failover occurs. To accelerate the failover process on openib, you can try to lower the btl_openib_ib_timeout openib parameter to 15 for example instead of 20 (default value). Mouhamed ___ devel mailing list de...@open-mpi.org http://www
Re: [OMPI devel] Device failover on ob1
Okay - here's a thought. Why not do what the original message asked? Checkout their changes and look at what they did. Then we can have the discussion about how intrusive it is. Otherwise, all we're doing is debating what they -might- have done, or what someone thinks they -should- have done, etc. Look at it first, and see how big or small a change is involved. That's all they asked us to do - certainly seemed a reasonable request. Just my $0.002 On Aug 2, 2009, at 4:49 PM, Graham, Richard L. wrote: The point here is very different, and is not being made because of objections for fail-over support. Previous work took precisely this sort of approach, and in that particular case the desire to support reliability, but be able to compile out this support still had a negative performance impact. This is why I am asking about precisely what assumptions are being made. If the assumption is that ompi can handle the failover with local information only, the impact on ompi is minimal, and the likelihood of needing to make undesirable changes to ob1 small. If ompi needs to deal with remote delivery - e.g. a send completed locally, but an ack did not arrive, is this because the remote side sent it and the connection failure kept it from arriving, or is it because the remote side did not send it at all, or maybe did not even get the data in the first plad - the logic becomes more complex, and one may end up wanting to change the way ob1 handles data to accommodate this Said another way, there may not be as much commonality as was assumed. Rich On 8/2/09 6:19 PM, "Ralph Castain" wrote: The objections being cited are somewhat unfair - perhaps people do not understand the proposal being made? The developers have gone out of their way to ensure that all changes are configured out unless you specifically select to use that functionality. This has been our policy from day one - as long as the changes have zero impact unless the user specifically requests that it be used, then no harm is done. So I personally don't see any objection to bringing it into the code base. Latency is not impacted one bit -unless- someone deliberately configures the code to use this feature. In that case, they are deliberately accepting any impact in order to gain the benefits. Perhaps a bigger question needs to be addressed - namely, does the ob1 code need to be refactored? Having been involved a little in the early discussion with bull when we debated over where to put this, I know the primary concern was that the code not suffer the same fate as the dr module. We have since run into a similar issue with the checksum module, so I know where they are coming from. The problem is that the code base is adjusted to support changes in ob1, which is still being debugged. On the order of 95% of the code in ob1 is required to be common across all the pml modules, so the rest of us have to (a) watch carefully all the commits to see if someone touches ob1, and then (b) manually mirror the change in our modules. This is not a supportable model over the long-term, which is why dr has died, and checksum is considering integrating into ob1 using configure #if's to avoid impacting non-checksum users. Likewise, device failover has been treated similarly here - i.e., configure out the added code unless someone wants it. This -does- lead to messier source code with these #if's in it. If we can refactor the ob1 code so the common functionality resides in the base, then perhaps we can avoid this problem. Is it possible? Ralph On Aug 2, 2009, at 3:25 PM, Graham, Richard L. wrote: On 8/2/09 12:55 AM, "Brian Barrett" wrote: While I agree that performance impact (latency in this case) is important, I disagree that this necessarily belongs somewhere other than ob1. For example, a zero-performance impact solution would be to provide two versions of all the interface functions, one with failover turned on and one with it turned off, and select the appropriate functions at initialization time. There are others, including careful placement of decision logic, which are likely to result in near-zero impact. I'm not attempting to prescribe a solution, but refuting the claim that this can't be in ob1 - I think more data is needed before such a claim is made. Just another way to do handle set the function pointers. Mouhamed - can the openib btl try to re-establish a connection between two peers today (with your ob1 patches, obviously)? Would this allow us to adapt to changing routes due to switch failures (assuming that there are other physical routes around the failed switch, of course)? The big question is what are the assumptions that are being made for this mode of failure recovery. If the assumption is that local completion implies remote delivery, the problem is simple to solve. If not, heavier weight protocols need to be used to cover the range of ways failure may manifest itself.
Re: [OMPI devel] Device failover on ob1
The point here is very different, and is not being made because of objections for fail-over support. Previous work took precisely this sort of approach, and in that particular case the desire to support reliability, but be able to compile out this support still had a negative performance impact. This is why I am asking about precisely what assumptions are being made. If the assumption is that ompi can handle the failover with local information only, the impact on ompi is minimal, and the likelihood of needing to make undesirable changes to ob1 small. If ompi needs to deal with remote delivery - e.g. a send completed locally, but an ack did not arrive, is this because the remote side sent it and the connection failure kept it from arriving, or is it because the remote side did not send it at all, or maybe did not even get the data in the first plad - the logic becomes more complex, and one may end up wanting to change the way ob1 handles data to accommodate this Said another way, there may not be as much commonality as was assumed. Rich On 8/2/09 6:19 PM, "Ralph Castain" wrote: The objections being cited are somewhat unfair - perhaps people do not understand the proposal being made? The developers have gone out of their way to ensure that all changes are configured out unless you specifically select to use that functionality. This has been our policy from day one - as long as the changes have zero impact unless the user specifically requests that it be used, then no harm is done. So I personally don't see any objection to bringing it into the code base. Latency is not impacted one bit -unless- someone deliberately configures the code to use this feature. In that case, they are deliberately accepting any impact in order to gain the benefits. Perhaps a bigger question needs to be addressed - namely, does the ob1 code need to be refactored? Having been involved a little in the early discussion with bull when we debated over where to put this, I know the primary concern was that the code not suffer the same fate as the dr module. We have since run into a similar issue with the checksum module, so I know where they are coming from. The problem is that the code base is adjusted to support changes in ob1, which is still being debugged. On the order of 95% of the code in ob1 is required to be common across all the pml modules, so the rest of us have to (a) watch carefully all the commits to see if someone touches ob1, and then (b) manually mirror the change in our modules. This is not a supportable model over the long-term, which is why dr has died, and checksum is considering integrating into ob1 using configure #if's to avoid impacting non-checksum users. Likewise, device failover has been treated similarly here - i.e., configure out the added code unless someone wants it. This -does- lead to messier source code with these #if's in it. If we can refactor the ob1 code so the common functionality resides in the base, then perhaps we can avoid this problem. Is it possible? Ralph On Aug 2, 2009, at 3:25 PM, Graham, Richard L. wrote: > > > > On 8/2/09 12:55 AM, "Brian Barrett" wrote: > > While I agree that performance impact (latency in this case) is > important, I disagree that this necessarily belongs somewhere other > than ob1. For example, a zero-performance impact solution would be to > provide two versions of all the interface functions, one with failover > turned on and one with it turned off, and select the appropriate > functions at initialization time. There are others, including careful > placement of decision logic, which are likely to result in near-zero > impact. I'm not attempting to prescribe a solution, but refuting the > claim that this can't be in ob1 - I think more data is needed before > such a claim is made. > >>> Just another way to do handle set the function pointers. > > Mouhamed - can the openib btl try to re-establish a connection between > two peers today (with your ob1 patches, obviously)? Would this allow > us to adapt to changing routes due to switch failures (assuming that > there are other physical routes around the failed switch, of course)? > >>> The big question is what are the assumptions that are being made >>> for this mode of failure recovery. If the assumption is that >>> local completion >>> implies remote delivery, the problem is simple to solve. If not, >>> heavier >>> weight protocols need to be used to cover the range of ways failure >>> may manifest itself. > > Rich > > Thanks, > > Brian > > On Aug 1, 2009, at 6:21 PM, Graham, Richard L. wrote: > >> What is the impact on sm, which is by far the most sensitive to >> latency. This really belongs in a place other than ob1. Ob1 is >> supposed to provide the lowest latency possible, and other pml's are >> supposed to be used for heavier weight protocols. >> >> On the technical side, how do you distinguish between a lot >> acknowledgement and an undelivered message ? You reall
Re: [OMPI devel] Device failover on ob1
The objections being cited are somewhat unfair - perhaps people do not understand the proposal being made? The developers have gone out of their way to ensure that all changes are configured out unless you specifically select to use that functionality. This has been our policy from day one - as long as the changes have zero impact unless the user specifically requests that it be used, then no harm is done. So I personally don't see any objection to bringing it into the code base. Latency is not impacted one bit -unless- someone deliberately configures the code to use this feature. In that case, they are deliberately accepting any impact in order to gain the benefits. Perhaps a bigger question needs to be addressed - namely, does the ob1 code need to be refactored? Having been involved a little in the early discussion with bull when we debated over where to put this, I know the primary concern was that the code not suffer the same fate as the dr module. We have since run into a similar issue with the checksum module, so I know where they are coming from. The problem is that the code base is adjusted to support changes in ob1, which is still being debugged. On the order of 95% of the code in ob1 is required to be common across all the pml modules, so the rest of us have to (a) watch carefully all the commits to see if someone touches ob1, and then (b) manually mirror the change in our modules. This is not a supportable model over the long-term, which is why dr has died, and checksum is considering integrating into ob1 using configure #if's to avoid impacting non-checksum users. Likewise, device failover has been treated similarly here - i.e., configure out the added code unless someone wants it. This -does- lead to messier source code with these #if's in it. If we can refactor the ob1 code so the common functionality resides in the base, then perhaps we can avoid this problem. Is it possible? Ralph On Aug 2, 2009, at 3:25 PM, Graham, Richard L. wrote: On 8/2/09 12:55 AM, "Brian Barrett" wrote: While I agree that performance impact (latency in this case) is important, I disagree that this necessarily belongs somewhere other than ob1. For example, a zero-performance impact solution would be to provide two versions of all the interface functions, one with failover turned on and one with it turned off, and select the appropriate functions at initialization time. There are others, including careful placement of decision logic, which are likely to result in near-zero impact. I'm not attempting to prescribe a solution, but refuting the claim that this can't be in ob1 - I think more data is needed before such a claim is made. Just another way to do handle set the function pointers. Mouhamed - can the openib btl try to re-establish a connection between two peers today (with your ob1 patches, obviously)? Would this allow us to adapt to changing routes due to switch failures (assuming that there are other physical routes around the failed switch, of course)? The big question is what are the assumptions that are being made for this mode of failure recovery. If the assumption is that local completion implies remote delivery, the problem is simple to solve. If not, heavier weight protocols need to be used to cover the range of ways failure may manifest itself. Rich Thanks, Brian On Aug 1, 2009, at 6:21 PM, Graham, Richard L. wrote: What is the impact on sm, which is by far the most sensitive to latency. This really belongs in a place other than ob1. Ob1 is supposed to provide the lowest latency possible, and other pml's are supposed to be used for heavier weight protocols. On the technical side, how do you distinguish between a lot acknowledgement and an undelivered message ? You really don't want to try and deliver data into user space twice, as once a receive is complete, who knows what the user has done with that buffer ? A general treatment needs to be able to false negatives, and attempts to deliver the data more than once. How are you detecting missing acknowledgements ? Are you using some sort of timer ? Rich On 7/31/09 5:49 AM, "Mouhamed Gueye" wrote: Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi- path (a big
Re: [OMPI devel] Device failover on ob1
On 8/2/09 12:55 AM, "Brian Barrett" wrote: While I agree that performance impact (latency in this case) is important, I disagree that this necessarily belongs somewhere other than ob1. For example, a zero-performance impact solution would be to provide two versions of all the interface functions, one with failover turned on and one with it turned off, and select the appropriate functions at initialization time. There are others, including careful placement of decision logic, which are likely to result in near-zero impact. I'm not attempting to prescribe a solution, but refuting the claim that this can't be in ob1 - I think more data is needed before such a claim is made. >> Just another way to do handle set the function pointers. Mouhamed - can the openib btl try to re-establish a connection between two peers today (with your ob1 patches, obviously)? Would this allow us to adapt to changing routes due to switch failures (assuming that there are other physical routes around the failed switch, of course)? >> The big question is what are the assumptions that are being made >> for this mode of failure recovery. If the assumption is that local >> completion >> implies remote delivery, the problem is simple to solve. If not, heavier >> weight protocols need to be used to cover the range of ways failure >> may manifest itself. Rich Thanks, Brian On Aug 1, 2009, at 6:21 PM, Graham, Richard L. wrote: > What is the impact on sm, which is by far the most sensitive to > latency. This really belongs in a place other than ob1. Ob1 is > supposed to provide the lowest latency possible, and other pml's are > supposed to be used for heavier weight protocols. > > On the technical side, how do you distinguish between a lot > acknowledgement and an undelivered message ? You really don't want > to try and deliver data into user space twice, as once a receive is > complete, who knows what the user has done with that buffer ? A > general treatment needs to be able to false negatives, and attempts > to deliver the data more than once. > > How are you detecting missing acknowledgements ? Are you using some > sort of timer ? > > Rich > > On 7/31/09 5:49 AM, "Mouhamed Gueye" wrote: > > Hi list, > > Here is an update on our work concerning device failover. > > As many of you suggested, we reoriented our work on ob1 rather than dr > and we now have a working prototype on top of ob1. The approach is to > store btl descriptors sent to peers and delete them when we receive > proof of delivery. So far, we rely on completion callback functions, > assuming that the message is delivered when the completion function is > called, that is the case of openib. When a btl module fails, it is > removed from the endpoint's btl list and the next one is used to > retransmit stored descriptors. No extra-message is transmitted, it > only > consists in additions to the header. It has been mainly tested with > two > IB modules, in both multi-rail (two separate networks) and multi- > path (a > big unique network). > > You can grab and test the patch here (applies on top of the trunk) : > http://bitbucket.org/gueyem/ob1-failover/ > > To compile with failover support, just define --enable-device-failover > at configure. You can then run a benchmark, disconnect a port and see > the failover operate. > > A little latency increase (~ 2%) is induced by the failover layer when > no failover occurs. To accelerate the failover process on openib, you > can try to lower the btl_openib_ib_timeout openib parameter to 15 for > example instead of 20 (default value). > > Mouhamed > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Brian Barrett Open MPI developer http://www.open-mpi.org/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Device failover on ob1
While I agree that performance impact (latency in this case) is important, I disagree that this necessarily belongs somewhere other than ob1. For example, a zero-performance impact solution would be to provide two versions of all the interface functions, one with failover turned on and one with it turned off, and select the appropriate functions at initialization time. There are others, including careful placement of decision logic, which are likely to result in near-zero impact. I'm not attempting to prescribe a solution, but refuting the claim that this can't be in ob1 - I think more data is needed before such a claim is made. Mouhamed - can the openib btl try to re-establish a connection between two peers today (with your ob1 patches, obviously)? Would this allow us to adapt to changing routes due to switch failures (assuming that there are other physical routes around the failed switch, of course)? Thanks, Brian On Aug 1, 2009, at 6:21 PM, Graham, Richard L. wrote: What is the impact on sm, which is by far the most sensitive to latency. This really belongs in a place other than ob1. Ob1 is supposed to provide the lowest latency possible, and other pml's are supposed to be used for heavier weight protocols. On the technical side, how do you distinguish between a lot acknowledgement and an undelivered message ? You really don't want to try and deliver data into user space twice, as once a receive is complete, who knows what the user has done with that buffer ? A general treatment needs to be able to false negatives, and attempts to deliver the data more than once. How are you detecting missing acknowledgements ? Are you using some sort of timer ? Rich On 7/31/09 5:49 AM, "Mouhamed Gueye" wrote: Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi- path (a big unique network). You can grab and test the patch here (applies on top of the trunk) : http://bitbucket.org/gueyem/ob1-failover/ To compile with failover support, just define --enable-device-failover at configure. You can then run a benchmark, disconnect a port and see the failover operate. A little latency increase (~ 2%) is induced by the failover layer when no failover occurs. To accelerate the failover process on openib, you can try to lower the btl_openib_ib_timeout openib parameter to 15 for example instead of 20 (default value). Mouhamed ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI devel] Device failover on ob1
What is the impact on sm, which is by far the most sensitive to latency. This really belongs in a place other than ob1. Ob1 is supposed to provide the lowest latency possible, and other pml's are supposed to be used for heavier weight protocols. On the technical side, how do you distinguish between a lot acknowledgement and an undelivered message ? You really don't want to try and deliver data into user space twice, as once a receive is complete, who knows what the user has done with that buffer ? A general treatment needs to be able to false negatives, and attempts to deliver the data more than once. How are you detecting missing acknowledgements ? Are you using some sort of timer ? Rich On 7/31/09 5:49 AM, "Mouhamed Gueye" wrote: Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi-path (a big unique network). You can grab and test the patch here (applies on top of the trunk) : http://bitbucket.org/gueyem/ob1-failover/ To compile with failover support, just define --enable-device-failover at configure. You can then run a benchmark, disconnect a port and see the failover operate. A little latency increase (~ 2%) is induced by the failover layer when no failover occurs. To accelerate the failover process on openib, you can try to lower the btl_openib_ib_timeout openib parameter to 15 for example instead of 20 (default value). Mouhamed ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Device failover on ob1
Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi-path (a big unique network). You can grab and test the patch here (applies on top of the trunk) : http://bitbucket.org/gueyem/ob1-failover/ To compile with failover support, just define --enable-device-failover at configure. You can then run a benchmark, disconnect a port and see the failover operate. A little latency increase (~ 2%) is induced by the failover layer when no failover occurs. To accelerate the failover process on openib, you can try to lower the btl_openib_ib_timeout openib parameter to 15 for example instead of 20 (default value). Mouhamed