Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]
Hi Gary, Ack from me. Thanks, Nagendra, 91-9866424860 High Availability Solutions Pvt. Ltd. (www.hasolutions.in) - OpenSAF Support and Services - Original Message - Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920] From: "Gary Lee" Date: 9/3/18 6:41 pm To: nagen...@hasolutions.in Cc: "Hans Nordeback" , minh.c...@dektech.com.au, opensaf-devel@lists.sourceforge.net Hi I think the most important point is we cannot trust any state returned from the payloads. Trying to reconcile what happened during the split seems futile. We are better off rebooting the node so we have a known starting point and reallocate assignments accordingly. During the split, the PLs likely didn't have concurrent access to a shared resource. Now that the network is merged, we could have lots of issues if both PLs are modifying this resource assuming it has exclusivity. Gary On 3 Sep 2018, at 11:00 pm, wrote: Hi Hans/Gary, Thanks for your opinion. I will presume that until the applications are declared healthy by Amf, they are good to go. I am just trying to find an alternate path like remove all assignments and terminate the applications of that SG and then unlock-in and unlock, to avoid impact on other applications because of reboot. In this case, we will be removing the assignments and not giving new assignments. If they go faulty, we can reboot if it goes to inst/term failure if saAmfNodeFailfastOnTerminationFailure and saAmfNodeFailfastOnInstantiationFailure are set anyway. I failed to understand application use case after cluster merge. We need to do fast deactivation of SUs, but when Cluster was separated, then both the applications were Active at the same time anyway for some time. Do you have any, please share. Thanks, Nagendra, 91-9866424860 High Availability Solutions Pvt. Ltd. (www.hasolutions.in) - OpenSAF Support and Services - Original Message - Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920] From: "Hans Nordeback" Date: 9/3/18 4:27 pm To: nagen...@hasolutions.in, "Gary Lee" , minh.c...@dektech.com.au Cc: opensaf-devel@lists.sourceforge.net Hi, I think AMF should avoid getting into this state. Resolving this state may be difficult. AMF should not make any new assignments/failovers when the state of the failing node/component is not known, i.e. we should prefer consistency before availability. /Thanks HansN On 09/03/2018 12:33 PM, nagen...@hasolutions.in wrote: Hi Gary, Thanks for your response. Susi delete will be little slower in resolving the conflicts, but advantage it has over reboot is, it doesn't impact other applications. The other advantage of susi delete is that the availability of SUs for workload assignments will be lesser in reboot than Susi delete as reboot will take its own time to come back and instantiate SUs. Also, I think susi delete of one SU will do. Going forward, we can intimate the applications that its assignments are being removed because of re-merge after split(either by CSI or by OsafCsiAttributeChangeCallbackT), it would help them taking their own actions like syncing of DB, etc. My take would be that we shouldn't use reboot in any case by Amf, we need to recover from our situations by our self. As a HA software, we need to adopt self healing approach. What other co-maintainers say? Thanks, Nagendra, 91-9866424860 High Availability Solutions Pvt. Ltd. (www.hasolutions.in) - OpenSAF Support and Services - Original Message - Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920] From: "Gary Lee" Date: 9/3/18 1:36 pm To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, minh.c...@dektech.com.au Cc: opensaf-devel@lists.sourceforge.net Hi Nagendra On 03/09/18 17:50, nagen...@hasolutions.in wrote: > Hi Gary, > I have few questions: > 1. Do we really want to reboot both the nodes in case of conflicts? That's a good question. A cluster reboot should also be considered? I have proposed both nodes as it's somewhere in between. Keep in mind other SG types could be affected also, but not picked up. > 2. Even we want to send reboot to one node, which node we should send > the reboot, the one, which was a part of smaller cluster? I think we should keep it simple for this ticket, as it's really just a stop gap. Something like #2918 should be considered. > 3. If we could differentiate here that the conflicts happened because > of re-merge, then will susi_delete message(here also, we need to > decide which SU susi need to be deleted) will do rather than reboot? > Rebooting will be little to harsh for other applications running on > the nodes, it is just my understanding. > 4. In general, what we assume if the partition is merged, applications > for sure will be out of sync
Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]
Hi I think the most important point is we cannot trust any state returned from the payloads. Trying to reconcile what happened during the split seems futile. We are better off rebooting the node so we have a known starting point and reallocate assignments accordingly. During the split, the PLs likely didn't have concurrent access to a shared resource. Now that the network is merged, we could have lots of issues if both PLs are modifying this resource assuming it has exclusivity. Gary > On 3 Sep 2018, at 11:00 pm, > wrote: > > Hi Hans/Gary, > Thanks for your opinion. > I will presume that until the applications are declared healthy by Amf, they > are good to go. > I am just trying to find an alternate path like remove all assignments and > terminate the applications of that SG and then unlock-in and unlock, to avoid > impact on other applications because of reboot. > In this case, we will be removing the assignments and not giving new > assignments. > If they go faulty, we can reboot if it goes to inst/term failure if > saAmfNodeFailfastOnTerminationFailure and > saAmfNodeFailfastOnInstantiationFailure are set anyway. > > I failed to understand application use case after cluster merge. We need to > do fast deactivation of SUs, but when Cluster was separated, then both the > applications were Active at the same time anyway for some time. Do you have > any, please share. > > Thanks, > Nagendra, 91-9866424860 > High Availability Solutions Pvt. Ltd. (www.hasolutions.in) > - OpenSAF Support and Services > > - Original Message - > Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active > assignments [#2920] > From: "Hans Nordeback" > Date: 9/3/18 4:27 pm > To: nagen...@hasolutions.in, "Gary Lee" , > minh.c...@dektech.com.au > Cc: opensaf-devel@lists.sourceforge.net > > Hi, > > I think AMF should avoid getting into this state. Resolving this state may be > difficult. > > AMF should not make any new assignments/failovers when the state of the > failing node/component is not known, > > i.e. we should prefer consistency before availability. > > /Thanks HansN > > > On 09/03/2018 12:33 PM, nagen...@hasolutions.in wrote: > Hi Gary, > Thanks for your response. > > Susi delete will be little slower in resolving the conflicts, but advantage > it has over reboot is, it doesn't impact other applications. The other > advantage of susi delete is that the availability of SUs for workload > assignments will be lesser in reboot than Susi delete as reboot will take its > own time to come back and instantiate SUs. Also, I think susi delete of one > SU will do. > > Going forward, we can intimate the applications that its assignments are > being removed because of re-merge after split(either by CSI or by > OsafCsiAttributeChangeCallbackT), it would help them taking their own actions > like syncing of DB, etc. > > My take would be that we shouldn't use reboot in any case by Amf, we need to > recover from our situations by our self. As a HA software, we need to adopt > self healing approach. > > What other co-maintainers say? > > Thanks, > Nagendra, 91-9866424860 > High Availability Solutions Pvt. Ltd. (www.hasolutions.in) > - OpenSAF Support and Services > > - Original Message - > Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active > assignments [#2920] > From: "Gary Lee" > Date: 9/3/18 1:36 pm > To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, > minh.c...@dektech.com.au > Cc: opensaf-devel@lists.sourceforge.net > > Hi Nagendra > > On 03/09/18 17:50, nagen...@hasolutions.in wrote: > > Hi Gary, > > I have few questions: > > 1. Do we really want to reboot both the nodes in case of conflicts? > > That's a good question. A cluster reboot should also be considered? I > have proposed both nodes as it's somewhere in between. Keep in mind > other SG types could be affected also, but not picked up. > > > 2. Even we want to send reboot to one node, which node we should send > > the reboot, the one, which was a part of smaller cluster? > > I think we should keep it simple for this ticket, as it's really just a > stop gap. Something like #2918 should be considered. > > > 3. If we could differentiate here that the conflicts happened because > > of re-merge, then will susi_delete message(here also, we need to > > decide which SU susi need to be deleted) will do rather than reboot? > > Rebooting will be little to harsh for other applications running on > > the nodes, it is just my understanding. > > > 4. In general, what we assume if the partition is merged, applications > > for sure will be out of sync , so just deleting the susi will do or we > > need to reboot for sure. This is just for my understanding as I am not > > much aware of actual application level impact(in terms of Data base, > > its behavior, etc.). > > I think we want to resolve the conflicting state as soon as
Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]
Hi Hans/Gary, Thanks for your opinion. I will presume that until the applications are declared healthy by Amf, they are good to go. I am just trying to find an alternate path like remove all assignments and terminate the applications of that SG and then unlock-in and unlock, to avoid impact on other applications because of reboot. In this case, we will be removing the assignments and not giving new assignments. If they go faulty, we can reboot if it goes to inst/term failure if saAmfNodeFailfastOnTerminationFailure and saAmfNodeFailfastOnInstantiationFailure are set anyway. I failed to understand application use case after cluster merge. We need to do fast deactivation of SUs, but when Cluster was separated, then both the applications were Active at the same time anyway for some time. Do you have any, please share. Thanks, Nagendra, 91-9866424860 High Availability Solutions Pvt. Ltd. (www.hasolutions.in) - OpenSAF Support and Services - Original Message - Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920] From: "Hans Nordeback" Date: 9/3/18 4:27 pm To: nagen...@hasolutions.in, "Gary Lee" , minh.c...@dektech.com.au Cc: opensaf-devel@lists.sourceforge.net Hi, I think AMF should avoid getting into this state. Resolving this state may be difficult. AMF should not make any new assignments/failovers when the state of the failing node/component is not known, i.e. we should prefer consistency before availability. /Thanks HansN On 09/03/2018 12:33 PM, nagen...@hasolutions.in wrote: Hi Gary, Thanks for your response. Susi delete will be little slower in resolving the conflicts, but advantage it has over reboot is, it doesn't impact other applications. The other advantage of susi delete is that the availability of SUs for workload assignments will be lesser in reboot than Susi delete as reboot will take its own time to come back and instantiate SUs. Also, I think susi delete of one SU will do. Going forward, we can intimate the applications that its assignments are being removed because of re-merge after split(either by CSI or by OsafCsiAttributeChangeCallbackT), it would help them taking their own actions like syncing of DB, etc. My take would be that we shouldn't use reboot in any case by Amf, we need to recover from our situations by our self. As a HA software, we need to adopt self healing approach. What other co-maintainers say? Thanks, Nagendra, 91-9866424860 High Availability Solutions Pvt. Ltd. (www.hasolutions.in) - OpenSAF Support and Services - Original Message - Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920] From: "Gary Lee" Date: 9/3/18 1:36 pm To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, minh.c...@dektech.com.au Cc: opensaf-devel@lists.sourceforge.net Hi Nagendra On 03/09/18 17:50, nagen...@hasolutions.in wrote: > Hi Gary, > I have few questions: > 1. Do we really want to reboot both the nodes in case of conflicts? That's a good question. A cluster reboot should also be considered? I have proposed both nodes as it's somewhere in between. Keep in mind other SG types could be affected also, but not picked up. > 2. Even we want to send reboot to one node, which node we should send > the reboot, the one, which was a part of smaller cluster? I think we should keep it simple for this ticket, as it's really just a stop gap. Something like #2918 should be considered. > 3. If we could differentiate here that the conflicts happened because > of re-merge, then will susi_delete message(here also, we need to > decide which SU susi need to be deleted) will do rather than reboot? > Rebooting will be little to harsh for other applications running on > the nodes, it is just my understanding. > 4. In general, what we assume if the partition is merged, applications > for sure will be out of sync , so just deleting the susi will do or we > need to reboot for sure. This is just for my understanding as I am not > much aware of actual application level impact(in terms of Data base, > its behavior, etc.). I think we want to resolve the conflicting state as soon as possible. Would deleting the susi be potentially slower than issuing a reboot? Thanks Gary -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]
Hi, I think AMF should avoid getting into this state. Resolving this state may be difficult. AMF should not make any new assignments/failovers when the state of the failing node/component is not known, i.e. we should prefer consistency before availability. /Thanks HansN On 09/03/2018 12:33 PM, nagen...@hasolutions.in wrote: Hi Gary, Thanks for your response. Susi delete will be little slower in resolving the conflicts, but advantage it has over reboot is, it doesn't impact other applications. The other advantage of susi delete is that the availability of SUs for workload assignments will be lesser in reboot than Susi delete as reboot will take its own time to come back and instantiate SUs. Also, I think susi delete of one SU will do. Going forward, we can intimate the applications that its assignments are being removed because of re-merge after split(either by CSI or by OsafCsiAttributeChangeCallbackT), it would help them taking their own actions like syncing of DB, etc. My take would be that we shouldn't use reboot in any case by Amf, we need to recover from our situations by our self. As a HA software, we need to adopt self healing approach. What other co-maintainers say? Thanks, Nagendra, 91-9866424860 High Availability Solutions Pvt. Ltd. (www.hasolutions.in) - OpenSAF Support and Services - Original Message - Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920] From: "Gary Lee" Date: 9/3/18 1:36 pm To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, minh.c...@dektech.com.au Cc: opensaf-devel@lists.sourceforge.net Hi Nagendra On 03/09/18 17:50, nagen...@hasolutions.in wrote: > Hi Gary, > I have few questions: > 1. Do we really want to reboot both the nodes in case of conflicts? That's a good question. A cluster reboot should also be considered? I have proposed both nodes as it's somewhere in between. Keep in mind other SG types could be affected also, but not picked up. > 2. Even we want to send reboot to one node, which node we should send > the reboot, the one, which was a part of smaller cluster? I think we should keep it simple for this ticket, as it's really just a stop gap. Something like #2918 should be considered. > 3. If we could differentiate here that the conflicts happened because > of re-merge, then will susi_delete message(here also, we need to > decide which SU susi need to be deleted) will do rather than reboot? > Rebooting will be little to harsh for other applications running on > the nodes, it is just my understanding. > 4. In general, what we assume if the partition is merged, applications > for sure will be out of sync , so just deleting the susi will do or we > need to reboot for sure. This is just for my understanding as I am not > much aware of actual application level impact(in terms of Data base, > its behavior, etc.). I think we want to resolve the conflicting state as soon as possible. Would deleting the susi be potentially slower than issuing a reboot? Thanks Gary -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]
Also we don't know how the application will respond to callbacks. If if it doesn't handle the callback properly, do we wait for the normal course of recovery / escalations? On 03/09/18 20:33, nagen...@hasolutions.in wrote: Hi Gary, Thanks for your response. Susi delete will be little slower in resolving the conflicts, but advantage it has over reboot is, it doesn't impact other applications. The other advantage of susi delete is that the availability of SUs for workload assignments will be lesser in reboot than Susi delete as reboot will take its own time to come back and instantiate SUs. Also, I think susi delete of one SU will do. Going forward, we can intimate the applications that its assignments are being removed because of re-merge after split(either by CSI or by OsafCsiAttributeChangeCallbackT), it would help them taking their own actions like syncing of DB, etc. My take would be that we shouldn't use reboot in any case by Amf, we need to recover from our situations by our self. As a HA software, we need to adopt self healing approach. What other co-maintainers say? Thanks, Nagendra, 91-9866424860 High Availability Solutions Pvt. Ltd. (www.hasolutions.in) - OpenSAF Support and Services - Original Message - Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920] From: "Gary Lee" Date: 9/3/18 1:36 pm To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, minh.c...@dektech.com.au Cc: opensaf-devel@lists.sourceforge.net Hi Nagendra On 03/09/18 17:50, nagen...@hasolutions.in wrote: > Hi Gary, > I have few questions: > 1. Do we really want to reboot both the nodes in case of conflicts? That's a good question. A cluster reboot should also be considered? I have proposed both nodes as it's somewhere in between. Keep in mind other SG types could be affected also, but not picked up. > 2. Even we want to send reboot to one node, which node we should send > the reboot, the one, which was a part of smaller cluster? I think we should keep it simple for this ticket, as it's really just a stop gap. Something like #2918 should be considered. > 3. If we could differentiate here that the conflicts happened because > of re-merge, then will susi_delete message(here also, we need to > decide which SU susi need to be deleted) will do rather than reboot? > Rebooting will be little to harsh for other applications running on > the nodes, it is just my understanding. > 4. In general, what we assume if the partition is merged, applications > for sure will be out of sync , so just deleting the susi will do or we > need to reboot for sure. This is just for my understanding as I am not > much aware of actual application level impact(in terms of Data base, > its behavior, etc.). I think we want to resolve the conflicting state as soon as possible. Would deleting the susi be potentially slower than issuing a reboot? Thanks Gary -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]
Hi Nagendra I think we must minimise the time the 2N SUs are active concurrently. IMO it's better for the nodes to be unavailable for a brief amount of time with a reboot, than having data inconsistency. The longer than 2N SUs are assigned actively concurrently, the higher the risk. We already know one of the nodes must have been split from the main network partition, there is a chance other SGs on the node are affected, eg. too many NwayActive assignments, or other duplicate 2N assignments. Gary On 03/09/18 20:33, nagen...@hasolutions.in wrote: Hi Gary, Thanks for your response. Susi delete will be little slower in resolving the conflicts, but advantage it has over reboot is, it doesn't impact other applications. The other advantage of susi delete is that the availability of SUs for workload assignments will be lesser in reboot than Susi delete as reboot will take its own time to come back and instantiate SUs. Also, I think susi delete of one SU will do. Going forward, we can intimate the applications that its assignments are being removed because of re-merge after split(either by CSI or by OsafCsiAttributeChangeCallbackT), it would help them taking their own actions like syncing of DB, etc. My take would be that we shouldn't use reboot in any case by Amf, we need to recover from our situations by our self. As a HA software, we need to adopt self healing approach. What other co-maintainers say? Thanks, Nagendra, 91-9866424860 High Availability Solutions Pvt. Ltd. (www.hasolutions.in) - OpenSAF Support and Services - Original Message - Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920] From: "Gary Lee" Date: 9/3/18 1:36 pm To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, minh.c...@dektech.com.au Cc: opensaf-devel@lists.sourceforge.net Hi Nagendra On 03/09/18 17:50, nagen...@hasolutions.in wrote: > Hi Gary, > I have few questions: > 1. Do we really want to reboot both the nodes in case of conflicts? That's a good question. A cluster reboot should also be considered? I have proposed both nodes as it's somewhere in between. Keep in mind other SG types could be affected also, but not picked up. > 2. Even we want to send reboot to one node, which node we should send > the reboot, the one, which was a part of smaller cluster? I think we should keep it simple for this ticket, as it's really just a stop gap. Something like #2918 should be considered. > 3. If we could differentiate here that the conflicts happened because > of re-merge, then will susi_delete message(here also, we need to > decide which SU susi need to be deleted) will do rather than reboot? > Rebooting will be little to harsh for other applications running on > the nodes, it is just my understanding. > 4. In general, what we assume if the partition is merged, applications > for sure will be out of sync , so just deleting the susi will do or we > need to reboot for sure. This is just for my understanding as I am not > much aware of actual application level impact(in terms of Data base, > its behavior, etc.). I think we want to resolve the conflicting state as soon as possible. Would deleting the susi be potentially slower than issuing a reboot? Thanks Gary -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]
Hi Gary, Thanks for your response. Susi delete will be little slower in resolving the conflicts, but advantage it has over reboot is, it doesn't impact other applications. The other advantage of susi delete is that the availability of SUs for workload assignments will be lesser in reboot than Susi delete as reboot will take its own time to come back and instantiate SUs. Also, I think susi delete of one SU will do. Going forward, we can intimate the applications that its assignments are being removed because of re-merge after split(either by CSI or by OsafCsiAttributeChangeCallbackT), it would help them taking their own actions like syncing of DB, etc. My take would be that we shouldn't use reboot in any case by Amf, we need to recover from our situations by our self. As a HA software, we need to adopt self healing approach. What other co-maintainers say? Thanks, Nagendra, 91-9866424860 High Availability Solutions Pvt. Ltd. (www.hasolutions.in) - OpenSAF Support and Services - Original Message - Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920] From: "Gary Lee" Date: 9/3/18 1:36 pm To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, minh.c...@dektech.com.au Cc: opensaf-devel@lists.sourceforge.net Hi Nagendra On 03/09/18 17:50, nagen...@hasolutions.in wrote: > Hi Gary, > I have few questions: > 1. Do we really want to reboot both the nodes in case of conflicts? That's a good question. A cluster reboot should also be considered? I have proposed both nodes as it's somewhere in between. Keep in mind other SG types could be affected also, but not picked up. > 2. Even we want to send reboot to one node, which node we should send > the reboot, the one, which was a part of smaller cluster? I think we should keep it simple for this ticket, as it's really just a stop gap. Something like #2918 should be considered. > 3. If we could differentiate here that the conflicts happened because > of re-merge, then will susi_delete message(here also, we need to > decide which SU susi need to be deleted) will do rather than reboot? > Rebooting will be little to harsh for other applications running on > the nodes, it is just my understanding. > 4. In general, what we assume if the partition is merged, applications > for sure will be out of sync , so just deleting the susi will do or we > need to reboot for sure. This is just for my understanding as I am not > much aware of actual application level impact(in terms of Data base, > its behavior, etc.). I think we want to resolve the conflicting state as soon as possible. Would deleting the susi be potentially slower than issuing a reboot? Thanks Gary -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]
Hi Nagendra On 03/09/18 17:50, nagen...@hasolutions.in wrote: Hi Gary, I have few questions: 1. Do we really want to reboot both the nodes in case of conflicts? That's a good question. A cluster reboot should also be considered? I have proposed both nodes as it's somewhere in between. Keep in mind other SG types could be affected also, but not picked up. 2. Even we want to send reboot to one node, which node we should send the reboot, the one, which was a part of smaller cluster? I think we should keep it simple for this ticket, as it's really just a stop gap. Something like #2918 should be considered. 3. If we could differentiate here that the conflicts happened because of re-merge, then will susi_delete message(here also, we need to decide which SU susi need to be deleted) will do rather than reboot? Rebooting will be little to harsh for other applications running on the nodes, it is just my understanding. 4. In general, what we assume if the partition is merged, applications for sure will be out of sync , so just deleting the susi will do or we need to reboot for sure. This is just for my understanding as I am not much aware of actual application level impact(in terms of Data base, its behavior, etc.). I think we want to resolve the conflicting state as soon as possible. Would deleting the susi be potentially slower than issuing a reboot? Thanks Gary -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]
Hi Gary, I have few questions: 1. Do we really want to reboot both the nodes in case of conflicts? 2. Even we want to send reboot to one node, which node we should send the reboot, the one, which was a part of smaller cluster? 3. If we could differentiate here that the conflicts happened because of re-merge, then will susi_delete message(here also, we need to decide which SU susi need to be deleted) will do rather than reboot? Rebooting will be little to harsh for other applications running on the nodes, it is just my understanding. 4. In general, what we assume if the partition is merged, applications for sure will be out of sync , so just deleting the susi will do or we need to reboot for sure. This is just for my understanding as I am not much aware of actual application level impact(in terms of Data base, its behavior, etc.). Thanks, Nagendra, 91-9866424860 High Availability Solutions Pvt. Ltd. (www.hasolutions.in) - OpenSAF Support and Services - Original Message - Subject: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920] From: "Gary Lee" Date: 8/31/18 12:17 pm To: hans.nordeb...@ericsson.com, minh.c...@dektech.com.au, nagen...@hasolutions.in Cc: opensaf-devel@lists.sourceforge.net, "Gary Lee" After a split network event, both SCs can reboot endlessly, due to this assertion: 2018-08-29 18:05:34.689 SC-2 osafamfd[263]: src/amf/amfd/sg_2n_fsm.cc:596: avd_sg_2n_act_susi: Assertion 'a_susi_1->su == a_susi_2->su' failed. 2018-08-29 18:05:34.695 SC-2 osafamfnd[273]: ER AMFD has unexpectedly crashed. Rebooting node During the network split, a SC could assign another SU to be active, if the node hosting the old active 2N assignment is not reachable. The assert occurs after the network is merged. SC absence must be enabled. For now, we can aid recovery of the cluster by rebooting both of the PLs in place of the assertion. --- src/amf/amfd/sg_2n_fsm.cc | 35 +-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/src/amf/amfd/sg_2n_fsm.cc b/src/amf/amfd/sg_2n_fsm.cc index c7d584473..3ba1dc6c8 100644 --- a/src/amf/amfd/sg_2n_fsm.cc +++ b/src/amf/amfd/sg_2n_fsm.cc @@ -593,8 +593,39 @@ static AVD_SU_SI_REL *avd_sg_2n_act_susi(AVD_CL_CB *cb, AVD_SG *sg, osafassert(a_susi_1->su == s_susi_2->su); osafassert(a_susi_2->su == s_susi_1->su); } else { - osafassert(a_susi_1->su == a_susi_2->su); - osafassert(s_susi_1->su == s_susi_2->su); + if (a_susi_1->su != a_susi_2->su) { + // Duplicate 2N active assignments found, probably after split brain + // Reboot both nodes hosting the SUs to recover + + LOG_EM("Duplicate 2N active assignments in '%s' and '%s'", + a_susi_1->su->name.c_str(), a_susi_2->su->name.c_str()); + + LOG_EM("Sending node reboot order to '%s'", + a_susi_1->su->su_on_node->name.c_str()); + avd_send_reboot_msg_directly(a_susi_1->su->su_on_node); + + if (a_susi_1->su->su_on_node != a_susi_2->su->su_on_node) { + LOG_EM("Sending node reboot order to '%s'", + a_susi_2->su->su_on_node->name.c_str()); + avd_send_reboot_msg_directly(a_susi_2->su->su_on_node); + } + } else if (s_susi_1->su != s_susi_2->su) { + // Duplicate 2N standby assignments found + // Reboot both nodes hosting the SUs to recover + + LOG_EM("Duplicate 2N standby assignments in '%s' and '%s'", + s_susi_1->su->name.c_str(), s_susi_2->su->name.c_str()); + + LOG_EM("Sending node reboot order to '%s'", + s_susi_1->su->su_on_node->name.c_str()); + avd_send_reboot_msg_directly(s_susi_1->su->su_on_node); + + if (s_susi_1->su->su_on_node != s_susi_2->su->su_on_node) { + LOG_EM("Sending node reboot order to '%s'", + s_susi_2->su->su_on_node->name.c_str()); + avd_send_reboot_msg_directly(s_susi_2->su->su_on_node); + } + } } a_susi = a_susi_1; s_susi = s_susi_1; -- 2.17.1 -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel