Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]

2018-09-04 Thread nagendra
Hi Gary,
 Ack from me.
 
Thanks,
Nagendra, 91-9866424860
High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
- OpenSAF Support and Services
 
 
- Original Message - Subject: Re: [PATCH 1/1] amfd: reboot 
nodes that report conflicting 2N active assignments [#2920]
From: "Gary Lee" 
Date: 9/3/18 6:41 pm
To: nagen...@hasolutions.in
Cc: "Hans Nordeback" , minh.c...@dektech.com.au, 
opensaf-devel@lists.sourceforge.net

  
Hi
 
I think the most important point is we cannot trust any state returned from the 
payloads. Trying to reconcile what happened during the split seems futile.
We are better off rebooting the node so we have a known starting point and 
reallocate assignments accordingly.
 
During the split, the PLs likely didn't have concurrent access to a shared 
resource. Now that the network is merged, we could have lots of issues if both 
PLs are modifying this resource assuming it has exclusivity.
 
Gary

On 3 Sep 2018, at 11:00 pm,   
wrote:


  Hi Hans/Gary,
Thanks for your opinion.
I will presume that until the applications are declared healthy by Amf, they 
are good to go.
I am just trying to find an alternate path like remove all assignments and 
terminate the applications of that SG and then unlock-in and unlock, to avoid 
impact on other applications because of reboot.
In this case, we will be removing the assignments and not giving new 
assignments.
If they go faulty, we can reboot if it goes to inst/term failure if 
saAmfNodeFailfastOnTerminationFailure and 
saAmfNodeFailfastOnInstantiationFailure are set anyway.
 
I failed to understand application use case after cluster merge. We need to do 
fast deactivation of SUs, but when Cluster was separated, then both the 
applications were Active at the same time anyway for some time. Do you have 
any, please share.
 
Thanks,
Nagendra, 91-9866424860
High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
- OpenSAF Support and Services
 
- Original Message - Subject: Re: [PATCH 1/1] amfd: reboot 
nodes that report conflicting 2N active assignments [#2920]
From: "Hans Nordeback" 
Date: 9/3/18 4:27 pm
To: nagen...@hasolutions.in, "Gary Lee" , 
minh.c...@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net

 Hi,
 I think AMF should avoid getting into this state. Resolving this state may be 
difficult.
 AMF should not make any new assignments/failovers when the state of the 
failing node/component is not known,
 i.e. we should prefer consistency before availability.
 /Thanks HansN
 
 On 09/03/2018 12:33 PM, nagen...@hasolutions.in wrote:
 Hi Gary,
Thanks for your response.
 
Susi delete will be little slower in resolving the conflicts, but advantage it 
has over reboot is, it doesn't impact other applications. The other advantage 
of susi delete is that the availability of SUs for workload assignments will be 
lesser in reboot than Susi delete as reboot will take its own time to come back 
and instantiate SUs. Also, I think  susi delete of one SU will do.
 
Going forward, we can intimate the applications that its assignments are being 
removed because of re-merge after split(either by CSI or by 
OsafCsiAttributeChangeCallbackT), it would help them taking their own actions 
like syncing of DB, etc.
 
My take would be that we shouldn't use reboot in any case by Amf, we need to 
recover from our situations by our self. As a HA software, we need to adopt 
self healing approach.
 
What other co-maintainers say?
 
Thanks,
 Nagendra, 91-9866424860
 High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
 - OpenSAF Support and Services
 
 - Original Message -
 Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active 
assignments [#2920]
 From: "Gary Lee" 
 Date: 9/3/18 1:36 pm
 To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, 
minh.c...@dektech.com.au
 Cc: opensaf-devel@lists.sourceforge.net
 
 Hi Nagendra
 
 On 03/09/18 17:50, nagen...@hasolutions.in wrote:
 > Hi Gary,
 > I have few questions:
 > 1. Do we really want to reboot both the nodes in case of conflicts?
 
 That's a good question. A cluster reboot should also be considered? I 
 have proposed both nodes as it's somewhere in between. Keep in mind 
 other SG types could be affected also, but not picked up.
 
 > 2. Even we want to send reboot to one node, which node we should send 
 > the reboot, the one, which was a part of smaller cluster?
 
 I think we should keep it simple for this ticket, as it's really just a 
 stop gap. Something like #2918 should be considered.
 
 > 3. If we could differentiate here that the conflicts happened because 
 > of re-merge, then will susi_delete message(here also, we need to 
 > decide which SU susi need to be deleted) will do rather than reboot? 
 > Rebooting will be little to harsh for other applications running on 
 > the nodes, it is just my understanding.
 
 > 4. In general, what we assume if the partition is merged, applications 
 > for sure will be out of sync

Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]

2018-09-03 Thread Gary Lee
Hi

I think the most important point is we cannot trust any state returned from the 
payloads. Trying to reconcile what happened during the split seems futile.
We are better off rebooting the node so we have a known starting point and 
reallocate assignments accordingly.

During the split, the PLs likely didn't have concurrent access to a shared 
resource. Now that the network is merged, we could have lots of issues if both 
PLs are modifying this resource assuming it has exclusivity.

Gary

> On 3 Sep 2018, at 11:00 pm,  
>  wrote:
> 
> Hi Hans/Gary,
> Thanks for your opinion.
> I will presume that until the applications are declared healthy by Amf, they 
> are good to go.
> I am just trying to find an alternate path like remove all assignments and 
> terminate the applications of that SG and then unlock-in and unlock, to avoid 
> impact on other applications because of reboot.
> In this case, we will be removing the assignments and not giving new 
> assignments.
> If they go faulty, we can reboot if it goes to inst/term failure if 
> saAmfNodeFailfastOnTerminationFailure and 
> saAmfNodeFailfastOnInstantiationFailure are set anyway.
>  
> I failed to understand application use case after cluster merge. We need to 
> do fast deactivation of SUs, but when Cluster was separated, then both the 
> applications were Active at the same time anyway for some time. Do you have 
> any, please share.
>  
> Thanks,
> Nagendra, 91-9866424860
> High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
> - OpenSAF Support and Services
>  
> - Original Message -
> Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active 
> assignments [#2920]
> From: "Hans Nordeback" 
> Date: 9/3/18 4:27 pm
> To: nagen...@hasolutions.in, "Gary Lee" , 
> minh.c...@dektech.com.au
> Cc: opensaf-devel@lists.sourceforge.net
> 
> Hi,
> 
> I think AMF should avoid getting into this state. Resolving this state may be 
> difficult.
> 
> AMF should not make any new assignments/failovers when the state of the 
> failing node/component is not known,
> 
> i.e. we should prefer consistency before availability.
> 
> /Thanks HansN
> 
> 
> On 09/03/2018 12:33 PM, nagen...@hasolutions.in wrote:
> Hi Gary,
> Thanks for your response.
>  
> Susi delete will be little slower in resolving the conflicts, but advantage 
> it has over reboot is, it doesn't impact other applications. The other 
> advantage of susi delete is that the availability of SUs for workload 
> assignments will be lesser in reboot than Susi delete as reboot will take its 
> own time to come back and instantiate SUs. Also, I think  susi delete of one 
> SU will do.
>  
> Going forward, we can intimate the applications that its assignments are 
> being removed because of re-merge after split(either by CSI or by 
> OsafCsiAttributeChangeCallbackT), it would help them taking their own actions 
> like syncing of DB, etc.
>  
> My take would be that we shouldn't use reboot in any case by Amf, we need to 
> recover from our situations by our self. As a HA software, we need to adopt 
> self healing approach.
>  
> What other co-maintainers say?
>  
> Thanks,
> Nagendra, 91-9866424860
> High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
> - OpenSAF Support and Services
>  
>  - Original Message -
> Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active 
> assignments [#2920]
> From: "Gary Lee" 
> Date: 9/3/18 1:36 pm
> To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, 
> minh.c...@dektech.com.au
> Cc: opensaf-devel@lists.sourceforge.net
> 
> Hi Nagendra
> 
> On 03/09/18 17:50, nagen...@hasolutions.in wrote:
> > Hi Gary,
> > I have few questions:
> > 1. Do we really want to reboot both the nodes in case of conflicts?
> 
> That's a good question. A cluster reboot should also be considered? I 
> have proposed both nodes as it's somewhere in between. Keep in mind 
> other SG types could be affected also, but not picked up.
> 
> > 2. Even we want to send reboot to one node, which node we should send 
> > the reboot, the one, which was a part of smaller cluster?
> 
> I think we should keep it simple for this ticket, as it's really just a 
> stop gap. Something like #2918 should be considered.
> 
> > 3. If we could differentiate here that the conflicts happened because 
> > of re-merge, then will susi_delete message(here also, we need to 
> > decide which SU susi need to be deleted) will do rather than reboot? 
> > Rebooting will be little to harsh for other applications running on 
> > the nodes, it is just my understanding.
> 
> > 4. In general, what we assume if the partition is merged, applications 
> > for sure will be out of sync , so just deleting the susi will do or we 
> > need to reboot for sure. This is just for my understanding as I am not 
> > much aware of actual application level impact(in terms of Data base, 
> > its behavior, etc.).
> 
> I think we want to resolve the conflicting state as soon as 

Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]

2018-09-03 Thread nagendra
Hi Hans/Gary,
Thanks for your opinion.
I will presume that until the applications are declared healthy by Amf, they 
are good to go.
I am just trying to find an alternate path like remove all assignments and 
terminate the applications of that SG and then unlock-in and unlock, to avoid 
impact on other applications because of reboot.
In this case, we will be removing the assignments and not giving new 
assignments.
If they go faulty, we can reboot if it goes to inst/term failure if 
saAmfNodeFailfastOnTerminationFailure and 
saAmfNodeFailfastOnInstantiationFailure are set anyway.
 
I failed to understand application use case after cluster merge. We need to do 
fast deactivation of SUs, but when Cluster was separated, then both the 
applications were Active at the same time anyway for some time. Do you have 
any, please share.
 
Thanks,
Nagendra, 91-9866424860
High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
- OpenSAF Support and Services
 
- Original Message - Subject: Re: [PATCH 1/1] amfd: reboot 
nodes that report conflicting 2N active assignments [#2920]
From: "Hans Nordeback" 
Date: 9/3/18 4:27 pm
To: nagen...@hasolutions.in, "Gary Lee" , 
minh.c...@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net

 Hi,
 I think AMF should avoid getting into this state. Resolving this state may be 
difficult.
 AMF should not make any new assignments/failovers when the state of the 
failing node/component is not known,
 i.e. we should prefer consistency before availability.
 /Thanks HansN
 
 On 09/03/2018 12:33 PM, nagen...@hasolutions.in wrote:
 Hi Gary,
Thanks for your response.
 
Susi delete will be little slower in resolving the conflicts, but advantage it 
has over reboot is, it doesn't impact other applications. The other advantage 
of susi delete is that the availability of SUs for workload assignments will be 
lesser in reboot than Susi delete as reboot will take its own time to come back 
and instantiate SUs. Also, I think  susi delete of one SU will do.
 
Going forward, we can intimate the applications that its assignments are being 
removed because of re-merge after split(either by CSI or by 
OsafCsiAttributeChangeCallbackT), it would help them taking their own actions 
like syncing of DB, etc.
 
My take would be that we shouldn't use reboot in any case by Amf, we need to 
recover from our situations by our self. As a HA software, we need to adopt 
self healing approach.
 
What other co-maintainers say?
 
Thanks,
 Nagendra, 91-9866424860
 High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
 - OpenSAF Support and Services
 
 - Original Message -
 Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active 
assignments [#2920]
 From: "Gary Lee" 
 Date: 9/3/18 1:36 pm
 To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, 
minh.c...@dektech.com.au
 Cc: opensaf-devel@lists.sourceforge.net
 
 Hi Nagendra
 
 On 03/09/18 17:50, nagen...@hasolutions.in wrote:
 > Hi Gary,
 > I have few questions:
 > 1. Do we really want to reboot both the nodes in case of conflicts?
 
 That's a good question. A cluster reboot should also be considered? I 
 have proposed both nodes as it's somewhere in between. Keep in mind 
 other SG types could be affected also, but not picked up.
 
 > 2. Even we want to send reboot to one node, which node we should send 
 > the reboot, the one, which was a part of smaller cluster?
 
 I think we should keep it simple for this ticket, as it's really just a 
 stop gap. Something like #2918 should be considered.
 
 > 3. If we could differentiate here that the conflicts happened because 
 > of re-merge, then will susi_delete message(here also, we need to 
 > decide which SU susi need to be deleted) will do rather than reboot? 
 > Rebooting will be little to harsh for other applications running on 
 > the nodes, it is just my understanding.
 
 > 4. In general, what we assume if the partition is merged, applications 
 > for sure will be out of sync , so just deleting the susi will do or we 
 > need to reboot for sure. This is just for my understanding as I am not 
 > much aware of actual application level impact(in terms of Data base, 
 > its behavior, etc.).
 
 I think we want to resolve the conflicting state as soon as possible. 
 Would deleting the susi be potentially slower than issuing a reboot?
 
 Thanks
 Gary
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]

2018-09-03 Thread Hans Nordeback

Hi,

I think AMF should avoid getting into this state. Resolving this state 
may be difficult.


AMF should not make any new assignments/failovers when the state of the 
failing node/component is not known,


i.e. we should prefer consistency before availability.

/Thanks HansN


On 09/03/2018 12:33 PM, nagen...@hasolutions.in wrote:

Hi Gary,
Thanks for your response.
Susi delete will be little slower in resolving the conflicts, but 
advantage it has over reboot is, it doesn't impact other applications. 
The other advantage of susi delete is that the availability of SUs for 
workload assignments will be lesser in reboot than Susi delete as 
reboot will take its own time to come back and instantiate SUs. Also, 
I think  susi delete of one SU will do.
Going forward, we can intimate the applications that its assignments 
are being removed because of re-merge after split(either by CSI or by 
OsafCsiAttributeChangeCallbackT), it would help them taking their own 
actions like syncing of DB, etc.
My take would be that we shouldn't use reboot in any case by Amf, we 
need to recover from our situations by our self. As a HA software, we 
need to adopt self healing approach.

What other co-maintainers say?
Thanks,
Nagendra, 91-9866424860
High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
- OpenSAF Support and Services
 - Original Message -

Subject: Re: [PATCH 1/1] amfd: reboot nodes that report
conflicting 2N active assignments [#2920]
From: "Gary Lee" 
Date: 9/3/18 1:36 pm
To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com,
minh.c...@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net

Hi Nagendra

On 03/09/18 17:50, nagen...@hasolutions.in wrote:
> Hi Gary,
> I have few questions:
> 1. Do we really want to reboot both the nodes in case of conflicts?

That's a good question. A cluster reboot should also be considered? I
have proposed both nodes as it's somewhere in between. Keep in mind
other SG types could be affected also, but not picked up.

> 2. Even we want to send reboot to one node, which node we should
send
> the reboot, the one, which was a part of smaller cluster?

I think we should keep it simple for this ticket, as it's really
just a
stop gap. Something like #2918 should be considered.

> 3. If we could differentiate here that the conflicts happened
because
> of re-merge, then will susi_delete message(here also, we need to
> decide which SU susi need to be deleted) will do rather than
reboot?
> Rebooting will be little to harsh for other applications running on
> the nodes, it is just my understanding.

> 4. In general, what we assume if the partition is merged,
applications
> for sure will be out of sync , so just deleting the susi will do
or we
> need to reboot for sure. This is just for my understanding as I
am not
> much aware of actual application level impact(in terms of Data
base,
> its behavior, etc.).

I think we want to resolve the conflicting state as soon as possible.
Would deleting the susi be potentially slower than issuing a reboot?

Thanks
Gary



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]

2018-09-03 Thread Gary Lee
Also we don't know how the application will respond to callbacks. If if 
it doesn't handle the callback properly, do we wait for the normal 
course of recovery / escalations?



On 03/09/18 20:33, nagen...@hasolutions.in wrote:

Hi Gary,
Thanks for your response.
Susi delete will be little slower in resolving the conflicts, but 
advantage it has over reboot is, it doesn't impact other applications. 
The other advantage of susi delete is that the availability of SUs for 
workload assignments will be lesser in reboot than Susi delete as 
reboot will take its own time to come back and instantiate SUs. Also, 
I think  susi delete of one SU will do.
Going forward, we can intimate the applications that its assignments 
are being removed because of re-merge after split(either by CSI or by 
OsafCsiAttributeChangeCallbackT), it would help them taking their own 
actions like syncing of DB, etc.
My take would be that we shouldn't use reboot in any case by Amf, we 
need to recover from our situations by our self. As a HA software, we 
need to adopt self healing approach.

What other co-maintainers say?
Thanks,
Nagendra, 91-9866424860
High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
- OpenSAF Support and Services
 - Original Message -

Subject: Re: [PATCH 1/1] amfd: reboot nodes that report
conflicting 2N active assignments [#2920]
From: "Gary Lee" 
Date: 9/3/18 1:36 pm
To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com,
minh.c...@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net

Hi Nagendra

On 03/09/18 17:50, nagen...@hasolutions.in wrote:
> Hi Gary,
> I have few questions:
> 1. Do we really want to reboot both the nodes in case of conflicts?

That's a good question. A cluster reboot should also be considered? I
have proposed both nodes as it's somewhere in between. Keep in mind
other SG types could be affected also, but not picked up.

> 2. Even we want to send reboot to one node, which node we should
send
> the reboot, the one, which was a part of smaller cluster?

I think we should keep it simple for this ticket, as it's really
just a
stop gap. Something like #2918 should be considered.

> 3. If we could differentiate here that the conflicts happened
because
> of re-merge, then will susi_delete message(here also, we need to
> decide which SU susi need to be deleted) will do rather than
reboot?
> Rebooting will be little to harsh for other applications running on
> the nodes, it is just my understanding.

> 4. In general, what we assume if the partition is merged,
applications
> for sure will be out of sync , so just deleting the susi will do
or we
> need to reboot for sure. This is just for my understanding as I
am not
> much aware of actual application level impact(in terms of Data
base,
> its behavior, etc.).

I think we want to resolve the conflicting state as soon as possible.
Would deleting the susi be potentially slower than issuing a reboot?

Thanks
Gary



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]

2018-09-03 Thread Gary Lee

Hi Nagendra

I think we must minimise the time the 2N SUs are active concurrently. 
IMO it's better for the nodes to be unavailable for a brief amount of 
time with a reboot, than having data inconsistency. The longer than 2N 
SUs are assigned actively concurrently, the higher the risk. We already 
know one of the nodes must have been split from the main network 
partition, there is a chance other SGs on the node are affected, eg. too 
many NwayActive assignments, or other duplicate 2N assignments.


Gary


On 03/09/18 20:33, nagen...@hasolutions.in wrote:

Hi Gary,
Thanks for your response.
Susi delete will be little slower in resolving the conflicts, but 
advantage it has over reboot is, it doesn't impact other applications. 
The other advantage of susi delete is that the availability of SUs for 
workload assignments will be lesser in reboot than Susi delete as 
reboot will take its own time to come back and instantiate SUs. Also, 
I think  susi delete of one SU will do.
Going forward, we can intimate the applications that its assignments 
are being removed because of re-merge after split(either by CSI or by 
OsafCsiAttributeChangeCallbackT), it would help them taking their own 
actions like syncing of DB, etc.
My take would be that we shouldn't use reboot in any case by Amf, we 
need to recover from our situations by our self. As a HA software, we 
need to adopt self healing approach.

What other co-maintainers say?
Thanks,
Nagendra, 91-9866424860
High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
- OpenSAF Support and Services
 - Original Message -

Subject: Re: [PATCH 1/1] amfd: reboot nodes that report
conflicting 2N active assignments [#2920]
From: "Gary Lee" 
Date: 9/3/18 1:36 pm
To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com,
minh.c...@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net

Hi Nagendra

On 03/09/18 17:50, nagen...@hasolutions.in wrote:
> Hi Gary,
> I have few questions:
> 1. Do we really want to reboot both the nodes in case of conflicts?

That's a good question. A cluster reboot should also be considered? I
have proposed both nodes as it's somewhere in between. Keep in mind
other SG types could be affected also, but not picked up.

> 2. Even we want to send reboot to one node, which node we should
send
> the reboot, the one, which was a part of smaller cluster?

I think we should keep it simple for this ticket, as it's really
just a
stop gap. Something like #2918 should be considered.

> 3. If we could differentiate here that the conflicts happened
because
> of re-merge, then will susi_delete message(here also, we need to
> decide which SU susi need to be deleted) will do rather than
reboot?
> Rebooting will be little to harsh for other applications running on
> the nodes, it is just my understanding.

> 4. In general, what we assume if the partition is merged,
applications
> for sure will be out of sync , so just deleting the susi will do
or we
> need to reboot for sure. This is just for my understanding as I
am not
> much aware of actual application level impact(in terms of Data
base,
> its behavior, etc.).

I think we want to resolve the conflicting state as soon as possible.
Would deleting the susi be potentially slower than issuing a reboot?

Thanks
Gary



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]

2018-09-03 Thread nagendra
Hi Gary,
Thanks for your response.
 
Susi delete will be little slower in resolving the conflicts, but advantage it 
has over reboot is, it doesn't impact other applications. The other advantage 
of susi delete is that the availability of SUs for workload assignments will be 
lesser in reboot than Susi delete as reboot will take its own time to come back 
and instantiate SUs. Also, I think  susi delete of one SU will do.
 
Going forward, we can intimate the applications that its assignments are being 
removed because of re-merge after split(either by CSI or by 
OsafCsiAttributeChangeCallbackT), it would help them taking their own actions 
like syncing of DB, etc.
 
My take would be that we shouldn't use reboot in any case by Amf, we need to 
recover from our situations by our self. As a HA software, we need to adopt 
self healing approach.
 
What other co-maintainers say?
 
Thanks,
Nagendra, 91-9866424860
High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
- OpenSAF Support and Services
 
 - Original Message -
 Subject: Re: [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active 
assignments [#2920]
From: "Gary Lee" 
Date: 9/3/18 1:36 pm
To: nagen...@hasolutions.in, hans.nordeb...@ericsson.com, 
minh.c...@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net

Hi Nagendra
 
 On 03/09/18 17:50, nagen...@hasolutions.in wrote:
 > Hi Gary,
 > I have few questions:
 > 1. Do we really want to reboot both the nodes in case of conflicts?
 
 That's a good question. A cluster reboot should also be considered? I 
 have proposed both nodes as it's somewhere in between. Keep in mind 
 other SG types could be affected also, but not picked up.
 
 > 2. Even we want to send reboot to one node, which node we should send 
 > the reboot, the one, which was a part of smaller cluster?
 
 I think we should keep it simple for this ticket, as it's really just a 
 stop gap. Something like #2918 should be considered.
 
 > 3. If we could differentiate here that the conflicts happened because 
 > of re-merge, then will susi_delete message(here also, we need to 
 > decide which SU susi need to be deleted) will do rather than reboot? 
 > Rebooting will be little to harsh for other applications running on 
 > the nodes, it is just my understanding.
 
 > 4. In general, what we assume if the partition is merged, applications 
 > for sure will be out of sync , so just deleting the susi will do or we 
 > need to reboot for sure. This is just for my understanding as I am not 
 > much aware of actual application level impact(in terms of Data base, 
 > its behavior, etc.).
 
 I think we want to resolve the conflicting state as soon as possible. 
 Would deleting the susi be potentially slower than issuing a reboot?
 
 Thanks
 Gary
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]

2018-09-03 Thread Gary Lee

Hi Nagendra

On 03/09/18 17:50, nagen...@hasolutions.in wrote:

Hi Gary,
I have few questions:
1. Do we really want to reboot both the nodes in case of conflicts?


That's a good question. A cluster reboot should also be considered? I 
have proposed both nodes as it's somewhere in between. Keep in mind 
other SG types could be affected also, but not picked up.


2. Even we want to send reboot to one node, which node we should send 
the reboot, the one, which was a part of smaller cluster?


I think we should keep it simple for this ticket, as it's really just a 
stop gap. Something like #2918 should be considered.


3. If we could differentiate here that the conflicts happened because 
of re-merge, then will susi_delete message(here also, we need to 
decide which SU susi need to be deleted) will do rather than reboot? 
Rebooting will be little to harsh for other applications running on 
the nodes, it is just my understanding.


4. In general, what we assume if the partition is merged, applications 
for sure will be out of sync , so just deleting the susi will do or we 
need to reboot for sure. This is just for my understanding as I am not 
much aware of actual application level impact(in terms of Data base, 
its behavior, etc.).


I think we want to resolve the conflicting state as soon as possible. 
Would deleting the susi be potentially slower than issuing a reboot?


Thanks
Gary



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1/1] amfd: reboot nodes that report conflicting 2N active assignments [#2920]

2018-09-03 Thread nagendra
Hi Gary,
I have few questions:
1. Do we really want to reboot both the nodes in case of conflicts?
2. Even we want to send reboot to one node, which node we should send the 
reboot, the one, which was a part of smaller cluster?
3. If we could differentiate here that the conflicts happened because of 
re-merge, then will susi_delete message(here also, we need to decide which SU 
susi need to be deleted) will do rather than reboot? Rebooting will be little 
to harsh for other applications running on the nodes, it is just my 
understanding.
4. In general, what we assume if the partition is merged, applications for sure 
will be out of sync , so just deleting the susi will do or we need to reboot 
for sure. This is just for my understanding as I am not much aware of actual 
application level impact(in terms of Data base, its behavior, etc.).
 
Thanks,
Nagendra, 91-9866424860
High Availability Solutions Pvt. Ltd. (www.hasolutions.in)
- OpenSAF Support and Services
 
 
- Original Message - Subject: [PATCH 1/1] amfd: reboot nodes 
that report conflicting 2N active assignments [#2920]
From: "Gary Lee" 
Date: 8/31/18 12:17 pm
To: hans.nordeb...@ericsson.com, minh.c...@dektech.com.au, 
nagen...@hasolutions.in
Cc: opensaf-devel@lists.sourceforge.net, "Gary Lee" 

After a split network event, both SCs can reboot endlessly,
 due to this assertion:
 
 2018-08-29 18:05:34.689 SC-2 osafamfd[263]: src/amf/amfd/sg_2n_fsm.cc:596:
 avd_sg_2n_act_susi: Assertion 'a_susi_1->su == a_susi_2->su' failed.
 2018-08-29 18:05:34.695 SC-2 osafamfnd[273]: ER AMFD has unexpectedly crashed. 
Rebooting node
 
 During the network split, a SC could assign another SU to be active,
 if the node hosting the old active 2N assignment is not reachable.
 
 The assert occurs after the network is merged. SC absence must be
 enabled.
 
 For now, we can aid recovery of the cluster by rebooting
 both of the PLs in place of the assertion.
 ---
 src/amf/amfd/sg_2n_fsm.cc | 35 +--
 1 file changed, 33 insertions(+), 2 deletions(-)
 
 diff --git a/src/amf/amfd/sg_2n_fsm.cc b/src/amf/amfd/sg_2n_fsm.cc
 index c7d584473..3ba1dc6c8 100644
 --- a/src/amf/amfd/sg_2n_fsm.cc
 +++ b/src/amf/amfd/sg_2n_fsm.cc
 @@ -593,8 +593,39 @@ static AVD_SU_SI_REL *avd_sg_2n_act_susi(AVD_CL_CB *cb, 
AVD_SG *sg,
 osafassert(a_susi_1->su == s_susi_2->su);
 osafassert(a_susi_2->su == s_susi_1->su);
 } else {
 - osafassert(a_susi_1->su == a_susi_2->su);
 - osafassert(s_susi_1->su == s_susi_2->su);
 + if (a_susi_1->su != a_susi_2->su) {
 + // Duplicate 2N active assignments found, probably after split brain
 + // Reboot both nodes hosting the SUs to recover
 +
 + LOG_EM("Duplicate 2N active assignments in '%s' and '%s'",
 + a_susi_1->su->name.c_str(), a_susi_2->su->name.c_str());
 +
 + LOG_EM("Sending node reboot order to '%s'",
 + a_susi_1->su->su_on_node->name.c_str());
 + avd_send_reboot_msg_directly(a_susi_1->su->su_on_node);
 +
 + if (a_susi_1->su->su_on_node != a_susi_2->su->su_on_node) {
 + LOG_EM("Sending node reboot order to '%s'",
 + a_susi_2->su->su_on_node->name.c_str());
 + avd_send_reboot_msg_directly(a_susi_2->su->su_on_node);
 + }
 + } else if (s_susi_1->su != s_susi_2->su) {
 + // Duplicate 2N standby assignments found
 + // Reboot both nodes hosting the SUs to recover
 +
 + LOG_EM("Duplicate 2N standby assignments in '%s' and '%s'",
 + s_susi_1->su->name.c_str(), s_susi_2->su->name.c_str());
 +
 + LOG_EM("Sending node reboot order to '%s'",
 + s_susi_1->su->su_on_node->name.c_str());
 + avd_send_reboot_msg_directly(s_susi_1->su->su_on_node);
 +
 + if (s_susi_1->su->su_on_node != s_susi_2->su->su_on_node) {
 + LOG_EM("Sending node reboot order to '%s'",
 + s_susi_2->su->su_on_node->name.c_str());
 + avd_send_reboot_msg_directly(s_susi_2->su->su_on_node);
 + }
 + }
 }
 a_susi = a_susi_1;
 s_susi = s_susi_1;
 -- 
 2.17.1
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel