Re: [devel] [PATCH 1 of 1] smfd: handle failed si-swap of middleware [#1605]

2016-10-12 Thread Rafael Odzakow
ACK


On 10/10/2016 01:46 PM, Neelakanta Reddy wrote:
> Hi Alex,
>
> Reviewed and tested the patch.
> Ack.
>
> /Neel.
>
> On 2016/10/07 11:10 PM, Alex Jones wrote:
>> osaf/services/saf/smfsv/smfd/SmfStepState.cc|  13 
>>   osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc |  68 
>> +---
>>   osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.hh |   6 +
>>   3 files changed, 76 insertions(+), 11 deletions(-)
>>
>>
>> Sep 27 00:34:14 q50-s1 osafsmfd[6667]: NO SA_AMF_ADMIN_SI_SWAP [rc=1] 
>> successfully initiated
>> Sep 27 00:34:15 q50-s1 osafimmnd[6571]: NO ERR_BAD_OPERATION: 
>> Mismatch on administrative owner '' != 'SMFSERVICE'
>> Sep 27 00:34:17 q50-s1 osafsmfd[6667]: NO Fail to invoke admin 
>> operation, rc=SA_AIS_ERR_BAD_OPERATION (20). 
>> dn=[safSi=SC-2N,safApp=OpenSAF], opId=[7]
>> Sep 27 00:34:17 q50-s1 osafsmfd[6667]: NO Admin op 
>> SA_AMF_ADMIN_SI_SWAP fail [rc = 20]
>> Sep 27 00:34:17 q50-s1 osafsmfd[6667]: NO CAMP: Procedure 
>> safSmfProc=RollingUpgrade returned FAILED
>> Sep 27 00:36:14 q50-s1 osafsmfd[6667]: NO Campaign thread does not 
>> disappear within 120 seconds after SA_AMF_ADMIN_SI_SWAP, the 
>> operation was assumed failed.
>> Sep 27 00:36:14 q50-s1 kernel: [14934029.531187] osafsmfd[32024]: 
>> segfault at 4 ip 004425b6 sp 7f67f7ffe1c0 error 4 in 
>> osafsmfd[40+9a000]
>> Sep 27 00:36:14 q50-s1 osafamfnd[6649]: NO 
>> 'safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 
>> 'avaDown' : Recovery is 'nodeFailfast'
>> Sep 27 00:36:14 q50-s1 osafamfnd[6649]: ER 
>> safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown 
>> Recovery is:nodeFailfast
>>
>> There are a few problems here. One is that the SmfSwapThread is 
>> pointing to a
>> deleted procedure when the original active controller is reassigned 
>> active. The
>> second problem is that a new SmfSwapThread is created when the 
>> original active
>> controller is reassigned active, so now there are two running. The 
>> first thread
>> tries to use its proc pointer (which has been deleted when the 
>> original active
>> goes to quiesced) and causes the segfault.
>>
>> The proposed solution is a little different from that proposed in the 
>> ticket
>> description. This solution proposes to use the existence of the 
>> SmfSwapThread as
>> a test. When the original active controller is reassigned active 
>> because the
>> si-swap failed, it will still remove the RestartIndicator as it does 
>> now. But,
>> if the SmfSwapThread is still running, it will not create a new one, 
>> but update
>> it with the recreated procedure pointer, and let it handle the 
>> si-swap timeout.
>> Then it will report the error. I believe this solution is backwards 
>> compatible
>> because no IMM changes are made like the ones proposed in the ticket.
>>
>> diff --git a/osaf/services/saf/smfsv/smfd/SmfStepState.cc 
>> b/osaf/services/saf/smfsv/smfd/SmfStepState.cc
>> --- a/osaf/services/saf/smfsv/smfd/SmfStepState.cc
>> +++ b/osaf/services/saf/smfsv/smfd/SmfStepState.cc
>> @@ -424,6 +424,19 @@ SmfStepStateUndone::execute(SmfUpgradeSt
>>   {
>>   TRACE_ENTER();
>>   +if (i_step->calculateStepType() != SA_AIS_OK) {
>> +LOG_ER("SmfStepStateUndone: Failed to calculate step 
>> type");
>> +changeState(i_step, SmfStepStateFailed::instance());
>> +TRACE_LEAVE();
>> +return SMF_STEP_FAILED;
>> +}
>> +
>> +if (i_step->getSwitchOver() == true) {
>> +TRACE("Switch over is needed in this step");
>> +TRACE_LEAVE();
>> +return SMF_STEP_SWITCHOVER;
>> +}
>> +
>>   i_step->setRetryCount(0); /* Reset the retry counter */
>>   changeState(i_step, SmfStepStateExecuting::instance());
>>   diff --git a/osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc 
>> b/osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc
>> --- a/osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc
>> +++ b/osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc
>> @@ -482,12 +482,17 @@ SmfUpgradeProcedure::switchOver()
>>   osafassert(0);
>>   }
>>   -TRACE("SmfUpgradeProcedure::switchOver: Create the restart 
>> indicator");
>> - 
>> SmfCampaignThread::instance()->campaign()->getUpgradeCampaign()->createSmfRestartIndicator();
>> -
>> -SmfSwapThread *swapThread = new SmfSwapThread(this);
>> -TRACE("SmfUpgradeProcedure::switchOver, Starting SI_SWAP thread");
>> -swapThread->start();
>> +if (!SmfSwapThread::running()) {
>> +TRACE("SmfUpgradeProcedure::switchOver: Create the restart 
>> indicator");
>> + 
>> SmfCampaignThread::instance()->campaign()->getUpgradeCampaign()->createSmfRestartIndicator();
>> +
>> +SmfSwapThread *swapThread = new SmfSwapThread(this);
>> +TRACE("SmfUpgradeProcedure::switchOver, Starting SI_SWAP 
>> thread");
>> +swapThread->start();
>> +} else {
>> +

[devel] [PATCH 1 of 1] smfd: handle failed si-swap of middleware [#1605]

2016-10-07 Thread Alex Jones
 osaf/services/saf/smfsv/smfd/SmfStepState.cc|  13 
 osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc |  68 +---
 osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.hh |   6 +
 3 files changed, 76 insertions(+), 11 deletions(-)


Sep 27 00:34:14 q50-s1 osafsmfd[6667]: NO SA_AMF_ADMIN_SI_SWAP [rc=1] 
successfully initiated
Sep 27 00:34:15 q50-s1 osafimmnd[6571]: NO ERR_BAD_OPERATION: Mismatch on 
administrative owner '' != 'SMFSERVICE'
Sep 27 00:34:17 q50-s1 osafsmfd[6667]: NO Fail to invoke admin operation, 
rc=SA_AIS_ERR_BAD_OPERATION (20). dn=[safSi=SC-2N,safApp=OpenSAF], opId=[7]
Sep 27 00:34:17 q50-s1 osafsmfd[6667]: NO Admin op SA_AMF_ADMIN_SI_SWAP fail 
[rc = 20]
Sep 27 00:34:17 q50-s1 osafsmfd[6667]: NO CAMP: Procedure 
safSmfProc=RollingUpgrade returned FAILED
Sep 27 00:36:14 q50-s1 osafsmfd[6667]: NO Campaign thread does not disappear 
within 120 seconds after SA_AMF_ADMIN_SI_SWAP, the operation was assumed failed.
Sep 27 00:36:14 q50-s1 kernel: [14934029.531187] osafsmfd[32024]: segfault at 4 
ip 004425b6 sp 7f67f7ffe1c0 error 4 in osafsmfd[40+9a000]
Sep 27 00:36:14 q50-s1 osafamfnd[6649]: NO 
'safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Sep 27 00:36:14 q50-s1 osafamfnd[6649]: ER 
safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast

There are a few problems here. One is that the SmfSwapThread is pointing to a
deleted procedure when the original active controller is reassigned active. The
second problem is that a new SmfSwapThread is created when the original active
controller is reassigned active, so now there are two running. The first thread
tries to use its proc pointer (which has been deleted when the original active
goes to quiesced) and causes the segfault.

The proposed solution is a little different from that proposed in the ticket
description. This solution proposes to use the existence of the SmfSwapThread as
a test. When the original active controller is reassigned active because the
si-swap failed, it will still remove the RestartIndicator as it does now. But,
if the SmfSwapThread is still running, it will not create a new one, but update
it with the recreated procedure pointer, and let it handle the si-swap timeout.
Then it will report the error. I believe this solution is backwards compatible
because no IMM changes are made like the ones proposed in the ticket.

diff --git a/osaf/services/saf/smfsv/smfd/SmfStepState.cc 
b/osaf/services/saf/smfsv/smfd/SmfStepState.cc
--- a/osaf/services/saf/smfsv/smfd/SmfStepState.cc
+++ b/osaf/services/saf/smfsv/smfd/SmfStepState.cc
@@ -424,6 +424,19 @@ SmfStepStateUndone::execute(SmfUpgradeSt
 {
TRACE_ENTER();
 
+if (i_step->calculateStepType() != SA_AIS_OK) {
+LOG_ER("SmfStepStateUndone: Failed to calculate step type");
+changeState(i_step, SmfStepStateFailed::instance());
+TRACE_LEAVE();
+return SMF_STEP_FAILED;
+}
+
+if (i_step->getSwitchOver() == true) {
+TRACE("Switch over is needed in this step");
+TRACE_LEAVE();
+return SMF_STEP_SWITCHOVER;
+}
+
 i_step->setRetryCount(0); /* Reset the retry counter */
changeState(i_step, SmfStepStateExecuting::instance());
 
diff --git a/osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc 
b/osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc
--- a/osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc
+++ b/osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc
@@ -482,12 +482,17 @@ SmfUpgradeProcedure::switchOver()
 osafassert(0);
 }
 
-   TRACE("SmfUpgradeProcedure::switchOver: Create the restart indicator");
-   
SmfCampaignThread::instance()->campaign()->getUpgradeCampaign()->createSmfRestartIndicator();
-
-   SmfSwapThread *swapThread = new SmfSwapThread(this);
-   TRACE("SmfUpgradeProcedure::switchOver, Starting SI_SWAP thread");
-   swapThread->start();
+   if (!SmfSwapThread::running()) {
+   TRACE("SmfUpgradeProcedure::switchOver: Create the restart 
indicator");
+   
SmfCampaignThread::instance()->campaign()->getUpgradeCampaign()->createSmfRestartIndicator();
+
+   SmfSwapThread *swapThread = new SmfSwapThread(this);
+   TRACE("SmfUpgradeProcedure::switchOver, Starting SI_SWAP 
thread");
+   swapThread->start();
+   } else {
+   TRACE("SmfUpgradeProcedure::switchOver, SI_SWAP thread already 
running");
+   SmfSwapThread::setProc(this);
+   }
 
TRACE_LEAVE();
 }
@@ -4156,6 +4161,31 @@ SmfUpgradeProcedure::resetProcCounter()
 /*  Static methods*/
 /**/
 
+SmfSwapThread *SmfSwapThread::me(0);
+std::mutex SmfSwapThread::m_mutex;
+