Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

2017-04-18 Thread Zoran Milinkovic
Hi Hoang,

Ack from me.

Thanks,
Zoran

-Original Message-
From: Vo Minh Hoang [mailto:hoang.m...@dektech.com.au] 
Sent: den 14 april 2017 10:44
To: 'A V Mahesh' ; Zoran Milinkovic 

Cc: opensaf-devel@lists.sourceforge.net; 'Ramesh Babu Betham' 

Subject: RE: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

Dear Mahesh,

Thank you for your comments.
I add 2 of my ideals inline, please find [Hoang] tags.

Dear Zoran,

Do you have any extra comment about this patch?
If not, I will request pushing it at start of next week.

Sincerely,
Hoang

-Original Message-
From: A V Mahesh [mailto:mahesh.va...@oracle.com]
Sent: Thursday, April 13, 2017 5:47 PM
To: Vo Minh Hoang ; zoran.milinko...@ericsson.com
Cc: opensaf-devel@lists.sourceforge.net; Ramesh Babu Betham 

Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

Hi Hoang,

ACK with following : ( tested basic ND restarts)

- The below errors are not related this patch, those are test case related

- It look their a existing issue ( not related to this patch ) on Cpnd down the 
 STANDBY Cpd is
   also starting `cpd_tmr_start(_info->cpnd_ret_timer,..);`  please check 
that flow once
   (after cpnd restart keep some sleep Actvie CPD and do a switch over )
[Hoang]: I also can reproduce this behavior but could not find error.
So I will continue checking it in separate ticket.
It is a little bit weird that standby cpd trigger something. Honestly I think 
standby should do data sync only. Btw, that is too soon to talk about this case.

-  You introduced cpd_tmr_stop(_info->cpnd_ret_timer);  in
cpnd_down_process()
but cpnd_up_process()  do call
`cpd_tmr_stop(_info->cpnd_ret_timer);`
do check that it may be redundant call .
[Hoang]: I thought we should keep this call even it is redundant in this case. 
We are detecting more and more unexpected error cases in system and cannot tell 
for sure it is redundant or not.

-AVM

On 4/12/2017 2:19 PM, A V Mahesh wrote:
> Hi Hoang,
>
> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>> If cpnd is temporary down only, we don't need clean up anything.
>> If cpnd is permanently down, the bad effect of this proposal is that 
>> replica is not clean up. But if cpnd permanently down, we have to 
>> reboot node for recovering so I think this cleanup is not really 
>> necessary.
>>
>> I also checked this implementation with possible test cases and have 
>> not seen any side effect.
>> Please consider it
> We are observing new node_user_info  databases mismatch Errors, while 
> testing multiple CPND restart with this patch,I will do more debugging 
> and update the root cause.
>
> ==
> =
>
>
> Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12
> 14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info 
> failed - no user on node id 0x2020F* Apr 12 14:06:58 SC-1
> osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id 
> = 0x7f86f0501750, arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1
> osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user 
> on node id 0x2020F* Apr 12 14:06:59 SC-1 osafckptd[27594]: NO
> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0503ab0,
> arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1 osafckptd[27594]: NO
> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500c70,
> arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO
> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930,
> arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER 
> cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F 
> Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12
> 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0
>
> ==
> =
>
>
> -AVM
>
>
> On 4/12/2017 11:08 AM, A V Mahesh wrote:
>> Hi Hoang,
>>
>> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>>> Dear Mahesh,
>>>
>>> Based on what I saw, in this case, retention time cannot detect CPND 
>>> temporarily down because its pid changed.
>> I will check that , I have some test cases based this retention time 
>> , not sure how were they working.
>>
>> Can you please provide reproducible steps, I did look at ticket , but 
>> looks complex , if you have any application that reproduces the case 
>> please share.
>>
>> -AVM
>>>
>>> If cpnd is temporary down only, we don't need clean up anything.
>>> If cpnd is permanently 

Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

2017-04-14 Thread A V Mahesh
Hi Hoang,

ACK , you can push.

 >>So I will continue checking it in separate ticket.

Please create a ticket for tracking.

-AVM


On 4/14/2017 2:14 PM, Vo Minh Hoang wrote:
> Dear Mahesh,
>
> Thank you for your comments.
> I add 2 of my ideals inline, please find [Hoang] tags.
>
> Dear Zoran,
>
> Do you have any extra comment about this patch?
> If not, I will request pushing it at start of next week.
>
> Sincerely,
> Hoang
>
> -Original Message-
> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
> Sent: Thursday, April 13, 2017 5:47 PM
> To: Vo Minh Hoang ; zoran.milinko...@ericsson.com
> Cc: opensaf-devel@lists.sourceforge.net; Ramesh Babu Betham
> 
> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv
> [#1765] V5
>
> Hi Hoang,
>
> ACK with following : ( tested basic ND restarts)
>
> - The below errors are not related this patch, those are test case related
>
> - It look their a existing issue ( not related to this patch ) on Cpnd down
> the  STANDBY Cpd is
> also starting `cpd_tmr_start(_info->cpnd_ret_timer,..);`  please
> check that flow once
> (after cpnd restart keep some sleep Actvie CPD and do a switch over )
> [Hoang]: I also can reproduce this behavior but could not find error.
> So I will continue checking it in separate ticket.
> It is a little bit weird that standby cpd trigger something. Honestly I
> think standby should do data sync only. Btw, that is too soon to talk about
> this case.
>
> -  You introduced cpd_tmr_stop(_info->cpnd_ret_timer);  in
> cpnd_down_process()
>  but cpnd_up_process()  do call
> `cpd_tmr_stop(_info->cpnd_ret_timer);`
>  do check that it may be redundant call .
> [Hoang]: I thought we should keep this call even it is redundant in this
> case. We are detecting more and more unexpected error cases in system and
> cannot tell for sure it is redundant or not.
>
> -AVM
>
> On 4/12/2017 2:19 PM, A V Mahesh wrote:
>> Hi Hoang,
>>
>> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>>> If cpnd is temporary down only, we don't need clean up anything.
>>> If cpnd is permanently down, the bad effect of this proposal is that
>>> replica is not clean up. But if cpnd permanently down, we have to
>>> reboot node for recovering so I think this cleanup is not really
>>> necessary.
>>>
>>> I also checked this implementation with possible test cases and have
>>> not seen any side effect.
>>> Please consider it
>> We are observing new node_user_info  databases mismatch Errors, while
>> testing multiple CPND restart with this patch,I will do more debugging
>> and update the root cause.
>>
>> ==
>> =
>>
>>
>> Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start
>> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12
>> 14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info
>> failed - no user on node id 0x2020F* Apr 12 14:06:58 SC-1
>> osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id
>> = 0x7f86f0501750, arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1
>> osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user
>> on node id 0x2020F* Apr 12 14:06:59 SC-1 osafckptd[27594]: NO
>> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0503ab0,
>> arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1 osafckptd[27594]: NO
>> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500c70,
>> arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO
>> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930,
>> arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER
>> cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F
>> Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start
>> CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12
>> 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start
>> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0
>>
>> ==
>> =
>>
>>
>> -AVM
>>
>>
>> On 4/12/2017 11:08 AM, A V Mahesh wrote:
>>> Hi Hoang,
>>>
>>> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
 Dear Mahesh,

 Based on what I saw, in this case, retention time cannot detect CPND
 temporarily down because its pid changed.
>>> I will check that , I have some test cases based this retention time
>>> , not sure how were they working.
>>>
>>> Can you please provide reproducible steps, I did look at ticket , but
>>> looks complex , if you have any application that reproduces the case
>>> please share.
>>>
>>> -AVM
 If cpnd is temporary down only, we don't need clean up anything.
 If cpnd is permanently down, the bad effect of this proposal is that
 replica is not clean up. But if cpnd permanently down, we have to

Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

2017-04-14 Thread Vo Minh Hoang
Dear Mahesh,

Thank you for your comments.
I add 2 of my ideals inline, please find [Hoang] tags.

Dear Zoran,

Do you have any extra comment about this patch?
If not, I will request pushing it at start of next week.

Sincerely,
Hoang

-Original Message-
From: A V Mahesh [mailto:mahesh.va...@oracle.com] 
Sent: Thursday, April 13, 2017 5:47 PM
To: Vo Minh Hoang ; zoran.milinko...@ericsson.com
Cc: opensaf-devel@lists.sourceforge.net; Ramesh Babu Betham

Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv
[#1765] V5

Hi Hoang,

ACK with following : ( tested basic ND restarts)

- The below errors are not related this patch, those are test case related

- It look their a existing issue ( not related to this patch ) on Cpnd down
the  STANDBY Cpd is
   also starting `cpd_tmr_start(_info->cpnd_ret_timer,..);`  please
check that flow once
   (after cpnd restart keep some sleep Actvie CPD and do a switch over )
[Hoang]: I also can reproduce this behavior but could not find error.
So I will continue checking it in separate ticket.
It is a little bit weird that standby cpd trigger something. Honestly I
think standby should do data sync only. Btw, that is too soon to talk about
this case.

-  You introduced cpd_tmr_stop(_info->cpnd_ret_timer);  in
cpnd_down_process()
but cpnd_up_process()  do call
`cpd_tmr_stop(_info->cpnd_ret_timer);`
do check that it may be redundant call .
[Hoang]: I thought we should keep this call even it is redundant in this
case. We are detecting more and more unexpected error cases in system and
cannot tell for sure it is redundant or not.

-AVM

On 4/12/2017 2:19 PM, A V Mahesh wrote:
> Hi Hoang,
>
> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>> If cpnd is temporary down only, we don't need clean up anything.
>> If cpnd is permanently down, the bad effect of this proposal is that 
>> replica is not clean up. But if cpnd permanently down, we have to 
>> reboot node for recovering so I think this cleanup is not really 
>> necessary.
>>
>> I also checked this implementation with possible test cases and have 
>> not seen any side effect.
>> Please consider it
> We are observing new node_user_info  databases mismatch Errors, while 
> testing multiple CPND restart with this patch,I will do more debugging 
> and update the root cause.
>
> ==
> =
>
>
> Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12 
> 14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info 
> failed - no user on node id 0x2020F* Apr 12 14:06:58 SC-1 
> osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id 
> = 0x7f86f0501750, arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1 
> osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user 
> on node id 0x2020F* Apr 12 14:06:59 SC-1 osafckptd[27594]: NO 
> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0503ab0, 
> arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1 osafckptd[27594]: NO 
> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500c70, 
> arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO 
> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930, 
> arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER 
> cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F 
> Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12 
> 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0
>
> ==
> =
>
>
> -AVM
>
>
> On 4/12/2017 11:08 AM, A V Mahesh wrote:
>> Hi Hoang,
>>
>> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>>> Dear Mahesh,
>>>
>>> Based on what I saw, in this case, retention time cannot detect CPND 
>>> temporarily down because its pid changed.
>> I will check that , I have some test cases based this retention time 
>> , not sure how were they working.
>>
>> Can you please provide reproducible steps, I did look at ticket , but 
>> looks complex , if you have any application that reproduces the case 
>> please share.
>>
>> -AVM
>>>
>>> If cpnd is temporary down only, we don't need clean up anything.
>>> If cpnd is permanently down, the bad effect of this proposal is that 
>>> replica is not clean up. But if cpnd permanently down, we have to 
>>> reboot node for recovering so I think this cleanup is not really 
>>> necessary.
>>>
>>> I also checked this implementation with possible test cases and have 
>>> not seen any side effect.
>>> Please consider it.
>>>
>>> Thank you and best regards,
>>> Hoang
>>>
>>> -Original Message-

Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

2017-04-13 Thread A V Mahesh
Hi Hoang,

ACK with following : ( tested basic ND restarts)

- The below errors are not related this patch, those are test case related

- It look their a existing issue ( not related to this patch ) on Cpnd 
down  the  STANDBY Cpd is
   also starting `cpd_tmr_start(_info->cpnd_ret_timer,..);`  please 
check that flow once
   (after cpnd restart keep some sleep Actvie CPD and do a switch over )

-  You introduced cpd_tmr_stop(_info->cpnd_ret_timer);  in 
cpnd_down_process()
but cpnd_up_process()  do call 
`cpd_tmr_stop(_info->cpnd_ret_timer);`
do check that it may be redundant call .

-AVM

On 4/12/2017 2:19 PM, A V Mahesh wrote:
> Hi Hoang,
>
> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>> If cpnd is temporary down only, we don't need clean up anything.
>> If cpnd is permanently down, the bad effect of this proposal is that 
>> replica
>> is not clean up. But if cpnd permanently down, we have to reboot node 
>> for
>> recovering so I think this cleanup is not really necessary.
>>
>> I also checked this implementation with possible test cases and have not
>> seen any side effect.
>> Please consider it
> We are observing new node_user_info  databases mismatch Errors, while 
> testing multiple CPND restart
> with this patch,I will do more debugging and update the root cause.
>
> ===
>  
>
>
> Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0
> *Apr 12 14:06:58 SC-1 osafckptd[27594]: ER 
> cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F*
> Apr 12 14:06:58 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f0501750, arg=0x7f86f0501ef0
> *Apr 12 14:06:59 SC-1 osafckptd[27594]: ER 
> cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F*
> Apr 12 14:06:59 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f0503ab0, arg=0x7f86f0501ef0
> Apr 12 14:07:00 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f0500c70, arg=0x7f86f0501ef0
> Apr 12 14:07:01 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f0500930, arg=0x7f86f0501ef0
> *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER 
> cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F
> Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0
> Apr 12 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0
>
> ===
>  
>
>
> -AVM
>
>
> On 4/12/2017 11:08 AM, A V Mahesh wrote:
>> Hi Hoang,
>>
>> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>>> Dear Mahesh,
>>>
>>> Based on what I saw, in this case, retention time cannot detect CPND
>>> temporarily down because its pid changed.
>> I will check that , I have some test cases based this retention time 
>> , not sure how were they working.
>>
>> Can you please provide reproducible steps, I did look at ticket , but 
>> looks complex ,
>> if you have any application that reproduces the case please share.
>>
>> -AVM
>>>
>>> If cpnd is temporary down only, we don't need clean up anything.
>>> If cpnd is permanently down, the bad effect of this proposal is that 
>>> replica
>>> is not clean up. But if cpnd permanently down, we have to reboot 
>>> node for
>>> recovering so I think this cleanup is not really necessary.
>>>
>>> I also checked this implementation with possible test cases and have 
>>> not
>>> seen any side effect.
>>> Please consider it.
>>>
>>> Thank you and best regards,
>>> Hoang
>>>
>>> -Original Message-
>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>>> Sent: Friday, February 10, 2017 10:40 AM
>>> To: Hoang Vo ; zoran.milinko...@ericsson.com
>>> Cc: opensaf-devel@lists.sourceforge.net
>>> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv
>>> [#1765] V5
>>>
>>> Hi Hoang,
>>>
>>> The CPD_CPND_DOWN_RETENTION  is to recognize, ether CPND temporarily 
>>> down or
>>> permanently down, this is started a CPND is down and based on
>>> cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete 
>>> down
>>> and do cleanup, else  cpnd rejoined with in 
>>> CPD_CPND_DOWN_RETENTION_TIME ,
>>> the CPD_CPND_DOWN_RETENTION is stoped.
>>>
>>> If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), 
>>> do cpd
>>> recognize the CPD permanently down, the cpd_process_cpnd_dow() being 
>>> called
>>> in multiple flows, can you please check all the flows, is stopping
>>> CPD_CPND_DOWN_RETENTION timer has any impact ?
>>>
>>> -AVM
>>>
>>> On 2/9/2017 1:35 PM, Hoang Vo wrote:

Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

2017-04-12 Thread Vo Minh Hoang
Dear Mahesh,

Sorry when it takes time to recall some long lost information.

Bellowing is the reproduce steps in newest source code:
- create some non-collocated checkpoints in SC-1
- make failover occur by pkill -9 amfd
- do that again 4 time with active SC
- check /run/shm found that all replicas gone
- create same name checkpoint again and got SA_AIS_ERR_LIBRARY

Sincerely,
Hoang

-Original Message-
From: A V Mahesh [mailto:mahesh.va...@oracle.com] 
Sent: Wednesday, April 12, 2017 3:50 PM
To: Vo Minh Hoang ; zoran.milinko...@ericsson.com
Cc: opensaf-devel@lists.sourceforge.net; Ramesh Babu Betham

Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv
[#1765] V5

Hi Hoang,

On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
> If cpnd is temporary down only, we don't need clean up anything.
> If cpnd is permanently down, the bad effect of this proposal is that 
> replica is not clean up. But if cpnd permanently down, we have to 
> reboot node for recovering so I think this cleanup is not really
necessary.
>
> I also checked this implementation with possible test cases and have 
> not seen any side effect.
> Please consider it
We are observing new node_user_info  databases mismatch Errors, while
testing multiple CPND restart with this patch,I will do more debugging and
update the root cause.


===

Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start
CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12
14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed -
no user on node id 0x2020F* Apr 12 14:06:58 SC-1 osafckptd[27594]: NO
cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0501750,
arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1 osafckptd[27594]: ER
cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F* Apr 12
14:06:59 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION
timer id = 0x7f86f0503ab0, arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1
osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id =
0x7f86f0500c70, arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO
cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930,
arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER
cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F Apr 12
14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION
timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12 14:07:04 SC-1
osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id =
0x7f86f0500cf0, arg=0x7f86f0501ef0


===

-AVM


On 4/12/2017 11:08 AM, A V Mahesh wrote:
> Hi Hoang,
>
> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>> Dear Mahesh,
>>
>> Based on what I saw, in this case, retention time cannot detect CPND 
>> temporarily down because its pid changed.
> I will check that , I have some test cases based this retention time , 
> not sure how were they working.
>
> Can you please provide reproducible steps, I did look at ticket , but 
> looks complex , if you have any application that reproduces the case 
> please share.
>
> -AVM
>>
>> If cpnd is temporary down only, we don't need clean up anything.
>> If cpnd is permanently down, the bad effect of this proposal is that 
>> replica is not clean up. But if cpnd permanently down, we have to 
>> reboot node for recovering so I think this cleanup is not really 
>> necessary.
>>
>> I also checked this implementation with possible test cases and have 
>> not seen any side effect.
>> Please consider it.
>>
>> Thank you and best regards,
>> Hoang
>>
>> -Original Message-
>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>> Sent: Friday, February 10, 2017 10:40 AM
>> To: Hoang Vo ; 
>> zoran.milinko...@ericsson.com
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv 
>> [#1765] V5
>>
>> Hi Hoang,
>>
>> The CPD_CPND_DOWN_RETENTION  is to recognize, ether CPND temporarily 
>> down or permanently down, this is started a CPND is down and based on 
>> cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete 
>> down and do cleanup, else  cpnd rejoined with in 
>> CPD_CPND_DOWN_RETENTION_TIME , the CPD_CPND_DOWN_RETENTION is stoped.
>>
>> If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), 
>> do cpd recognize the CPD permanently down, the cpd_process_cpnd_dow() 
>> being called in multiple flows, can you please check all the flows, 
>> is stopping CPD_CPND_DOWN_RETENTION timer has any impact ?
>>
>> -AVM
>>
>> On 2/9/2017 1:35 PM, Hoang Vo wrote:
>>>src/ckpt/ckptd/cpd_proc.c |  11 ++-
>>>1 files changed, 10 

Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

2017-04-12 Thread A V Mahesh
Hi Hoang,

On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
> If cpnd is temporary down only, we don't need clean up anything.
> If cpnd is permanently down, the bad effect of this proposal is that replica
> is not clean up. But if cpnd permanently down, we have to reboot node for
> recovering so I think this cleanup is not really necessary.
>
> I also checked this implementation with possible test cases and have not
> seen any side effect.
> Please consider it
We are observing new node_user_info  databases mismatch Errors, while 
testing multiple CPND restart
with this patch,I will do more debugging and update the root cause.

===

Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0
*Apr 12 14:06:58 SC-1 osafckptd[27594]: ER 
cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F*
Apr 12 14:06:58 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
CPND_RETENTION timer id = 0x7f86f0501750, arg=0x7f86f0501ef0
*Apr 12 14:06:59 SC-1 osafckptd[27594]: ER 
cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F*
Apr 12 14:06:59 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
CPND_RETENTION timer id = 0x7f86f0503ab0, arg=0x7f86f0501ef0
Apr 12 14:07:00 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
CPND_RETENTION timer id = 0x7f86f0500c70, arg=0x7f86f0501ef0
Apr 12 14:07:01 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
CPND_RETENTION timer id = 0x7f86f0500930, arg=0x7f86f0501ef0
*Apr 12 14:07:03 SC-1 osafckptd[27594]: ER 
cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F
Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0
Apr 12 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start 
CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0

===

-AVM


On 4/12/2017 11:08 AM, A V Mahesh wrote:
> Hi Hoang,
>
> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
>> Dear Mahesh,
>>
>> Based on what I saw, in this case, retention time cannot detect CPND
>> temporarily down because its pid changed.
> I will check that , I have some test cases based this retention time , 
> not sure how were they working.
>
> Can you please provide reproducible steps, I did look at ticket , but 
> looks complex ,
> if you have any application that reproduces the case please share.
>
> -AVM
>>
>> If cpnd is temporary down only, we don't need clean up anything.
>> If cpnd is permanently down, the bad effect of this proposal is that 
>> replica
>> is not clean up. But if cpnd permanently down, we have to reboot node 
>> for
>> recovering so I think this cleanup is not really necessary.
>>
>> I also checked this implementation with possible test cases and have not
>> seen any side effect.
>> Please consider it.
>>
>> Thank you and best regards,
>> Hoang
>>
>> -Original Message-
>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>> Sent: Friday, February 10, 2017 10:40 AM
>> To: Hoang Vo ; zoran.milinko...@ericsson.com
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv
>> [#1765] V5
>>
>> Hi Hoang,
>>
>> The CPD_CPND_DOWN_RETENTION  is to recognize, ether CPND temporarily 
>> down or
>> permanently down, this is started a CPND is down and based on
>> cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete 
>> down
>> and do cleanup, else  cpnd rejoined with in 
>> CPD_CPND_DOWN_RETENTION_TIME ,
>> the CPD_CPND_DOWN_RETENTION is stoped.
>>
>> If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), 
>> do cpd
>> recognize the CPD permanently down, the cpd_process_cpnd_dow() being 
>> called
>> in multiple flows, can you please check all the flows, is stopping
>> CPD_CPND_DOWN_RETENTION timer has any impact ?
>>
>> -AVM
>>
>> On 2/9/2017 1:35 PM, Hoang Vo wrote:
>>>src/ckpt/ckptd/cpd_proc.c |  11 ++-
>>>1 files changed, 10 insertions(+), 1 deletions(-)
>>>
>>>
>>> problem:
>>> In case failover multiple times, the cpnd is down for a moment so
>>> there is no cpnd opening specific checkpoint. This lead to retention 
>>> timer
>> is trigger.
>>> When cpnd is up again but has different pid so retention timer is not
>> stoped.
>>> Repica is deleted at retention while its information still be in ckpt
>> database.
>>> That cause problem
>>>
>>> Fix:
>>> - Stop timer of removed node.
>>> - Update data in patricia trees (for retention value consistence).
>>>
>>> diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c
>>> --- a/src/ckpt/ckptd/cpd_proc.c
>>> +++ b/src/ckpt/ckptd/cpd_proc.c
>>> @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c

Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

2017-04-11 Thread A V Mahesh
Hi Hoang,

On 2/10/2017 3:09 PM, Vo Minh Hoang wrote:
> Dear Mahesh,
>
> Based on what I saw, in this case, retention time cannot detect CPND
> temporarily down because its pid changed.
I will check that , I have some test cases based this retention time , 
not sure how were they working.

Can you please provide reproducible steps, I did look at ticket , but 
looks complex ,
if you have any application that reproduces the case please share.

-AVM
>
> If cpnd is temporary down only, we don't need clean up anything.
> If cpnd is permanently down, the bad effect of this proposal is that replica
> is not clean up. But if cpnd permanently down, we have to reboot node for
> recovering so I think this cleanup is not really necessary.
>
> I also checked this implementation with possible test cases and have not
> seen any side effect.
> Please consider it.
>
> Thank you and best regards,
> Hoang
>
> -Original Message-
> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
> Sent: Friday, February 10, 2017 10:40 AM
> To: Hoang Vo ; zoran.milinko...@ericsson.com
> Cc: opensaf-devel@lists.sourceforge.net
> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv
> [#1765] V5
>
> Hi Hoang,
>
> The CPD_CPND_DOWN_RETENTION  is to recognize, ether CPND temporarily down or
> permanently down, this is started a CPND is down and based on
> cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete down
> and do cleanup, else  cpnd rejoined with in CPD_CPND_DOWN_RETENTION_TIME ,
> the CPD_CPND_DOWN_RETENTION is stoped.
>
> If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), do cpd
> recognize the CPD permanently down, the cpd_process_cpnd_dow() being called
> in multiple flows, can you please check all the flows, is stopping
> CPD_CPND_DOWN_RETENTION timer has any impact ?
>
> -AVM
>
> On 2/9/2017 1:35 PM, Hoang Vo wrote:
>>src/ckpt/ckptd/cpd_proc.c |  11 ++-
>>1 files changed, 10 insertions(+), 1 deletions(-)
>>
>>
>> problem:
>> In case failover multiple times, the cpnd is down for a moment so
>> there is no cpnd opening specific checkpoint. This lead to retention timer
> is trigger.
>> When cpnd is up again but has different pid so retention timer is not
> stoped.
>> Repica is deleted at retention while its information still be in ckpt
> database.
>> That cause problem
>>
>> Fix:
>> - Stop timer of removed node.
>> - Update data in patricia trees (for retention value consistence).
>>
>> diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c
>> --- a/src/ckpt/ckptd/cpd_proc.c
>> +++ b/src/ckpt/ckptd/cpd_proc.c
>> @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c
>>  cpd_cpnd_info_node_find_add(>cpnd_tree, cpnd_dest, _info,
> _flag);
>>  if (!cpnd_info)
>>  return NCSCC_RC_SUCCESS;
>> -
>> +/* Stop timer before processing down */
>> +cpd_tmr_stop(_info->cpnd_ret_timer);
>>  cref_info = cpnd_info->ckpt_ref_list;
>>
>>  while (cref_info) {
>> @@ -989,6 +990,14 @@ uint32_t cpd_proc_retention_set(CPD_CB *
>>
>>  /* Update the retention Time */
>>  (*ckpt_node)->ret_time = reten_time;
>> +(*ckpt_node)->attributes.retentionDuration = reten_time;
>> +
>> +/* Update the related patricia tree */
>> +CPD_CKPT_MAP_INFO *map_info = NULL;
>> +cpd_ckpt_map_node_get(>ckpt_map_tree, (*ckpt_node)->ckpt_name,
> _info);
>> +if (map_info) {
>> +map_info->attributes.retentionDuration = reten_time;
>> +}
>>  return rc;
>>}
>>
>


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

2017-02-10 Thread Vo Minh Hoang
Dear Mahesh,

Based on what I saw, in this case, retention time cannot detect CPND
temporarily down because its pid changed.

If cpnd is temporary down only, we don't need clean up anything.
If cpnd is permanently down, the bad effect of this proposal is that replica
is not clean up. But if cpnd permanently down, we have to reboot node for
recovering so I think this cleanup is not really necessary.

I also checked this implementation with possible test cases and have not
seen any side effect.
Please consider it.

Thank you and best regards,
Hoang

-Original Message-
From: A V Mahesh [mailto:mahesh.va...@oracle.com] 
Sent: Friday, February 10, 2017 10:40 AM
To: Hoang Vo ; zoran.milinko...@ericsson.com
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv
[#1765] V5

Hi Hoang,

The CPD_CPND_DOWN_RETENTION  is to recognize, ether CPND temporarily down or
permanently down, this is started a CPND is down and based on
cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete down
and do cleanup, else  cpnd rejoined with in CPD_CPND_DOWN_RETENTION_TIME ,
the CPD_CPND_DOWN_RETENTION is stoped.

If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), do cpd
recognize the CPD permanently down, the cpd_process_cpnd_dow() being called
in multiple flows, can you please check all the flows, is stopping
CPD_CPND_DOWN_RETENTION timer has any impact ?

-AVM

On 2/9/2017 1:35 PM, Hoang Vo wrote:
>   src/ckpt/ckptd/cpd_proc.c |  11 ++-
>   1 files changed, 10 insertions(+), 1 deletions(-)
>
>
> problem:
> In case failover multiple times, the cpnd is down for a moment so 
> there is no cpnd opening specific checkpoint. This lead to retention timer
is trigger.
> When cpnd is up again but has different pid so retention timer is not
stoped.
> Repica is deleted at retention while its information still be in ckpt
database.
> That cause problem
>
> Fix:
> - Stop timer of removed node.
> - Update data in patricia trees (for retention value consistence).
>
> diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c
> --- a/src/ckpt/ckptd/cpd_proc.c
> +++ b/src/ckpt/ckptd/cpd_proc.c
> @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c
>   cpd_cpnd_info_node_find_add(>cpnd_tree, cpnd_dest, _info,
_flag);
>   if (!cpnd_info)
>   return NCSCC_RC_SUCCESS;
> -
> + /* Stop timer before processing down */
> + cpd_tmr_stop(_info->cpnd_ret_timer);
>   cref_info = cpnd_info->ckpt_ref_list;
>   
>   while (cref_info) {
> @@ -989,6 +990,14 @@ uint32_t cpd_proc_retention_set(CPD_CB *
>   
>   /* Update the retention Time */
>   (*ckpt_node)->ret_time = reten_time;
> + (*ckpt_node)->attributes.retentionDuration = reten_time;
> +
> + /* Update the related patricia tree */
> + CPD_CKPT_MAP_INFO *map_info = NULL;
> + cpd_ckpt_map_node_get(>ckpt_map_tree, (*ckpt_node)->ckpt_name,
_info);
> + if (map_info) {
> + map_info->attributes.retentionDuration = reten_time;
> + }
>   return rc;
>   }
>   



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5

2017-02-09 Thread A V Mahesh
Hi Hoang,

The CPD_CPND_DOWN_RETENTION  is to recognize, ether CPND temporarily 
down or permanently down,
this is started a CPND is down and based on cpd_evt_proc_timer_expiry(), 
cpd recognize that
the CPND is complete down and do cleanup, else  cpnd rejoined with in  
CPD_CPND_DOWN_RETENTION_TIME ,
the CPD_CPND_DOWN_RETENTION is stoped.

If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), do 
cpd recognize the CPD permanently down,
the cpd_process_cpnd_dow() being called in multiple flows, can you 
please check all the flows, is stopping CPD_CPND_DOWN_RETENTION timer
has any impact ?

-AVM

On 2/9/2017 1:35 PM, Hoang Vo wrote:
>   src/ckpt/ckptd/cpd_proc.c |  11 ++-
>   1 files changed, 10 insertions(+), 1 deletions(-)
>
>
> problem:
> In case failover multiple times, the cpnd is down for a moment so there is no
> cpnd opening specific checkpoint. This lead to retention timer is trigger.
> When cpnd is up again but has different pid so retention timer is not stoped.
> Repica is deleted at retention while its information still be in ckpt 
> database.
> That cause problem
>
> Fix:
> - Stop timer of removed node.
> - Update data in patricia trees (for retention value consistence).
>
> diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c
> --- a/src/ckpt/ckptd/cpd_proc.c
> +++ b/src/ckpt/ckptd/cpd_proc.c
> @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c
>   cpd_cpnd_info_node_find_add(>cpnd_tree, cpnd_dest, _info, 
> _flag);
>   if (!cpnd_info)
>   return NCSCC_RC_SUCCESS;
> -
> + /* Stop timer before processing down */
> + cpd_tmr_stop(_info->cpnd_ret_timer);
>   cref_info = cpnd_info->ckpt_ref_list;
>   
>   while (cref_info) {
> @@ -989,6 +990,14 @@ uint32_t cpd_proc_retention_set(CPD_CB *
>   
>   /* Update the retention Time */
>   (*ckpt_node)->ret_time = reten_time;
> + (*ckpt_node)->attributes.retentionDuration = reten_time;
> +
> + /* Update the related patricia tree */
> + CPD_CKPT_MAP_INFO *map_info = NULL;
> + cpd_ckpt_map_node_get(>ckpt_map_tree, (*ckpt_node)->ckpt_name, 
> _info);
> + if (map_info) {
> + map_info->attributes.retentionDuration = reten_time;
> + }
>   return rc;
>   }
>   


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel