Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5
Hi Hoang, Ack from me. Thanks, Zoran -Original Message- From: Vo Minh Hoang [mailto:hoang.m...@dektech.com.au] Sent: den 14 april 2017 10:44 To: 'A V Mahesh'; Zoran Milinkovic Cc: opensaf-devel@lists.sourceforge.net; 'Ramesh Babu Betham' Subject: RE: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5 Dear Mahesh, Thank you for your comments. I add 2 of my ideals inline, please find [Hoang] tags. Dear Zoran, Do you have any extra comment about this patch? If not, I will request pushing it at start of next week. Sincerely, Hoang -Original Message- From: A V Mahesh [mailto:mahesh.va...@oracle.com] Sent: Thursday, April 13, 2017 5:47 PM To: Vo Minh Hoang ; zoran.milinko...@ericsson.com Cc: opensaf-devel@lists.sourceforge.net; Ramesh Babu Betham Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5 Hi Hoang, ACK with following : ( tested basic ND restarts) - The below errors are not related this patch, those are test case related - It look their a existing issue ( not related to this patch ) on Cpnd down the STANDBY Cpd is also starting `cpd_tmr_start(_info->cpnd_ret_timer,..);` please check that flow once (after cpnd restart keep some sleep Actvie CPD and do a switch over ) [Hoang]: I also can reproduce this behavior but could not find error. So I will continue checking it in separate ticket. It is a little bit weird that standby cpd trigger something. Honestly I think standby should do data sync only. Btw, that is too soon to talk about this case. - You introduced cpd_tmr_stop(_info->cpnd_ret_timer); in cpnd_down_process() but cpnd_up_process() do call `cpd_tmr_stop(_info->cpnd_ret_timer);` do check that it may be redundant call . [Hoang]: I thought we should keep this call even it is redundant in this case. We are detecting more and more unexpected error cases in system and cannot tell for sure it is redundant or not. -AVM On 4/12/2017 2:19 PM, A V Mahesh wrote: > Hi Hoang, > > On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >> If cpnd is temporary down only, we don't need clean up anything. >> If cpnd is permanently down, the bad effect of this proposal is that >> replica is not clean up. But if cpnd permanently down, we have to >> reboot node for recovering so I think this cleanup is not really >> necessary. >> >> I also checked this implementation with possible test cases and have >> not seen any side effect. >> Please consider it > We are observing new node_user_info databases mismatch Errors, while > testing multiple CPND restart with this patch,I will do more debugging > and update the root cause. > > == > = > > > Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12 > 14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info > failed - no user on node id 0x2020F* Apr 12 14:06:58 SC-1 > osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id > = 0x7f86f0501750, arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1 > osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user > on node id 0x2020F* Apr 12 14:06:59 SC-1 osafckptd[27594]: NO > cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0503ab0, > arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1 osafckptd[27594]: NO > cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500c70, > arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO > cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930, > arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER > cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F > Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12 > 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 > > == > = > > > -AVM > > > On 4/12/2017 11:08 AM, A V Mahesh wrote: >> Hi Hoang, >> >> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >>> Dear Mahesh, >>> >>> Based on what I saw, in this case, retention time cannot detect CPND >>> temporarily down because its pid changed. >> I will check that , I have some test cases based this retention time >> , not sure how were they working. >> >> Can you please provide reproducible steps, I did look at ticket , but >> looks complex , if you have any application that reproduces the case >> please share. >> >> -AVM >>> >>> If cpnd is temporary down only, we don't need clean up anything. >>> If cpnd is permanently
Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5
Hi Hoang, ACK , you can push. >>So I will continue checking it in separate ticket. Please create a ticket for tracking. -AVM On 4/14/2017 2:14 PM, Vo Minh Hoang wrote: > Dear Mahesh, > > Thank you for your comments. > I add 2 of my ideals inline, please find [Hoang] tags. > > Dear Zoran, > > Do you have any extra comment about this patch? > If not, I will request pushing it at start of next week. > > Sincerely, > Hoang > > -Original Message- > From: A V Mahesh [mailto:mahesh.va...@oracle.com] > Sent: Thursday, April 13, 2017 5:47 PM > To: Vo Minh Hoang; zoran.milinko...@ericsson.com > Cc: opensaf-devel@lists.sourceforge.net; Ramesh Babu Betham > > Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv > [#1765] V5 > > Hi Hoang, > > ACK with following : ( tested basic ND restarts) > > - The below errors are not related this patch, those are test case related > > - It look their a existing issue ( not related to this patch ) on Cpnd down > the STANDBY Cpd is > also starting `cpd_tmr_start(_info->cpnd_ret_timer,..);` please > check that flow once > (after cpnd restart keep some sleep Actvie CPD and do a switch over ) > [Hoang]: I also can reproduce this behavior but could not find error. > So I will continue checking it in separate ticket. > It is a little bit weird that standby cpd trigger something. Honestly I > think standby should do data sync only. Btw, that is too soon to talk about > this case. > > - You introduced cpd_tmr_stop(_info->cpnd_ret_timer); in > cpnd_down_process() > but cpnd_up_process() do call > `cpd_tmr_stop(_info->cpnd_ret_timer);` > do check that it may be redundant call . > [Hoang]: I thought we should keep this call even it is redundant in this > case. We are detecting more and more unexpected error cases in system and > cannot tell for sure it is redundant or not. > > -AVM > > On 4/12/2017 2:19 PM, A V Mahesh wrote: >> Hi Hoang, >> >> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >>> If cpnd is temporary down only, we don't need clean up anything. >>> If cpnd is permanently down, the bad effect of this proposal is that >>> replica is not clean up. But if cpnd permanently down, we have to >>> reboot node for recovering so I think this cleanup is not really >>> necessary. >>> >>> I also checked this implementation with possible test cases and have >>> not seen any side effect. >>> Please consider it >> We are observing new node_user_info databases mismatch Errors, while >> testing multiple CPND restart with this patch,I will do more debugging >> and update the root cause. >> >> == >> = >> >> >> Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start >> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12 >> 14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info >> failed - no user on node id 0x2020F* Apr 12 14:06:58 SC-1 >> osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id >> = 0x7f86f0501750, arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1 >> osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user >> on node id 0x2020F* Apr 12 14:06:59 SC-1 osafckptd[27594]: NO >> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0503ab0, >> arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1 osafckptd[27594]: NO >> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500c70, >> arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO >> cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930, >> arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER >> cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F >> Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start >> CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12 >> 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start >> CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 >> >> == >> = >> >> >> -AVM >> >> >> On 4/12/2017 11:08 AM, A V Mahesh wrote: >>> Hi Hoang, >>> >>> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: Dear Mahesh, Based on what I saw, in this case, retention time cannot detect CPND temporarily down because its pid changed. >>> I will check that , I have some test cases based this retention time >>> , not sure how were they working. >>> >>> Can you please provide reproducible steps, I did look at ticket , but >>> looks complex , if you have any application that reproduces the case >>> please share. >>> >>> -AVM If cpnd is temporary down only, we don't need clean up anything. If cpnd is permanently down, the bad effect of this proposal is that replica is not clean up. But if cpnd permanently down, we have to
Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5
Dear Mahesh, Thank you for your comments. I add 2 of my ideals inline, please find [Hoang] tags. Dear Zoran, Do you have any extra comment about this patch? If not, I will request pushing it at start of next week. Sincerely, Hoang -Original Message- From: A V Mahesh [mailto:mahesh.va...@oracle.com] Sent: Thursday, April 13, 2017 5:47 PM To: Vo Minh Hoang; zoran.milinko...@ericsson.com Cc: opensaf-devel@lists.sourceforge.net; Ramesh Babu Betham Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5 Hi Hoang, ACK with following : ( tested basic ND restarts) - The below errors are not related this patch, those are test case related - It look their a existing issue ( not related to this patch ) on Cpnd down the STANDBY Cpd is also starting `cpd_tmr_start(_info->cpnd_ret_timer,..);` please check that flow once (after cpnd restart keep some sleep Actvie CPD and do a switch over ) [Hoang]: I also can reproduce this behavior but could not find error. So I will continue checking it in separate ticket. It is a little bit weird that standby cpd trigger something. Honestly I think standby should do data sync only. Btw, that is too soon to talk about this case. - You introduced cpd_tmr_stop(_info->cpnd_ret_timer); in cpnd_down_process() but cpnd_up_process() do call `cpd_tmr_stop(_info->cpnd_ret_timer);` do check that it may be redundant call . [Hoang]: I thought we should keep this call even it is redundant in this case. We are detecting more and more unexpected error cases in system and cannot tell for sure it is redundant or not. -AVM On 4/12/2017 2:19 PM, A V Mahesh wrote: > Hi Hoang, > > On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >> If cpnd is temporary down only, we don't need clean up anything. >> If cpnd is permanently down, the bad effect of this proposal is that >> replica is not clean up. But if cpnd permanently down, we have to >> reboot node for recovering so I think this cleanup is not really >> necessary. >> >> I also checked this implementation with possible test cases and have >> not seen any side effect. >> Please consider it > We are observing new node_user_info databases mismatch Errors, while > testing multiple CPND restart with this patch,I will do more debugging > and update the root cause. > > == > = > > > Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12 > 14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info > failed - no user on node id 0x2020F* Apr 12 14:06:58 SC-1 > osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id > = 0x7f86f0501750, arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1 > osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user > on node id 0x2020F* Apr 12 14:06:59 SC-1 osafckptd[27594]: NO > cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0503ab0, > arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1 osafckptd[27594]: NO > cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500c70, > arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO > cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930, > arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER > cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F > Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12 > 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 > > == > = > > > -AVM > > > On 4/12/2017 11:08 AM, A V Mahesh wrote: >> Hi Hoang, >> >> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >>> Dear Mahesh, >>> >>> Based on what I saw, in this case, retention time cannot detect CPND >>> temporarily down because its pid changed. >> I will check that , I have some test cases based this retention time >> , not sure how were they working. >> >> Can you please provide reproducible steps, I did look at ticket , but >> looks complex , if you have any application that reproduces the case >> please share. >> >> -AVM >>> >>> If cpnd is temporary down only, we don't need clean up anything. >>> If cpnd is permanently down, the bad effect of this proposal is that >>> replica is not clean up. But if cpnd permanently down, we have to >>> reboot node for recovering so I think this cleanup is not really >>> necessary. >>> >>> I also checked this implementation with possible test cases and have >>> not seen any side effect. >>> Please consider it. >>> >>> Thank you and best regards, >>> Hoang >>> >>> -Original Message-
Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5
Hi Hoang, ACK with following : ( tested basic ND restarts) - The below errors are not related this patch, those are test case related - It look their a existing issue ( not related to this patch ) on Cpnd down the STANDBY Cpd is also starting `cpd_tmr_start(_info->cpnd_ret_timer,..);` please check that flow once (after cpnd restart keep some sleep Actvie CPD and do a switch over ) - You introduced cpd_tmr_stop(_info->cpnd_ret_timer); in cpnd_down_process() but cpnd_up_process() do call `cpd_tmr_stop(_info->cpnd_ret_timer);` do check that it may be redundant call . -AVM On 4/12/2017 2:19 PM, A V Mahesh wrote: > Hi Hoang, > > On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >> If cpnd is temporary down only, we don't need clean up anything. >> If cpnd is permanently down, the bad effect of this proposal is that >> replica >> is not clean up. But if cpnd permanently down, we have to reboot node >> for >> recovering so I think this cleanup is not really necessary. >> >> I also checked this implementation with possible test cases and have not >> seen any side effect. >> Please consider it > We are observing new node_user_info databases mismatch Errors, while > testing multiple CPND restart > with this patch,I will do more debugging and update the root cause. > > === > > > > Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 > *Apr 12 14:06:58 SC-1 osafckptd[27594]: ER > cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F* > Apr 12 14:06:58 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f0501750, arg=0x7f86f0501ef0 > *Apr 12 14:06:59 SC-1 osafckptd[27594]: ER > cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F* > Apr 12 14:06:59 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f0503ab0, arg=0x7f86f0501ef0 > Apr 12 14:07:00 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f0500c70, arg=0x7f86f0501ef0 > Apr 12 14:07:01 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f0500930, arg=0x7f86f0501ef0 > *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER > cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F > Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 > Apr 12 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start > CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 > > === > > > > -AVM > > > On 4/12/2017 11:08 AM, A V Mahesh wrote: >> Hi Hoang, >> >> On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >>> Dear Mahesh, >>> >>> Based on what I saw, in this case, retention time cannot detect CPND >>> temporarily down because its pid changed. >> I will check that , I have some test cases based this retention time >> , not sure how were they working. >> >> Can you please provide reproducible steps, I did look at ticket , but >> looks complex , >> if you have any application that reproduces the case please share. >> >> -AVM >>> >>> If cpnd is temporary down only, we don't need clean up anything. >>> If cpnd is permanently down, the bad effect of this proposal is that >>> replica >>> is not clean up. But if cpnd permanently down, we have to reboot >>> node for >>> recovering so I think this cleanup is not really necessary. >>> >>> I also checked this implementation with possible test cases and have >>> not >>> seen any side effect. >>> Please consider it. >>> >>> Thank you and best regards, >>> Hoang >>> >>> -Original Message- >>> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >>> Sent: Friday, February 10, 2017 10:40 AM >>> To: Hoang Vo; zoran.milinko...@ericsson.com >>> Cc: opensaf-devel@lists.sourceforge.net >>> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv >>> [#1765] V5 >>> >>> Hi Hoang, >>> >>> The CPD_CPND_DOWN_RETENTION is to recognize, ether CPND temporarily >>> down or >>> permanently down, this is started a CPND is down and based on >>> cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete >>> down >>> and do cleanup, else cpnd rejoined with in >>> CPD_CPND_DOWN_RETENTION_TIME , >>> the CPD_CPND_DOWN_RETENTION is stoped. >>> >>> If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), >>> do cpd >>> recognize the CPD permanently down, the cpd_process_cpnd_dow() being >>> called >>> in multiple flows, can you please check all the flows, is stopping >>> CPD_CPND_DOWN_RETENTION timer has any impact ? >>> >>> -AVM >>> >>> On 2/9/2017 1:35 PM, Hoang Vo wrote:
Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5
Dear Mahesh, Sorry when it takes time to recall some long lost information. Bellowing is the reproduce steps in newest source code: - create some non-collocated checkpoints in SC-1 - make failover occur by pkill -9 amfd - do that again 4 time with active SC - check /run/shm found that all replicas gone - create same name checkpoint again and got SA_AIS_ERR_LIBRARY Sincerely, Hoang -Original Message- From: A V Mahesh [mailto:mahesh.va...@oracle.com] Sent: Wednesday, April 12, 2017 3:50 PM To: Vo Minh Hoang; zoran.milinko...@ericsson.com Cc: opensaf-devel@lists.sourceforge.net; Ramesh Babu Betham Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5 Hi Hoang, On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: > If cpnd is temporary down only, we don't need clean up anything. > If cpnd is permanently down, the bad effect of this proposal is that > replica is not clean up. But if cpnd permanently down, we have to > reboot node for recovering so I think this cleanup is not really necessary. > > I also checked this implementation with possible test cases and have > not seen any side effect. > Please consider it We are observing new node_user_info databases mismatch Errors, while testing multiple CPND restart with this patch,I will do more debugging and update the root cause. === Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12 14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F* Apr 12 14:06:58 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0501750, arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F* Apr 12 14:06:59 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0503ab0, arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500c70, arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930, arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 === -AVM On 4/12/2017 11:08 AM, A V Mahesh wrote: > Hi Hoang, > > On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >> Dear Mahesh, >> >> Based on what I saw, in this case, retention time cannot detect CPND >> temporarily down because its pid changed. > I will check that , I have some test cases based this retention time , > not sure how were they working. > > Can you please provide reproducible steps, I did look at ticket , but > looks complex , if you have any application that reproduces the case > please share. > > -AVM >> >> If cpnd is temporary down only, we don't need clean up anything. >> If cpnd is permanently down, the bad effect of this proposal is that >> replica is not clean up. But if cpnd permanently down, we have to >> reboot node for recovering so I think this cleanup is not really >> necessary. >> >> I also checked this implementation with possible test cases and have >> not seen any side effect. >> Please consider it. >> >> Thank you and best regards, >> Hoang >> >> -Original Message- >> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >> Sent: Friday, February 10, 2017 10:40 AM >> To: Hoang Vo ; >> zoran.milinko...@ericsson.com >> Cc: opensaf-devel@lists.sourceforge.net >> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv >> [#1765] V5 >> >> Hi Hoang, >> >> The CPD_CPND_DOWN_RETENTION is to recognize, ether CPND temporarily >> down or permanently down, this is started a CPND is down and based on >> cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete >> down and do cleanup, else cpnd rejoined with in >> CPD_CPND_DOWN_RETENTION_TIME , the CPD_CPND_DOWN_RETENTION is stoped. >> >> If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), >> do cpd recognize the CPD permanently down, the cpd_process_cpnd_dow() >> being called in multiple flows, can you please check all the flows, >> is stopping CPD_CPND_DOWN_RETENTION timer has any impact ? >> >> -AVM >> >> On 2/9/2017 1:35 PM, Hoang Vo wrote: >>>src/ckpt/ckptd/cpd_proc.c | 11 ++- >>>1 files changed, 10
Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5
Hi Hoang, On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: > If cpnd is temporary down only, we don't need clean up anything. > If cpnd is permanently down, the bad effect of this proposal is that replica > is not clean up. But if cpnd permanently down, we have to reboot node for > recovering so I think this cleanup is not really necessary. > > I also checked this implementation with possible test cases and have not > seen any side effect. > Please consider it We are observing new node_user_info databases mismatch Errors, while testing multiple CPND restart with this patch,I will do more debugging and update the root cause. === Apr 12 14:06:57 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 *Apr 12 14:06:58 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F* Apr 12 14:06:58 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0501750, arg=0x7f86f0501ef0 *Apr 12 14:06:59 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user on node id 0x2020F* Apr 12 14:06:59 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0503ab0, arg=0x7f86f0501ef0 Apr 12 14:07:00 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500c70, arg=0x7f86f0501ef0 Apr 12 14:07:01 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500930, arg=0x7f86f0501ef0 *Apr 12 14:07:03 SC-1 osafckptd[27594]: ER cpd_proc_decrease_node_user_info failed - no user on node id 0x2020*F Apr 12 14:07:03 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f04fe3a0, arg=0x7f86f0501ef0 Apr 12 14:07:04 SC-1 osafckptd[27594]: NO cpnd_down_process:: Start CPND_RETENTION timer id = 0x7f86f0500cf0, arg=0x7f86f0501ef0 === -AVM On 4/12/2017 11:08 AM, A V Mahesh wrote: > Hi Hoang, > > On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: >> Dear Mahesh, >> >> Based on what I saw, in this case, retention time cannot detect CPND >> temporarily down because its pid changed. > I will check that , I have some test cases based this retention time , > not sure how were they working. > > Can you please provide reproducible steps, I did look at ticket , but > looks complex , > if you have any application that reproduces the case please share. > > -AVM >> >> If cpnd is temporary down only, we don't need clean up anything. >> If cpnd is permanently down, the bad effect of this proposal is that >> replica >> is not clean up. But if cpnd permanently down, we have to reboot node >> for >> recovering so I think this cleanup is not really necessary. >> >> I also checked this implementation with possible test cases and have not >> seen any side effect. >> Please consider it. >> >> Thank you and best regards, >> Hoang >> >> -Original Message- >> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >> Sent: Friday, February 10, 2017 10:40 AM >> To: Hoang Vo; zoran.milinko...@ericsson.com >> Cc: opensaf-devel@lists.sourceforge.net >> Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv >> [#1765] V5 >> >> Hi Hoang, >> >> The CPD_CPND_DOWN_RETENTION is to recognize, ether CPND temporarily >> down or >> permanently down, this is started a CPND is down and based on >> cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete >> down >> and do cleanup, else cpnd rejoined with in >> CPD_CPND_DOWN_RETENTION_TIME , >> the CPD_CPND_DOWN_RETENTION is stoped. >> >> If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), >> do cpd >> recognize the CPD permanently down, the cpd_process_cpnd_dow() being >> called >> in multiple flows, can you please check all the flows, is stopping >> CPD_CPND_DOWN_RETENTION timer has any impact ? >> >> -AVM >> >> On 2/9/2017 1:35 PM, Hoang Vo wrote: >>>src/ckpt/ckptd/cpd_proc.c | 11 ++- >>>1 files changed, 10 insertions(+), 1 deletions(-) >>> >>> >>> problem: >>> In case failover multiple times, the cpnd is down for a moment so >>> there is no cpnd opening specific checkpoint. This lead to retention >>> timer >> is trigger. >>> When cpnd is up again but has different pid so retention timer is not >> stoped. >>> Repica is deleted at retention while its information still be in ckpt >> database. >>> That cause problem >>> >>> Fix: >>> - Stop timer of removed node. >>> - Update data in patricia trees (for retention value consistence). >>> >>> diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c >>> --- a/src/ckpt/ckptd/cpd_proc.c >>> +++ b/src/ckpt/ckptd/cpd_proc.c >>> @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c
Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5
Hi Hoang, On 2/10/2017 3:09 PM, Vo Minh Hoang wrote: > Dear Mahesh, > > Based on what I saw, in this case, retention time cannot detect CPND > temporarily down because its pid changed. I will check that , I have some test cases based this retention time , not sure how were they working. Can you please provide reproducible steps, I did look at ticket , but looks complex , if you have any application that reproduces the case please share. -AVM > > If cpnd is temporary down only, we don't need clean up anything. > If cpnd is permanently down, the bad effect of this proposal is that replica > is not clean up. But if cpnd permanently down, we have to reboot node for > recovering so I think this cleanup is not really necessary. > > I also checked this implementation with possible test cases and have not > seen any side effect. > Please consider it. > > Thank you and best regards, > Hoang > > -Original Message- > From: A V Mahesh [mailto:mahesh.va...@oracle.com] > Sent: Friday, February 10, 2017 10:40 AM > To: Hoang Vo; zoran.milinko...@ericsson.com > Cc: opensaf-devel@lists.sourceforge.net > Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv > [#1765] V5 > > Hi Hoang, > > The CPD_CPND_DOWN_RETENTION is to recognize, ether CPND temporarily down or > permanently down, this is started a CPND is down and based on > cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete down > and do cleanup, else cpnd rejoined with in CPD_CPND_DOWN_RETENTION_TIME , > the CPD_CPND_DOWN_RETENTION is stoped. > > If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), do cpd > recognize the CPD permanently down, the cpd_process_cpnd_dow() being called > in multiple flows, can you please check all the flows, is stopping > CPD_CPND_DOWN_RETENTION timer has any impact ? > > -AVM > > On 2/9/2017 1:35 PM, Hoang Vo wrote: >>src/ckpt/ckptd/cpd_proc.c | 11 ++- >>1 files changed, 10 insertions(+), 1 deletions(-) >> >> >> problem: >> In case failover multiple times, the cpnd is down for a moment so >> there is no cpnd opening specific checkpoint. This lead to retention timer > is trigger. >> When cpnd is up again but has different pid so retention timer is not > stoped. >> Repica is deleted at retention while its information still be in ckpt > database. >> That cause problem >> >> Fix: >> - Stop timer of removed node. >> - Update data in patricia trees (for retention value consistence). >> >> diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c >> --- a/src/ckpt/ckptd/cpd_proc.c >> +++ b/src/ckpt/ckptd/cpd_proc.c >> @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c >> cpd_cpnd_info_node_find_add(>cpnd_tree, cpnd_dest, _info, > _flag); >> if (!cpnd_info) >> return NCSCC_RC_SUCCESS; >> - >> +/* Stop timer before processing down */ >> +cpd_tmr_stop(_info->cpnd_ret_timer); >> cref_info = cpnd_info->ckpt_ref_list; >> >> while (cref_info) { >> @@ -989,6 +990,14 @@ uint32_t cpd_proc_retention_set(CPD_CB * >> >> /* Update the retention Time */ >> (*ckpt_node)->ret_time = reten_time; >> +(*ckpt_node)->attributes.retentionDuration = reten_time; >> + >> +/* Update the related patricia tree */ >> +CPD_CKPT_MAP_INFO *map_info = NULL; >> +cpd_ckpt_map_node_get(>ckpt_map_tree, (*ckpt_node)->ckpt_name, > _info); >> +if (map_info) { >> +map_info->attributes.retentionDuration = reten_time; >> +} >> return rc; >>} >> > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5
Dear Mahesh, Based on what I saw, in this case, retention time cannot detect CPND temporarily down because its pid changed. If cpnd is temporary down only, we don't need clean up anything. If cpnd is permanently down, the bad effect of this proposal is that replica is not clean up. But if cpnd permanently down, we have to reboot node for recovering so I think this cleanup is not really necessary. I also checked this implementation with possible test cases and have not seen any side effect. Please consider it. Thank you and best regards, Hoang -Original Message- From: A V Mahesh [mailto:mahesh.va...@oracle.com] Sent: Friday, February 10, 2017 10:40 AM To: Hoang Vo; zoran.milinko...@ericsson.com Cc: opensaf-devel@lists.sourceforge.net Subject: Re: [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5 Hi Hoang, The CPD_CPND_DOWN_RETENTION is to recognize, ether CPND temporarily down or permanently down, this is started a CPND is down and based on cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete down and do cleanup, else cpnd rejoined with in CPD_CPND_DOWN_RETENTION_TIME , the CPD_CPND_DOWN_RETENTION is stoped. If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), do cpd recognize the CPD permanently down, the cpd_process_cpnd_dow() being called in multiple flows, can you please check all the flows, is stopping CPD_CPND_DOWN_RETENTION timer has any impact ? -AVM On 2/9/2017 1:35 PM, Hoang Vo wrote: > src/ckpt/ckptd/cpd_proc.c | 11 ++- > 1 files changed, 10 insertions(+), 1 deletions(-) > > > problem: > In case failover multiple times, the cpnd is down for a moment so > there is no cpnd opening specific checkpoint. This lead to retention timer is trigger. > When cpnd is up again but has different pid so retention timer is not stoped. > Repica is deleted at retention while its information still be in ckpt database. > That cause problem > > Fix: > - Stop timer of removed node. > - Update data in patricia trees (for retention value consistence). > > diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c > --- a/src/ckpt/ckptd/cpd_proc.c > +++ b/src/ckpt/ckptd/cpd_proc.c > @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c > cpd_cpnd_info_node_find_add(>cpnd_tree, cpnd_dest, _info, _flag); > if (!cpnd_info) > return NCSCC_RC_SUCCESS; > - > + /* Stop timer before processing down */ > + cpd_tmr_stop(_info->cpnd_ret_timer); > cref_info = cpnd_info->ckpt_ref_list; > > while (cref_info) { > @@ -989,6 +990,14 @@ uint32_t cpd_proc_retention_set(CPD_CB * > > /* Update the retention Time */ > (*ckpt_node)->ret_time = reten_time; > + (*ckpt_node)->attributes.retentionDuration = reten_time; > + > + /* Update the related patricia tree */ > + CPD_CKPT_MAP_INFO *map_info = NULL; > + cpd_ckpt_map_node_get(>ckpt_map_tree, (*ckpt_node)->ckpt_name, _info); > + if (map_info) { > + map_info->attributes.retentionDuration = reten_time; > + } > return rc; > } > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of cpsv [#1765] V5
Hi Hoang, The CPD_CPND_DOWN_RETENTION is to recognize, ether CPND temporarily down or permanently down, this is started a CPND is down and based on cpd_evt_proc_timer_expiry(), cpd recognize that the CPND is complete down and do cleanup, else cpnd rejoined with in CPD_CPND_DOWN_RETENTION_TIME , the CPD_CPND_DOWN_RETENTION is stoped. If we stop CPD_CPND_DOWN_RETENTION timer in cpd_process_cpnd_dow(), do cpd recognize the CPD permanently down, the cpd_process_cpnd_dow() being called in multiple flows, can you please check all the flows, is stopping CPD_CPND_DOWN_RETENTION timer has any impact ? -AVM On 2/9/2017 1:35 PM, Hoang Vo wrote: > src/ckpt/ckptd/cpd_proc.c | 11 ++- > 1 files changed, 10 insertions(+), 1 deletions(-) > > > problem: > In case failover multiple times, the cpnd is down for a moment so there is no > cpnd opening specific checkpoint. This lead to retention timer is trigger. > When cpnd is up again but has different pid so retention timer is not stoped. > Repica is deleted at retention while its information still be in ckpt > database. > That cause problem > > Fix: > - Stop timer of removed node. > - Update data in patricia trees (for retention value consistence). > > diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c > --- a/src/ckpt/ckptd/cpd_proc.c > +++ b/src/ckpt/ckptd/cpd_proc.c > @@ -679,7 +679,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c > cpd_cpnd_info_node_find_add(>cpnd_tree, cpnd_dest, _info, > _flag); > if (!cpnd_info) > return NCSCC_RC_SUCCESS; > - > + /* Stop timer before processing down */ > + cpd_tmr_stop(_info->cpnd_ret_timer); > cref_info = cpnd_info->ckpt_ref_list; > > while (cref_info) { > @@ -989,6 +990,14 @@ uint32_t cpd_proc_retention_set(CPD_CB * > > /* Update the retention Time */ > (*ckpt_node)->ret_time = reten_time; > + (*ckpt_node)->attributes.retentionDuration = reten_time; > + > + /* Update the related patricia tree */ > + CPD_CKPT_MAP_INFO *map_info = NULL; > + cpd_ckpt_map_node_get(>ckpt_map_tree, (*ckpt_node)->ckpt_name, > _info); > + if (map_info) { > + map_info->attributes.retentionDuration = reten_time; > + } > return rc; > } > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel