Re: [Sequoia] Failure detection

Emmanuel Cecchet Sat, 08 May 2010 12:54:29 -0700

Hi Francis,

Do you have the traces with 
log4j.logger.org.continuent.sequoia.controller.virtualdatabase set to DEBUG?
Could you also try with the latest version of Hedera?


Sorry for the lag in the responses I have been swamped since I'm back!
Emmanuel
> Hello Emmanuel,
>
> Yes, all were in debug. Here is the snippet:
>
> ######################################
> # Hedera group communication loggers #
> ######################################
> # Hedera channels test #
>   log4j.logger.test.org.continuent.hedera.channel=DEBUG, Console, Filetrace
>   log4j.additivity.test.org.continuent.hedera.channel=false
> # Hedera adapters #
>   log4j.logger.org.continuent.hedera.adapters=DEBUG, Console, Filetrace
>   log4j.additivity.org.continuent.hedera.adapters=false
> # Hedera factories #
>   log4j.logger.org.continuent.hedera.factory=DEBUG, Console, Filetrace
>   log4j.additivity.org.continuent.hedera.factory=false
> # Hedera channels #
>   log4j.logger.org.continuent.hedera.channel=DEBUG, Console, Filetrace
>   log4j.additivity.org.continuent.hedera.channel=false
> # Hedera Group Membership Service #
>   log4j.logger.org.continuent.hedera.gms=DEBUG, Console, Filetrace
>   log4j.additivity.org.continuent.hedera.gms=false
> # JGroups
>   log4j.logger.org.jgroups=DEBUG, Console, Filetrace
>   log4j.additivity.org.jgroups=false
> # JGroups protocols
>   log4j.logger.org.jgroups.protocols=DEBUG, Console, Filetrace
>   log4j.additivity.org.jgroups.protocols=false
> ######################################
>
> I've the distributed logs for the same time-frame. Let me know if you need 
> that.
>
> No, the hedera were not updated.
>
> Thanks,
> Seby.
> -----Original Message-----
> From: sequoia-boun...@lists.forge.continuent.org 
> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of Emmanuel 
> Cecchet
> Sent: Tuesday, May 04, 2010 6:20 AM
> To: Sequoia general mailing list
> Cc: sequoiadb-disc...@lists.sourceforge.net
> Subject: Re: [Sequoia] Failure detection
>
> Hi Seby,
>
> When JGroups reported the MERGE messages in the log, did you have Hedera 
> DEBUG logs enabled too? If this is the case, the message was never 
> handled by Hedera which is a problem. The new view should have been 
> installed anyway by the view synchrony layer and Hedera should at least 
> catch that.
> Can you confirm is the Hedera logs are enabled?
> Could you also set the Distributed Virtual Database logs to DEBUG?
> Did you try to update Hedera to a newer version?
>
> Thanks
> Emmanuel
>
>   
>> Hi Emmanuel,
>>
>> Do you need more logs on this. Please let me know.
>>
>> Thanks,
>> Seby.
>>
>> -----Original Message-----
>> From: sequoia-boun...@lists.forge.continuent.org 
>> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of Francis, 
>> Seby
>> Sent: Monday, March 29, 2010 1:51 PM
>> To: Sequoia general mailing list
>> Cc: sequoiadb-disc...@lists.sourceforge.net
>> Subject: Re: [Sequoia] Failure detection
>>
>> Hi Emmanuel,
>>
>> I've tried different jgroup configuration and now I can see in the logs that 
>> the groups are merging. But for some reason, Sequoia never shows that it is 
>> merged. Ie; when I ran 'show controllers' on console I see only that 
>> particular host. Below is the snippet from one of the host. I see the 
>> similar on the other host showing the merge. Let me know if you would like 
>> to see the debug logs during the time-frame.
>>
>> 2010-03-29 06:59:45,683 DEBUG jgroups.protocols.VERIFY_SUSPECT diff=1507, 
>> mbr 10.0.0.33:35974 is dead (passing up SUSPECT event)
>> 2010-03-29 06:59:45,687 DEBUG continuent.hedera.gms JGroups reported 
>> suspected member:10.0.0.33:35974
>> 2010-03-29 06:59:45,688 DEBUG continuent.hedera.gms 
>> Member(address=/10.0.0.33:35974, uid=db2) leaves Group(gid=db2).
>>
>> 2010-03-29 06:59:45,868 INFO  controller.requestmanager.cleanup Waiting 
>> 30000ms for client of controller 562949953421312 to failover
>> 2010-03-29 07:00:15,875 INFO  controller.requestmanager.cleanup Cleanup for 
>> controller 562949953421312 failure is completed.
>>
>> -----
>> 2010-03-29 07:03:14,725 DEBUG protocols.pbcast.GMS I (10.0.0.23:49731) will 
>> be the leader. Starting the merge task for [10.0.0.33:35974, 10.0.0.23:49731]
>> 2010-03-29 07:03:14,726 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 running 
>> merge task, coordinators are [10.0.0.33:35974, 10.0.0.23:49731]
>> 2010-03-29 07:03:14,730 DEBUG protocols.pbcast.GMS Merge leader 
>> 10.0.0.23:49731 sending MERGE_REQ to [10.0.0.33:35974, 10.0.0.23:49731]
>> 2010-03-29 07:03:14,746 DEBUG jgroups.protocols.UDP sending msg to 
>> 10.0.0.23:49731, src=10.0.0.23:49731, headers are GMS: GmsHeader[MERGE_RSP]: 
>> view=[10.0.0.23:49731|2] [10.0.0.23:49731], digest=10.0.0.23:49731: [44 : 47 
>> (47)], merge_rejected=false, merge_id=[10.0.0.23:49731|1269860594727], 
>> UNICAST: [UNICAST: DATA, seqno=4], UDP: [channel_name=db2]
>> 2010-03-29 07:03:14,748 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 responded 
>> to 10.0.0.23:49731, merge_id=[10.0.0.23:49731|1269860594727]
>> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader 
>> 10.0.0.23:49731 expects 2 responses, so far got 2 responses
>> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader 
>> 10.0.0.23:49731 collected 2 merge response(s) in 36 ms
>> 2010-03-29 07:03:14,772 DEBUG protocols.pbcast.GMS Merge leader 
>> 10.0.0.23:49731 computed new merged view that will be 
>> MergeView::[10.0.0.23:49731|3] [10.0.0.23:49731, 10.0.0.33:35974], 
>> subgroups=[[10.0.0.23:49731|2] [10.0.0.23:49731], [10.0.0.33:35974|2] 
>> [10.0.0.33:35974]]
>> 2010-03-29 07:03:14,773 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 is 
>> sending merge view [10.0.0.23:49731|3] to coordinators [10.0.0.33:35974, 
>> 10.0.0.23:49731
>>
>> Seby.
>>
>> -----Original Message-----
>> From: sequoia-boun...@lists.forge.continuent.org 
>> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of Emmanuel 
>> Cecchet
>> Sent: Wednesday, March 24, 2010 10:41 AM
>> To: Sequoia general mailing list
>> Cc: sequoiadb-disc...@lists.sourceforge.net
>> Subject: Re: [Sequoia] Failure detection
>>
>> Hi Seby,
>>
>> Sorry for the late reply, I have been very busy these past days.
>> This seems to be a JGroups issue that could probably be better answered 
>> by Bela Ban on the JGroups mailing list. I have seen emails these past 
>> days on the list with people having similar problem.
>> I would recommend that you post an email on the JGroups mailing list 
>> with your JGroups configuration and the messages you see regarding MERGE 
>> failing.
>>
>> Keep me posted
>> Emmanuel
>>
>>   
>>     
>>> Also, here is the error which I see from the logs:
>>>
>>> 2010-03-22 08:31:15,912 DEBUG protocols.pbcast.GMS Merge leader 
>>> 10.10.10.23:39729 expects 2 responses, so far got 1 responses
>>> 2010-03-22 08:31:15,913 DEBUG protocols.pbcast.GMS Merge leader 
>>> 10.10.10.23:39729 waiting 382 msecs for merge responses
>>> 2010-03-22 08:31:16,313 DEBUG protocols.pbcast.GMS At 10.10.10.23:39729 
>>> cancelling merge due to timer timeout (5000 ms)
>>> 2010-03-22 08:31:16,314 DEBUG protocols.pbcast.GMS cancelling merge 
>>> (merge_id=[10.10.10.23:39729|1269261071286])
>>> 2010-03-22 08:31:16,316 DEBUG protocols.pbcast.GMS resumed ViewHandler
>>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 
>>> 10.10.10.23:39729 expects 2 responses, so far got 0 responses
>>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 
>>> 10.10.10.23:39729 collected 0 merge response(s) in 5027 ms
>>> 2010-03-22 08:31:16,318 WARN  protocols.pbcast.GMS Merge aborted. Merge 
>>> leader did not get MergeData from all subgroup coordinators 
>>> [10.10.10.33:38822, 10.10.10.23:39729]
>>>
>>> -----Original Message-----
>>> From: Francis, Seby 
>>> Sent: Monday, March 22, 2010 1:03 PM
>>> To: 'Sequoia general mailing list'
>>> Cc: sequoiadb-disc...@lists.sourceforge.net
>>> Subject: RE: [Sequoia] Failure detection
>>>
>>> Hi Emmanuel,
>>>
>>> I've updated my jgroups to the version which you have mentioned, but I 
>>> still see the issue with Merging the groups. One of the controller lost 
>>> track after the failure and won't merge. Can you please give me a hand to 
>>> figure out where it goes wrong. I've the debug logs. Shall I send the logs 
>>> as a zip file.  
>>>
>>> Thanks,
>>> Seby.
>>>
>>> -----Original Message-----
>>> From: sequoia-boun...@lists.forge.continuent.org 
>>> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of Emmanuel 
>>> Cecchet
>>> Sent: Thursday, March 18, 2010 10:22 PM
>>> To: Sequoia general mailing list
>>> Cc: sequoiadb-disc...@lists.sourceforge.net
>>> Subject: Re: [Sequoia] Failure detection
>>>
>>> Hi Seby,
>>>
>>> I looked into the mailing list archive and this version of JGroups has a 
>>> number of significant bugs. An issue was filed 
>>> (http://forge.continuent.org/jira/browse/SEQUOIA-1130) and I fixed it 
>>> for Sequoia 4. Just using a drop in replacement for JGroups core for 
>>> Sequoia 2.10.10 might work. You might have to update Hedera jars as well 
>>> but that could work with the old one too.
>>>
>>> Let me know if the upgrade does not work
>>> Emmanuel
>>>
>>>   
>>>     
>>>       
>>>> Thanks for your support!!
>>>>
>>>> I'm using jgroups-core.jar Version 2.4.2 which came with 
>>>> "sequoia-2.10.10". My solaris test servers have only single interface 
>>>> and I'm using the same ip for both group & db/client communications. I 
>>>> ran a test again removing "*STATE_TRANSFER*" and attached the logs. At 
>>>> around 13:36, I took the host1 interface down and opened it around 
>>>> 13:38. After I opened the interface, and when I ran the show 
>>>> controllers on console, host1 showed both controllers while host2 
>>>> showed its own name in the member list.
>>>>
>>>> Regards,
>>>>
>>>> Seby.
>>>>
>>>> -----Original Message-----
>>>> Hi Seby,
>>>>
>>>> Welcome to the wonderful world of group communications!
>>>>
>>>>     
>>>>       
>>>>         
>>>>> I've tried various FD options and could not get it working when one 
>>>>>       
>>>>>         
>>>>>           
>>>> of the hosts fail. I can see the message 'A leaving group' on live 
>>>> controller B when I shutdown the interface of A. This is working as 
>>>> expected and the virtual db is still accessible/writable as the 
>>>> controller B is alive. But when I open the interface on A, the 
>>>> controller A shows (show controllers) that the virtual-db is hosted by 
>>>> controllers A & B while controller B just shows B. And the data 
>>>> inserted into the vdb hosted by controller B is NOT being played on A. 
>>>> This will cause inconsistencies in the data between the virtual-dbs. 
>>>> Is there a way, we can disable the backend if the network goes down, 
>>>> so that I can recover the db using the backup?
>>>>
>>>>     
>>>> There is a problem with your group communication configuration if 
>>>> controllers have different views of the group. That should not happen.
>>>>
>>>>     
>>>>       
>>>>         
>>>>> I've also noticed that in some cases, if I take one of the host 
>>>>>       
>>>>>         
>>>>>           
>>>> interface down, both of them thinks that the other controller failed. 
>>>> This will also create issues. In my case, I only have two controllers 
>>>> hosted. Is it possible to ping a network gateway? That way the 
>>>> controller know that it is the one which failed and can disable the 
>>>> backend.
>>>>
>>>>     
>>>> The best solution is to use the same interface for group communication 
>>>> and client/database communications. If you use a dedicated network for 
>>>> group communications and this network fails, you will end up with a 
>>>> network partition and this is very bad. If all communications go 
>>>> through the same interface, when it goes down, all communications are 
>>>> down and the controller will not be able to serve stale data.
>>>>
>>>> You don't need STATE_TRANSFER as Sequoia has its own state transfer 
>>>> protocol when a new member joins a group. Which version of JGroups are 
>>>> you using? Could you send me the log with JGroups messages that you 
>>>> see on each controller by activating them in log4j.properties. I would 
>>>> need the initial sequence when you start the cluster and the messages 
>>>> you see when the failure is detected and when the failed controller 
>>>> joins back. There might be a problem with the timeout settings of the 
>>>> different component of the stack.
>>>>
>>>> Keep me posted with your findings
>>>>
>>>> Emmanuel
>>>>
>>>> ------------------------------------------------------------------------
>>>>       
>>>>         
>>>   
>>>     
>>>       
>>   
>>     
>
>
>   


-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: m...@frogthinker.org
Skype: emmanuel_cecchet

_______________________________________________
Sequoia mailing list
Sequoia@lists.forge.continuent.org
http://forge.continuent.org/mailman/listinfo/sequoia

Re: [Sequoia] Failure detection

Reply via email to