Re: [Sequoia] Failure detection

Emmanuel Cecchet Fri, 30 Apr 2010 20:35:09 -0700

Hi Francis,
> Are you back?
>   
Not yet, I am in Sydney right now and I am flying back home (it should 
take about 24 hours). I should be back online on Monday.


Talk to you soon
Emmanuel
> Seby.
> -----Original Message-----
> From: Francis, Seby 
> Sent: Monday, April 05, 2010 11:26 PM
> To: Sequoia general mailing list
> Cc: [email protected]
> Subject: RE: [Sequoia] Failure detection
>
> Hi Emmanuel,
>
> Do you need more logs on this. Please let me know.
>
> Thanks,
> Seby.
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Francis, Seby
> Sent: Monday, March 29, 2010 1:51 PM
> To: Sequoia general mailing list
> Cc: [email protected]
> Subject: Re: [Sequoia] Failure detection
>
> Hi Emmanuel,
>
> I've tried different jgroup configuration and now I can see in the logs that 
> the groups are merging. But for some reason, Sequoia never shows that it is 
> merged. Ie; when I ran 'show controllers' on console I see only that 
> particular host. Below is the snippet from one of the host. I see the similar 
> on the other host showing the merge. Let me know if you would like to see the 
> debug logs during the time-frame.
>
> 2010-03-29 06:59:45,683 DEBUG jgroups.protocols.VERIFY_SUSPECT diff=1507, mbr 
> 10.0.0.33:35974 is dead (passing up SUSPECT event)
> 2010-03-29 06:59:45,687 DEBUG continuent.hedera.gms JGroups reported 
> suspected member:10.0.0.33:35974
> 2010-03-29 06:59:45,688 DEBUG continuent.hedera.gms 
> Member(address=/10.0.0.33:35974, uid=db2) leaves Group(gid=db2).
>
> 2010-03-29 06:59:45,868 INFO  controller.requestmanager.cleanup Waiting 
> 30000ms for client of controller 562949953421312 to failover
> 2010-03-29 07:00:15,875 INFO  controller.requestmanager.cleanup Cleanup for 
> controller 562949953421312 failure is completed.
>
> -----
> 2010-03-29 07:03:14,725 DEBUG protocols.pbcast.GMS I (10.0.0.23:49731) will 
> be the leader. Starting the merge task for [10.0.0.33:35974, 10.0.0.23:49731]
> 2010-03-29 07:03:14,726 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 running 
> merge task, coordinators are [10.0.0.33:35974, 10.0.0.23:49731]
> 2010-03-29 07:03:14,730 DEBUG protocols.pbcast.GMS Merge leader 
> 10.0.0.23:49731 sending MERGE_REQ to [10.0.0.33:35974, 10.0.0.23:49731]
> 2010-03-29 07:03:14,746 DEBUG jgroups.protocols.UDP sending msg to 
> 10.0.0.23:49731, src=10.0.0.23:49731, headers are GMS: GmsHeader[MERGE_RSP]: 
> view=[10.0.0.23:49731|2] [10.0.0.23:49731], digest=10.0.0.23:49731: [44 : 47 
> (47)], merge_rejected=false, merge_id=[10.0.0.23:49731|1269860594727], 
> UNICAST: [UNICAST: DATA, seqno=4], UDP: [channel_name=db2]
> 2010-03-29 07:03:14,748 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 responded 
> to 10.0.0.23:49731, merge_id=[10.0.0.23:49731|1269860594727]
> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader 
> 10.0.0.23:49731 expects 2 responses, so far got 2 responses
> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader 
> 10.0.0.23:49731 collected 2 merge response(s) in 36 ms
> 2010-03-29 07:03:14,772 DEBUG protocols.pbcast.GMS Merge leader 
> 10.0.0.23:49731 computed new merged view that will be 
> MergeView::[10.0.0.23:49731|3] [10.0.0.23:49731, 10.0.0.33:35974], 
> subgroups=[[10.0.0.23:49731|2] [10.0.0.23:49731], [10.0.0.33:35974|2] 
> [10.0.0.33:35974]]
> 2010-03-29 07:03:14,773 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 is sending 
> merge view [10.0.0.23:49731|3] to coordinators [10.0.0.33:35974, 
> 10.0.0.23:49731
>
> Seby.
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Emmanuel 
> Cecchet
> Sent: Wednesday, March 24, 2010 10:41 AM
> To: Sequoia general mailing list
> Cc: [email protected]
> Subject: Re: [Sequoia] Failure detection
>
> Hi Seby,
>
> Sorry for the late reply, I have been very busy these past days.
> This seems to be a JGroups issue that could probably be better answered 
> by Bela Ban on the JGroups mailing list. I have seen emails these past 
> days on the list with people having similar problem.
> I would recommend that you post an email on the JGroups mailing list 
> with your JGroups configuration and the messages you see regarding MERGE 
> failing.
>
> Keep me posted
> Emmanuel
>
>   
>> Also, here is the error which I see from the logs:
>>
>> 2010-03-22 08:31:15,912 DEBUG protocols.pbcast.GMS Merge leader 
>> 10.10.10.23:39729 expects 2 responses, so far got 1 responses
>> 2010-03-22 08:31:15,913 DEBUG protocols.pbcast.GMS Merge leader 
>> 10.10.10.23:39729 waiting 382 msecs for merge responses
>> 2010-03-22 08:31:16,313 DEBUG protocols.pbcast.GMS At 10.10.10.23:39729 
>> cancelling merge due to timer timeout (5000 ms)
>> 2010-03-22 08:31:16,314 DEBUG protocols.pbcast.GMS cancelling merge 
>> (merge_id=[10.10.10.23:39729|1269261071286])
>> 2010-03-22 08:31:16,316 DEBUG protocols.pbcast.GMS resumed ViewHandler
>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 
>> 10.10.10.23:39729 expects 2 responses, so far got 0 responses
>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 
>> 10.10.10.23:39729 collected 0 merge response(s) in 5027 ms
>> 2010-03-22 08:31:16,318 WARN  protocols.pbcast.GMS Merge aborted. Merge 
>> leader did not get MergeData from all subgroup coordinators 
>> [10.10.10.33:38822, 10.10.10.23:39729]
>>
>> -----Original Message-----
>> From: Francis, Seby 
>> Sent: Monday, March 22, 2010 1:03 PM
>> To: 'Sequoia general mailing list'
>> Cc: [email protected]
>> Subject: RE: [Sequoia] Failure detection
>>
>> Hi Emmanuel,
>>
>> I've updated my jgroups to the version which you have mentioned, but I still 
>> see the issue with Merging the groups. One of the controller lost track 
>> after the failure and won't merge. Can you please give me a hand to figure 
>> out where it goes wrong. I've the debug logs. Shall I send the logs as a zip 
>> file.  
>>
>> Thanks,
>> Seby.
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Emmanuel 
>> Cecchet
>> Sent: Thursday, March 18, 2010 10:22 PM
>> To: Sequoia general mailing list
>> Cc: [email protected]
>> Subject: Re: [Sequoia] Failure detection
>>
>> Hi Seby,
>>
>> I looked into the mailing list archive and this version of JGroups has a 
>> number of significant bugs. An issue was filed 
>> (http://forge.continuent.org/jira/browse/SEQUOIA-1130) and I fixed it 
>> for Sequoia 4. Just using a drop in replacement for JGroups core for 
>> Sequoia 2.10.10 might work. You might have to update Hedera jars as well 
>> but that could work with the old one too.
>>
>> Let me know if the upgrade does not work
>> Emmanuel
>>
>>   
>>     
>>> Thanks for your support!!
>>>
>>> I'm using jgroups-core.jar Version 2.4.2 which came with 
>>> "sequoia-2.10.10". My solaris test servers have only single interface 
>>> and I'm using the same ip for both group & db/client communications. I 
>>> ran a test again removing "*STATE_TRANSFER*" and attached the logs. At 
>>> around 13:36, I took the host1 interface down and opened it around 
>>> 13:38. After I opened the interface, and when I ran the show 
>>> controllers on console, host1 showed both controllers while host2 
>>> showed its own name in the member list.
>>>
>>> Regards,
>>>
>>> Seby.
>>>
>>> -----Original Message-----
>>> Hi Seby,
>>>
>>> Welcome to the wonderful world of group communications!
>>>
>>>     
>>>       
>>>> I've tried various FD options and could not get it working when one 
>>>>       
>>>>         
>>> of the hosts fail. I can see the message 'A leaving group' on live 
>>> controller B when I shutdown the interface of A. This is working as 
>>> expected and the virtual db is still accessible/writable as the 
>>> controller B is alive. But when I open the interface on A, the 
>>> controller A shows (show controllers) that the virtual-db is hosted by 
>>> controllers A & B while controller B just shows B. And the data 
>>> inserted into the vdb hosted by controller B is NOT being played on A. 
>>> This will cause inconsistencies in the data between the virtual-dbs. 
>>> Is there a way, we can disable the backend if the network goes down, 
>>> so that I can recover the db using the backup?
>>>
>>>     
>>> There is a problem with your group communication configuration if 
>>> controllers have different views of the group. That should not happen.
>>>
>>>     
>>>       
>>>> I've also noticed that in some cases, if I take one of the host 
>>>>       
>>>>         
>>> interface down, both of them thinks that the other controller failed. 
>>> This will also create issues. In my case, I only have two controllers 
>>> hosted. Is it possible to ping a network gateway? That way the 
>>> controller know that it is the one which failed and can disable the 
>>> backend.
>>>
>>>     
>>> The best solution is to use the same interface for group communication 
>>> and client/database communications. If you use a dedicated network for 
>>> group communications and this network fails, you will end up with a 
>>> network partition and this is very bad. If all communications go 
>>> through the same interface, when it goes down, all communications are 
>>> down and the controller will not be able to serve stale data.
>>>
>>> You don't need STATE_TRANSFER as Sequoia has its own state transfer 
>>> protocol when a new member joins a group. Which version of JGroups are 
>>> you using? Could you send me the log with JGroups messages that you 
>>> see on each controller by activating them in log4j.properties. I would 
>>> need the initial sequence when you start the cluster and the messages 
>>> you see when the failure is detected and when the failed controller 
>>> joins back. There might be a problem with the timeout settings of the 
>>> different component of the stack.
>>>
>>> Keep me posted with your findings
>>>
>>> Emmanuel
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Sequoia mailing list
>>> [email protected]
>>> http://forge.continuent.org/mailman/listinfo/sequoia
>>>     
>>>       
>>   
>>     
>
>
>   


-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: [email protected]
Skype: emmanuel_cecchet

_______________________________________________
Sequoia mailing list
[email protected]
http://forge.continuent.org/mailman/listinfo/sequoia

Re: [Sequoia] Failure detection

Reply via email to