Re: [Sequoia] Failure detection

Emmanuel Cecchet Wed, 24 Mar 2010 07:41:26 -0700

Hi Seby,

Sorry for the late reply, I have been very busy these past days.
This seems to be a JGroups issue that could probably be better answered 
by Bela Ban on the JGroups mailing list. I have seen emails these past 
days on the list with people having similar problem.
I would recommend that you post an email on the JGroups mailing list 
with your JGroups configuration and the messages you see regarding MERGE 
failing.


Keep me posted
Emmanuel

> Also, here is the error which I see from the logs:
>
> 2010-03-22 08:31:15,912 DEBUG protocols.pbcast.GMS Merge leader 
> 10.10.10.23:39729 expects 2 responses, so far got 1 responses
> 2010-03-22 08:31:15,913 DEBUG protocols.pbcast.GMS Merge leader 
> 10.10.10.23:39729 waiting 382 msecs for merge responses
> 2010-03-22 08:31:16,313 DEBUG protocols.pbcast.GMS At 10.10.10.23:39729 
> cancelling merge due to timer timeout (5000 ms)
> 2010-03-22 08:31:16,314 DEBUG protocols.pbcast.GMS cancelling merge 
> (merge_id=[10.10.10.23:39729|1269261071286])
> 2010-03-22 08:31:16,316 DEBUG protocols.pbcast.GMS resumed ViewHandler
> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 
> 10.10.10.23:39729 expects 2 responses, so far got 0 responses
> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 
> 10.10.10.23:39729 collected 0 merge response(s) in 5027 ms
> 2010-03-22 08:31:16,318 WARN  protocols.pbcast.GMS Merge aborted. Merge 
> leader did not get MergeData from all subgroup coordinators 
> [10.10.10.33:38822, 10.10.10.23:39729]
>
> -----Original Message-----
> From: Francis, Seby 
> Sent: Monday, March 22, 2010 1:03 PM
> To: 'Sequoia general mailing list'
> Cc: sequoiadb-disc...@lists.sourceforge.net
> Subject: RE: [Sequoia] Failure detection
>
> Hi Emmanuel,
>
> I've updated my jgroups to the version which you have mentioned, but I still 
> see the issue with Merging the groups. One of the controller lost track after 
> the failure and won't merge. Can you please give me a hand to figure out 
> where it goes wrong. I've the debug logs. Shall I send the logs as a zip 
> file.  
>
> Thanks,
> Seby.
>
> -----Original Message-----
> From: sequoia-boun...@lists.forge.continuent.org 
> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of Emmanuel 
> Cecchet
> Sent: Thursday, March 18, 2010 10:22 PM
> To: Sequoia general mailing list
> Cc: sequoiadb-disc...@lists.sourceforge.net
> Subject: Re: [Sequoia] Failure detection
>
> Hi Seby,
>
> I looked into the mailing list archive and this version of JGroups has a 
> number of significant bugs. An issue was filed 
> (http://forge.continuent.org/jira/browse/SEQUOIA-1130) and I fixed it 
> for Sequoia 4. Just using a drop in replacement for JGroups core for 
> Sequoia 2.10.10 might work. You might have to update Hedera jars as well 
> but that could work with the old one too.
>
> Let me know if the upgrade does not work
> Emmanuel
>
>   
>> Thanks for your support!!
>>
>> I'm using jgroups-core.jar Version 2.4.2 which came with 
>> "sequoia-2.10.10". My solaris test servers have only single interface 
>> and I'm using the same ip for both group & db/client communications. I 
>> ran a test again removing "*STATE_TRANSFER*" and attached the logs. At 
>> around 13:36, I took the host1 interface down and opened it around 
>> 13:38. After I opened the interface, and when I ran the show 
>> controllers on console, host1 showed both controllers while host2 
>> showed its own name in the member list.
>>
>> Regards,
>>
>> Seby.
>>
>> -----Original Message-----
>> Hi Seby,
>>
>> Welcome to the wonderful world of group communications!
>>
>>     
>>> I've tried various FD options and could not get it working when one 
>>>       
>> of the hosts fail. I can see the message 'A leaving group' on live 
>> controller B when I shutdown the interface of A. This is working as 
>> expected and the virtual db is still accessible/writable as the 
>> controller B is alive. But when I open the interface on A, the 
>> controller A shows (show controllers) that the virtual-db is hosted by 
>> controllers A & B while controller B just shows B. And the data 
>> inserted into the vdb hosted by controller B is NOT being played on A. 
>> This will cause inconsistencies in the data between the virtual-dbs. 
>> Is there a way, we can disable the backend if the network goes down, 
>> so that I can recover the db using the backup?
>>
>>     
>> There is a problem with your group communication configuration if 
>> controllers have different views of the group. That should not happen.
>>
>>     
>>> I've also noticed that in some cases, if I take one of the host 
>>>       
>> interface down, both of them thinks that the other controller failed. 
>> This will also create issues. In my case, I only have two controllers 
>> hosted. Is it possible to ping a network gateway? That way the 
>> controller know that it is the one which failed and can disable the 
>> backend.
>>
>>     
>> The best solution is to use the same interface for group communication 
>> and client/database communications. If you use a dedicated network for 
>> group communications and this network fails, you will end up with a 
>> network partition and this is very bad. If all communications go 
>> through the same interface, when it goes down, all communications are 
>> down and the controller will not be able to serve stale data.
>>
>> You don't need STATE_TRANSFER as Sequoia has its own state transfer 
>> protocol when a new member joins a group. Which version of JGroups are 
>> you using? Could you send me the log with JGroups messages that you 
>> see on each controller by activating them in log4j.properties. I would 
>> need the initial sequence when you start the cluster and the messages 
>> you see when the failure is detected and when the failed controller 
>> joins back. There might be a problem with the timeout settings of the 
>> different component of the stack.
>>
>> Keep me posted with your findings
>>
>> Emmanuel
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Sequoia mailing list
>> Sequoia@lists.forge.continuent.org
>> http://forge.continuent.org/mailman/listinfo/sequoia
>>     
>
>
>   


-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: m...@frogthinker.org
Skype: emmanuel_cecchet

_______________________________________________
Sequoia mailing list
Sequoia@lists.forge.continuent.org
http://forge.continuent.org/mailman/listinfo/sequoia

Re: [Sequoia] Failure detection

Reply via email to