Re: [Sequoia] Failure detection

Francis, Seby Wed, 17 Mar 2010 22:02:24 -0700

Hi Emmanuel,

        I've tried various FD options and could not get it working when one of 
the hosts fail. I can see the message 'A leaving group' on live controller B 
when I shutdown the interface of A. This is working as expected and the virtual 
db is still accessible/writable as the controller B is alive. But when I open 
the interface on A, the controller A shows (show controllers) that the 
virtual-db is hosted by controllers A & B while controller B just shows B. And 
the data inserted into the vdb hosted by controller B is NOT being played on A. 
This will cause inconsistencies in the data between the virtual-dbs. Is there a 
way, we can disable the backend if the network goes down, so that I can recover 
the db using the backup?


        I've also noticed that in some cases, if I take one of the host 
interface down, both of them thinks that the other controller failed. This will 
also create issues. In my case, I only have two controllers hosted. Is it 
possible to ping a network gateway? That way the controller know that it is the 
one which failed and can disable the backend.

I've attached my config xml file which I'm using for the above failure test. It 
would be nice if the failed controller can join back automatically to the group 
or disable the backend by its own.

Thanks,
Seby.
----------------------------------------------------------------------------------------------------------------
Start: sequencer.xml file from hostA. The only diff in hostB is the bind_addr.
----------------------------------------------------------------------------------------------------------------
<config>
    <UDP bind_addr="A"
         mcast_port="45566" 
         mcast_addr="228.8.8.9"
         tos="16"
         ucast_recv_buf_size="20000000"
         ucast_send_buf_size="640000"
         mcast_recv_buf_size="25000000" 
         mcast_send_buf_size="640000" 
         loopback="false"
         discard_incompatible_packets="true"
         max_bundle_size="64000"
         max_bundle_timeout="30"
         use_incoming_packet_handler="true" 
         use_outgoing_packet_handler="false" 
         ip_ttl="2" 
         down_thread="false" up_thread="false"
         enable_bundling="true"/>
    <PING timeout="2000"
          down_thread="false" up_thread="false" num_initial_members="3"/>
    <MERGE2 max_interval="10000"
            down_thread="false" up_thread="false" min_interval="5000"/>
    <FD_SOCK down_thread="false" up_thread="false"/>
    <FD timeout="2500" max_tries="5" down_thread="false" up_thread="false" 
shun="true"/>
   <VERIFY_SUSPECT timeout="1500" down_thread="false"/>
    <pbcast.NAKACK max_xmit_size="60000"
                   use_mcast_xmit="false" gc_lag="0"
                   retransmit_timeout="100,200,300,600,1200,2400,4800"
                   down_thread="false" up_thread="false"
                   discard_delivered_msgs="true"/>
    <UNICAST timeout="300,600,1200,2400,3600"
             down_thread="false" up_thread="false"/>
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" 
                   down_thread="false" up_thread="false"
                   max_bytes="400000"/>
    <VIEW_SYNC avg_send_interval="60000" down_thread="false" up_thread="false" 
/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000" 
                down_thread="false" up_thread="false"
                join_retry_timeout="2000" shun="true" 
handle_concurrent_startup="true" />
    <SEQUENCER down_thread="false" up_thread="false" />
    <FC max_credits="2000000" down_thread="false" up_thread="false"
           min_threshold="0.10"/>
    <pbcast.STATE_TRANSFER down_thread="false" up_thread="false"/>
</config>
----------------------------------------------------------------------------------------------------------------
End: sequencer.xml
----------------------------------------------------------------------------------------------------------------


-----Original Message-----
From: sequoia-boun...@lists.forge.continuent.org 
[mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of Emmanuel 
Cecchet
Sent: Tuesday, March 16, 2010 9:21 AM
To: Sequoia general mailing list
Cc: sequoiadb-disc...@lists.sourceforge.net
Subject: Re: [Sequoia] Failure detection

Hi Francis,

When a group communication network failure happens, if no write was 
pending at the time the failure was detected, the recovery is going to 
be automatic. If a write was pending during the failure, there is no way 
to know if the other controller really performed that write properly and 
it is considered as failed. You then have to start the controller 
recovery sequence (you can automate the script). The procedure is 
described in section '8.1 Recover from a controller node failure' of the 
Sequoia 2.10 management guide (pdf can be found in the doc directory).

Hope this helps
Emmanuel

> Thank you Emmanuel! I've enabled the FD, FD_SOCK & VERIFY_SUSPECT in 
> sequencer.xml and I could see the insert statements are now working. 
>
> I do have another question. After the interface came back up, I see that the 
> backend re-joins the group but, it never replays the insert statements 
> occurred when it was down. Is that supposed to do automatically? Is there a 
> way to make it automated. 
>
> Thanks,
> Seby.
>
> -----Original Message-----
> From: sequoia-boun...@lists.forge.continuent.org 
> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of Emmanuel 
> Cecchet
> Sent: Saturday, March 13, 2010 12:51 PM
> To: Sequoia general mailing list
> Subject: Re: [Sequoia] Failure detection
>
> Hi Seby,
>   
>>         I setup sequoia in my lab having two controllers on two Solaris 
>> hosts. Each controller has one postgres backed attached. This dbs are in 
>> these Solaris servers itself. The controllers use group communication 
>> (jgroup) to sync updates/writes. 
>>  
>>         For a failure test, I shutdown the interface on one of the host, but 
>> the other controller/host never figured this and my INSERT statement started 
>> failing. There were no errors I could see in the controller log. As soon I 
>> open the interface, I can see the request are being played and the data 
>> getting inserted to the db.
>>  
>>         Could you please let me know how the jgroup detects the network 
>> failures?
>>     
> This depends on your JGroups configuration.
> If you are using a TCP based failure detector, the detection will depend 
> on your operating system TCP settings. Otherwise you should be able to 
> setup the timeout in your gossip server or udp-based failure detector.
>
> Hope this helps
> Emmanuel
>
>   


-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: m...@frogthinker.org
Skype: emmanuel_cecchet

_______________________________________________
Sequoia mailing list
Sequoia@lists.forge.continuent.org
http://forge.continuent.org/mailman/listinfo/sequoia
_______________________________________________
Sequoia mailing list
Sequoia@lists.forge.continuent.org
http://forge.continuent.org/mailman/listinfo/sequoia

Re: [Sequoia] Failure detection

Reply via email to