Hi Emmanuel, I've tried various FD options and could not get it working when one of the hosts fail. I can see the message 'A leaving group' on live controller B when I shutdown the interface of A. This is working as expected and the virtual db is still accessible/writable as the controller B is alive. But when I open the interface on A, the controller A shows (show controllers) that the virtual-db is hosted by controllers A & B while controller B just shows B. And the data inserted into the vdb hosted by controller B is NOT being played on A. This will cause inconsistencies in the data between the virtual-dbs. Is there a way, we can disable the backend if the network goes down, so that I can recover the db using the backup?
I've also noticed that in some cases, if I take one of the host interface down, both of them thinks that the other controller failed. This will also create issues. In my case, I only have two controllers hosted. Is it possible to ping a network gateway? That way the controller know that it is the one which failed and can disable the backend. I've attached my config xml file which I'm using for the above failure test. It would be nice if the failed controller can join back automatically to the group or disable the backend by its own. Thanks, Seby. ---------------------------------------------------------------------------------------------------------------- Start: sequencer.xml file from hostA. The only diff in hostB is the bind_addr. ---------------------------------------------------------------------------------------------------------------- <config> <UDP bind_addr="A" mcast_port="45566" mcast_addr="228.8.8.9" tos="16" ucast_recv_buf_size="20000000" ucast_send_buf_size="640000" mcast_recv_buf_size="25000000" mcast_send_buf_size="640000" loopback="false" discard_incompatible_packets="true" max_bundle_size="64000" max_bundle_timeout="30" use_incoming_packet_handler="true" use_outgoing_packet_handler="false" ip_ttl="2" down_thread="false" up_thread="false" enable_bundling="true"/> <PING timeout="2000" down_thread="false" up_thread="false" num_initial_members="3"/> <MERGE2 max_interval="10000" down_thread="false" up_thread="false" min_interval="5000"/> <FD_SOCK down_thread="false" up_thread="false"/> <FD timeout="2500" max_tries="5" down_thread="false" up_thread="false" shun="true"/> <VERIFY_SUSPECT timeout="1500" down_thread="false"/> <pbcast.NAKACK max_xmit_size="60000" use_mcast_xmit="false" gc_lag="0" retransmit_timeout="100,200,300,600,1200,2400,4800" down_thread="false" up_thread="false" discard_delivered_msgs="true"/> <UNICAST timeout="300,600,1200,2400,3600" down_thread="false" up_thread="false"/> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" down_thread="false" up_thread="false" max_bytes="400000"/> <VIEW_SYNC avg_send_interval="60000" down_thread="false" up_thread="false" /> <pbcast.GMS print_local_addr="true" join_timeout="3000" down_thread="false" up_thread="false" join_retry_timeout="2000" shun="true" handle_concurrent_startup="true" /> <SEQUENCER down_thread="false" up_thread="false" /> <FC max_credits="2000000" down_thread="false" up_thread="false" min_threshold="0.10"/> <pbcast.STATE_TRANSFER down_thread="false" up_thread="false"/> </config> ---------------------------------------------------------------------------------------------------------------- End: sequencer.xml ---------------------------------------------------------------------------------------------------------------- -----Original Message----- From: sequoia-boun...@lists.forge.continuent.org [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of Emmanuel Cecchet Sent: Tuesday, March 16, 2010 9:21 AM To: Sequoia general mailing list Cc: sequoiadb-disc...@lists.sourceforge.net Subject: Re: [Sequoia] Failure detection Hi Francis, When a group communication network failure happens, if no write was pending at the time the failure was detected, the recovery is going to be automatic. If a write was pending during the failure, there is no way to know if the other controller really performed that write properly and it is considered as failed. You then have to start the controller recovery sequence (you can automate the script). The procedure is described in section '8.1 Recover from a controller node failure' of the Sequoia 2.10 management guide (pdf can be found in the doc directory). Hope this helps Emmanuel > Thank you Emmanuel! I've enabled the FD, FD_SOCK & VERIFY_SUSPECT in > sequencer.xml and I could see the insert statements are now working. > > I do have another question. After the interface came back up, I see that the > backend re-joins the group but, it never replays the insert statements > occurred when it was down. Is that supposed to do automatically? Is there a > way to make it automated. > > Thanks, > Seby. > > -----Original Message----- > From: sequoia-boun...@lists.forge.continuent.org > [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of Emmanuel > Cecchet > Sent: Saturday, March 13, 2010 12:51 PM > To: Sequoia general mailing list > Subject: Re: [Sequoia] Failure detection > > Hi Seby, > >> I setup sequoia in my lab having two controllers on two Solaris >> hosts. Each controller has one postgres backed attached. This dbs are in >> these Solaris servers itself. The controllers use group communication >> (jgroup) to sync updates/writes. >> >> For a failure test, I shutdown the interface on one of the host, but >> the other controller/host never figured this and my INSERT statement started >> failing. There were no errors I could see in the controller log. As soon I >> open the interface, I can see the request are being played and the data >> getting inserted to the db. >> >> Could you please let me know how the jgroup detects the network >> failures? >> > This depends on your JGroups configuration. > If you are using a TCP based failure detector, the detection will depend > on your operating system TCP settings. Otherwise you should be able to > setup the timeout in your gossip server or udp-based failure detector. > > Hope this helps > Emmanuel > > -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development & Consulting -- Web: http://www.frogthinker.org email: m...@frogthinker.org Skype: emmanuel_cecchet _______________________________________________ Sequoia mailing list Sequoia@lists.forge.continuent.org http://forge.continuent.org/mailman/listinfo/sequoia _______________________________________________ Sequoia mailing list Sequoia@lists.forge.continuent.org http://forge.continuent.org/mailman/listinfo/sequoia