Re: [Sequoia] controller hangs on broken network

Emmanuel Cecchet Sun, 02 Sep 2007 05:46:53 -0700

Hi Don,

What you are describing are just TCP timeouts. When you unplug yourcable all processes have to wait for the kernel to timeout on TCPconnections. You might want to tune those TCP timeout settings (I don'tknow how to do that in Windows but there are probably many resources onthe web for that).Note that the group communication uses UDP-based heartbeat and thereforedoes not suffer the TCP timeout problem.


Hope this helps,
Emmanuel

Sequoia appears not to handle network failure gracefully. Myconfiguration:
- two MS Windows servers, A (10.0.0.61) and B (10.0.0.60).
- JBoss 4.0.5
- Sequoia 2.10.9 using Appia default configuration.
- MySql 5.0.41
- Server A running JBoss, Sequoia controller, MySql backend.
- Server B running Sequoia controller, MySql backend.
- Controller A and B are (the only) members of a cluster called"mySequoia", as confirmed on each machine using "show controllers".- JBoss is configured to use only controller A, via"<connection-url>jdbc:sequoia://A/mySequoia</connection-url>".
- B's backed is disabled.
Everything works fine under load, with JBoss happily hittingcontroller A, which in turn updates the database backend on A. Then Iunplug the ethernet cable on server B, and everything hangs. JBossstops, controller A stops, logging nothing. Controller B logs awarning that controller A has left the cluster. I wait for fiveminutes, nothing happens except a transaction timeout on the JBossserver. After ten minutes I plug the ethernet back in, and controllerA logs this:
14:21:05,390 INFO continuent.hedera.gmsMember(address=/10.0.0.60:49573, uid=10.0.0.60:49573) failed inGroup(gid=mySequoia)14:21:05,390 WARN controller.virtualdatabase.mySequoia ControllerMember(address=/10.0.0.60:49573, uid=10.0.0.60:49573) has left thecluster.14:21:05,390 INFO controller.virtualdatabase.mySequoia 1 requestswere waiting responses from Member(address=/10.0.0.60:49573,uid=10.0.0.60:49573)14:21:05,390 WARN controller.RequestManager.mySequoia 1 controller(s)died during execution of request 84442493013302514:21:05,390 WARN controller.RequestManager.mySequoia ControllerMember(address=/10.0.0.60:49573, uid=10.0.0.60:49573) is suspected offailure.14:21:06,906 INFO controller.requestmanager.cleanup Waiting 120000msfor client of controller 281474976710656 to failover14:23:06,906 INFO controller.requestmanager.cleanup Cleanup forcontroller 281474976710656 failure is completed.
and comes back to life, as does JBoss. (However, the cluster remainsbroken -- neither controller sees the other any more.)
Originally I saw this problem with both controllers active andenabled, with JBoss configured to round-robin them. I suspected acluster communications bug, so I simplified the deployment to thissingle active controller to see what would happen.What's going on here? Is it a controller bug, an Appia bug, maybe amisconfiguration, or what?


_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Re: [Sequoia] controller hangs on broken network

Reply via email to