Right you are, Emmanuel, it was a TCP timeout thing. However, it's not a Windows issue, it's a JVM issue. The TCP timeout is a JVM setting and the default is infinity. When I set it to ten seconds (by adding -Dsun.net.client.defaultConnectTimeout=10000 -Dsun.net.client.defaultReadTimeout=10000 to bin/controller.bat) the problem was fixed -- after ten seconds Sequoia recovered and resumed transactions.

It would be helpful if bin/controller.bat set these timeouts by default -- the current default behaviour (to hang Sequoia forever) is not good.

One question though... if group communications uses UDP and is not susceptible to the TCP timeout problem, then why did the controller A hang when I disconnected controller B? Controller A's backend database is local.

Emmanuel Cecchet wrote:
Hi Don,

What you are describing are just TCP timeouts. When you unplug your cable all processes have to wait for the kernel to timeout on TCP connections. You might want to tune those TCP timeout settings (I don't know how to do that in Windows but there are probably many resources on the web for that). Note that the group communication uses UDP-based heartbeat and therefore does not suffer the TCP timeout problem.

Hope this helps,
Emmanuel

Sequoia appears not to handle network failure gracefully. My configuration:
- two MS Windows servers, A (10.0.0.61) and B (10.0.0.60).
- JBoss 4.0.5
- Sequoia 2.10.9 using Appia default configuration.
- MySql 5.0.41
- Server A running JBoss, Sequoia controller, MySql backend.
- Server B running Sequoia controller, MySql backend.
- Controller A and B are (the only) members of a cluster called "mySequoia", as confirmed on each machine using "show controllers". - JBoss is configured to use only controller A, via "<connection-url>jdbc:sequoia://A/mySequoia</connection-url>".
- B's backed is disabled.

Everything works fine under load, with JBoss happily hitting controller A, which in turn updates the database backend on A. Then I unplug the ethernet cable on server B, and everything hangs. JBoss stops, controller A stops, logging nothing. Controller B logs a warning that controller A has left the cluster. I wait for five minutes, nothing happens except a transaction timeout on the JBoss server. After ten minutes I plug the ethernet back in, and controller A logs this:

14:21:05,390 INFO continuent.hedera.gms Member(address=/10.0.0.60:49573, uid=10.0.0.60:49573) failed in Group(gid=mySequoia) 14:21:05,390 WARN controller.virtualdatabase.mySequoia Controller Member(address=/10.0.0.60:49573, uid=10.0.0.60:49573) has left the cluster. 14:21:05,390 INFO controller.virtualdatabase.mySequoia 1 requests were waiting responses from Member(address=/10.0.0.60:49573, uid=10.0.0.60:49573) 14:21:05,390 WARN controller.RequestManager.mySequoia 1 controller(s) died during execution of request 844424930133025 14:21:05,390 WARN controller.RequestManager.mySequoia Controller Member(address=/10.0.0.60:49573, uid=10.0.0.60:49573) is suspected of failure. 14:21:06,906 INFO controller.requestmanager.cleanup Waiting 120000ms for client of controller 281474976710656 to failover 14:23:06,906 INFO controller.requestmanager.cleanup Cleanup for controller 281474976710656 failure is completed.

and comes back to life, as does JBoss. (However, the cluster remains broken -- neither controller sees the other any more.)

Originally I saw this problem with both controllers active and enabled, with JBoss configured to round-robin them. I suspected a cluster communications bug, so I simplified the deployment to this single active controller to see what would happen. What's going on here? Is it a controller bug, an Appia bug, maybe a misconfiguration, or what?

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Reply via email to