Hi Don,
What you are describing are just TCP timeouts. When you unplug your
cable all processes have to wait for the kernel to timeout on TCP
connections. You might want to tune those TCP timeout settings (I don't
know how to do that in Windows but there are probably many resources on
the web for that).
Note that the group communication uses UDP-based heartbeat and therefore
does not suffer the TCP timeout problem.
Hope this helps,
Emmanuel
Sequoia appears not to handle network failure gracefully. My
configuration:
- two MS Windows servers, A (10.0.0.61) and B (10.0.0.60).
- JBoss 4.0.5
- Sequoia 2.10.9 using Appia default configuration.
- MySql 5.0.41
- Server A running JBoss, Sequoia controller, MySql backend.
- Server B running Sequoia controller, MySql backend.
- Controller A and B are (the only) members of a cluster called
"mySequoia", as confirmed on each machine using "show controllers".
- JBoss is configured to use only controller A, via
"<connection-url>jdbc:sequoia://A/mySequoia</connection-url>".
- B's backed is disabled.
Everything works fine under load, with JBoss happily hitting
controller A, which in turn updates the database backend on A. Then I
unplug the ethernet cable on server B, and everything hangs. JBoss
stops, controller A stops, logging nothing. Controller B logs a
warning that controller A has left the cluster. I wait for five
minutes, nothing happens except a transaction timeout on the JBoss
server. After ten minutes I plug the ethernet back in, and controller
A logs this:
14:21:05,390 INFO continuent.hedera.gms
Member(address=/10.0.0.60:49573, uid=10.0.0.60:49573) failed in
Group(gid=mySequoia)
14:21:05,390 WARN controller.virtualdatabase.mySequoia Controller
Member(address=/10.0.0.60:49573, uid=10.0.0.60:49573) has left the
cluster.
14:21:05,390 INFO controller.virtualdatabase.mySequoia 1 requests
were waiting responses from Member(address=/10.0.0.60:49573,
uid=10.0.0.60:49573)
14:21:05,390 WARN controller.RequestManager.mySequoia 1 controller(s)
died during execution of request 844424930133025
14:21:05,390 WARN controller.RequestManager.mySequoia Controller
Member(address=/10.0.0.60:49573, uid=10.0.0.60:49573) is suspected of
failure.
14:21:06,906 INFO controller.requestmanager.cleanup Waiting 120000ms
for client of controller 281474976710656 to failover
14:23:06,906 INFO controller.requestmanager.cleanup Cleanup for
controller 281474976710656 failure is completed.
and comes back to life, as does JBoss. (However, the cluster remains
broken -- neither controller sees the other any more.)
Originally I saw this problem with both controllers active and
enabled, with JBoss configured to round-robin them. I suspected a
cluster communications bug, so I simplified the deployment to this
single active controller to see what would happen.
What's going on here? Is it a controller bug, an Appia bug, maybe a
misconfiguration, or what?
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia