adding a separate thread to detect network timeouts faster

Jeremy Stribling Tue, 10 Sep 2013 13:30:45 -0700

Hi all,

Let's assume that you wanted to deploy ZK in a virtualized environment,despite all of the known drawbacks. Assume we could deploy it such thatthe ZK servers were all using independent CPUs and storage (though notdedicated disks). Obviously, the shared disks (shared with other,non-ZK VMs on the same hypervisor) will cause ZK to hit the defaultsession timeout occasionally, so you would need to raise the existingsession timeout to something like 30 seconds.

I'm curious if there would be any technical drawbacks to adding anadditional heartbeat mechanism between the clients and the servers,which would have the goal of detecting network-only failures faster thanthe existing heartbeat mechanism. The idea is that there would be a newthread dedicated to processing these heartbeats, which would not getblocked on I/O. Then the clients could configure a second, smallertimeout value, and it would be assumed that any such timeout indicated areal problem. The existing mechanism would still be in place to catchI/O-related errors.

I understand the philosophy that there should be some heartbeatmechanism that takes the disk into account, but I'm having troublecoming up with technical reasons not to add a second mechanism.Obviously, the advantage would be that the clients could detect networkfailures and system crashes more quickly in an environment with slowdisks, and fail over to other servers more quickly. The onlydisadvantages I can come up with are:


1) More code complexity, and slightly more heartbeat traffic on the wire

2) I think the servers have to log session expirations to disk, so ifthe sessions expire at a faster rate than the disk can handle, it mightlead to a large backlog.

Are there other drawbacks I am missing? Would a patch that addedsomething like this be considered, or is it dead from the start? Thanks,


Jeremy

adding a separate thread to detect network timeouts faster

Reply via email to