5x seems like a lot but what is the functional difference between 5 and 25 ms?
I think there is probably some problem you could solve a different way using the guarantees that zk already makes. -m On Sep 10, 2013, at 3:34 PM, Jeremy Stribling <[email protected]> wrote: > I mostly agree, but let's assume that a ~5x speedup in detecting those types > of failures is considered significant for some people. Are there technical > reasons that would prevent this idea from working? > > On 09/10/2013 01:31 PM, Ted Dunning wrote: >> I don't see the strong value here. A few failures would be detected more >> quickly, but I am not convinced that this would actually improve >> functionality significantly. >> >> >> On Tue, Sep 10, 2013 at 1:01 PM, Jeremy Stribling <[email protected]> wrote: >> >>> Hi all, >>> >>> Let's assume that you wanted to deploy ZK in a virtualized environment, >>> despite all of the known drawbacks. Assume we could deploy it such that >>> the ZK servers were all using independent CPUs and storage (though not >>> dedicated disks). Obviously, the shared disks (shared with other, non-ZK >>> VMs on the same hypervisor) will cause ZK to hit the default session >>> timeout occasionally, so you would need to raise the existing session >>> timeout to something like 30 seconds. >>> >>> I'm curious if there would be any technical drawbacks to adding an >>> additional heartbeat mechanism between the clients and the servers, which >>> would have the goal of detecting network-only failures faster than the >>> existing heartbeat mechanism. The idea is that there would be a new thread >>> dedicated to processing these heartbeats, which would not get blocked on >>> I/O. Then the clients could configure a second, smaller timeout value, and >>> it would be assumed that any such timeout indicated a real problem. The >>> existing mechanism would still be in place to catch I/O-related errors. >>> >>> I understand the philosophy that there should be some heartbeat mechanism >>> that takes the disk into account, but I'm having trouble coming up with >>> technical reasons not to add a second mechanism. Obviously, the advantage >>> would be that the clients could detect network failures and system crashes >>> more quickly in an environment with slow disks, and fail over to other >>> servers more quickly. The only disadvantages I can come up with are: >>> >>> 1) More code complexity, and slightly more heartbeat traffic on the wire >>> 2) I think the servers have to log session expirations to disk, so if the >>> sessions expire at a faster rate than the disk can handle, it might lead to >>> a large backlog. >>> >>> Are there other drawbacks I am missing? Would a patch that added >>> something like this be considered, or is it dead from the start? Thanks, >>> >>> Jeremy >
