Camille Fournier commented on ZOOKEEPER-922:
I'm interested in hearing the problems that you believe it would lead to in
more detail. To me, this feels like a reasonable compromise solution to a tough
problem. If the problem you foresee is a client and server getting disconnected
from each other but both staying alive, and this causing weirdness leading to a
session expiration for the client on reconnecting to another server, for my
particular scenario that is fine. I have a wrapped ZK client that is highly
tolerant to all sorts of failures and has no problem resetting its state. I
realize that may not be acceptable for other users, and I would not propose
this solution without either community agreement that this risk, if
well-documented, is ok, or a fix for that problem. But I don't know what other
problems you are seeing and while I might be able to solve them if you help me
see what they are, I can't do anything on vague suppositions of problematic
circumstances. Don't get me wrong, I'm not married to this solution, but I am
interested in some solution if possible.
It seems to me that not allowing clients to reconnect to other servers causes a
host of other problems and is a worse solution for people that would not want
this fast expiration forced on them. In what scenarios can a client not
reconnect to another server? All? Obviously that won't fly because even I would
not want to have all of my sessions expire in the case of an ensemble member
dying and clients failing over. If we only want to do this where my code is
doing the "touchAndClose" (ie, when the server the client was connected to sees
a failure-based disconnect), then we see exactly the same potential problem
outlined above where the client could still be alive but have a switch go down
and disconnect it from the server. Now it tries to fail over and its session is
always dead. I'm not convinced off the bat that that is any better than letting
it try to fail over and risking a potential session timeout race, which I think
could possibly be fixed by associating the client session with the server
currently maintaining it (already done but not passed through on ticks).
What did you mean in the earlier comment about this causing leadership election
issues? Does this actually interact with that at all? This is the kind of thing
I could use guidance on. Or we can let this whole idea drop, but it does seem
that more people than me are interested so might be worth hashing it out.
> enable faster timeout of sessions in case of unexpected socket disconnect
> Key: ZOOKEEPER-922
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-922
> Project: Zookeeper
> Issue Type: Improvement
> Components: server
> Reporter: Camille Fournier
> Assignee: Camille Fournier
> Fix For: 3.4.0
> Attachments: ZOOKEEPER-922.patch
> In the case when a client connection is closed due to socket error instead of
> the client calling close explicitly, it would be nice to enable the session
> associated with that client to time out faster than the negotiated session
> timeout. This would enable a zookeeper ensemble that is acting as a dynamic
> discovery provider to remove ephemeral nodes for crashed clients quickly,
> while allowing for a longer heartbeat-based timeout for java clients that
> need to do long stop-the-world GC.
> I propose doing this by setting the timeout associated with the crashed
> session to "minSessionTimeout".
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.