| Sampath, Do you think something along the lines of what Ted describes would work for you?
-Flavio On Aug 18, 2011, at 7:13 PM, Ted Dunning wrote: The thought is that a server would not complain about connection refused or inability to form a quorum during the first (say) twenty seconds of operation.
The thesis is that warnings from these causes during that time are spurious.
As I mentioned, I don't see this as urgent or even necessarily a good idea. I completely reboot a ZK cluster once every year or three. When I am doing a rolling upgrade, I *want* to see alerts when I bounce a machine. If I don't want to see those alerts, my monitoring system allows me to put a machine into maintenance mode for a short period of time to temporarily suppress the warnings.
All I was doing was translating and elaborating the original poster's suggestion, not so much endorsing it. On Thu, Aug 18, 2011 at 8:54 AM, Flavio Junqueira <[email protected]> wrote: Hi Ted, I don't see how one can automate the distinction between a machine that is down because it crashed and a machine that is down because it hasn't started yet. Assuming that we are logging the machine unavailability as we are doing currently, one can always look at the timestamp of the warning and remember that this is the time the machines were bootstrapping. Consequently, I don't really see the point of reducing the number of warnings, unless the warnings are really polluting the logs. I typically don't see so many that prevents me from reading the rest, but you may have a different perception. Also, recall that we back off, so the warnings become less frequent over time.
I'm open to ideas, though. If you see anything wrong in my rationale or if you have an idea of how to do it differently, then I'd be happy to hear. However, if the idea is simply to add a parameter that configures the time for leader election to start, then I'm currently not in favor.
-Flavio
On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: Flavio, What you say is correct, but the original poster does have a point that many of these warnings are to be expected and there is a heuristic that might assist in distinguishing some of these cases so that false alarms in the logs could be decreased. That doesn't seem like a big deal to me, but different people have different itches. In my experience, restarting a ZK cluster from zero almost never happens. On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning < [email protected]> wrote:
On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[email protected]>wrote:
Hhmmm, I think this is a bit different isn't it? Here we know that the
first
server to come will be failing to connect to the other as they are not yet
up. Anyway our real issue is the warning.
We know that.
But how does the server know that it is the first server? That is the whole point of the leader election. You might just have a server rejoining
a cluster. Or you might have a cluster that has been turned off. Or a cluster with 2 out of 5 machines off and we tried to touch the other down
machine before the others.
Would you like to suggest a patch?
Of course I do.. will prepare a patch and attach.
Great!
flaviojunqueira research scientist [email protected]direct +34 93-183-8828 avinguda diagonal 177, 8th floor, barcelona, 08018, esphone (408) 349 3300 fax (408) 349 3301
|