s/one of customer/one of our customer sorry for the typo.
On Thu, Aug 18, 2011 at 10:24 PM, Sampath Perera <[email protected]>wrote: > Hi Flavio, > > On Thu, Aug 18, 2011 at 9:24 PM, Flavio Junqueira <[email protected]>wrote: > >> Hi Ted, I don't see how one can automate the distinction between a machine >> that is down because it crashed and a machine that is down because it hasn't >> started yet. Assuming that we are logging the machine unavailability as we >> are doing currently, one can always look at the timestamp of the warning and >> remember that this is the time the machines were bootstrapping. >> Consequently, I don't really see the point of reducing the number of >> warnings, unless the warnings are really polluting the logs. I typically >> don't see so many that prevents me from reading the rest, but you may have a >> different perception. Also, recall that we back off, so the warnings become >> less frequent over time. >> > > True, but one of customer deployments have a log analyzing tool and sends > notifications for the errors on the log, as you previously said we cannot > get an optimal value for this timeout, but we can come up with a sub optimal > value to get rid of this warning. > > >> >> I'm open to ideas, though. If you see anything wrong in my rationale or if >> you have an idea of how to do it differently, then I'd be happy to hear. >> However, if the idea is simply to add a parameter that configures the time >> for leader election to start, then I'm currently not in favor. >> > > Well, what I was originally looking for was to delay the leader election, > but as pointed out by Ted, I was going to provide a path on printing this > warning. (If you carefully look at Ted's comment, and my response, was > thinking of a timeout for the warning to be considered as a warning to be > printed on the log... at least that is what I got from Ted's first comment). > What do you think about that? > > >> >> -Flavio >> >> On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: >> >> Flavio, >> >> What you say is correct, but the original poster does have a point that >> many >> of these warnings are to be expected and there is a heuristic that might >> assist in distinguishing some of these cases so that false alarms in the >> logs could be decreased. >> >> That doesn't seem like a big deal to me, but different people have >> different >> itches. In my experience, restarting a ZK cluster from zero almost never >> happens. >> >> On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <[email protected]> >> wrote: >> >> >> >> On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[email protected] >> >wrote: >> >> >> >> Hhmmm, I think this is a bit different isn't it? Here we know that the >> >> first >> >> server to come will be failing to connect to the other as they are not yet >> >> up. Anyway our real issue is the warning. >> >> >> >> We know that. >> >> >> But how does the server know that it is the first server? That is the >> >> whole point of the leader election. You might just have a server >> rejoining >> >> a cluster. Or you might have a cluster that has been turned off. Or a >> >> cluster with 2 out of 5 machines off and we tried to touch the other down >> >> machine before the others. >> >> >> >> >> Would you like to suggest a patch? >> >> >> >> Of course I do.. will prepare a patch and attach. >> >> >> >> Great! >> >> >> >> >> *flavio* >> *junqueira* >> >> research scientist >> >> [email protected] >> direct +34 93-183-8828 >> >> avinguda diagonal 177, 8th floor, barcelona, 08018, es >> phone (408) 349 3300 fax (408) 349 3301 >> >> >> > > > -- > Thanks, > Sampath > http://adroitlogic.org > > -- Thanks, Sampath http://adroitlogic.org
