Rather than jumping on the "Me too" bandwagon (mine was a little shy
of 200ms it looks)...

> If this gets bad enough such that all servers get kicked out of the
> pool and there's nobody left, then there's a problem to be fixed.

This got me thinking... As I understand the way the monitor works, if
whatever happened this morning had lasted longer, the scenario you
mentioned could happen. Even the best providers (like Internap)
experience random problems from time to time, which, if they lasted
long enough, could bring everyone's score down below the threshold,
leaving the monitor to merrily purge the pool of all its members. Do
safeguards exist right now to prevent that?

Multiple monitoring pools, taking the lowest offset from each (an
average would still allow this problem if one host was having problems
putting the offset higher than threshold * number_of_hosts), could
solve this problem, but a less-involved way might be to simply have a
little code watching total trends.

You could calculate the average offset of all hosts per monitoring
cycle (which would be an interesting graph anyway, actually), which,
with the number of hosts involved ought to be fairly low and, more
importantly, fairly consistent. An even lazier solution might be to
monitor the number of hosts in the pool, and watch for drops in that.
In either case (a spike in average offsets, or a precipitous drop in
pool members), the system should suspend purging of pool members until
someone can intervene and straighten out whatever's happened.

In theory the average offset could be subtracted from individual
offsets when computing score, but that might cause problems of its
own... I just have in mind a quick little sanity check to prevent
freak network failures from emptying the pool.
_______________________________________________
timekeepers mailing list
[email protected]
https://fortytwo.ch/mailman/cgi-bin/listinfo/timekeepers

Reply via email to