Re: [time] what's wrong with the monitor this morning?

Chris Kuethe Tue, 04 Mar 2008 16:39:10 -0800

On Tue, Mar 4, 2008 at 4:05 PM, Ryan Malayter <[EMAIL PROTECTED]> wrote:
>  I think using some high-school-level statistics would be more
>  reliable... compute the standard deviation of offsets, and anything
>  outside of N standard deviations gets kicked out of the pool. Or
>  something like that. You have to presume that the crowd is more
>  knowledgable than the monitoring server about what true time is.


For less than $100 you should be able to hang a gps off the back of a
monitoring server. Then it too will have a good estimate of what the
true time was.

>  I imagine the CPU load on the monitoring server isn't an issue, but
>  rather the network is the bottleneck.

I dunno about that, though I had considered the same thing. Even with
nearly 1600 servers in the pool that's less than 1 per second to poll
them all in under half an hour. But I suppose we want to detect
misbehaviour sooner than that. 10 servers/sec would probably be easily
doable... 3 minutes per scan. Get 5 pollers, break up the server list
into shares, and scan. poller[n] starts scanning at share[n] and wraps
around. Then you can get the entire pool probed 5 times in 3 minutes.
That would let you vote on the reliability of each server. The 5
pollers could each can health-check eachother too.

If a server is marked as bad, it can be pulled out of the pool very
quickly and added to a list of unhealthy servers. These can be
retested in 15 minutes and re-added if they're healthy again.

>  The neatest thing to do, IMHO, would be to have each pool member
>  monitor say 10 randomly selected others with noselect, then have the
>  monitoring server use ntpq or some other means to pull in that data
>  and crunch it to determine a score. That is a much bigger job, though.

The reports could also be used to verify a misbehaving reporter. If
each server will be polled by 10 other servers and "joe" returns times
that are ... 75ms higher on all of his reports than any other machine,
then we can flag "joe" as being broken. It may even be possible to
detect network problems between two nets ("argh - peering between shaw
and sonic is down!")

CK

-- 
GDB has a 'break' feature; why doesn't it have 'fix' too?
_______________________________________________
timekeepers mailing list
[email protected]
https://fortytwo.ch/mailman/cgi-bin/listinfo/timekeepers

Re: [time] what's wrong with the monitor this morning?

Reply via email to