On 27 Dec, 2012, at 15:13 , Magnus Danielson <[email protected]> wrote: > On GE, a full-length packet is about 12 us, so a single packets head-of-line > blocking can be anything up to that amount, multiple packets... well, it > keeps adding. Knowing how switches works doesn't really help as packets > arrive in a myriad of rates, they interact and cross-modulate and create > strange patterns and dance in interesting ways that is ever changing in > unpredictable fashion.
I wanted to address this bit because it seems like most people base their expectations for NTP on this complexity, as does the argument being made above, but the holiday intervened. While I suspect many people are thoroughly bored of this topic by now I can't resist completing the thought. Yes, the delay of a sample packet through an output queue will be proportional to the number of untransmitted bits in the queue ahead of it, yes, the magnitude of that delay can be very large and largely variable and, even, yes, the statistics governing that delay may often be unpredictable and non-gaussian, exhibiting dangerously heavy tails. The thing is, though, that this doesn't necessarily have to matter so much. A better approach might avoid relying on the things you can't know. To see how, consider a different question: what is the probability that any two samples sent through that queue will experience precisely the same delay (i.e. find precisely the same number of bits queued in front of it when it gets there)? I think it is fairly conservative to predict that the probability that two samples will arrive at a non-empty output queue with exactly the same number of bits in front of them will be fairly small; the number of bits in the queue will be continuously changing, so the delay through a non-empty queue should have a near-continuous (and unpredictable) probability distribution, as you point out, and if the sampling is uncorrelated with the competing traffic it is unlikely that any pair of samples will find exactly the same point on that distribution. The exception to this, of course, is a queue length of precisely 0 bits (which is precisely why the behaviour of a switch with no competing traffic is interesting). The vast majority of queues in the vast majority of network devices in real networks are no where near continuously occupied for long periods. The time-averaged fractional load on the circuit a queue is feeding is also the probability of finding the queue not-empty. If the average load on the output circuit is less than 100% then multiple samples are probably going to find that queue precisely empty; if the average load on the output circuit is 50% (and that would be an unusually high number in a LAN, though maybe less unusual in other contexts) then 50% of the samples that pass through that queue are going to find it empty. Since samples that found the queue empty will have experienced pretty much identical delays, the "results" (for some value of "result") from those samples will cluster closely together. The results from samples which experienced a delay will differ from that cluster but, as discussed above, will also differ from each other and generally won't form a cluster somewhere else. The cluster marks the good spot independent of the precise (and precisely unknowable) nature of the statistics governing the distribution of samples outside the cluster. If we can find the cluster we have a result which does not depend on understanding the precise behaviour of samples outside the cluster. Given this it is also worth while to consider "jitter", which intuition based on a normal distribution assumption might suggest should be predictive of the quality of the result derived from a collection of samples. In the situation above, however, the dominant contributors to "jitter", however measured, are going to be the samples outside the cluster since they are the ones that are "jittering" (it is that property we are relying on to define the cluster). If jitter mostly measures information about the samples the estimate doesn't rely on then it tells you little about the samples the estimate does rely on, and hence can provide no prediction about the quality of an estimate derived from those samples alone. In fact, in a true perversion of normal intuition, high jitter and heavy-tailed probability distributions might even make it easier to get a good result by making it easier to identify the cluster. Saying "I see a lot of jitter" doesn't necessarily tell you anything about what is possible. While the argument gets a lot more complex in a hurry, and too much to attempt here (the above is too much already), I believe this general approach can scale to a whole large network of devices with queues (though even the single-switch case has real life relevance too). That is, I think it is possible to find a sample "result" for which there is a strong tendency for "good" samples to cluster together while "bad" samples are unlikely to do so, with the quality of the result depending on the population and nature of variability of the cluster but hardly at all on the outliers, and with the lack of a measurable cluster telling you when you might be better off relying on your local clock rather than the network. The approach relies on the things we do know about networks and networking equipment while avoiding reliance on things we can't know: it mostly avoids making gaussian statistical assumptions about distributions that may not be gaussian. The field of robust statistics provides tools addressing this which might be of use. I guess it is worth completing this by mentioning what it says about ntpd. First, ntpd knows all of the above, probably much, much better than I do, though it might not put it in quite the same terms. If you make the assumption that the stochastic delays experienced by samples are evenly distributed between the outbound and inbound paths (this is not a good match for the real world, by the way, but there are constraints...) then round trip delay becomes a stand-in measure of "cluster", and ntpd does what it can with this. The fundamental constraint that limits what ntpd can do, in a couple of ways, is the fact that the final stage of its filter is a PLL. The integrator in a PLL assumes that the errors in the samples it is being fed are zero-mean and normally distributed, and will fail to arrive at a correct answer if this is not the case, so if you want to filter samples for which this is unlikely to be the case you need to do it before they get to the PLL. The problem with doing this well, however, is that a PLL is also destabilised by adding delays to its feedback path, causing errors of a different nature, so anything done before the PLL is severely limited in the amount of time it can spend doing that, and hence the number of samples it can look at to do that. Doing better probably requires replacing the PLL; the "replace it with what?" question is truly interesting. I suspect I've gone well off topic for this list, however, and for that I apologize. I just wanted to make sure it was understood that there is an argument for the view that we do not yet know of any fundamental limits on the precision that NTP, or a network time protocol like NTP, might achieve, so any effort to build NTP servers and clients which can make their measurements more precisely is not a waste of time. It instead is what is required to make progress in understanding how to do this better. Dennis Ferguson _______________________________________________ time-nuts mailing list -- [email protected] To unsubscribe, go to https://www.febo.com/cgi-bin/mailman/listinfo/time-nuts and follow the instructions there.
