Re: [TICTOC] [ntpwg] comments on draft-stenn-ntp-suggest-refid-00

Harlan Stenn Tue, 05 Apr 2016 14:17:49 -0700

Danny Mayer writes:
> On 4/5/2016 3:21 PM, Sharon Goldberg wrote:
> > On Tue, Apr 5, 2016 at 3:00 PM, Danny Mayer <[email protected]
> > <mailto:[email protected]>> wrote:
> > =
> 
> >     On 4/5/2016 2:28 PM, Sharon Goldberg wrote:
> >     > Dear WG,
> >     >
> >     > To follow up on my comments on draft-stenn-ntp-suggest-refid-00 at =
> the
> >     > IETF'95 WG meeting just now. The current draft requires the use of =
> an
> >     > extension field.=C3=82=C2  I believe the goals of the draft can be
> >     accomplished
> >     > without using an extension field, in a backwards compatible fashion.
> >     >
> >     > The goal of the draft is to limit the information exposed by the RE=
> FID
> >     > while still preserving robustness to "length-1" timing loops where
> >     > system A takes time from system B, but system B takes time from
> >     system A. =C3=82
> >     > This proposal allow system A to limit the info it leaks in its refI=
> D,
> >     > without harming any of its legacy clients.=C3=82
> >     >
> >     > Suppose system A is taking time from system B. Then there are two c=
> ases:
> >     > 1) If A gets a time query from system B, A puts the IP of B in the =
> refID
> >     > of its response. This way, even a legacy B can tell it cannot take =
> time
> >     > from A because this would cause a timing loop.
> >     >
> >     > 2) If A gets a time query from system C, A puts a "nonsense" value =
> in
> >     > its refID.=C3=82=C2  Even a legacy C can see that its IP is not in =
> the
> >     refID,
> >     > and so it is allowed to take time from A.=C3=82
> >     >
> >     > One question is what this "nonsense" value should be.=C3=82=C2  I t=
> hink it
> >     > should be a fixed value. For example 0.0.0.0.=C3=82=C2  We would no=
> t want a
> >     > randomly-chosen value since this might collide with actual IP addre=
> sses
> >     > on the network.
> >     >
> >     > Thanks,
> >     > Sharon
> > =
> 
> >     The problem that I was alluding to in the jabber room is this. I'll k=
> eep
> >     the problem simple and to one example.
> > =
> 
> >     A takes its time from B over an IPv4 address. A then gets a time query
> >     from C over IPv6. How does it know if B and C are the same system or
> >     different system? A has no information about that.
> > =
> 
> >     Is this clearer?
> > =
> 
> > Yes, it's clear. How does the current ntpd implementation deal with this
> > problem? =C2 =
> 
> > =
> 
> Currently it doesn't. One of my proposals is for each system to generate
> a RefID and use that for all interfaces. I'm not sure if that works for
> all cases or causes other issues.


It *hasn't* been an issue.  It would be easy enough to address it if
somebody felt there was a need to, however.

The point is that with your proposal, Sharon, I'm not seeing that it is
possible to address this case.  That is separate from whether or not
it's a problem that needs to be addressed, now or in the "near" future.

It's clear that if A has multiple IPs on it (very common now, not common
20 years' ago) that A could check all of its IPs when it saw B's refid
and decide if any of them meant B was sync'd to A.

In reality, we have only seen that if A sends a packet to B, B responds
to that same address on that same IP.  We have had cases where that
response has come over a different interface.

There are other possiblities, though.

Here are more scenarios.

A VM host maintains the primary clock, and "guests" use that clock.  In
this case, if A and B are on the same VM host, their underlying system
clock is actually the same one.

And that assumes one (common) scenario - the VMs are likely set up so
that all the VMs have tightly-synchronized time.

There's another use case I've been looking for - a VM setup where I
can configure each VM to have a clock that behaves independently, so I
can test wider behaviors more easily.

Is it sufficient to check for loops of degree 1?  Exactly what problems
are we solving here?

What about the situation where a single refclock is directly feeding
multiple servers?  If A and B are getting their time from the same
refclock, should that mean A and B should not fall back on each other if
one of them loses signal to their shared refclock?

If A and B are talking to the same type of refclock is that something
that should be addressed?  There have been cases where broken firmware
has caused a family of refclocks to all take simultaneous leaps to the
same spot in the weeds.

If A and B are both using GPS (for example) and there is a system glitch
where GPS time is affected, is that also a timing loop that we'd want to
avoid?  This happened in Jan 2016.

What about a group of systems that share a PPS signal?  The PPS signal
might become untrustworthy, or the "source of numbered seconds" might
break.

Many of these issues are currently(!) insignificant in a LAN setting
where we're looking at timing/frequency sync to around the millisecond
level (roughly).  I don't *know*, but I'd bet they are significant when
trying to get to the microsecond level, let alone the nanosecond level.
Hello PTP.  Hello RADclock and the work PHK is doing with the algorithms
he's using on Ntimed.  Once we learn more about these, we need to make
sure these different choices will "play nice" with each other.

Similarly, chrony uses algorithms that are slightly different from the
ones NTP uses.  We haven't seen studies of any potential loop
instabilities between these two approaches, either.

I mention this because 20 years' ago we thought we were doing well
getting clocks to sync to within a tenth of a second.  Now, getting
sub-millisecond sync on a LAN is common.  I've spoken with folks who
told me that with some care and support, they are getting sync levels
with NTP that are very close to what they get with PTP - down to the
range of nanoseconds.  Where will we be 10 years' time from now?

I recall PHK saying that with PTP (or PTP-like stuff), he was able to
get offset and frequency sync on cold start in less than a second.  He
was doing this with a poll interval of about -9, or about 500
polls/second.

I know DLM was solving a real and evidenced problem with the refid and
loop detection.

It's looking like we need to take a stronger look at this to decide if
those problems are still significant, and if different problems are
becoming significant.
-- 
Harlan Stenn <[email protected]>
http://networktimefoundation.org - be a member!

_______________________________________________
TICTOC mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/tictoc

Re: [TICTOC] [ntpwg] comments on draft-stenn-ntp-suggest-refid-00

Reply via email to