Re: [chrony-users] Possible bug in PPS support

2017-10-25 Thread Rob Janssen

Miroslav Lichvar wrote:

On Tue, Oct 24, 2017 at 11:14:21PM +0200, Rob Janssen wrote:

I am now monitoring the Root dispersion and this appears to work OK, after some 
tweaking
of the threshold value.  The reference time unfortunately is in a format that 
is not easy to
check for "being recent" in a simple script, it would be nice if there was a 
"seconds since epoch"
field as well (as there is in ntpd/ntpq).

With the -c option, which is available in newer chrony versions, the
reference timestamp is printed in "seconds since epoch".

$ chronyc -c tracking | awk -F , '{ print $4 }'
1508912704.491908798


Thanks!  I have updated to 3.2 but not re-read the manpage.
This format is much easier to parse in our monitoring plugin, I'll rework it to 
use this feature.

Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-25 Thread Miroslav Lichvar
On Tue, Oct 24, 2017 at 11:14:21PM +0200, Rob Janssen wrote:
> I am now monitoring the Root dispersion and this appears to work OK, after 
> some tweaking
> of the threshold value.  The reference time unfortunately is in a format that 
> is not easy to
> check for "being recent" in a simple script, it would be nice if there was a 
> "seconds since epoch"
> field as well (as there is in ntpd/ntpq).

With the -c option, which is available in newer chrony versions, the
reference timestamp is printed in "seconds since epoch".

$ chronyc -c tracking | awk -F , '{ print $4 }'
1508912704.491908798

-- 
Miroslav Lichvar

-- 
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.
Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-24 Thread Rob Janssen

Miroslav Lichvar wrote:

I think the best approach for checking the accuracy of the clock is to
monitor the root delay+dispersion. That's the estimated maximum error
of the clock. If you really wanted to make sure an update of the clock
was made in the last X seconds, you can check the reference time.



I am now monitoring the Root dispersion and this appears to work OK, after some 
tweaking
of the threshold value.  The reference time unfortunately is in a format that 
is not easy to
check for "being recent" in a simple script, it would be nice if there was a 
"seconds since epoch"
field as well (as there is in ntpd/ntpq).  But well, it looks like the 
dispersion increases
rapidly when there is no PPS reference and this is much like what I require.
(after all, the same is happening to the uncertainty of the time for our 
application)

Thanks for the hint!

Rob


--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-24 Thread Bill Unruh


On Mon, 23 Oct 2017, Bill Unruh wrote:


If that is all you want, then you could look at the "refclock" log and see


Sorry. That's refclocks.log

when the last successful input came in. If it is more than say 15 min ago 
then

the reach would be down to 0 and the refclock would have stopped.

Or you could run  chronyc in a cron, and use the sources and look at the 
reach and if it

was 0 hit an error flag.


William G. Unruh __| Canadian Institute for| Tel: +1(604)822-3273
Physics _|___ Advanced Research _| Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology | un...@physics.ubc.ca
Canada V6T 1Z1 | and Gravity __|_ www.theory.physics.ubc.ca/

On Mon, 23 Oct 2017, Rob Janssen wrote:


Bill Unruh wrote:

On Mon, 23 Oct 2017, Rob Janssen wrote:


Bill Unruh wrote:

If you really need 20usec, then relying on one gps is certainly a bad
decision. You should have two or three machines all with independent gps
sources so you could catch one of them going rogue, or quitting.

The GPSDOs we are using are 2-3 orders of magnitude better than that.
These are not your typical $50 modules, but professional GPSDO with OCXO
or better oscillator.


It is not the accuracy of the individual gps but the the fallback in case 
one

of them goes mad (as happened to you). You do not want them on the same
machine unless they have hardware timestamping, since the interrupt 
latency is

far larger than 1us for servicing each interrupt.

Again you are wandering away from the topic Bill!
The discussion is about detection of a possible problem, not about 
availability.
I did not specify availability of the system, it may well be down when 
there is a component

failure, but we only want to know about it.




Monitoring of their accuracy is done by their owners, we only get the 
signal
via distribution amplifiers.  That is why we would prefer to have some 
additional

validation, like the PPS signal completely missing.
(which could also be caused by a mistakenly unplugged or cut cable, which
would never be detected by the GPSDO monitoring)


As I said, you could do that with a cron job every 5 min cheching.
We already have a comprehensive monitoring system based on Nagios, that in 
case
of this service uses "chronyc -h host tracking" to regularly retrieve the 
status of chrony

and alerts responsible people when something is wrong.

The issue is that it monitors "stratum" and "last offset" and it failed to 
trigger when the
PPS signal went away, even after 13 hours.  It would have triggered when 
stratum
went above 1 or last offset above 20us, but it didn't.  Both of these 
values remain frozen

when there is no PPS.

That is the issue I want to rectify, but that won't happen when I discuss 
with you.

Fortunately there is Miroslav who gave me useful hints.

Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org with 
"unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org with "help" in the 
subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.





--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Bill Unruh

If that is all you want, then you could look at the "refclock" log and see
when the last successful input came in. If it is more than say 15 min ago then
the reach would be down to 0 and the refclock would have stopped.

Or you could run  chronyc in a cron, and use the sources and look at the reach 
and if it
was 0 hit an error flag.


William G. Unruh __| Canadian Institute for| Tel: +1(604)822-3273
Physics _|___ Advanced Research _| Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology | un...@physics.ubc.ca
Canada V6T 1Z1 | and Gravity __|_ www.theory.physics.ubc.ca/

On Mon, 23 Oct 2017, Rob Janssen wrote:


Bill Unruh wrote:

On Mon, 23 Oct 2017, Rob Janssen wrote:


Bill Unruh wrote:

If you really need 20usec, then relying on one gps is certainly a bad
decision. You should have two or three machines all with independent gps
sources so you could catch one of them going rogue, or quitting.

The GPSDOs we are using are 2-3 orders of magnitude better than that.
These are not your typical $50 modules, but professional GPSDO with OCXO
or better oscillator.


It is not the accuracy of the individual gps but the the fallback in case 
one

of them goes mad (as happened to you). You do not want them on the same
machine unless they have hardware timestamping, since the interrupt latency 
is

far larger than 1us for servicing each interrupt.

Again you are wandering away from the topic Bill!
The discussion is about detection of a possible problem, not about 
availability.
I did not specify availability of the system, it may well be down when there 
is a component

failure, but we only want to know about it.




Monitoring of their accuracy is done by their owners, we only get the 
signal
via distribution amplifiers.  That is why we would prefer to have some 
additional

validation, like the PPS signal completely missing.
(which could also be caused by a mistakenly unplugged or cut cable, which
would never be detected by the GPSDO monitoring)


As I said, you could do that with a cron job every 5 min cheching.
We already have a comprehensive monitoring system based on Nagios, that in 
case
of this service uses "chronyc -h host tracking" to regularly retrieve the 
status of chrony

and alerts responsible people when something is wrong.

The issue is that it monitors "stratum" and "last offset" and it failed to 
trigger when the
PPS signal went away, even after 13 hours.  It would have triggered when 
stratum
went above 1 or last offset above 20us, but it didn't.  Both of these values 
remain frozen

when there is no PPS.

That is the issue I want to rectify, but that won't happen when I discuss 
with you.

Fortunately there is Miroslav who gave me useful hints.

Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org with 
"unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org with "help" in the 
subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Rob Janssen

Bill Unruh wrote:

On Mon, 23 Oct 2017, Rob Janssen wrote:


Bill Unruh wrote:

If you really need 20usec, then relying on one gps is certainly a bad
decision. You should have two or three machines all with independent gps
sources so you could catch one of them going rogue, or quitting.

The GPSDOs we are using are 2-3 orders of magnitude better than that.
These are not your typical $50 modules, but professional GPSDO with OCXO
or better oscillator.


It is not the accuracy of the individual gps but the the fallback in case one
of them goes mad (as happened to you). You do not want them on the same
machine unless they have hardware timestamping, since the interrupt latency is
far larger than 1us for servicing each interrupt.

Again you are wandering away from the topic Bill!
The discussion is about detection of a possible problem, not about availability.
I did not specify availability of the system, it may well be down when there is 
a component
failure, but we only want to know about it.





Monitoring of their accuracy is done by their owners, we only get the signal
via distribution amplifiers.  That is why we would prefer to have some 
additional
validation, like the PPS signal completely missing.
(which could also be caused by a mistakenly unplugged or cut cable, which
would never be detected by the GPSDO monitoring)


As I said, you could do that with a cron job every 5 min cheching.

We already have a comprehensive monitoring system based on Nagios, that in case
of this service uses "chronyc -h host tracking" to regularly retrieve the 
status of chrony
and alerts responsible people when something is wrong.

The issue is that it monitors "stratum" and "last offset" and it failed to 
trigger when the
PPS signal went away, even after 13 hours.  It would have triggered when stratum
went above 1 or last offset above 20us, but it didn't.  Both of these values 
remain frozen
when there is no PPS.

That is the issue I want to rectify, but that won't happen when I discuss with 
you.
Fortunately there is Miroslav who gave me useful hints.

Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Bill Unruh

On Mon, 23 Oct 2017, Rob Janssen wrote:


Bill Unruh wrote:

If you really need 20usec, then relying on one gps is certainly a bad
decision. You should have two or three machines all with independent gps
sources so you could catch one of them going rogue, or quitting.

The GPSDOs we are using are 2-3 orders of magnitude better than that.
These are not your typical $50 modules, but professional GPSDO with OCXO
or better oscillator.


It is not the accuracy of the individual gps but the the fallback in case one
of them goes mad (as happened to you). You do not want them on the same
machine unless they have hardware timestamping, since the interrupt latency is
far larger than 1us for servicing each interrupt.



Monitoring of their accuracy is done by their owners, we only get the signal
via distribution amplifiers.  That is why we would prefer to have some 
additional

validation, like the PPS signal completely missing.
(which could also be caused by a mistakenly unplugged or cut cable, which
would never be detected by the GPSDO monitoring)


As I said, you could do that with a cron job every 5 min cheching.





You seem to be saying that having no time source whatsoever is better than
having one which may be off by 20us? I think you need to set out the real
conditions that you need in detail ("We need accuracy to 20us" could be
because it was a number that some administrator with absolutely no idea of
time came up with, or it could be a legal requirement, or it could be "we
should be able to do that" kind of requirement) 


The time is used for a single-channel simulcast transmitter system. That is,
the same signal is transmitted from multiple locations on the same frequency
at the same time.  When this is not done within 20us at the same time, it 
will cause

severe distortion of the signal.  When we don't know we are within 20us, we
prefer to not transmit at all, so disable that particular transmitter.


OK then you should have redundancy on each transmitter, and monitoring eg via
that cron job.



I think I know better what is involved and what the limitations are than you 
do.


Of course. But that is not what is at issue here. 
Also, I prefer to discuss with Miroslav, who concentrates on the problem 
under

discussion rather than casting doubt on everything.  Thank you for you input
until now.

???
You are making claims. I ask for what your evidence is for those claims, and
you have never given the evidence. Operating on false evidence is a sure way
of making bad decision. 
I am not casting doubt on everything. I am trying to explain how chrony works

and why it does what it does.




Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org with 
"unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org with "help" in the 
subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Rob Janssen

Bill Unruh wrote:

If you really need 20usec, then relying on one gps is certainly a bad
decision. You should have two or three machines all with independent gps
sources so you could catch one of them going rogue, or quitting.

The GPSDOs we are using are 2-3 orders of magnitude better than that.
These are not your typical $50 modules, but professional GPSDO with OCXO
or better oscillator.
Monitoring of their accuracy is done by their owners, we only get the signal
via distribution amplifiers.  That is why we would prefer to have some 
additional
validation, like the PPS signal completely missing.
(which could also be caused by a mistakenly unplugged or cut cable, which
would never be detected by the GPSDO monitoring)



You seem to be saying that having no time source whatsoever is better than
having one which may be off by 20us? I think you need to set out the real
conditions that you need in detail ("We need accuracy to 20us" could be
because it was a number that some administrator with absolutely no idea of
time came up with, or it could be a legal requirement, or it could be "we
should be able to do that" kind of requirement) 


The time is used for a single-channel simulcast transmitter system. That is,
the same signal is transmitted from multiple locations on the same frequency
at the same time.  When this is not done within 20us at the same time, it will 
cause
severe distortion of the signal.  When we don't know we are within 20us, we
prefer to not transmit at all, so disable that particular transmitter.

I think I know better what is involved and what the limitations are than you do.
Also, I prefer to discuss with Miroslav, who concentrates on the problem under
discussion rather than casting doubt on everything.  Thank you for you input
until now.

Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Rob Janssen

Bill Unruh wrote:


On Mon, 23 Oct 2017, Miroslav Lichvar wrote:


On Mon, Oct 23, 2017 at 10:24:54AM -0700, Bill Unruh wrote:

On Mon, 23 Oct 2017, Rob Janssen wrote:

You don't support my calculation that if the clock apparently wandered
away 3400us


Again, no evidence of that 3400 us.


If I understand it correctly, 3.4ms was the offset of the NTP source


My question is how he determined that the offset was 3.4 ms after 13 hours. 
Simply looking at the offset from the one of the ntp servers does not cut it.
That is only 2 std dev from the mean.

I pasted the output for a single server but the other 2 were within very short 
offset of that:
MS Name/IP address Stratum Poll Reach LastRx Last sample
===
#* PPS   0   4 0   13h   -279ns[ -401ns] +/-   79ns
^- xx..xxx   1  10   377   17m  +3476us[+3476us] +/- 9930us
^- xx..xxx   1  10   377   250  +3462us[+3462us] +/-   10ms
^- xxx.xx..xxx   1  10   377   299  +3459us[+3459us] +/-   10ms

I am confident that those offsets were correct, but as I mentioned I forgot to 
subtract the offset that
was already there when the PPS sync was present (due to network delay 
asymmetry).

So that part of his concern is certainly valid. On the other hand, being
worried about the loss of connectivity on the 15 min time scale probably  is
not, unless he has evidence.
But the evidence is all there in the measurement logs when the pps is running.
He could use that to estimate what the skew is over a variety of time periods.

Again, I am not interested in the performance when the clock is free-running as 
I do not believe
that it is good enough for our application anyway.
I am interested in monitoring/detecting that the clock is not synced to 
(recent) PPS input.

Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Bill Unruh

If you really need 20usec, then relying on one gps is certainly a bad
decision. You should have two or three machines all with independent gps
sources so you could catch one of them going rogue, or quitting.

You seem to be saying that having no time source whatsoever is better than
having one which may be off by 20us? I think you need to set out the real
conditions that you need in detail ("We need accuracy to 20us" could be
because it was a number that some administrator with absolutely no idea of
time came up with, or it could be a legal requirement, or it could be "we
should be able to do that" kind of requirement) They impliment a system which
can with some confidence deliver that. There are of course no guarentees. A
nuke over the building would severely degrade the accuracy of the clocks in a
way that was totally unpredictable beforehand. Or a power failure. etc.



William G. Unruh __| Canadian Institute for| Tel: +1(604)822-3273
Physics _|___ Advanced Research _| Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology | un...@physics.ubc.ca
Canada V6T 1Z1 | and Gravity __|_ www.theory.physics.ubc.ca/

On Mon, 23 Oct 2017, Rob Janssen wrote:


Miroslav Lichvar wrote:

On Mon, Oct 23, 2017 at 10:24:54AM -0700, Bill Unruh wrote:

On Mon, 23 Oct 2017, Rob Janssen wrote:

You don't support my calculation that if the clock apparently wandered
away 3400us

Again, no evidence of that 3400 us.

If I understand it correctly, 3.4ms was the offset of the NTP source
13 hours after the PPS stopped working. The stddev of the NTP source
from sourcestats is ~50 microseconds, so if the offset was originally
better than 1.6ms (there are 3 different sources in the original
report with -0.1ms, 1.5ms, 1.5ms offsets), it drifted at least by
~1.8ms in that time.



Yes, I forgot that there was a systematic 1.4 ms offset at the time the PPS 
sync was
active so after it ran unsynced for 13h and had a 3.4 ms offset the drift was 
more like

2ms instead of 3.4ms.
However, that still is 2 orders of magnitude more than we can allow.  So we 
certainly
need to alert on this condition, we cannot just freewheel for 13 hours and 
assume the

time is still accurate enough.

I am now testing with the root delay/dispersion.   A couple of minutes after 
the PPS
has been removed, the root delay remains at 0.1 seconds but the root 
dispersion
now has increased to 0.000662125 seconds.  That certainly is a value that is 
immediately
affected by the lack of sync, however I need to determine a threshold value 
for the

monitoring alert.
The tracking also shows "System time : 0.9 seconds fast of NTP 
time"

but I cannot believe the time is still that accurate.

I understand now that the 10ms value shown in "chronyc sources" is based on 
the 20ms
roundtriptime of the network towards the NTP source.  This time is quite 
constant as
indicated by the low Std Dev but the fixed RTT apparently makes chrony 
believe the
network is dodgy (as Bill expresses it).  The only thing dodgy about it is 
that for this
particular site there is a systematic offset in the propagation time 
from/towards the
site of 1.4 ms resulting in the 1.4ms offset observed when PPS is available, 
probably
caused by asymmetric routing.  Other than that, it is quite stable. It is a 
network
designed for distribution of audio and video to transmitter sites, well 
dimensioned with

guaranteed bandwidth and not overloaded at any time.

Rob


--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org with 
"unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org with "help" in the 
subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Bill Unruh


On Mon, 23 Oct 2017, Miroslav Lichvar wrote:


On Mon, Oct 23, 2017 at 10:24:54AM -0700, Bill Unruh wrote:

On Mon, 23 Oct 2017, Rob Janssen wrote:

You don't support my calculation that if the clock apparently wandered
away 3400us


Again, no evidence of that 3400 us.


If I understand it correctly, 3.4ms was the offset of the NTP source


My question is how he determined that the offset was 3.4 ms after 13 hours. 
Simply looking at the offset from the one of the ntp servers does not cut it.

That is only 2 std dev from the mean.




13 hours after the PPS stopped working. The stddev of the NTP source
from sourcestats is ~50 microseconds, so if the offset was originally
better than 1.6ms (there are 3 different sources in the original
report with -0.1ms, 1.5ms, 1.5ms offsets), it drifted at least by
~1.8ms in that time.

If there was a significant change in the temperature, the error gained
in 13 hours could be much larger. On one of my servers with PPS I see
that the frequency offset can change by 0.5 ppm in just few seconds.


I agree. and chrony PPS does a bad job of measuring that. Perhaps chrony
should keep track of the drift over a much longer period than the measurement
period (max 64 samples are 16 sec per sample is only about 15 min. so, keeping
a list of the drift rate over say a day would give a much better feeling for
the drift wander due to temp differences, etc. It is certainly true that the
drift fluctuations are not guassian so an estimate derived from 15 min really
gives a very poor estimate of the fluctuations on the time scale of hours or
days.

So that part of his concern is certainly valid. On the other hand, being
worried about the loss of connectivity on the 15 min time scale probably  is
not, unless he has evidence.
But the evidence is all there in the measurement logs when the pps is running.
He could use that to estimate what the skew is over a variety of time periods.



O


--
Miroslav Lichvar

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org
with "help" in the subject.
Trouble?  Email listmas...@chrony.tuxfamily.org.



--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Rob Janssen

Miroslav Lichvar wrote:

On Mon, Oct 23, 2017 at 10:24:54AM -0700, Bill Unruh wrote:

On Mon, 23 Oct 2017, Rob Janssen wrote:

You don't support my calculation that if the clock apparently wandered
away 3400us

Again, no evidence of that 3400 us.

If I understand it correctly, 3.4ms was the offset of the NTP source
13 hours after the PPS stopped working. The stddev of the NTP source
from sourcestats is ~50 microseconds, so if the offset was originally
better than 1.6ms (there are 3 different sources in the original
report with -0.1ms, 1.5ms, 1.5ms offsets), it drifted at least by
~1.8ms in that time.



Yes, I forgot that there was a systematic 1.4 ms offset at the time the PPS 
sync was
active so after it ran unsynced for 13h and had a 3.4 ms offset the drift was 
more like
2ms instead of 3.4ms.
However, that still is 2 orders of magnitude more than we can allow.  So we 
certainly
need to alert on this condition, we cannot just freewheel for 13 hours and 
assume the
time is still accurate enough.

I am now testing with the root delay/dispersion.   A couple of minutes after 
the PPS
has been removed, the root delay remains at 0.1 seconds but the root 
dispersion
now has increased to 0.000662125 seconds.  That certainly is a value that is 
immediately
affected by the lack of sync, however I need to determine a threshold value for 
the
monitoring alert.
The tracking also shows "System time : 0.9 seconds fast of NTP time"
but I cannot believe the time is still that accurate.

I understand now that the 10ms value shown in "chronyc sources" is based on the 
20ms
roundtriptime of the network towards the NTP source.  This time is quite 
constant as
indicated by the low Std Dev but the fixed RTT apparently makes chrony believe 
the
network is dodgy (as Bill expresses it).  The only thing dodgy about it is that 
for this
particular site there is a systematic offset in the propagation time 
from/towards the
site of 1.4 ms resulting in the 1.4ms offset observed when PPS is available, 
probably
caused by asymmetric routing.  Other than that, it is quite stable. It is a 
network
designed for distribution of audio and video to transmitter sites, well 
dimensioned with
guaranteed bandwidth and not overloaded at any time.

Rob


--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Miroslav Lichvar
On Mon, Oct 23, 2017 at 10:24:54AM -0700, Bill Unruh wrote:
> On Mon, 23 Oct 2017, Rob Janssen wrote:
> > You don't support my calculation that if the clock apparently wandered
> > away 3400us
> 
> Again, no evidence of that 3400 us.

If I understand it correctly, 3.4ms was the offset of the NTP source
13 hours after the PPS stopped working. The stddev of the NTP source
from sourcestats is ~50 microseconds, so if the offset was originally
better than 1.6ms (there are 3 different sources in the original
report with -0.1ms, 1.5ms, 1.5ms offsets), it drifted at least by
~1.8ms in that time.

If there was a significant change in the temperature, the error gained
in 13 hours could be much larger. On one of my servers with PPS I see
that the frequency offset can change by 0.5 ppm in just few seconds.

-- 
Miroslav Lichvar

-- 
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.
Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Bill Unruh


On Mon, 23 Oct 2017, Rob Janssen wrote:


Bill Unruh wrote:


Ok but rather than "only a few hours" I would like to see "only a few 
minutes".


But that would be totally rediculous. The offset of the local clock from 
UTC
after even a few hours is still far far better than that from the network, 
and
far better even than 20us. Remember what you want to know is how far the 
local
clock is from UTC, not whether or not the local clock has not heard from 
PPS

in the past few minutes.
You don't support my calculation that if the clock apparently wandered away 
3400us


Again, no evidence of that 3400 us.

From the evidence that chrony has, pps does NOT wander that badly in 13 hrs. 

Remember chrony constantly measures both the standard deviation in the offset
AND in the rate. So it has a good estimate of how far the offset will wander
in that time.  And it is NOT 3400us. So you need to tell us how you measure
that 3400us.



after 13 hours, it would take about 5 minutes to wander 20us?


Not it would not. chrony has measured it, and it is not that much.


I would think it is a best-case calculation as it assumes a linear drift in 
one direction.
I practice it will probably wobble, and take less than 5 minutes to wander 
20us.




Please note we are talking MICROseconds here.  Not MILLIseconds.  I don't 
think
many standard systems will remain within 20us for several hours if left 
without sync.

(it would likely require some TCXO clock option)


Sure they could. If the temp is constant, as you claim, that is main cause of
changes in drift rate.






The Span indicated by sourcestats is 79 for the PPS source now, and 103m 
for

the network sources.
Would that mean it drops the PPS after 79 seconds?  That would be fine.


No. You really need to think through what you want and what the time on 
your
server machine delivers. After all if the computer clock in your local 
machine
was and exact track of UTC always to atto seconds, and you used the GPS 
only

to make determine the intial offset determination then it would be silly to
throw away that source just because the pps had not been heard from.


We are not interested in "time that is likely a good estimation". We require 
accurate time and if


I am sorry, but nothing will give you "accurate time" Not even GPS. What it
can give you is an estimate of the time and the accuracy of that estimate.


we do not have it, or do not have certainty about it, we need to shutdown our 
application.
So we require some monitoring.  Of course I can add monitoring of "sources" 
or "sourcestats"
to the monitoring of "tracking" that we currently do, and alert when "Reach" 
of the PPS
clock is zero.  That is probably our quickest solution.  However, I would 
have expected this
error condition (missing PPS pulses) to be somehow reflected in the 
"tracking" output.


Why?




Rob



--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Bill Unruh



William G. Unruh __| Canadian Institute for| Tel: +1(604)822-3273
Physics _|___ Advanced Research _| Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology | un...@physics.ubc.ca
Canada V6T 1Z1 | and Gravity __|_ www.theory.physics.ubc.ca/

On Mon, 23 Oct 2017, Rob Janssen wrote:


Bill Unruh wrote:

210 Number of sources = 4
MS Name/IP address Stratum Poll Reach LastRx Last sample
===
#* PPS   0   4   37724   +218ns[ +278ns] +/- 
124ns
^- xx..xxx   1  10   377   877   -147us[ -122us] +/- 
11ms
^- xx..xxx   1  10   37714 +1480us[+1480us] +/- 
10ms
^- xxx.xx..xxx   1  10   377   345 +1446us[+1447us] +/- 
10ms


However, recently at one site the PPS signal was lost, but chrony keeps 
"locked" to it:


MS Name/IP address Stratum Poll Reach LastRx Last sample
===
#* PPS   0   4 0   13h   -279ns[ -401ns] +/- 
79ns
^- xx..xxx   1  10   377   250 +3462us[+3462us] +/- 
10ms


As can be seen, it has been lost for 13 hours but it still has the * sign 
in the 2nd column.
We are remotely monitoring these systems using chronyc tracking and it 
still indicated stratum 1 referenced to PPS.


I would have expected it to drop back to using those network time servers 
after some time of not getting pulses
(i.e. once "Reach" is 0) and the stratum to increase to 2.  When it would 
operate that way, we would have

received an alert.

Furthermore, the clock had drifted by 3.5ms by the time the above status 
was noticed, while when synchronized
to network time it usually is within 1 to 1.5ms.  So it really is not 
considering those network time sources anymore.


Not sure what the above paragraph means. How do you know it has drifted by
3.5ms or 1 ms? I do not believe those figures, unless you meant 3.5us and
1usec. If by remote monitoring you mean really really remote with dodgy
network between them.
Look in the above stats: it usually is at about 1.5ms (14xx us) from the 
network time sources,

and when the error condition occurred, it was at 3462us offset.
There is a network between the source and the system, but it isn't dodgy.


Yes, it is. Note that it is saying that the standard deviation is 10ms. That
one particular measurement was only off by 1.5ms does not tell one anything. 
The standard deviation tells much more.


And if it is off by 1.5 ms, that is still 1 times worse than the PPS.





Was this a test by the way where you unplugged the gps from the machine.
Otherwise figuring out why gps pps was lost for that period of time is
probably the first thing to do.
We know what happened: the GPSDO went defective so there were no PPS pulses 
anymore.

(and also no 10 MHz reference, which we need in another part of the system)


That is of course a different issue. And seeing no 10MHz reference is surely
something you can test for elsewhere.



What I would like to see is handling of the error condition.  Of course it is


The purpose of chrony is to discipline the local clock Not to test GPS
receivers. 
You could run a cron job which looks at the PPS reach every 5 min and if it

finds it has dropped to 0, it can do something like let you know your gps has
problems. But why should that be chrony's job? It is giving you the best
estimate of UTC it can given the data. I certainly would not want it giving me
worse estimates.



understandable that
there is no time syncing when there are no PPS pulses, but the condition


Sure there is. You can still use the past info from PPS to sync the current
clock.


should be visible.
(e.g. by the stratum increasing and/or the source changing)

Miroslav is better placed to figure out what is happening within chrony 
when

it loses pps input. Given the uncertainty in the rate as estimated from the
PPS it, 13 hrs ago, is still probably a better estimate of the current time
than is the network time from the other systems. 
It isn't!  Network time from the other systems would be about 1500us out, 
time was now 3400us out.


No idea what you mean. As I said I have seen no evidence about how you
determined those figures.



However, that is not the main point.
Remember that they are at poll 10 which is 1000 seconds or so (about 15 
min)
so the network time sources have not had that many "measurements" in that 
time

interval and those are pretty crappy (10ms std dev which is really huge).
The PPS std dev is inn the ns range-- about  1 times better.
I don't think the shown output in the last column of "chronyc sources" is the 
stddev.
Right now that column still indicates 10ms, but when I use "chronyc 
sourcestats" the last

column actually has a header Std Dev and the values are around 40-60us.


So the PPS is
still, even 13 hrs later, a better 

Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Rob Janssen

Miroslav Lichvar wrote:
One point I forgot to make is that even if chronyd reselected immediately after the reach value of the PPS refclock got to 0, like ntpd does, checking stratum or selected source wouldn't be a reliable way to monitor the accuracy, because the 
reselection wouldn't happen if the NTP source was down too. 


Ok I will experiment with watching the root delay and -dispersion and see how 
they behave when removing PPS on my test system.
At the moment (after being locked for 8 hours or so) it shows:

Root delay  : 0.1 seconds
Root dispersion : 0.10389 seconds

Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Miroslav Lichvar
On Mon, Oct 23, 2017 at 07:01:30PM +0200, Rob Janssen wrote:
> Miroslav Lichvar wrote:
> > 
> > The last skew was 14 ppb, so it would take about 8 days to accumulate
> > 10 milliseconds worth of dispersion.
> 
> Can you explain where the 10ms comes from?  I know it is displayed in the 
> "sources" output,
> but how is it calculated?  It is way above the StdDev indicated in the 
> "sourcestats".
> And of course it is also way above our usual accuracy.

It includes the root delay and distance. Check "chronyc ntpdata". Most
of that is probably the round-trip time to the server.

One point I forgot to make is that even if chronyd reselected
immediately after the reach value of the PPS refclock got to 0, like
ntpd does, checking stratum or selected source wouldn't be a reliable
way to monitor the accuracy, because the reselection wouldn't happen
if the NTP source was down too.

-- 
Miroslav Lichvar

-- 
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.
Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Rob Janssen

Miroslav Lichvar wrote:


The last skew was 14 ppb, so it would take about 8 days to accumulate
10 milliseconds worth of dispersion.


Can you explain where the 10ms comes from?  I know it is displayed in the 
"sources" output,
but how is it calculated?  It is way above the StdDev indicated in the 
"sourcestats".
And of course it is also way above our usual accuracy.

Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Rob Janssen

Bill Unruh wrote:



Ok but rather than "only a few hours" I would like to see "only a few minutes".


But that would be totally rediculous. The offset of the local clock from UTC
after even a few hours is still far far better than that from the network, and
far better even than 20us. Remember what you want to know is how far the local
clock is from UTC, not whether or not the local clock has not heard from PPS
in the past few minutes.

You don't support my calculation that if the clock apparently wandered away 
3400us
after 13 hours, it would take about 5 minutes to wander 20us?
I would think it is a best-case calculation as it assumes a linear drift in one 
direction.
I practice it will probably wobble, and take less than 5 minutes to wander 20us.

Please note we are talking MICROseconds here.  Not MILLIseconds.  I don't think
many standard systems will remain within 20us for several hours if left without 
sync.
(it would likely require some TCXO clock option)





The Span indicated by sourcestats is 79 for the PPS source now, and 103m for
the network sources.
Would that mean it drops the PPS after 79 seconds?  That would be fine.


No. You really need to think through what you want and what the time on your
server machine delivers. After all if the computer clock in your local machine
was and exact track of UTC always to atto seconds, and you used the GPS only
to make determine the intial offset determination then it would be silly to
throw away that source just because the pps had not been heard from.


We are not interested in "time that is likely a good estimation". We require 
accurate time and if
we do not have it, or do not have certainty about it, we need to shutdown our 
application.
So we require some monitoring.  Of course I can add monitoring of "sources" or 
"sourcestats"
to the monitoring of "tracking" that we currently do, and alert when "Reach" of 
the PPS
clock is zero.  That is probably our quickest solution.  However, I would have 
expected this
error condition (missing PPS pulses) to be somehow reflected in the "tracking" 
output.

Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Bill Unruh

...



Is it to be considered a bug, or is this just a design feature?

It's a feature, but there is apparently a bug which may make the
switch take much longer than it should.


However, we use this form of time synchronization because we need the clock 
to be within about 20us
of real time.  When the PPS sync is lost and only network sync is achieved, 
that is not really attainable.

So we need some indication whenever there is no PPS sync.
Would it not be reasonable to indicate loss of PPS sync when the Reach value 
becomes zero?
Ok, it could be that freewheeling keeps a more accurate time than syncing to 
another source, but

at least the error condition should be monitored.




How could we work around that in this case?

Decreasing the maximum number of samples of the NTP source with the
maxsamples option should reduce the maximum span (as reported in
sourcestats) and also the time it will switch from unreachable
sources.

Increasing the maxclockerror would do that too if it was included in
the source selection. Even with the default value it would take only few
hours to switch in your case.



Ok but rather than "only a few hours" I would like to see "only a few 
minutes".


But that would be totally rediculous. The offset of the local clock from UTC
after even a few hours is still far far better than that from the network, and
far better even than 20us. Remember what you want to know is how far the local
clock is from UTC, not whether or not the local clock has not heard from PPS
in the past few minutes.



The Span indicated by sourcestats is 79 for the PPS source now, and 103m for
the network sources.
Would that mean it drops the PPS after 79 seconds?  That would be fine.


No. You really need to think through what you want and what the time on your
server machine delivers. After all if the computer clock in your local machine
was and exact track of UTC always to atto seconds, and you used the GPS only
to make determine the intial offset determination then it would be silly to
throw away that source just because the pps had not been heard from.




Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org with 
"unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org with "help" in the 
subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Rob Janssen

Bill Unruh wrote:

210 Number of sources = 4
MS Name/IP address Stratum Poll Reach LastRx Last sample
===
#* PPS   0   4   37724   +218ns[ +278ns] +/- 124ns
^- xx..xxx   1  10   377   877   -147us[ -122us] +/- 11ms
^- xx..xxx   1  10   37714 +1480us[+1480us] +/- 10ms
^- xxx.xx..xxx   1  10   377   345 +1446us[+1447us] +/- 10ms

However, recently at one site the PPS signal was lost, but chrony keeps 
"locked" to it:

MS Name/IP address Stratum Poll Reach LastRx Last sample
===
#* PPS   0   4 0   13h   -279ns[ -401ns] +/- 79ns
^- xx..xxx   1  10   377   250 +3462us[+3462us] +/- 10ms

As can be seen, it has been lost for 13 hours but it still has the * sign in 
the 2nd column.
We are remotely monitoring these systems using chronyc tracking and it still 
indicated stratum 1 referenced to PPS.

I would have expected it to drop back to using those network time servers after 
some time of not getting pulses
(i.e. once "Reach" is 0) and the stratum to increase to 2.  When it would 
operate that way, we would have
received an alert.

Furthermore, the clock had drifted by 3.5ms by the time the above status was 
noticed, while when synchronized
to network time it usually is within 1 to 1.5ms.  So it really is not 
considering those network time sources anymore.


Not sure what the above paragraph means. How do you know it has drifted by
3.5ms or 1 ms? I do not believe those figures, unless you meant 3.5us and
1usec. If by remote monitoring you mean really really remote with dodgy
network between them.

Look in the above stats: it usually is at about 1.5ms (14xx us) from the 
network time sources,
and when the error condition occurred, it was at 3462us offset.
There is a network between the source and the system, but it isn't dodgy.


Was this a test by the way where you unplugged the gps from the machine.
Otherwise figuring out why gps pps was lost for that period of time is
probably the first thing to do.

We know what happened: the GPSDO went defective so there were no PPS pulses 
anymore.
(and also no 10 MHz reference, which we need in another part of the system)

What I would like to see is handling of the error condition.  Of course it is 
understandable that
there is no time syncing when there are no PPS pulses, but the condition should 
be visible.
(e.g. by the stratum increasing and/or the source changing)


Miroslav is better placed to figure out what is happening within chrony when
it loses pps input. Given the uncertainty in the rate as estimated from the
PPS it, 13 hrs ago, is still probably a better estimate of the current time
than is the network time from the other systems. 

It isn't!  Network time from the other systems would be about 1500us out, time 
was now 3400us out.
However, that is not the main point.

Remember that they are at poll 10 which is 1000 seconds or so (about 15 min)
so the network time sources have not had that many "measurements" in that time
interval and those are pretty crappy (10ms std dev which is really huge).
The PPS std dev is inn the ns range-- about  1 times better.

I don't think the shown output in the last column of "chronyc sources" is the 
stddev.
Right now that column still indicates 10ms, but when I use "chronyc 
sourcestats" the last
column actually has a header Std Dev and the values are around 40-60us.


So the PPS is
still, even 13 hrs later, a better estimate of the true time than are those
crappy network sources.

The network sources aren't crappy.  There is a systematic offset but the 
variation is low.
I have no idea what the figure in the last column of sources means, it has no 
header.






The above situation occurred with chrony 2.1
However, I have reproduced it with an installation updated to version 3.2 although with 
an "outage" time of 15 minutes.
It had Reach 0 but still was indicating lock to PPS after 869 seconds.


The star means that the PPS is the best indicator of what the true time now
is. 

Even when it has not provided information for 13 hours?



Is it to be considered a bug, or is this just a design feature?


It is neither a bug or a "design feature" (by which I assume you mean it is
not working properly but the designer does not care-- that is how it is often
taken to mean).

Of course it could be that the design has a different objective.
We need the time to be very accurate (preferably within 2us but certainly 
within 20us)
and it looks like chrony is normally able to achieve that, but a design feature 
could
be that it is freewheeling on loss of sync rather than indicating an error.
I don't mind that it is freewheeling but I need an indication of that - because 
I need to
turn off our application as I know it does not take long 

Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Bill Unruh

210 Number of sources = 4
MS Name/IP address Stratum Poll Reach LastRx Last sample
===
#* PPS   0   4   37724   +218ns[ +278ns] +/- 
124ns
^- xx..xxx   1  10   377   877   -147us[ -122us] +/- 
11ms
^- xx..xxx   1  10   37714  +1480us[+1480us] +/- 
10ms
^- xxx.xx..xxx   1  10   377   345  +1446us[+1447us] +/- 
10ms


However, recently at one site the PPS signal was lost, but chrony keeps 
"locked" to it:


MS Name/IP address Stratum Poll Reach LastRx Last sample
===
#* PPS   0   4 0   13h   -279ns[ -401ns] +/- 
79ns
^- xx..xxx   1  10   377   250  +3462us[+3462us] +/- 
10ms


As can be seen, it has been lost for 13 hours but it still has the * sign in 
the 2nd column.
We are remotely monitoring these systems using chronyc tracking and it still 
indicated stratum 1 referenced to PPS.


I would have expected it to drop back to using those network time servers 
after some time of not getting pulses
(i.e. once "Reach" is 0) and the stratum to increase to 2.  When it would 
operate that way, we would have

received an alert.

Furthermore, the clock had drifted by 3.5ms by the time the above status was 
noticed, while when synchronized
to network time it usually is within 1 to 1.5ms.  So it really is not 
considering those network time sources anymore.


Not sure what the above paragraph means. How do you know it has drifted by
3.5ms or 1 ms? I do not believe those figures, unless you meant 3.5us and
1usec. If by remote monitoring you mean really really remote with dodgy
network between them.
Was this a test by the way where you unplugged the gps from the machine.
Otherwise figuring out why gps pps was lost for that period of time is
probably the first thing to do.
Miroslav is better placed to figure out what is happening within chrony when
it loses pps input. Given the uncertainty in the rate as estimated from the
PPS it, 13 hrs ago, is still probably a better estimate of the current time
than is the network time from the other systems. 
Remember that they are at poll 10 which is 1000 seconds or so (about 15 min)

so the network time sources have not had that many "measurements" in that time
interval and those are pretty crappy (10ms std dev which is really huge).
The PPS std dev is inn the ns range-- about  1 times better. So the PPS is
still, even 13 hrs later, a better estimate of the true time than are those
crappy network sources.




The above situation occurred with chrony 2.1
However, I have reproduced it with an installation updated to version 3.2 
although with an "outage" time of 15 minutes.

It had Reach 0 but still was indicating lock to PPS after 869 seconds.


The star means that the PPS is the best indicator of what the true time now
is. 


Is it to be considered a bug, or is this just a design feature?


It is neither a bug or a "design feature" (by which I assume you mean it is
not working properly but the designer does not care-- that is how it is often
taken to mean). Here it indicates that the PPS is still, 13 hrs later, the
best indication of the offset from UTC. Now, this assumption that it is the
best could be off itself. For example if the time span used by the PPS was
overnight when the machine was cool inside, and during the day the machine is
used a lot and heats up, then the estimate from the PPS rate could well be
off because those kinds of jump in the rate would not enter into the estimate
of the skew for the PPS. (if the PPS had accumulated 64 samples at 16 sec per
sample, that is only 15 min, so the time span over which the pps is measuring
the rate  and the changes in the rate is quite short and would not capture
large rate deviations which occur with non-gaussian distribution-- like the
heating up every morning)


How could we work around that in this case?


It is not clear what it is you want to work around? From all the data, the PPS
13 hrs ago is still the best estimate of the UTC. Why would you want chrony to
use a measureably much worse source just because the PPS has not been heard
from for 13 hrs? Eventually the PPS from the remote past is no longer as good
as the relatively really crappy time from the network, but that could take
days.




Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org with 
"unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org with "help" in the 
subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.



Re: [chrony-users] Possible bug in PPS support

2017-10-23 Thread Rob Janssen

Miroslav Lichvar wrote:

On Mon, Oct 23, 2017 at 10:54:52AM +0200, Rob Janssen wrote:

However, recently at one site the PPS signal was lost, but chrony keeps 
"locked" to it:

MS Name/IP address Stratum Poll Reach LastRx Last sample
===
#* PPS   0   4 0   13h   -279ns[ -401ns] +/-   79ns
^- xx..xxx   1  10   377   250  +3462us[+3462us] +/-   10ms

As can be seen, it has been lost for 13 hours but it still has the * sign in 
the 2nd column.
We are remotely monitoring these systems using chronyc tracking and it still 
indicated stratum 1 referenced to PPS.

I would have expected it to drop back to using those network time servers after 
some time of not getting pulses
(i.e. once "Reach" is 0) and the stratum to increase to 2.  When it would 
operate that way, we would have
received an alert.

Furthermore, the clock had drifted by 3.5ms by the time the above status was 
noticed, while when synchronized
to network time it usually is within 1 to 1.5ms.  So it really is not 
considering those network time sources anymore.

It would have switched eventually when the estimated error of the
refclock was larger than the error of the NTP source (10
milliseconds).

That does not seem reasonable... should it not refer to the estimated error of 
the source itself rather
than to the network source?



Have you saved the tracking or sourcestats output? From the skew we
could estimate how long it would take.


Ok here is the tracking.log, the last few lines before it failed:

2017-10-21 22:18:30 PPS  1-12.275  0.048 -6.697e-07 N  1  
4.525e-07  1.504e-07
2017-10-21 22:18:46 PPS  1-12.279  0.030 -1.661e-07 N  1  
3.788e-07  3.638e-11
2017-10-21 22:19:02 PPS  1-12.284  0.029 -7.386e-07 N  1  
4.446e-07  1.177e-07
2017-10-21 22:19:18 PPS  1-12.286  0.020 -6.956e-08 N  1  
3.629e-07  4.908e-11
2017-10-21 22:19:34 PPS  1-12.290  0.022 -7.190e-07 N  1  
4.091e-07  6.094e-08
2017-10-21 22:19:50 PPS  1-12.292  0.018 -1.540e-07 N  1  
3.709e-07  4.822e-11
2017-10-21 22:20:06 PPS  1-12.295  0.017 -4.841e-07 N  1  
4.030e-07  1.114e-07
2017-10-21 22:20:22 PPS  1-12.297  0.014 -1.363e-07 N  1  
3.626e-07  8.935e-09

After this, nothing was logged until I restarted chronyd 13 hours later and it 
synced to the network sources.




Is it to be considered a bug, or is this just a design feature?

It's a feature, but there is apparently a bug which may make the
switch take much longer than it should.


However, we use this form of time synchronization because we need the clock to 
be within about 20us
of real time.  When the PPS sync is lost and only network sync is achieved, 
that is not really attainable.
So we need some indication whenever there is no PPS sync.
Would it not be reasonable to indicate loss of PPS sync when the Reach value 
becomes zero?
Ok, it could be that freewheeling keeps a more accurate time than syncing to 
another source, but
at least the error condition should be monitored.




How could we work around that in this case?

Decreasing the maximum number of samples of the NTP source with the
maxsamples option should reduce the maximum span (as reported in
sourcestats) and also the time it will switch from unreachable
sources.

Increasing the maxclockerror would do that too if it was included in
the source selection. Even with the default value it would take only few
hours to switch in your case.



Ok but rather than "only a few hours" I would like to see "only a few minutes".
The Span indicated by sourcestats is 79 for the PPS source now, and 103m for
the network sources.
Would that mean it drops the PPS after 79 seconds?  That would be fine.

Rob

--
To unsubscribe email chrony-users-requ...@chrony.tuxfamily.org 
with "unsubscribe" in the subject.
For help email chrony-users-requ...@chrony.tuxfamily.org 
with "help" in the subject.

Trouble?  Email listmas...@chrony.tuxfamily.org.