Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-10 Thread Miroslav Lichvar
On Tue, Dec 10, 2013 at 11:20:51AM +0100, Miroslav Lichvar wrote:
> What does the following line from your patch mean?
> 
> tick_error -= tk->xtime_interval;

Ok, I think I understand how it should work. There are two loops, the
bigadjust one is correcting only for ntp tick length and the other for
the cumulative error. I think it might work better if they were both
active at the same time instead of switching between them according to
the current ntp error.

-- 
Miroslav Lichvar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-10 Thread Miroslav Lichvar
On Fri, Dec 06, 2013 at 05:43:45PM -0800, John Stultz wrote:
> Being that the bigadjust code, and specifically this lookahead bit, has
> always been the most opaque logic to me, I figured I'd spend some time
> looking at alternatives, and came up with one approach that tries to
> mimic your patch, but tries to be more in line with the existing logic.
> 
> It seems to do fairly well in the simulator:
> n: 30, slope: 1.00 (1.00 GHz), dev: 3.2 ns, max: 3.6 ns, freq: -99.95677 ppm

Hm, this shows a 0.043ppm error in the frequency. It doesn't seem to go
away even when I use a long sampling interval or give it more time to
settle down. Is that an expected side effect of the patch?

> Basically in the big-error case, we calculate the adjustment from the
> current tick error (and the assumption is that is where the majority of
> the large error is coming from), leaving the normal +1/-1 adjustments to
> the cumulative error.

The normal +1/-1 adjustment doesn't seem to be active in the
simulation, at least in the default settings with 100ppm offset. When
I print the error variable in timekeeping_adjust() I can see its
absolute value stays above interval*2, so timekeeping_bigadjust() is
called on every update. The bigadjust correction seems too weak to
bring the error down to activate the normal +1/-1 adjustment, the
error keeps increasing and the frequency is slighly off.

What does the following line from your patch mean?

tick_error -= tk->xtime_interval;

-- 
Miroslav Lichvar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-10 Thread Miroslav Lichvar
On Fri, Dec 06, 2013 at 05:43:45PM -0800, John Stultz wrote:
 Being that the bigadjust code, and specifically this lookahead bit, has
 always been the most opaque logic to me, I figured I'd spend some time
 looking at alternatives, and came up with one approach that tries to
 mimic your patch, but tries to be more in line with the existing logic.
 
 It seems to do fairly well in the simulator:
 n: 30, slope: 1.00 (1.00 GHz), dev: 3.2 ns, max: 3.6 ns, freq: -99.95677 ppm

Hm, this shows a 0.043ppm error in the frequency. It doesn't seem to go
away even when I use a long sampling interval or give it more time to
settle down. Is that an expected side effect of the patch?

 Basically in the big-error case, we calculate the adjustment from the
 current tick error (and the assumption is that is where the majority of
 the large error is coming from), leaving the normal +1/-1 adjustments to
 the cumulative error.

The normal +1/-1 adjustment doesn't seem to be active in the
simulation, at least in the default settings with 100ppm offset. When
I print the error variable in timekeeping_adjust() I can see its
absolute value stays above interval*2, so timekeeping_bigadjust() is
called on every update. The bigadjust correction seems too weak to
bring the error down to activate the normal +1/-1 adjustment, the
error keeps increasing and the frequency is slighly off.

What does the following line from your patch mean?

tick_error -= tk-xtime_interval;

-- 
Miroslav Lichvar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-10 Thread Miroslav Lichvar
On Tue, Dec 10, 2013 at 11:20:51AM +0100, Miroslav Lichvar wrote:
 What does the following line from your patch mean?
 
 tick_error -= tk-xtime_interval;

Ok, I think I understand how it should work. There are two loops, the
bigadjust one is correcting only for ntp tick length and the other for
the cumulative error. I think it might work better if they were both
active at the same time instead of switching between them according to
the current ntp error.

-- 
Miroslav Lichvar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-07 Thread John Stultz
On 12/07/2013 09:56 AM, Richard Cochran wrote:
> On Fri, Dec 06, 2013 at 05:43:45PM -0800, John Stultz wrote:
>> Anyway, let me know what you think and I'll run some tests on it this
>> weekend.
>>
>> thanks
>> -john
>>
>>
>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index 3abf534..bfb36fd 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -1056,42 +1056,24 @@ static __always_inline int
>> timekeeping_bigadjust(struct timekeeper *tk,
>>   s64 *offset)
> John,
>
> Any chance of posting this against a normal kernel?  I am preparing
> "real" tests comparing the three different patches in this thread on
> v3.12.3, but this one does not apply.

Sorry about that. :(  About half the time I try pasting in a patch,
thunderbird seems to decide to ignore the preformat setting and corrupts
whitespace in the patch.

Anyway, patch is attached. Many thanks for the additional testing and
review!

thanks
-john
>From 3fbbd9ade38419245af22902a98ea221e7b36c94 Mon Sep 17 00:00:00 2001
From: John Stultz 
Date: Fri, 6 Dec 2013 17:25:21 -0800
Subject: [PATCH] my first attempt at reworking bigadjust for nohz

Takes a similar approach as Miroslav's but using the bigadjust method.

Signed-off-by: John Stultz 
---
 kernel/time/timekeeping.c | 80 +--
 1 file changed, 22 insertions(+), 58 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 947ba25..46f4bd2 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1056,42 +1056,24 @@ static __always_inline int timekeeping_bigadjust(struct timekeeper *tk,
 		 s64 *offset)
 {
 	s64 tick_error, i;
-	u32 look_ahead, adj;
-	s32 error2, mult;
+	u32 adj;
+	s32 mult = 1;
 
-	/*
-	 * Use the current error value to determine how much to look ahead.
-	 * The larger the error the slower we adjust for it to avoid problems
-	 * with losing too many ticks, otherwise we would overadjust and
-	 * produce an even larger error.  The smaller the adjustment the
-	 * faster we try to adjust for it, as lost ticks can do less harm
-	 * here.  This is tuned so that an error of about 1 msec is adjusted
-	 * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
-	 */
-	error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
-	error2 = abs(error2);
-	for (look_ahead = 0; error2 > 0; look_ahead++)
-		error2 >>= 2;
+	/* Calculate current tick error */
+	tick_error = ntp_tick_length() >> (tk->ntp_error_shift );
+	tick_error -= tk->xtime_interval;
 
-	/*
-	 * Now calculate the error in (1 << look_ahead) ticks, but first
-	 * remove the single look ahead already included in the error.
-	 */
-	tick_error = ntp_tick_length() >> (tk->ntp_error_shift + 1);
-	tick_error -= tk->xtime_interval >> 1;
-	error = ((error - tick_error) >> look_ahead) + tick_error;
-
-	/* Finally calculate the adjustment shift value.  */
-	i = *interval;
-	mult = 1;
-	if (error < 0) {
-		error = -error;
+	if (tick_error < 0) {
 		*interval = -*interval;
 		*offset = -*offset;
-		mult = -1;
+		mult = -mult;
 	}
-	for (adj = 0; error > i; adj++)
-		error >>= 1;
+
+	/* Sort out the magnitude of the correction */
+	tick_error = abs(tick_error);
+	i = abs(*interval);
+	for (adj = 0; tick_error > i; adj++)
+		tick_error >>= 1;
 
 	*interval <<= adj;
 	*offset <<= adj;
@@ -1114,41 +1096,23 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 	 *
 	 * First we shift it down from NTP_SHIFT to clocksource->shifted nsecs.
 	 *
-	 * Note we subtract one in the shift, so that error is really error*2.
-	 * This "saves" dividing(shifting) interval twice, but keeps the
-	 * (error > interval) comparison as still measuring if error is
-	 * larger than half an interval.
-	 *
-	 * Note: It does not "save" on aggravation when reading the code.
+	 * Then we meausre if error is larger than half an interval.
 	 */
-	error = tk->ntp_error >> (tk->ntp_error_shift - 1);
-	if (error > interval) {
-		/*
-		 * We now divide error by 4(via shift), which checks if
-		 * the error is greater than twice the interval.
-		 * If it is greater, we need a bigadjust, if its smaller,
-		 * we can adjust by 1.
-		 */
-		error >>= 2;
+	error = tk->ntp_error >> (tk->ntp_error_shift);
+	if (error > interval/2) {
 		/*
-		 * XXX - In update_wall_time, we round up to the next
-		 * nanosecond, and store the amount rounded up into
-		 * the error. This causes the likely below to be unlikely.
-		 *
-		 * The proper fix is to avoid rounding up by using
-		 * the high precision tk->xtime_nsec instead of
-		 * xtime.tv_nsec everywhere. Fixing this will take some
-		 * time.
+		 * We now checks if the error is greater than twice the
+		 * interval. If it is greater, we need a bigadjust, if its
+		 * smaller, we can adjust by 1.
 		 */
-		if (likely(error <= interval))
+		if (likely(error <= interval*2))
 			adj = 1;
 		else
 			adj = timekeeping_bigadjust(tk, error, , );
 	} else {
-		if 

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-07 Thread Richard Cochran
On Fri, Dec 06, 2013 at 05:43:45PM -0800, John Stultz wrote:
> Anyway, let me know what you think and I'll run some tests on it this
> weekend.
> 
> thanks
> -john
> 
> 
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 3abf534..bfb36fd 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1056,42 +1056,24 @@ static __always_inline int
> timekeeping_bigadjust(struct timekeeper *tk,
>   s64 *offset)

John,

Any chance of posting this against a normal kernel?  I am preparing
"real" tests comparing the three different patches in this thread on
v3.12.3, but this one does not apply.

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-07 Thread Richard Cochran
On Fri, Dec 06, 2013 at 05:43:45PM -0800, John Stultz wrote:
 Anyway, let me know what you think and I'll run some tests on it this
 weekend.
 
 thanks
 -john
 
 
 diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
 index 3abf534..bfb36fd 100644
 --- a/kernel/time/timekeeping.c
 +++ b/kernel/time/timekeeping.c
 @@ -1056,42 +1056,24 @@ static __always_inline int
 timekeeping_bigadjust(struct timekeeper *tk,
   s64 *offset)

John,

Any chance of posting this against a normal kernel?  I am preparing
real tests comparing the three different patches in this thread on
v3.12.3, but this one does not apply.

Thanks,
Richard
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-07 Thread John Stultz
On 12/07/2013 09:56 AM, Richard Cochran wrote:
 On Fri, Dec 06, 2013 at 05:43:45PM -0800, John Stultz wrote:
 Anyway, let me know what you think and I'll run some tests on it this
 weekend.

 thanks
 -john


 diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
 index 3abf534..bfb36fd 100644
 --- a/kernel/time/timekeeping.c
 +++ b/kernel/time/timekeeping.c
 @@ -1056,42 +1056,24 @@ static __always_inline int
 timekeeping_bigadjust(struct timekeeper *tk,
   s64 *offset)
 John,

 Any chance of posting this against a normal kernel?  I am preparing
 real tests comparing the three different patches in this thread on
 v3.12.3, but this one does not apply.

Sorry about that. :(  About half the time I try pasting in a patch,
thunderbird seems to decide to ignore the preformat setting and corrupts
whitespace in the patch.

Anyway, patch is attached. Many thanks for the additional testing and
review!

thanks
-john
From 3fbbd9ade38419245af22902a98ea221e7b36c94 Mon Sep 17 00:00:00 2001
From: John Stultz john.stu...@linaro.org
Date: Fri, 6 Dec 2013 17:25:21 -0800
Subject: [PATCH] my first attempt at reworking bigadjust for nohz

Takes a similar approach as Miroslav's but using the bigadjust method.

Signed-off-by: John Stultz john.stu...@linaro.org
---
 kernel/time/timekeeping.c | 80 +--
 1 file changed, 22 insertions(+), 58 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 947ba25..46f4bd2 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1056,42 +1056,24 @@ static __always_inline int timekeeping_bigadjust(struct timekeeper *tk,
 		 s64 *offset)
 {
 	s64 tick_error, i;
-	u32 look_ahead, adj;
-	s32 error2, mult;
+	u32 adj;
+	s32 mult = 1;
 
-	/*
-	 * Use the current error value to determine how much to look ahead.
-	 * The larger the error the slower we adjust for it to avoid problems
-	 * with losing too many ticks, otherwise we would overadjust and
-	 * produce an even larger error.  The smaller the adjustment the
-	 * faster we try to adjust for it, as lost ticks can do less harm
-	 * here.  This is tuned so that an error of about 1 msec is adjusted
-	 * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
-	 */
-	error2 = tk-ntp_error  (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
-	error2 = abs(error2);
-	for (look_ahead = 0; error2  0; look_ahead++)
-		error2 = 2;
+	/* Calculate current tick error */
+	tick_error = ntp_tick_length()  (tk-ntp_error_shift );
+	tick_error -= tk-xtime_interval;
 
-	/*
-	 * Now calculate the error in (1  look_ahead) ticks, but first
-	 * remove the single look ahead already included in the error.
-	 */
-	tick_error = ntp_tick_length()  (tk-ntp_error_shift + 1);
-	tick_error -= tk-xtime_interval  1;
-	error = ((error - tick_error)  look_ahead) + tick_error;
-
-	/* Finally calculate the adjustment shift value.  */
-	i = *interval;
-	mult = 1;
-	if (error  0) {
-		error = -error;
+	if (tick_error  0) {
 		*interval = -*interval;
 		*offset = -*offset;
-		mult = -1;
+		mult = -mult;
 	}
-	for (adj = 0; error  i; adj++)
-		error = 1;
+
+	/* Sort out the magnitude of the correction */
+	tick_error = abs(tick_error);
+	i = abs(*interval);
+	for (adj = 0; tick_error  i; adj++)
+		tick_error = 1;
 
 	*interval = adj;
 	*offset = adj;
@@ -1114,41 +1096,23 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 	 *
 	 * First we shift it down from NTP_SHIFT to clocksource-shifted nsecs.
 	 *
-	 * Note we subtract one in the shift, so that error is really error*2.
-	 * This saves dividing(shifting) interval twice, but keeps the
-	 * (error  interval) comparison as still measuring if error is
-	 * larger than half an interval.
-	 *
-	 * Note: It does not save on aggravation when reading the code.
+	 * Then we meausre if error is larger than half an interval.
 	 */
-	error = tk-ntp_error  (tk-ntp_error_shift - 1);
-	if (error  interval) {
-		/*
-		 * We now divide error by 4(via shift), which checks if
-		 * the error is greater than twice the interval.
-		 * If it is greater, we need a bigadjust, if its smaller,
-		 * we can adjust by 1.
-		 */
-		error = 2;
+	error = tk-ntp_error  (tk-ntp_error_shift);
+	if (error  interval/2) {
 		/*
-		 * XXX - In update_wall_time, we round up to the next
-		 * nanosecond, and store the amount rounded up into
-		 * the error. This causes the likely below to be unlikely.
-		 *
-		 * The proper fix is to avoid rounding up by using
-		 * the high precision tk-xtime_nsec instead of
-		 * xtime.tv_nsec everywhere. Fixing this will take some
-		 * time.
+		 * We now checks if the error is greater than twice the
+		 * interval. If it is greater, we need a bigadjust, if its
+		 * smaller, we can adjust by 1.
 		 */
-		if (likely(error = interval))
+		if (likely(error = interval*2))
 			adj = 1;
 		else
 			adj = timekeeping_bigadjust(tk, error, interval, offset);
 	} else {
-		if (error  -interval) {
+		if (error  

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-06 Thread John Stultz
On 12/06/2013 06:26 AM, Miroslav Lichvar wrote:
> This graph shows the value of tk->mult as it changes with clock
> updates:
> http://mlichvar.fedorapeople.org/tmp/tk_test1.png
>
> When the TSC frequency is set to 100 MHz, it becomes more pronounced:
> http://mlichvar.fedorapeople.org/tmp/tk_test2.png
>
> I'm worried about the artifacts in the response, is that a bug?

So now being able to reproduce this, I can see the jaggy
artifacts/discontinuities in the mult curve are basically where the
look_ahead logic is being adjusted.

My understanding is the lookahead bit is trying to evaluate the order of
magnitude in error and use that to dampen the adjustment. This means
when we have large errors, we adjust more slowly. When we have small
errors we can adjust more quickly.

So each discontinuity, you're basically seeing the dampening effect
decreased, causing the adjustment slope to increase.


Being that the bigadjust code, and specifically this lookahead bit, has
always been the most opaque logic to me, I figured I'd spend some time
looking at alternatives, and came up with one approach that tries to
mimic your patch, but tries to be more in line with the existing logic.

It seems to do fairly well in the simulator:
n: 30, slope: 1.00 (1.00 GHz), dev: 3.2 ns, max: 3.6 ns, freq: -99.95677 ppm

Basically in the big-error case, we calculate the adjustment from the
current tick error (and the assumption is that is where the majority of
the large error is coming from), leaving the normal +1/-1 adjustments to
the cumulative error.

But I'm a little worried its maybe too reactive, focusing really only on
the per-tick error magnitude, so I'm not sure how it will handle real
world ntp fluctuation or things like granularity error.

I also went through and removed some of the over-optimization that the
compiler should handle, so the code is a bit more readable.

Anyway, let me know what you think and I'll run some tests on it this
weekend.

thanks
-john


diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3abf534..bfb36fd 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1056,42 +1056,24 @@ static __always_inline int
timekeeping_bigadjust(struct timekeeper *tk,
  s64 *offset)
 {
 s64 tick_error, i;
-u32 look_ahead, adj;
-s32 error2, mult;
+u32 adj;
+s32 mult = 1;
 
-/*
- * Use the current error value to determine how much to look ahead.
- * The larger the error the slower we adjust for it to avoid problems
- * with losing too many ticks, otherwise we would overadjust and
- * produce an even larger error.  The smaller the adjustment the
- * faster we try to adjust for it, as lost ticks can do less harm
- * here.  This is tuned so that an error of about 1 msec is adjusted
- * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
- */
-error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
-error2 = abs(error2);
-for (look_ahead = 0; error2 > 0; look_ahead++)
-error2 >>= 2;
+/* Calculate current tick error */
+tick_error = ntp_tick_length() >> (tk->ntp_error_shift );
+tick_error -= tk->xtime_interval;
 
-/*
- * Now calculate the error in (1 << look_ahead) ticks, but first
- * remove the single look ahead already included in the error.
- */
-tick_error = ntp_tick_length() >> (tk->ntp_error_shift + 1);
-tick_error -= tk->xtime_interval >> 1;
-error = ((error - tick_error) >> look_ahead) + tick_error;
-
-/* Finally calculate the adjustment shift value.  */
-i = *interval;
-mult = 1;
-if (error < 0) {
-error = -error;
+if (tick_error < 0) {
 *interval = -*interval;
 *offset = -*offset;
-mult = -1;
+mult = -mult;
 }
-for (adj = 0; error > i; adj++)
-error >>= 1;
+
+/* Sort out the magnitude of the correction */
+tick_error = abs(tick_error);
+i = abs(*interval);
+for (adj = 0; tick_error > i; adj++)
+tick_error >>= 1;
 
 *interval <<= adj;
 *offset <<= adj;
@@ -1114,41 +1096,23 @@ static void timekeeping_adjust(struct timekeeper
*tk, s64 offset)
  *
  * First we shift it down from NTP_SHIFT to clocksource->shifted nsecs.
  *
- * Note we subtract one in the shift, so that error is really error*2.
- * This "saves" dividing(shifting) interval twice, but keeps the
- * (error > interval) comparison as still measuring if error is
- * larger than half an interval.
- *
- * Note: It does not "save" on aggravation when reading the code.
+ * Then we meausre if error is larger than half an interval.
  */
-error = tk->ntp_error >> (tk->ntp_error_shift - 1);
-if (error > interval) {
-/*
- * We now divide error by 4(via shift), which checks if
- * the error is greater than twice the interval.
- * If it is greater, we need a bigadjust, if its smaller,
-   

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-06 Thread Miroslav Lichvar
On Fri, Dec 06, 2013 at 10:09:03AM -0800, John Stultz wrote:
> On 12/06/2013 06:26 AM, Miroslav Lichvar wrote:
> > It seems to fix the problem with stability, that's good. But the
> > response seems to be very slow now. In the simulated test with 10Hz
> > clock update it takes about 1000 updates (100 seconds!) for the loop
> > to converge to the correct frequency.
> 
> Yea. That was my concern that it over dampens the correction. In my
> tests on actual systems it doesn't seem to cause much change in the
> overall convergence picture with ntp, so I'll have to look more closely.

In a few quick tests with phc2sys I didn't see any changes either.
But I couldn't get the wakeup rate on my test system below 20 per
second even after playing with powertop and killing everything, so I'm
not sure if that means anything.

What is the wakeup rate on your test system?

My feeling is that the internal kernel loop should be faster than the
application's loop.

> Just to be clear, when you say 10Hz clock update, what exactly are you
> changing, as that doesn't quite match to the terminology in the tktest
> simulator (ie: are you changing the ticks count?).

Yes, the second argument of advance_ticks(), 100 for 10 Hz (the
current value) and 1000 for 1 Hz.

> > When the sampling interval is changed to 100*50 ticks:
> > n: 30, slope: 1.00 (1.00 GHz), dev: 2146.9 ns, max: 5446.5 ns, freq: 
> > -100.07928 ppm

> I get the first and the last numbers, but the middle are different for
> me. Are you just setting:

> -   advance_ticks(freq, 100, 1, 20);
> +   advance_ticks(freq, 100, 1, 50);
>  
> for (i = 0; i < samples; i++) {
> getnstimeofday();

No, I'm setting the one in the loop:

ts_x[i] = simtsc;
ts_y[i] = ts.tv_sec * 10ULL + ts.tv_nsec;
 
-   advance_ticks(freq, 100, 1, 1);
+   advance_ticks(freq, 100, 1, 50);

> > When the TSC frequency is set to 100 MHz, it becomes more pronounced:
> > http://mlichvar.fedorapeople.org/tmp/tk_test2.png
> >
> > I'm worried about the artifacts in the response, is that a bug?
> 
> It does look strange. And again so I can reproduce this, how are you
> generating the charts?

I changed TSC_FREQ, added "printk("mult: %d\n", tk->mult);" to
update_wall_time() in timekeeping.c, grepped the output for "^mult"
and used this command in gnuplot on display the data:

plot [] [167755330:167755390] "log.mult" using 2

-- 
Miroslav Lichvar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-06 Thread John Stultz
On 12/06/2013 06:26 AM, Miroslav Lichvar wrote:
> On Mon, Dec 02, 2013 at 08:03:17PM -0800, John Stultz wrote:
>> On 12/02/2013 04:53 PM, John Stultz wrote:
>> Finally found a config to get it working (disabling kernel debugging
>> seems to work), and am currently trying to fixup the missing symbols
>> (although I'm getting segfaults from various inline cli's :)
> Patches are welcome :).
>
>> Very cool simulator, by the way. Do you plan to have a git repo at some
>> point for it?
> It's now at https://github.com/mlichvar/linux-tktest
>
> I'm considering to include it in https://github.com/mlichvar/clknetsim
> as an optional replacement of the somewhat idealized clock which is
> currently implemented there. This would allow us to see the whole
> picture with applications controlling the clock.

Interesting!  I don't think I've seen clknetsim before. I'll have to
look at it more closely!


>> See the patch below. I'm doing some actual testing with it to see if its
>> maybe too dampened.
> It seems to fix the problem with stability, that's good. But the
> response seems to be very slow now. In the simulated test with 10Hz
> clock update it takes about 1000 updates (100 seconds!) for the loop
> to converge to the correct frequency.

Yea. That was my concern that it over dampens the correction. In my
tests on actual systems it doesn't seem to cause much change in the
overall convergence picture with ntp, so I'll have to look more closely.

Just to be clear, when you say 10Hz clock update, what exactly are you
changing, as that doesn't quite match to the terminology in the tktest
simulator (ie: are you changing the ticks count?).


> With the current tktest code from git:
> n: 30, slope: 1.00 (1.00 GHz), dev: 3.1 ns, max: 3.6 ns, freq: -100.43404 ppm
>
> You can see here the frequency is off by 0.43 ppm, that's after the 20
> skipped updates.
>
> When the sampling interval is changed to 100*50 ticks:
> n: 30, slope: 1.00 (1.00 GHz), dev: 2146.9 ns, max: 5446.5 ns, freq: 
> -100.07928 ppm
>
> Only when the warmup period is extended to 100*1000 ticks, it produces
> a nice fit:
> n: 30, slope: 1.00 (1.00 GHz), dev: 7.3 ns, max: 12.2 ns, freq: -100.4 ppm


I get the first and the last numbers, but the middle are different for
me. Are you just setting:

diff --git a/tk_test.c b/tk_test.c
index e44a488..680f315 100644
--- a/tk_test.c
+++ b/tk_test.c
@@ -82,7 +82,7 @@ void tk_test(uint64_t *ts_x, uint64_t *ts_y, int samples, int 
 
advance_ticks(freq, 1, 1, 200);
ntp_freq -= 10;
-   advance_ticks(freq, 100, 1, 20);
+   advance_ticks(freq, 100, 1, 50);
 
for (i = 0; i < samples; i++) {
getnstimeofday();

?

> This graph shows the value of tk->mult as it changes with clock
> updates:
> http://mlichvar.fedorapeople.org/tmp/tk_test1.png
>
> When the TSC frequency is set to 100 MHz, it becomes more pronounced:
> http://mlichvar.fedorapeople.org/tmp/tk_test2.png
>
> I'm worried about the artifacts in the response, is that a bug?

It does look strange. And again so I can reproduce this, how are you
generating the charts?


thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-06 Thread Miroslav Lichvar
On Mon, Dec 02, 2013 at 08:03:17PM -0800, John Stultz wrote:
> On 12/02/2013 04:53 PM, John Stultz wrote:
> Finally found a config to get it working (disabling kernel debugging
> seems to work), and am currently trying to fixup the missing symbols
> (although I'm getting segfaults from various inline cli's :)

Patches are welcome :).

> Very cool simulator, by the way. Do you plan to have a git repo at some
> point for it?

It's now at https://github.com/mlichvar/linux-tktest

I'm considering to include it in https://github.com/mlichvar/clknetsim
as an optional replacement of the somewhat idealized clock which is
currently implemented there. This would allow us to see the whole
picture with applications controlling the clock.

> See the patch below. I'm doing some actual testing with it to see if its
> maybe too dampened.

It seems to fix the problem with stability, that's good. But the
response seems to be very slow now. In the simulated test with 10Hz
clock update it takes about 1000 updates (100 seconds!) for the loop
to converge to the correct frequency.

With the current tktest code from git:
n: 30, slope: 1.00 (1.00 GHz), dev: 3.1 ns, max: 3.6 ns, freq: -100.43404 ppm

You can see here the frequency is off by 0.43 ppm, that's after the 20
skipped updates.

When the sampling interval is changed to 100*50 ticks:
n: 30, slope: 1.00 (1.00 GHz), dev: 2146.9 ns, max: 5446.5 ns, freq: -100.07928 
ppm

Only when the warmup period is extended to 100*1000 ticks, it produces
a nice fit:
n: 30, slope: 1.00 (1.00 GHz), dev: 7.3 ns, max: 12.2 ns, freq: -100.4 ppm

This graph shows the value of tk->mult as it changes with clock
updates:
http://mlichvar.fedorapeople.org/tmp/tk_test1.png

When the TSC frequency is set to 100 MHz, it becomes more pronounced:
http://mlichvar.fedorapeople.org/tmp/tk_test2.png

I'm worried about the artifacts in the response, is that a bug?

> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1068,7 +1068,7 @@ static __always_inline int timekeeping_bigadjust(struct 
> timekeeper *tk,
>* here.  This is tuned so that an error of about 1 msec is adjusted
>* within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
>*/
> - error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
> + error2 = tk->ntp_error >> (NTP_SCALE_SHIFT/2);
>   error2 = abs(error2);
>   for (look_ahead = 0; error2 > 0; look_ahead++)
>   error2 >>= 2;
> 

-- 
Miroslav Lichvar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-06 Thread Miroslav Lichvar
On Mon, Dec 02, 2013 at 08:03:17PM -0800, John Stultz wrote:
 On 12/02/2013 04:53 PM, John Stultz wrote:
 Finally found a config to get it working (disabling kernel debugging
 seems to work), and am currently trying to fixup the missing symbols
 (although I'm getting segfaults from various inline cli's :)

Patches are welcome :).

 Very cool simulator, by the way. Do you plan to have a git repo at some
 point for it?

It's now at https://github.com/mlichvar/linux-tktest

I'm considering to include it in https://github.com/mlichvar/clknetsim
as an optional replacement of the somewhat idealized clock which is
currently implemented there. This would allow us to see the whole
picture with applications controlling the clock.

 See the patch below. I'm doing some actual testing with it to see if its
 maybe too dampened.

It seems to fix the problem with stability, that's good. But the
response seems to be very slow now. In the simulated test with 10Hz
clock update it takes about 1000 updates (100 seconds!) for the loop
to converge to the correct frequency.

With the current tktest code from git:
n: 30, slope: 1.00 (1.00 GHz), dev: 3.1 ns, max: 3.6 ns, freq: -100.43404 ppm

You can see here the frequency is off by 0.43 ppm, that's after the 20
skipped updates.

When the sampling interval is changed to 100*50 ticks:
n: 30, slope: 1.00 (1.00 GHz), dev: 2146.9 ns, max: 5446.5 ns, freq: -100.07928 
ppm

Only when the warmup period is extended to 100*1000 ticks, it produces
a nice fit:
n: 30, slope: 1.00 (1.00 GHz), dev: 7.3 ns, max: 12.2 ns, freq: -100.4 ppm

This graph shows the value of tk-mult as it changes with clock
updates:
http://mlichvar.fedorapeople.org/tmp/tk_test1.png

When the TSC frequency is set to 100 MHz, it becomes more pronounced:
http://mlichvar.fedorapeople.org/tmp/tk_test2.png

I'm worried about the artifacts in the response, is that a bug?

 --- a/kernel/time/timekeeping.c
 +++ b/kernel/time/timekeeping.c
 @@ -1068,7 +1068,7 @@ static __always_inline int timekeeping_bigadjust(struct 
 timekeeper *tk,
* here.  This is tuned so that an error of about 1 msec is adjusted
* within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
*/
 - error2 = tk-ntp_error  (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
 + error2 = tk-ntp_error  (NTP_SCALE_SHIFT/2);
   error2 = abs(error2);
   for (look_ahead = 0; error2  0; look_ahead++)
   error2 = 2;
 

-- 
Miroslav Lichvar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-06 Thread John Stultz
On 12/06/2013 06:26 AM, Miroslav Lichvar wrote:
 On Mon, Dec 02, 2013 at 08:03:17PM -0800, John Stultz wrote:
 On 12/02/2013 04:53 PM, John Stultz wrote:
 Finally found a config to get it working (disabling kernel debugging
 seems to work), and am currently trying to fixup the missing symbols
 (although I'm getting segfaults from various inline cli's :)
 Patches are welcome :).

 Very cool simulator, by the way. Do you plan to have a git repo at some
 point for it?
 It's now at https://github.com/mlichvar/linux-tktest

 I'm considering to include it in https://github.com/mlichvar/clknetsim
 as an optional replacement of the somewhat idealized clock which is
 currently implemented there. This would allow us to see the whole
 picture with applications controlling the clock.

Interesting!  I don't think I've seen clknetsim before. I'll have to
look at it more closely!


 See the patch below. I'm doing some actual testing with it to see if its
 maybe too dampened.
 It seems to fix the problem with stability, that's good. But the
 response seems to be very slow now. In the simulated test with 10Hz
 clock update it takes about 1000 updates (100 seconds!) for the loop
 to converge to the correct frequency.

Yea. That was my concern that it over dampens the correction. In my
tests on actual systems it doesn't seem to cause much change in the
overall convergence picture with ntp, so I'll have to look more closely.

Just to be clear, when you say 10Hz clock update, what exactly are you
changing, as that doesn't quite match to the terminology in the tktest
simulator (ie: are you changing the ticks count?).


 With the current tktest code from git:
 n: 30, slope: 1.00 (1.00 GHz), dev: 3.1 ns, max: 3.6 ns, freq: -100.43404 ppm

 You can see here the frequency is off by 0.43 ppm, that's after the 20
 skipped updates.

 When the sampling interval is changed to 100*50 ticks:
 n: 30, slope: 1.00 (1.00 GHz), dev: 2146.9 ns, max: 5446.5 ns, freq: 
 -100.07928 ppm

 Only when the warmup period is extended to 100*1000 ticks, it produces
 a nice fit:
 n: 30, slope: 1.00 (1.00 GHz), dev: 7.3 ns, max: 12.2 ns, freq: -100.4 ppm


I get the first and the last numbers, but the middle are different for
me. Are you just setting:

diff --git a/tk_test.c b/tk_test.c
index e44a488..680f315 100644
--- a/tk_test.c
+++ b/tk_test.c
@@ -82,7 +82,7 @@ void tk_test(uint64_t *ts_x, uint64_t *ts_y, int samples, int 
 
advance_ticks(freq, 1, 1, 200);
ntp_freq -= 10;
-   advance_ticks(freq, 100, 1, 20);
+   advance_ticks(freq, 100, 1, 50);
 
for (i = 0; i  samples; i++) {
getnstimeofday(ts);

?

 This graph shows the value of tk-mult as it changes with clock
 updates:
 http://mlichvar.fedorapeople.org/tmp/tk_test1.png

 When the TSC frequency is set to 100 MHz, it becomes more pronounced:
 http://mlichvar.fedorapeople.org/tmp/tk_test2.png

 I'm worried about the artifacts in the response, is that a bug?

It does look strange. And again so I can reproduce this, how are you
generating the charts?


thanks
-john

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-06 Thread Miroslav Lichvar
On Fri, Dec 06, 2013 at 10:09:03AM -0800, John Stultz wrote:
 On 12/06/2013 06:26 AM, Miroslav Lichvar wrote:
  It seems to fix the problem with stability, that's good. But the
  response seems to be very slow now. In the simulated test with 10Hz
  clock update it takes about 1000 updates (100 seconds!) for the loop
  to converge to the correct frequency.
 
 Yea. That was my concern that it over dampens the correction. In my
 tests on actual systems it doesn't seem to cause much change in the
 overall convergence picture with ntp, so I'll have to look more closely.

In a few quick tests with phc2sys I didn't see any changes either.
But I couldn't get the wakeup rate on my test system below 20 per
second even after playing with powertop and killing everything, so I'm
not sure if that means anything.

What is the wakeup rate on your test system?

My feeling is that the internal kernel loop should be faster than the
application's loop.

 Just to be clear, when you say 10Hz clock update, what exactly are you
 changing, as that doesn't quite match to the terminology in the tktest
 simulator (ie: are you changing the ticks count?).

Yes, the second argument of advance_ticks(), 100 for 10 Hz (the
current value) and 1000 for 1 Hz.

  When the sampling interval is changed to 100*50 ticks:
  n: 30, slope: 1.00 (1.00 GHz), dev: 2146.9 ns, max: 5446.5 ns, freq: 
  -100.07928 ppm

 I get the first and the last numbers, but the middle are different for
 me. Are you just setting:

 -   advance_ticks(freq, 100, 1, 20);
 +   advance_ticks(freq, 100, 1, 50);
  
 for (i = 0; i  samples; i++) {
 getnstimeofday(ts);

No, I'm setting the one in the loop:

ts_x[i] = simtsc;
ts_y[i] = ts.tv_sec * 10ULL + ts.tv_nsec;
 
-   advance_ticks(freq, 100, 1, 1);
+   advance_ticks(freq, 100, 1, 50);

  When the TSC frequency is set to 100 MHz, it becomes more pronounced:
  http://mlichvar.fedorapeople.org/tmp/tk_test2.png
 
  I'm worried about the artifacts in the response, is that a bug?
 
 It does look strange. And again so I can reproduce this, how are you
 generating the charts?

I changed TSC_FREQ, added printk(mult: %d\n, tk-mult); to
update_wall_time() in timekeeping.c, grepped the output for ^mult
and used this command in gnuplot on display the data:

plot [] [167755330:167755390] log.mult using 2

-- 
Miroslav Lichvar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-06 Thread John Stultz
On 12/06/2013 06:26 AM, Miroslav Lichvar wrote:
 This graph shows the value of tk-mult as it changes with clock
 updates:
 http://mlichvar.fedorapeople.org/tmp/tk_test1.png

 When the TSC frequency is set to 100 MHz, it becomes more pronounced:
 http://mlichvar.fedorapeople.org/tmp/tk_test2.png

 I'm worried about the artifacts in the response, is that a bug?

So now being able to reproduce this, I can see the jaggy
artifacts/discontinuities in the mult curve are basically where the
look_ahead logic is being adjusted.

My understanding is the lookahead bit is trying to evaluate the order of
magnitude in error and use that to dampen the adjustment. This means
when we have large errors, we adjust more slowly. When we have small
errors we can adjust more quickly.

So each discontinuity, you're basically seeing the dampening effect
decreased, causing the adjustment slope to increase.


Being that the bigadjust code, and specifically this lookahead bit, has
always been the most opaque logic to me, I figured I'd spend some time
looking at alternatives, and came up with one approach that tries to
mimic your patch, but tries to be more in line with the existing logic.

It seems to do fairly well in the simulator:
n: 30, slope: 1.00 (1.00 GHz), dev: 3.2 ns, max: 3.6 ns, freq: -99.95677 ppm

Basically in the big-error case, we calculate the adjustment from the
current tick error (and the assumption is that is where the majority of
the large error is coming from), leaving the normal +1/-1 adjustments to
the cumulative error.

But I'm a little worried its maybe too reactive, focusing really only on
the per-tick error magnitude, so I'm not sure how it will handle real
world ntp fluctuation or things like granularity error.

I also went through and removed some of the over-optimization that the
compiler should handle, so the code is a bit more readable.

Anyway, let me know what you think and I'll run some tests on it this
weekend.

thanks
-john


diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3abf534..bfb36fd 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1056,42 +1056,24 @@ static __always_inline int
timekeeping_bigadjust(struct timekeeper *tk,
  s64 *offset)
 {
 s64 tick_error, i;
-u32 look_ahead, adj;
-s32 error2, mult;
+u32 adj;
+s32 mult = 1;
 
-/*
- * Use the current error value to determine how much to look ahead.
- * The larger the error the slower we adjust for it to avoid problems
- * with losing too many ticks, otherwise we would overadjust and
- * produce an even larger error.  The smaller the adjustment the
- * faster we try to adjust for it, as lost ticks can do less harm
- * here.  This is tuned so that an error of about 1 msec is adjusted
- * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
- */
-error2 = tk-ntp_error  (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
-error2 = abs(error2);
-for (look_ahead = 0; error2  0; look_ahead++)
-error2 = 2;
+/* Calculate current tick error */
+tick_error = ntp_tick_length()  (tk-ntp_error_shift );
+tick_error -= tk-xtime_interval;
 
-/*
- * Now calculate the error in (1  look_ahead) ticks, but first
- * remove the single look ahead already included in the error.
- */
-tick_error = ntp_tick_length()  (tk-ntp_error_shift + 1);
-tick_error -= tk-xtime_interval  1;
-error = ((error - tick_error)  look_ahead) + tick_error;
-
-/* Finally calculate the adjustment shift value.  */
-i = *interval;
-mult = 1;
-if (error  0) {
-error = -error;
+if (tick_error  0) {
 *interval = -*interval;
 *offset = -*offset;
-mult = -1;
+mult = -mult;
 }
-for (adj = 0; error  i; adj++)
-error = 1;
+
+/* Sort out the magnitude of the correction */
+tick_error = abs(tick_error);
+i = abs(*interval);
+for (adj = 0; tick_error  i; adj++)
+tick_error = 1;
 
 *interval = adj;
 *offset = adj;
@@ -1114,41 +1096,23 @@ static void timekeeping_adjust(struct timekeeper
*tk, s64 offset)
  *
  * First we shift it down from NTP_SHIFT to clocksource-shifted nsecs.
  *
- * Note we subtract one in the shift, so that error is really error*2.
- * This saves dividing(shifting) interval twice, but keeps the
- * (error  interval) comparison as still measuring if error is
- * larger than half an interval.
- *
- * Note: It does not save on aggravation when reading the code.
+ * Then we meausre if error is larger than half an interval.
  */
-error = tk-ntp_error  (tk-ntp_error_shift - 1);
-if (error  interval) {
-/*
- * We now divide error by 4(via shift), which checks if
- * the error is greater than twice the interval.
- * If it is greater, we need a bigadjust, if its smaller,
- * we can adjust by 1.
- */
-error 

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-02 Thread John Stultz
On 12/02/2013 04:53 PM, John Stultz wrote:
> On 11/20/2013 10:39 AM, Miroslav Lichvar wrote:
>> On Mon, Nov 18, 2013 at 12:46:00PM -0800, John Stultz wrote:
 In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
 error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
 update the maximum error went down from 480 microseconds to 55
 nanoseconds.
>>> Curious what sort of a environment you're using for simulation (as well
>>> as the real test below)?
>> I compile the kernel timekeeping.c file into a test program where a
>> fake TSC is provided to the kernel code and I can easily control the
>> tick length and timekeeper updates. The program collects pairs of
>> TSC/getnstimeofday() values and then it runs linear regression through
>> the points to see the frequency offset and how stable the clock was.
>>
>> http://mlichvar.fedorapeople.org/tmp/tk_test.tar.gz
> So.. this doesn't build for me.
>
> timekeeping.o: In function `change_clocksource':
> timekeeping.c:(.text+0x535): undefined reference to `lock_acquire'
> timekeeping.c:(.text+0x588): undefined reference to `lock_release'
> timekeeping.o: In function `__getnstimeofday':
> timekeeping.c:(.text+0x675): undefined reference to `trace_hardirqs_off'
> timekeeping.c:(.text+0x69b): undefined reference to `lock_acquire'
> timekeeping.c:(.text+0x6b0): undefined reference to `lock_release'
> timekeeping.c:(.text+0x6c2): undefined reference to `trace_hardirqs_on'
> timekeeping.c:(.text+0x77c): undefined reference to `trace_hardirqs_off'
>
>
> Do you need a special .config in your kernel source? Changing lockdep
> and irq tracer configs don't seem to fix things for me.

Finally found a config to get it working (disabling kernel debugging
seems to work), and am currently trying to fixup the missing symbols
(although I'm getting segfaults from various inline cli's :)

Very cool simulator, by the way. Do you plan to have a git repo at some
point for it?


>
>>> But this is all very interesting! Thanks again for digging into this
>>> issue and sending the patch!
>> Thanks for looking into it. I agree this is a rather radical change
>> and it needs to be done very carefully. At this point, I'd like to
>> know if you think there are no fundamental problems in the design and
>> whether it would be an acceptable replacement.
>>
>> A different approach to fix this problem would be controlling the
>> maximum idle interval from the tick adjusting code so that the NTP
>> error corrections can be accurate. That would further increase the
>> complexity of the code and increase the interrupt rate.
> So I'm still trying to break apart the larger change you've made into
> smaller functional steps.
>
> The main two differences I see are:
>
> 1) Calculating the mult value directly from the ntp_tick_length() value
> (via the division) each update cycle.
>
> 2) Drastically simplifying the ntp_error correction to a 0/1 multiplier
> adjustment (assuming the calculated mult will be slightly slow, due to
> truncating the remainder in the division) based on the sign of the error.
>
> (and I'm ignoring the previously mentioned accounting changes that
> appear to be improperly dropped ;)
>
> This makes me suspect the main issue is we're over-correcting in the
> timekeeping_bigadjust() logic with nohz and I'm curious if instead we
> could either limit timekeeping_bigadjust() adjustment to achieve the
> same stability? In particular, bigadjust()'s exponential adjustment of
> mult seems problematic.

Tuning the look_ahead seems to solve most of the issues, at least using
your simulator. But this may not be yet ideal for all cases. But
overall, I'm glad, since that particular code was always the most
opaque, so this gives some good reason for us to sort it out and get it
documented better.

See the patch below. I'm doing some actual testing with it to see if its
maybe too dampened.

thanks
-john


diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3abf534..872c9c5 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1068,7 +1068,7 @@ static __always_inline int timekeeping_bigadjust(struct 
timekeeper *tk,
 * here.  This is tuned so that an error of about 1 msec is adjusted
 * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
 */
-   error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
+   error2 = tk->ntp_error >> (NTP_SCALE_SHIFT/2);
error2 = abs(error2);
for (look_ahead = 0; error2 > 0; look_ahead++)
error2 >>= 2;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-02 Thread John Stultz
On 11/20/2013 10:39 AM, Miroslav Lichvar wrote:
> On Mon, Nov 18, 2013 at 12:46:00PM -0800, John Stultz wrote:
>>> In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
>>> error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
>>> update the maximum error went down from 480 microseconds to 55
>>> nanoseconds.
>> Curious what sort of a environment you're using for simulation (as well
>> as the real test below)?
> I compile the kernel timekeeping.c file into a test program where a
> fake TSC is provided to the kernel code and I can easily control the
> tick length and timekeeper updates. The program collects pairs of
> TSC/getnstimeofday() values and then it runs linear regression through
> the points to see the frequency offset and how stable the clock was.
>
> http://mlichvar.fedorapeople.org/tmp/tk_test.tar.gz

So.. this doesn't build for me.

timekeeping.o: In function `change_clocksource':
timekeeping.c:(.text+0x535): undefined reference to `lock_acquire'
timekeeping.c:(.text+0x588): undefined reference to `lock_release'
timekeeping.o: In function `__getnstimeofday':
timekeeping.c:(.text+0x675): undefined reference to `trace_hardirqs_off'
timekeeping.c:(.text+0x69b): undefined reference to `lock_acquire'
timekeeping.c:(.text+0x6b0): undefined reference to `lock_release'
timekeeping.c:(.text+0x6c2): undefined reference to `trace_hardirqs_on'
timekeeping.c:(.text+0x77c): undefined reference to `trace_hardirqs_off'


Do you need a special .config in your kernel source? Changing lockdep
and irq tracer configs don't seem to fix things for me.



>> But this is all very interesting! Thanks again for digging into this
>> issue and sending the patch!
> Thanks for looking into it. I agree this is a rather radical change
> and it needs to be done very carefully. At this point, I'd like to
> know if you think there are no fundamental problems in the design and
> whether it would be an acceptable replacement.
>
> A different approach to fix this problem would be controlling the
> maximum idle interval from the tick adjusting code so that the NTP
> error corrections can be accurate. That would further increase the
> complexity of the code and increase the interrupt rate.

So I'm still trying to break apart the larger change you've made into
smaller functional steps.

The main two differences I see are:

1) Calculating the mult value directly from the ntp_tick_length() value
(via the division) each update cycle.

2) Drastically simplifying the ntp_error correction to a 0/1 multiplier
adjustment (assuming the calculated mult will be slightly slow, due to
truncating the remainder in the division) based on the sign of the error.

(and I'm ignoring the previously mentioned accounting changes that
appear to be improperly dropped ;)

This makes me suspect the main issue is we're over-correcting in the
timekeeping_bigadjust() logic with nohz and I'm curious if instead we
could either limit timekeeping_bigadjust() adjustment to achieve the
same stability? In particular, bigadjust()'s exponential adjustment of
mult seems problematic.

Am I missing or glossing over anything else that is key to your changes?

thanks
-john




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-02 Thread John Stultz
On 11/20/2013 10:39 AM, Miroslav Lichvar wrote:
 On Mon, Nov 18, 2013 at 12:46:00PM -0800, John Stultz wrote:
 In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
 error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
 update the maximum error went down from 480 microseconds to 55
 nanoseconds.
 Curious what sort of a environment you're using for simulation (as well
 as the real test below)?
 I compile the kernel timekeeping.c file into a test program where a
 fake TSC is provided to the kernel code and I can easily control the
 tick length and timekeeper updates. The program collects pairs of
 TSC/getnstimeofday() values and then it runs linear regression through
 the points to see the frequency offset and how stable the clock was.

 http://mlichvar.fedorapeople.org/tmp/tk_test.tar.gz

So.. this doesn't build for me.

timekeeping.o: In function `change_clocksource':
timekeeping.c:(.text+0x535): undefined reference to `lock_acquire'
timekeeping.c:(.text+0x588): undefined reference to `lock_release'
timekeeping.o: In function `__getnstimeofday':
timekeeping.c:(.text+0x675): undefined reference to `trace_hardirqs_off'
timekeeping.c:(.text+0x69b): undefined reference to `lock_acquire'
timekeeping.c:(.text+0x6b0): undefined reference to `lock_release'
timekeeping.c:(.text+0x6c2): undefined reference to `trace_hardirqs_on'
timekeeping.c:(.text+0x77c): undefined reference to `trace_hardirqs_off'


Do you need a special .config in your kernel source? Changing lockdep
and irq tracer configs don't seem to fix things for me.



 But this is all very interesting! Thanks again for digging into this
 issue and sending the patch!
 Thanks for looking into it. I agree this is a rather radical change
 and it needs to be done very carefully. At this point, I'd like to
 know if you think there are no fundamental problems in the design and
 whether it would be an acceptable replacement.

 A different approach to fix this problem would be controlling the
 maximum idle interval from the tick adjusting code so that the NTP
 error corrections can be accurate. That would further increase the
 complexity of the code and increase the interrupt rate.

So I'm still trying to break apart the larger change you've made into
smaller functional steps.

The main two differences I see are:

1) Calculating the mult value directly from the ntp_tick_length() value
(via the division) each update cycle.

2) Drastically simplifying the ntp_error correction to a 0/1 multiplier
adjustment (assuming the calculated mult will be slightly slow, due to
truncating the remainder in the division) based on the sign of the error.

(and I'm ignoring the previously mentioned accounting changes that
appear to be improperly dropped ;)

This makes me suspect the main issue is we're over-correcting in the
timekeeping_bigadjust() logic with nohz and I'm curious if instead we
could either limit timekeeping_bigadjust() adjustment to achieve the
same stability? In particular, bigadjust()'s exponential adjustment of
mult seems problematic.

Am I missing or glossing over anything else that is key to your changes?

thanks
-john




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-12-02 Thread John Stultz
On 12/02/2013 04:53 PM, John Stultz wrote:
 On 11/20/2013 10:39 AM, Miroslav Lichvar wrote:
 On Mon, Nov 18, 2013 at 12:46:00PM -0800, John Stultz wrote:
 In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
 error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
 update the maximum error went down from 480 microseconds to 55
 nanoseconds.
 Curious what sort of a environment you're using for simulation (as well
 as the real test below)?
 I compile the kernel timekeeping.c file into a test program where a
 fake TSC is provided to the kernel code and I can easily control the
 tick length and timekeeper updates. The program collects pairs of
 TSC/getnstimeofday() values and then it runs linear regression through
 the points to see the frequency offset and how stable the clock was.

 http://mlichvar.fedorapeople.org/tmp/tk_test.tar.gz
 So.. this doesn't build for me.

 timekeeping.o: In function `change_clocksource':
 timekeeping.c:(.text+0x535): undefined reference to `lock_acquire'
 timekeeping.c:(.text+0x588): undefined reference to `lock_release'
 timekeeping.o: In function `__getnstimeofday':
 timekeeping.c:(.text+0x675): undefined reference to `trace_hardirqs_off'
 timekeeping.c:(.text+0x69b): undefined reference to `lock_acquire'
 timekeeping.c:(.text+0x6b0): undefined reference to `lock_release'
 timekeeping.c:(.text+0x6c2): undefined reference to `trace_hardirqs_on'
 timekeeping.c:(.text+0x77c): undefined reference to `trace_hardirqs_off'


 Do you need a special .config in your kernel source? Changing lockdep
 and irq tracer configs don't seem to fix things for me.

Finally found a config to get it working (disabling kernel debugging
seems to work), and am currently trying to fixup the missing symbols
(although I'm getting segfaults from various inline cli's :)

Very cool simulator, by the way. Do you plan to have a git repo at some
point for it?



 But this is all very interesting! Thanks again for digging into this
 issue and sending the patch!
 Thanks for looking into it. I agree this is a rather radical change
 and it needs to be done very carefully. At this point, I'd like to
 know if you think there are no fundamental problems in the design and
 whether it would be an acceptable replacement.

 A different approach to fix this problem would be controlling the
 maximum idle interval from the tick adjusting code so that the NTP
 error corrections can be accurate. That would further increase the
 complexity of the code and increase the interrupt rate.
 So I'm still trying to break apart the larger change you've made into
 smaller functional steps.

 The main two differences I see are:

 1) Calculating the mult value directly from the ntp_tick_length() value
 (via the division) each update cycle.

 2) Drastically simplifying the ntp_error correction to a 0/1 multiplier
 adjustment (assuming the calculated mult will be slightly slow, due to
 truncating the remainder in the division) based on the sign of the error.

 (and I'm ignoring the previously mentioned accounting changes that
 appear to be improperly dropped ;)

 This makes me suspect the main issue is we're over-correcting in the
 timekeeping_bigadjust() logic with nohz and I'm curious if instead we
 could either limit timekeeping_bigadjust() adjustment to achieve the
 same stability? In particular, bigadjust()'s exponential adjustment of
 mult seems problematic.

Tuning the look_ahead seems to solve most of the issues, at least using
your simulator. But this may not be yet ideal for all cases. But
overall, I'm glad, since that particular code was always the most
opaque, so this gives some good reason for us to sort it out and get it
documented better.

See the patch below. I'm doing some actual testing with it to see if its
maybe too dampened.

thanks
-john


diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3abf534..872c9c5 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1068,7 +1068,7 @@ static __always_inline int timekeeping_bigadjust(struct 
timekeeper *tk,
 * here.  This is tuned so that an error of about 1 msec is adjusted
 * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
 */
-   error2 = tk-ntp_error  (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
+   error2 = tk-ntp_error  (NTP_SCALE_SHIFT/2);
error2 = abs(error2);
for (look_ahead = 0; error2  0; look_ahead++)
error2 = 2;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-27 Thread Richard Cochran
On Tue, Nov 19, 2013 at 03:13:19PM +0100, Richard Cochran wrote:
> 
> In this test, the update rate is once per second. When using longer
> intervals, the problem becomes worse.

Here is another pair of example runs on an idle system, this time with
a 32 second update interval.

* Periodic Case

  CONFIG_HZ_PERIODIC=y
  CONFIG_NO_HZ=y
  CONFIG_HZ_1000=y

  The peak to peak time error is about 600 nanoseconds.

sudo ./phc2sys -s eth3 -m -q -l7 -O0 -P0 -I0 -R.03125
phc2sys[118.045]: PI servo: sync interval 32.000 kp 0.022 ki 0.009375
phc2sys[150.045]: sys offset894205 s0 freq  +0 delay   5172
phc2sys[182.045]: sys offset   1416429 s1 freq  +16319 delay   5176
phc2sys[214.045]: sys offset 7 s2 freq  +16320 delay   5172
phc2sys[246.045]: sys offset   140 s2 freq  +16324 delay   5179
phc2sys[278.045]: sys offset   346 s2 freq  +16332 delay   5171
phc2sys[310.045]: sys offset   494 s2 freq  +16339 delay   5184
phc2sys[342.046]: sys offset   485 s2 freq  +16344 delay   5191
phc2sys[374.046]: sys offset   251 s2 freq  +16341 delay   5172
phc2sys[406.046]: sys offset   153 s2 freq  +16340 delay   5174
phc2sys[438.046]: sys offset   156 s2 freq  +16342 delay   5193
phc2sys[470.046]: sys offset39 s2 freq  +16340 delay   5179
phc2sys[502.046]: sys offset   173 s2 freq  +16344 delay   5076
phc2sys[534.046]: sys offset   286 s2 freq  +16349 delay   5171
phc2sys[566.047]: sys offset   130 s2 freq  +16347 delay   5159
phc2sys[598.047]: sys offset   -64 s2 freq  +16342 delay   5191
phc2sys[630.047]: sys offset   176 s2 freq  +16349 delay   5184
phc2sys[662.047]: sys offset   248 s2 freq  +16353 delay   5167
phc2sys[694.047]: sys offset   166 s2 freq  +16353 delay   5187
phc2sys[726.047]: sys offset   227 s2 freq  +16356 delay   5179
phc2sys[758.047]: sys offset91 s2 freq  +16354 delay   5177
phc2sys[790.048]: sys offset   -12 s2 freq  +16352 delay   5179
phc2sys[822.048]: sys offset   -44 s2 freq  +16351 delay   5189
phc2sys[854.048]: sys offset   -99 s2 freq  +16349 delay   5159
phc2sys[886.048]: sys offset53 s2 freq  +16352 delay   5184
phc2sys[918.048]: sys offset   241 s2 freq  +16359 delay   5172
phc2sys[950.048]: sys offset   293 s2 freq  +16363 delay   5191
phc2sys[982.048]: sys offset   -92 s2 freq  +16353 delay   5179
phc2sys[1014.048]: sys offset   -61 s2 freq  +16354 delay   5172
phc2sys[1046.049]: sys offset   -42 s2 freq  +16354 delay   5186

* NO_HZ 

  CONFIG_NO_HZ_COMMON=y
  CONFIG_NO_HZ_IDLE=y
  CONFIG_NO_HZ=y
  CONFIG_RCU_FAST_NO_HZ=y
  CONFIG_HZ_250=y

  The peak to peak time error reaches 9.5 microseconds.

sudo ./phc2sys -s eth3 -m -q -l7 -O0 -P0 -I0 -R.03125
phc2sys[2036.649]: PI servo: sync interval 32.000 kp 0.022 ki 0.009375
phc2sys[2068.650]: sys offset   1322846 s0 freq  +0 delay   5184
phc2sys[2100.650]: sys offset   1845801 s1 freq  +16342 delay   5172
phc2sys[2132.650]: sys offset  -643 s2 freq  +16322 delay   5197
phc2sys[2164.650]: sys offset   111 s2 freq  +16340 delay   5192
phc2sys[2196.650]: sys offset -2110 s2 freq  +16271 delay   5174
phc2sys[2228.650]: sys offset  4545 s2 freq  +16460 delay   5172
phc2sys[2260.650]: sys offset -3666 s2 freq  +16246 delay   5160
phc2sys[2292.651]: sys offset  4454 s2 freq  +16465 delay   5183
phc2sys[2324.651]: sys offset -1017 s2 freq  +16336 delay   5166
phc2sys[2356.651]: sys offset -1248 s2 freq  +16319 delay   5159
phc2sys[2388.651]: sys offset  2824 s2 freq  +16435 delay   5174
phc2sys[2420.651]: sys offset -2253 s2 freq  +16302 delay   5174
phc2sys[2452.651]: sys offset  -431 s2 freq  +16338 delay   5186
phc2sys[2484.651]: sys offset   374 s2 freq  +16359 delay   5191
phc2sys[2516.652]: sys offset  1198 s2 freq  +16389 delay   5176
phc2sys[2548.652]: sys offset -1803 s2 freq  +16306 delay   5184
phc2sys[2580.652]: sys offset  -295 s2 freq  +16336 delay   5179
phc2sys[2612.652]: sys offset   753 s2 freq  +16366 delay   5188
phc2sys[2644.652]: sys offset  -907 s2 freq  +16321 delay   5184
phc2sys[2676.652]: sys offset  -128 s2 freq  +16337 delay   5111
phc2sys[2708.652]: sys offset -2000 s2 freq  +16277 delay   5179
phc2sys[2740.653]: sys offset  2270 s2 freq  +16392 delay   5174
phc2sys[2772.653]: sys offset   718 s2 freq  +16365 delay   5191
phc2sys[2804.653]: sys offset -1004 s2 freq  +16318 delay   5172
phc2sys[2836.653]: sys offset  -331 s2 freq  +16329 delay   5184
phc2sys[2868.653]: sys offset  1667 s2 freq  +16389 delay   5081
phc2sys[2900.653]: sys offset -1154 s2 freq  +16316 delay   5192
phc2sys[2932.653]: sys offset  5693 s2 freq  +16519 delay   5184
phc2sys[2964.654]: sys offset -3833 s2 freq  +16275 delay   5171
 
Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-27 Thread Richard Cochran
On Tue, Nov 19, 2013 at 03:13:19PM +0100, Richard Cochran wrote:
 
 In this test, the update rate is once per second. When using longer
 intervals, the problem becomes worse.

Here is another pair of example runs on an idle system, this time with
a 32 second update interval.

* Periodic Case

  CONFIG_HZ_PERIODIC=y
  CONFIG_NO_HZ=y
  CONFIG_HZ_1000=y

  The peak to peak time error is about 600 nanoseconds.

sudo ./phc2sys -s eth3 -m -q -l7 -O0 -P0 -I0 -R.03125
phc2sys[118.045]: PI servo: sync interval 32.000 kp 0.022 ki 0.009375
phc2sys[150.045]: sys offset894205 s0 freq  +0 delay   5172
phc2sys[182.045]: sys offset   1416429 s1 freq  +16319 delay   5176
phc2sys[214.045]: sys offset 7 s2 freq  +16320 delay   5172
phc2sys[246.045]: sys offset   140 s2 freq  +16324 delay   5179
phc2sys[278.045]: sys offset   346 s2 freq  +16332 delay   5171
phc2sys[310.045]: sys offset   494 s2 freq  +16339 delay   5184
phc2sys[342.046]: sys offset   485 s2 freq  +16344 delay   5191
phc2sys[374.046]: sys offset   251 s2 freq  +16341 delay   5172
phc2sys[406.046]: sys offset   153 s2 freq  +16340 delay   5174
phc2sys[438.046]: sys offset   156 s2 freq  +16342 delay   5193
phc2sys[470.046]: sys offset39 s2 freq  +16340 delay   5179
phc2sys[502.046]: sys offset   173 s2 freq  +16344 delay   5076
phc2sys[534.046]: sys offset   286 s2 freq  +16349 delay   5171
phc2sys[566.047]: sys offset   130 s2 freq  +16347 delay   5159
phc2sys[598.047]: sys offset   -64 s2 freq  +16342 delay   5191
phc2sys[630.047]: sys offset   176 s2 freq  +16349 delay   5184
phc2sys[662.047]: sys offset   248 s2 freq  +16353 delay   5167
phc2sys[694.047]: sys offset   166 s2 freq  +16353 delay   5187
phc2sys[726.047]: sys offset   227 s2 freq  +16356 delay   5179
phc2sys[758.047]: sys offset91 s2 freq  +16354 delay   5177
phc2sys[790.048]: sys offset   -12 s2 freq  +16352 delay   5179
phc2sys[822.048]: sys offset   -44 s2 freq  +16351 delay   5189
phc2sys[854.048]: sys offset   -99 s2 freq  +16349 delay   5159
phc2sys[886.048]: sys offset53 s2 freq  +16352 delay   5184
phc2sys[918.048]: sys offset   241 s2 freq  +16359 delay   5172
phc2sys[950.048]: sys offset   293 s2 freq  +16363 delay   5191
phc2sys[982.048]: sys offset   -92 s2 freq  +16353 delay   5179
phc2sys[1014.048]: sys offset   -61 s2 freq  +16354 delay   5172
phc2sys[1046.049]: sys offset   -42 s2 freq  +16354 delay   5186

* NO_HZ 

  CONFIG_NO_HZ_COMMON=y
  CONFIG_NO_HZ_IDLE=y
  CONFIG_NO_HZ=y
  CONFIG_RCU_FAST_NO_HZ=y
  CONFIG_HZ_250=y

  The peak to peak time error reaches 9.5 microseconds.

sudo ./phc2sys -s eth3 -m -q -l7 -O0 -P0 -I0 -R.03125
phc2sys[2036.649]: PI servo: sync interval 32.000 kp 0.022 ki 0.009375
phc2sys[2068.650]: sys offset   1322846 s0 freq  +0 delay   5184
phc2sys[2100.650]: sys offset   1845801 s1 freq  +16342 delay   5172
phc2sys[2132.650]: sys offset  -643 s2 freq  +16322 delay   5197
phc2sys[2164.650]: sys offset   111 s2 freq  +16340 delay   5192
phc2sys[2196.650]: sys offset -2110 s2 freq  +16271 delay   5174
phc2sys[2228.650]: sys offset  4545 s2 freq  +16460 delay   5172
phc2sys[2260.650]: sys offset -3666 s2 freq  +16246 delay   5160
phc2sys[2292.651]: sys offset  4454 s2 freq  +16465 delay   5183
phc2sys[2324.651]: sys offset -1017 s2 freq  +16336 delay   5166
phc2sys[2356.651]: sys offset -1248 s2 freq  +16319 delay   5159
phc2sys[2388.651]: sys offset  2824 s2 freq  +16435 delay   5174
phc2sys[2420.651]: sys offset -2253 s2 freq  +16302 delay   5174
phc2sys[2452.651]: sys offset  -431 s2 freq  +16338 delay   5186
phc2sys[2484.651]: sys offset   374 s2 freq  +16359 delay   5191
phc2sys[2516.652]: sys offset  1198 s2 freq  +16389 delay   5176
phc2sys[2548.652]: sys offset -1803 s2 freq  +16306 delay   5184
phc2sys[2580.652]: sys offset  -295 s2 freq  +16336 delay   5179
phc2sys[2612.652]: sys offset   753 s2 freq  +16366 delay   5188
phc2sys[2644.652]: sys offset  -907 s2 freq  +16321 delay   5184
phc2sys[2676.652]: sys offset  -128 s2 freq  +16337 delay   5111
phc2sys[2708.652]: sys offset -2000 s2 freq  +16277 delay   5179
phc2sys[2740.653]: sys offset  2270 s2 freq  +16392 delay   5174
phc2sys[2772.653]: sys offset   718 s2 freq  +16365 delay   5191
phc2sys[2804.653]: sys offset -1004 s2 freq  +16318 delay   5172
phc2sys[2836.653]: sys offset  -331 s2 freq  +16329 delay   5184
phc2sys[2868.653]: sys offset  1667 s2 freq  +16389 delay   5081
phc2sys[2900.653]: sys offset -1154 s2 freq  +16316 delay   5192
phc2sys[2932.653]: sys offset  5693 s2 freq  +16519 delay   5184
phc2sys[2964.654]: sys offset -3833 s2 freq  +16275 delay   5171
 
Thanks,
Richard
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-21 Thread Miroslav Lichvar
On Mon, Nov 18, 2013 at 01:28:07PM -0800, John Stultz wrote:
> Also is this brokenness something that has been around for awhile for
> you or more recently cropped up?  I'm wondering as nohz idle has been
> around for quite a few years and I've not seen too many complaints. So
> I'm wondering if I'm looking in the right places, or if something has
> regressed recently, or if its just that accuracy expectations have gone up?

I think the problem was there right from the beginning when the
internal synchronization loop was introduced, but before PTP hardware
clocks there might not had been a reference which could be used to
evaluate the stability of the system clock at this level. The quantum
mechanical property of the problem that it disappears when you
increase sampling doesn't help much either :).

IIRC the first time I noticed something wasn't quite alright was in
2009 when I wrote a PPS refclock driver for chrony. With a GPS
receiver connected to serial port the synchronization seemed to work
better when the system was loaded. My explanation at the time was that
it's a HW limitation, the interrupt latency/jitter is worse when the
CPU is idle. I had no idea there could be a problem in the kernel
timekeeping.

-- 
Miroslav Lichvar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-21 Thread Miroslav Lichvar
On Mon, Nov 18, 2013 at 01:28:07PM -0800, John Stultz wrote:
 Also is this brokenness something that has been around for awhile for
 you or more recently cropped up?  I'm wondering as nohz idle has been
 around for quite a few years and I've not seen too many complaints. So
 I'm wondering if I'm looking in the right places, or if something has
 regressed recently, or if its just that accuracy expectations have gone up?

I think the problem was there right from the beginning when the
internal synchronization loop was introduced, but before PTP hardware
clocks there might not had been a reference which could be used to
evaluate the stability of the system clock at this level. The quantum
mechanical property of the problem that it disappears when you
increase sampling doesn't help much either :).

IIRC the first time I noticed something wasn't quite alright was in
2009 when I wrote a PPS refclock driver for chrony. With a GPS
receiver connected to serial port the synchronization seemed to work
better when the system was loaded. My explanation at the time was that
it's a HW limitation, the interrupt latency/jitter is worse when the
CPU is idle. I had no idea there could be a problem in the kernel
timekeeping.

-- 
Miroslav Lichvar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-20 Thread Miroslav Lichvar
On Mon, Nov 18, 2013 at 12:46:00PM -0800, John Stultz wrote:
> Hmm. Reading this, but not having studied your patch in depth, is
> interesting. It originally was that we only applied any NTP adjustment
> to future changes. Also, since at that time the tick length was only
> changed on the second overflow, we ran into some problems, since there
> were some applications using the ntp interface to make very small
> adjustments for only a few ms. Thus we changed (see fdcedf7b75808 -
> time: apply NTP frequency/tick changes immediately) it to adjust the
> tick length immediately.
> 
> If this is the core issue, then maybe would revisiting that change alone
> improve things?

No, this doesn't seem to be related to the problem with stability.
Applying changes in tick length immediately is definitely a useful
feature. One item in my todo list for this patch is to make such phase
adjustments more accurate by forcing the timekeeper update immediately
after adjtimex().

> Also I'm not sure if I quite follow the bit about "a change in the tick
> was applied immediately to all ticks since the last update", since at
> least with classic nohz, if someone is making a change to the ntp tick
> length, we're not idle, so we should still be getting regular ticks. So
> at most, it should be the modified ntp tick length is immediately
> applied to the current  tick in progress, as well as following ticks.

I meant the tick length used in the NTP error calculation
in the logarithmic accumulation function (tk->ntp_error +=
ntp_tick_length() << shift). The code calls second_overflow() before
the NTP error is updated, which means the NTP error will jump by the
difference in the tick length multiplied by number of accumulated
ticks in that step and that's extra work for the loop to correct. This
must not happen in the design I'm proposing.

Probably a stupid question, but can adjtimex() or clock_gettime() be
called before the clock is updated after an idle period? I'm wondering
if this assumption would allow the timekeeper to correct the NTP error
simply by stepping the clock.

> Now with nohz_full this is definitely more complicated, but I want to
> make sure I really understand what you're seeing. So are the issues
> you're seeing w/ classic idle nohz or nohz full?

I was using only kernels with classic idle nohz.

> One of the issues I've recently started to worry about with the existing
> code, is that the ntp_error value can really only hold about 2 seconds
> worth of error. Now, that's the internal loop adjustment error (ie: the
> error from what ntpd specified the kernel to be and where the kernel is
> which should be very small), not the actual drift from the ntp server,
> so it really shouldn't be problematic, but as we do get into *very*
> large nohz idle values, it seems possible that we could accumulate more
> then 2 seconds and see overflows of ntp_error, which could cause
> instability.  So I'm curious if you were seeing anything like this, or
> at least ask if you've already eliminated this as a potential source of
> the problem you were seeing.

That doesn't seem to be related to the problem I'm seeing. With 1Hz
updates I think ntp_error was always in microseconds or milliseconds
at most.

> > The only source of the accounted NTP error is now lack of resolution in
> > the clock multiplier. The error is corrected by adding 0 or 1 to the
> > calculated multiplier. With a constant tick length, the multiplier will
> > be switching between the two values. The loop is greatly simplified and
> > there is no problem with stability. The maximum error is limited by the
> > update interval.
> 
> Hrmm.. So I wonder how well this works with fairly coarse grained
> clocksources? Quite a bit of the previous design (which is still looks
> mostly in place with a superficial review of your patch) is focused on
> keeping our long-term accuracy correct via the internally managed
> feedback loop, even if we had a very coarse clocksource shift value. 
> While improving the nohz performance is desired, I'm not sure we want to
> also throw away that managed precision.

There shouldn't be any regressions in the long-term frequency
accuracy. Short-term frequency accuracy should be better as the bug is
fixed :). 

However, the PLL/FLL and singleshot phase adjustments will be possibly
even less accurate as the whole accumulated interval will use only one
tick length. In the current code, the adjustments could be made
accurate by splitting the accumulated interval at start of second and
updating NTP error for each part with the right tick length. In the
design I'm proposing that will no longer be possible.

> Also, if the ntp_mult_correction is always 0 or 1, (which handles if
> we're too slow) how does it handle if we're running too fast?

If tick length is not divisible by cycles, with zero
ntp_mult_correction the clock will be running slower. If it is
divisible, the clock will be running at the exact same rate and stay
slightly ahead. How 

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-20 Thread Miroslav Lichvar
On Mon, Nov 18, 2013 at 12:46:00PM -0800, John Stultz wrote:
 Hmm. Reading this, but not having studied your patch in depth, is
 interesting. It originally was that we only applied any NTP adjustment
 to future changes. Also, since at that time the tick length was only
 changed on the second overflow, we ran into some problems, since there
 were some applications using the ntp interface to make very small
 adjustments for only a few ms. Thus we changed (see fdcedf7b75808 -
 time: apply NTP frequency/tick changes immediately) it to adjust the
 tick length immediately.
 
 If this is the core issue, then maybe would revisiting that change alone
 improve things?

No, this doesn't seem to be related to the problem with stability.
Applying changes in tick length immediately is definitely a useful
feature. One item in my todo list for this patch is to make such phase
adjustments more accurate by forcing the timekeeper update immediately
after adjtimex().

 Also I'm not sure if I quite follow the bit about a change in the tick
 was applied immediately to all ticks since the last update, since at
 least with classic nohz, if someone is making a change to the ntp tick
 length, we're not idle, so we should still be getting regular ticks. So
 at most, it should be the modified ntp tick length is immediately
 applied to the current  tick in progress, as well as following ticks.

I meant the tick length used in the NTP error calculation
in the logarithmic accumulation function (tk-ntp_error +=
ntp_tick_length()  shift). The code calls second_overflow() before
the NTP error is updated, which means the NTP error will jump by the
difference in the tick length multiplied by number of accumulated
ticks in that step and that's extra work for the loop to correct. This
must not happen in the design I'm proposing.

Probably a stupid question, but can adjtimex() or clock_gettime() be
called before the clock is updated after an idle period? I'm wondering
if this assumption would allow the timekeeper to correct the NTP error
simply by stepping the clock.

 Now with nohz_full this is definitely more complicated, but I want to
 make sure I really understand what you're seeing. So are the issues
 you're seeing w/ classic idle nohz or nohz full?

I was using only kernels with classic idle nohz.

 One of the issues I've recently started to worry about with the existing
 code, is that the ntp_error value can really only hold about 2 seconds
 worth of error. Now, that's the internal loop adjustment error (ie: the
 error from what ntpd specified the kernel to be and where the kernel is
 which should be very small), not the actual drift from the ntp server,
 so it really shouldn't be problematic, but as we do get into *very*
 large nohz idle values, it seems possible that we could accumulate more
 then 2 seconds and see overflows of ntp_error, which could cause
 instability.  So I'm curious if you were seeing anything like this, or
 at least ask if you've already eliminated this as a potential source of
 the problem you were seeing.

That doesn't seem to be related to the problem I'm seeing. With 1Hz
updates I think ntp_error was always in microseconds or milliseconds
at most.

  The only source of the accounted NTP error is now lack of resolution in
  the clock multiplier. The error is corrected by adding 0 or 1 to the
  calculated multiplier. With a constant tick length, the multiplier will
  be switching between the two values. The loop is greatly simplified and
  there is no problem with stability. The maximum error is limited by the
  update interval.
 
 Hrmm.. So I wonder how well this works with fairly coarse grained
 clocksources? Quite a bit of the previous design (which is still looks
 mostly in place with a superficial review of your patch) is focused on
 keeping our long-term accuracy correct via the internally managed
 feedback loop, even if we had a very coarse clocksource shift value. 
 While improving the nohz performance is desired, I'm not sure we want to
 also throw away that managed precision.

There shouldn't be any regressions in the long-term frequency
accuracy. Short-term frequency accuracy should be better as the bug is
fixed :). 

However, the PLL/FLL and singleshot phase adjustments will be possibly
even less accurate as the whole accumulated interval will use only one
tick length. In the current code, the adjustments could be made
accurate by splitting the accumulated interval at start of second and
updating NTP error for each part with the right tick length. In the
design I'm proposing that will no longer be possible.

 Also, if the ntp_mult_correction is always 0 or 1, (which handles if
 we're too slow) how does it handle if we're running too fast?

If tick length is not divisible by cycles, with zero
ntp_mult_correction the clock will be running slower. If it is
divisible, the clock will be running at the exact same rate and stay
slightly ahead. How much it will be ahead depends on how long was the
last 

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-19 Thread Richard Cochran
On Mon, Nov 18, 2013 at 01:28:07PM -0800, John Stultz wrote:
 
> Also, just for clarity's sake, when you're seeing things "broken",
> curious how/what you are measuring?

Here is the background for this issue. The linuxptp stack has a
program called phc2sys whose purpose is to synchronize the Linux
system clock with a PTP hardware clock typically residing on a PCIe
card. This program uses the PTP_SYS_OFFSET ioctl to read the clocks.

One user on the list was complaining that the reported time and
frequency errors were unacceptably large. After vainly trying
different PI weights, we found out that the poor performance is
due to the nohz setting in the kernel.

See below for example logs.
 
> Also is this brokenness something that has been around for awhile for
> you or more recently cropped up?  I'm wondering as nohz idle has been
> around for quite a few years and I've not seen too many complaints. So
> I'm wondering if I'm looking in the right places, or if something has
> regressed recently, or if its just that accuracy expectations have gone up?

I have always avoided nohz like the plague. IIRC, it would not even
compile on ARM for the longest time. In any case, I really don't know
when this issue appeared. At some point, nohz became the Kconfig
default, and I did not notice, and so now I had it enabled by
accident.

Here are two sample runs of phc2sys. The PHC is an Intel i210 PCIe
card. I simply set the PHC time to the Linux system time (using
testptp -s), and then let PHC clock run free. During the test, the
phc2sys program is the only active program, and the system is
otherwise idle.

In this test, the update rate is once per second. When using longer
intervals, the problem becomes worse.

In the listings, "sys offset" is measured in nanosecond using the
PTP_SYS_OFFSET ioctl, "freq" is the PI servo output in ppb, and
"delay" is the time it took to read the two clocks in nanoseconds.

* Periodic Case

  CONFIG_HZ_PERIODIC=y
  CONFIG_NO_HZ=y
  CONFIG_HZ_1000=y

  Here we observe a time error of about 100 nanoseconds peak to peak
  and a frequency error within about .2 PPM.

sudo ./phc2sys -s eth3 -m -q -l7 -O0
phc2sys[5050.678]: PI servo: sync interval 1.000 kp 0.700 ki 0.30
phc2sys[5051.678]: sys offset287239 s0 freq  +0 delay   5164
phc2sys[5052.678]: sys offset303841 s1 freq  +16600 delay   5161
phc2sys[5053.679]: sys offset11 s2 freq  +16611 delay   5184
phc2sys[5054.679]: sys offset-6 s2 freq  +16597 delay   5191
phc2sys[5055.679]: sys offset-6 s2 freq  +16595 delay   5121
phc2sys[5056.679]: sys offset   -32 s2 freq  +16567 delay   5184
phc2sys[5057.679]: sys offset-6 s2 freq  +16584 delay   5197
phc2sys[5058.679]: sys offset10 s2 freq  +16598 delay   5186
phc2sys[5059.679]: sys offset14 s2 freq  +16605 delay   5162
phc2sys[5060.680]: sys offset   -12 s2 freq  +16583 delay   5212
phc2sys[5061.680]: sys offset   -15 s2 freq  +16576 delay   5196
phc2sys[5062.680]: sys offset-4 s2 freq  +16583 delay   5186
phc2sys[5063.680]: sys offset21 s2 freq  +16607 delay   5187
phc2sys[5064.680]: sys offset 5 s2 freq  +16597 delay   5196
phc2sys[5065.680]: sys offset-1 s2 freq  +16593 delay   5192
phc2sys[5066.680]: sys offset13 s2 freq  +16606 delay   5177
phc2sys[5067.680]: sys offset 5 s2 freq  +16602 delay   5212
phc2sys[5068.681]: sys offset11 s2 freq  +16610 delay   5191
phc2sys[5069.681]: sys offset   -19 s2 freq  +16583 delay   5203
phc2sys[5070.681]: sys offset-3 s2 freq  +16593 delay   5185
phc2sys[5071.681]: sys offset 9 s2 freq  +16604 delay   5197
phc2sys[5072.681]: sys offset 4 s2 freq  +16602 delay   5176
phc2sys[5073.681]: sys offset 7 s2 freq  +16606 delay   5196
phc2sys[5074.681]: sys offset-6 s2 freq  +16595 delay   5186
phc2sys[5075.681]: sys offset   -28 s2 freq  +16572 delay   5192
phc2sys[5076.682]: sys offset48 s2 freq  +16639 delay   5214

* NO_HZ 

  CONFIG_NO_HZ_COMMON=y
  CONFIG_NO_HZ_IDLE=y
  CONFIG_NO_HZ=y
  CONFIG_RCU_FAST_NO_HZ=y
  CONFIG_HZ_250=y

  Here we observe a time error of about 3 microseconds peak to peak
  and a frequency error within about 3 PPM.

cori@cher1293:~/git/linuxptp$ sudo ./phc2sys -s eth3 -m -q -l7 -O0
phc2sys[105.837]: PI servo: sync interval 1.000 kp 0.700 ki 0.30
phc2sys[106.837]: sys offset135052 s0 freq  +0 delay   5189
phc2sys[107.837]: sys offset151449 s1 freq  +16394 delay   5174
phc2sys[108.837]: sys offset   246 s2 freq  +16640 delay   5179
phc2sys[109.838]: sys offset  -185 s2 freq  +16283 delay   5171
phc2sys[110.838]: sys offset  -552 s2 freq  +15860 delay   5192
phc2sys[111.838]: sys offset  1383 s2 freq  +17630 delay   5179
phc2sys[112.838]: sys offset  -737 s2 freq  +15925 delay   5158
phc2sys[113.838]: sys offset  1147 s2 freq  +17587 delay   5171
phc2sys[114.839]: sys offset -1593 s2 freq  +15192 delay   5186

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-19 Thread Richard Cochran
On Mon, Nov 18, 2013 at 01:28:07PM -0800, John Stultz wrote:
 
 Also, just for clarity's sake, when you're seeing things broken,
 curious how/what you are measuring?

Here is the background for this issue. The linuxptp stack has a
program called phc2sys whose purpose is to synchronize the Linux
system clock with a PTP hardware clock typically residing on a PCIe
card. This program uses the PTP_SYS_OFFSET ioctl to read the clocks.

One user on the list was complaining that the reported time and
frequency errors were unacceptably large. After vainly trying
different PI weights, we found out that the poor performance is
due to the nohz setting in the kernel.

See below for example logs.
 
 Also is this brokenness something that has been around for awhile for
 you or more recently cropped up?  I'm wondering as nohz idle has been
 around for quite a few years and I've not seen too many complaints. So
 I'm wondering if I'm looking in the right places, or if something has
 regressed recently, or if its just that accuracy expectations have gone up?

I have always avoided nohz like the plague. IIRC, it would not even
compile on ARM for the longest time. In any case, I really don't know
when this issue appeared. At some point, nohz became the Kconfig
default, and I did not notice, and so now I had it enabled by
accident.

Here are two sample runs of phc2sys. The PHC is an Intel i210 PCIe
card. I simply set the PHC time to the Linux system time (using
testptp -s), and then let PHC clock run free. During the test, the
phc2sys program is the only active program, and the system is
otherwise idle.

In this test, the update rate is once per second. When using longer
intervals, the problem becomes worse.

In the listings, sys offset is measured in nanosecond using the
PTP_SYS_OFFSET ioctl, freq is the PI servo output in ppb, and
delay is the time it took to read the two clocks in nanoseconds.

* Periodic Case

  CONFIG_HZ_PERIODIC=y
  CONFIG_NO_HZ=y
  CONFIG_HZ_1000=y

  Here we observe a time error of about 100 nanoseconds peak to peak
  and a frequency error within about .2 PPM.

sudo ./phc2sys -s eth3 -m -q -l7 -O0
phc2sys[5050.678]: PI servo: sync interval 1.000 kp 0.700 ki 0.30
phc2sys[5051.678]: sys offset287239 s0 freq  +0 delay   5164
phc2sys[5052.678]: sys offset303841 s1 freq  +16600 delay   5161
phc2sys[5053.679]: sys offset11 s2 freq  +16611 delay   5184
phc2sys[5054.679]: sys offset-6 s2 freq  +16597 delay   5191
phc2sys[5055.679]: sys offset-6 s2 freq  +16595 delay   5121
phc2sys[5056.679]: sys offset   -32 s2 freq  +16567 delay   5184
phc2sys[5057.679]: sys offset-6 s2 freq  +16584 delay   5197
phc2sys[5058.679]: sys offset10 s2 freq  +16598 delay   5186
phc2sys[5059.679]: sys offset14 s2 freq  +16605 delay   5162
phc2sys[5060.680]: sys offset   -12 s2 freq  +16583 delay   5212
phc2sys[5061.680]: sys offset   -15 s2 freq  +16576 delay   5196
phc2sys[5062.680]: sys offset-4 s2 freq  +16583 delay   5186
phc2sys[5063.680]: sys offset21 s2 freq  +16607 delay   5187
phc2sys[5064.680]: sys offset 5 s2 freq  +16597 delay   5196
phc2sys[5065.680]: sys offset-1 s2 freq  +16593 delay   5192
phc2sys[5066.680]: sys offset13 s2 freq  +16606 delay   5177
phc2sys[5067.680]: sys offset 5 s2 freq  +16602 delay   5212
phc2sys[5068.681]: sys offset11 s2 freq  +16610 delay   5191
phc2sys[5069.681]: sys offset   -19 s2 freq  +16583 delay   5203
phc2sys[5070.681]: sys offset-3 s2 freq  +16593 delay   5185
phc2sys[5071.681]: sys offset 9 s2 freq  +16604 delay   5197
phc2sys[5072.681]: sys offset 4 s2 freq  +16602 delay   5176
phc2sys[5073.681]: sys offset 7 s2 freq  +16606 delay   5196
phc2sys[5074.681]: sys offset-6 s2 freq  +16595 delay   5186
phc2sys[5075.681]: sys offset   -28 s2 freq  +16572 delay   5192
phc2sys[5076.682]: sys offset48 s2 freq  +16639 delay   5214

* NO_HZ 

  CONFIG_NO_HZ_COMMON=y
  CONFIG_NO_HZ_IDLE=y
  CONFIG_NO_HZ=y
  CONFIG_RCU_FAST_NO_HZ=y
  CONFIG_HZ_250=y

  Here we observe a time error of about 3 microseconds peak to peak
  and a frequency error within about 3 PPM.

cori@cher1293:~/git/linuxptp$ sudo ./phc2sys -s eth3 -m -q -l7 -O0
phc2sys[105.837]: PI servo: sync interval 1.000 kp 0.700 ki 0.30
phc2sys[106.837]: sys offset135052 s0 freq  +0 delay   5189
phc2sys[107.837]: sys offset151449 s1 freq  +16394 delay   5174
phc2sys[108.837]: sys offset   246 s2 freq  +16640 delay   5179
phc2sys[109.838]: sys offset  -185 s2 freq  +16283 delay   5171
phc2sys[110.838]: sys offset  -552 s2 freq  +15860 delay   5192
phc2sys[111.838]: sys offset  1383 s2 freq  +17630 delay   5179
phc2sys[112.838]: sys offset  -737 s2 freq  +15925 delay   5158
phc2sys[113.838]: sys offset  1147 s2 freq  +17587 delay   5171
phc2sys[114.839]: sys offset -1593 s2 freq  +15192 delay   5186

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-18 Thread John Stultz
On 11/15/2013 11:03 PM, Richard Cochran wrote:
> On Thu, Nov 14, 2013 at 03:50:40PM +0100, Miroslav Lichvar wrote:
>
>>  include/linux/timekeeper_internal.h |   4 +
>>  kernel/time/timekeeping.c   | 209 
>> +---
>>  2 files changed, 31 insertions(+), 182 deletions(-)
> This looks like an impressive simplification...

Indeed!

>
>> - * So the following can be confusing.
> Yep.
>
> So I really have no idea how the deleted code worked (or didn't work
> for nohz), but I can confirm that nohz time keeping is broken under
> light system load. Running a high load (like recompiling the kernel on
> all cores for CONFIG_HZ_PERIODIC=y ;) hides the issue, but that is
> obviously not the right solution.
Thanks for confirming the issue, and yes, forcing periodic hz isn't the
right solution.

That said there is a bit of a tension between very long nohz periods and
very accurate ntp syncing. Its a bit like sleeping at the wheel: you can
either get your rest, or stay on the road. :)  

It may be that we need to feed some of the ntp error state into the
tick-scheduling logic, so we don't try to sleep very long when we know
we're on a curvy bend.

Also, just for clarity's sake, when you're seeing things "broken",
curious how/what you are measuring?

Also is this brokenness something that has been around for awhile for
you or more recently cropped up?  I'm wondering as nohz idle has been
around for quite a few years and I've not seen too many complaints. So
I'm wondering if I'm looking in the right places, or if something has
regressed recently, or if its just that accuracy expectations have gone up?


> Out of my ignorance, two questions spring to mind.
>
> 1. Considering the simplicity of Miroslav's patch, what was the
>benefit of the much more complicated code in the first place?

The much more complicated code was designed by Roman quite awhile back
(2006-ish). He was extremely bright, but produced very opaque code. It
made it very difficult to review, but once you sat down for awhile and
worked out the math, it ended up being correct and very efficient.

His concern was mostly about making the timer interrupt very fast on
very slow hardware (m68k was the architecture he co-maintained at the
time). So the code is very optimized for steady state, where the
adjustment value is 1,0, or -1. And we avoid doing any expensive math
operations (no multiplies, no divides), even in the non-steady state path.

Now it was designed before nohz was common (and back when nohz for a
full second was considered a aggresive length), so I could very well
believe that it has been stretched past its limit.  So I think your
point above about understanding exactly how it doesn't work for nohz is key.

Now, there is some parts of his look-ahead logic that I never quite
understood the rational for (see the top part of timekeeping_bigadjust).
At the time it was designed, I preferred a more PID-like approach, but
it was more computationally expensive and I couldn't measure any benefit
to my approach over his, so that part stuck around.


> 2. Does this patch work in the CONFIG_HZ_PERIODIC case just as well as
>the deleted code?

As I mentioned in my other email, I'm a little concerned about some of
the accounting that is removed. At a blackbox level, its not handling
all the same values, so I suspect there's some issues here, but they may
not be very significant or difficult to fix.

So while I'm very cautious to throwing out the complex code which has
worked relatively well for quite awhile, I am very interested in coming
up with a solution that is easier to understand and validate as correct.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-18 Thread John Stultz
On 11/14/2013 06:50 AM, Miroslav Lichvar wrote:
> Since commit 5eb6d205 the system clock is kept separately from NTP time
> and it is synchronized by adjusting its multiplier in a feedback loop.
> This works well when the updates are done regularly. With nohz and idle
> system, however, the loop becomes unstable at a certain update interval.
> The loop overcorrects and the frequency doesn't settle down. The clock
> has a large error, which seems to grow quadratically with update
> interval.
>
> If the constants used in the loop were modified for a maximum update
> interval, it would be stable, but too slow to keep up with NTP. Without
> knowing when will be the next update it's difficult to implement a loop
> that is both fast and stable.

So yes, trying to sort out how much of an adjustment to make when you
don't know how long the next interval will be is a difficult problem to
solve.

I'm definitely interested in ways to improve this. That said, there are
some major subtleties in the code you're removing. I'm never been a huge
fan of the code in question, but I did manage to sit down at one point
with some paper and see the design was right, its designed to be quite
efficient and has been mostly left alone for a number of years, so its
reasonably time tested (at least for the requirements it was designed
for). That said, its terribly opaque code, so I usually have to to
re-establish that proof to myself every time I look at it in depth. :)

So quite a bit of care and caution will be needed.

Also, I know you've had to have spent quite a bit of time researching
this issue and coming up with the solution, so I want to be clear I
really appreciate that! But if we are to replace this core logic, I'll
want to make sure I throughly understand your solution, so forgive me if
I have to ask quite a number of stupid sounding (or just stupid)
questions to clarify things and I'll likely propose alternative
solutions that may seem/be bone headed or overly cautious.  So please
bear with me through this. :)

> This patch fixes the problem by postponing update of NTP tick length in
> the clock and setting the multiplier directly without feedback loop by
> dividing the tick length with clock cycles. Previously, a change in tick
> length was applied immediately to all ticks since last update, which
> caused a jump in the NTP error proportional to the change and the update
> interval and which had to be corrected later by the loop.

Hmm. Reading this, but not having studied your patch in depth, is
interesting. It originally was that we only applied any NTP adjustment
to future changes. Also, since at that time the tick length was only
changed on the second overflow, we ran into some problems, since there
were some applications using the ntp interface to make very small
adjustments for only a few ms. Thus we changed (see fdcedf7b75808 -
time: apply NTP frequency/tick changes immediately) it to adjust the
tick length immediately.

If this is the core issue, then maybe would revisiting that change alone
improve things?

Also I'm not sure if I quite follow the bit about "a change in the tick
was applied immediately to all ticks since the last update", since at
least with classic nohz, if someone is making a change to the ntp tick
length, we're not idle, so we should still be getting regular ticks. So
at most, it should be the modified ntp tick length is immediately
applied to the current  tick in progress, as well as following ticks.
Now with nohz_full this is definitely more complicated, but I want to
make sure I really understand what you're seeing. So are the issues
you're seeing w/ classic idle nohz or nohz full?


One of the issues I've recently started to worry about with the existing
code, is that the ntp_error value can really only hold about 2 seconds
worth of error. Now, that's the internal loop adjustment error (ie: the
error from what ntpd specified the kernel to be and where the kernel is
which should be very small), not the actual drift from the ntp server,
so it really shouldn't be problematic, but as we do get into *very*
large nohz idle values, it seems possible that we could accumulate more
then 2 seconds and see overflows of ntp_error, which could cause
instability.  So I'm curious if you were seeing anything like this, or
at least ask if you've already eliminated this as a potential source of
the problem you were seeing.

> The only source of the accounted NTP error is now lack of resolution in
> the clock multiplier. The error is corrected by adding 0 or 1 to the
> calculated multiplier. With a constant tick length, the multiplier will
> be switching between the two values. The loop is greatly simplified and
> there is no problem with stability. The maximum error is limited by the
> update interval.

Hrmm.. So I wonder how well this works with fairly coarse grained
clocksources? Quite a bit of the previous design (which is still looks
mostly in place with a superficial review of your patch) is focused on

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-18 Thread John Stultz
On 11/14/2013 06:50 AM, Miroslav Lichvar wrote:
 Since commit 5eb6d205 the system clock is kept separately from NTP time
 and it is synchronized by adjusting its multiplier in a feedback loop.
 This works well when the updates are done regularly. With nohz and idle
 system, however, the loop becomes unstable at a certain update interval.
 The loop overcorrects and the frequency doesn't settle down. The clock
 has a large error, which seems to grow quadratically with update
 interval.

 If the constants used in the loop were modified for a maximum update
 interval, it would be stable, but too slow to keep up with NTP. Without
 knowing when will be the next update it's difficult to implement a loop
 that is both fast and stable.

So yes, trying to sort out how much of an adjustment to make when you
don't know how long the next interval will be is a difficult problem to
solve.

I'm definitely interested in ways to improve this. That said, there are
some major subtleties in the code you're removing. I'm never been a huge
fan of the code in question, but I did manage to sit down at one point
with some paper and see the design was right, its designed to be quite
efficient and has been mostly left alone for a number of years, so its
reasonably time tested (at least for the requirements it was designed
for). That said, its terribly opaque code, so I usually have to to
re-establish that proof to myself every time I look at it in depth. :)

So quite a bit of care and caution will be needed.

Also, I know you've had to have spent quite a bit of time researching
this issue and coming up with the solution, so I want to be clear I
really appreciate that! But if we are to replace this core logic, I'll
want to make sure I throughly understand your solution, so forgive me if
I have to ask quite a number of stupid sounding (or just stupid)
questions to clarify things and I'll likely propose alternative
solutions that may seem/be bone headed or overly cautious.  So please
bear with me through this. :)

 This patch fixes the problem by postponing update of NTP tick length in
 the clock and setting the multiplier directly without feedback loop by
 dividing the tick length with clock cycles. Previously, a change in tick
 length was applied immediately to all ticks since last update, which
 caused a jump in the NTP error proportional to the change and the update
 interval and which had to be corrected later by the loop.

Hmm. Reading this, but not having studied your patch in depth, is
interesting. It originally was that we only applied any NTP adjustment
to future changes. Also, since at that time the tick length was only
changed on the second overflow, we ran into some problems, since there
were some applications using the ntp interface to make very small
adjustments for only a few ms. Thus we changed (see fdcedf7b75808 -
time: apply NTP frequency/tick changes immediately) it to adjust the
tick length immediately.

If this is the core issue, then maybe would revisiting that change alone
improve things?

Also I'm not sure if I quite follow the bit about a change in the tick
was applied immediately to all ticks since the last update, since at
least with classic nohz, if someone is making a change to the ntp tick
length, we're not idle, so we should still be getting regular ticks. So
at most, it should be the modified ntp tick length is immediately
applied to the current  tick in progress, as well as following ticks.
Now with nohz_full this is definitely more complicated, but I want to
make sure I really understand what you're seeing. So are the issues
you're seeing w/ classic idle nohz or nohz full?


One of the issues I've recently started to worry about with the existing
code, is that the ntp_error value can really only hold about 2 seconds
worth of error. Now, that's the internal loop adjustment error (ie: the
error from what ntpd specified the kernel to be and where the kernel is
which should be very small), not the actual drift from the ntp server,
so it really shouldn't be problematic, but as we do get into *very*
large nohz idle values, it seems possible that we could accumulate more
then 2 seconds and see overflows of ntp_error, which could cause
instability.  So I'm curious if you were seeing anything like this, or
at least ask if you've already eliminated this as a potential source of
the problem you were seeing.

 The only source of the accounted NTP error is now lack of resolution in
 the clock multiplier. The error is corrected by adding 0 or 1 to the
 calculated multiplier. With a constant tick length, the multiplier will
 be switching between the two values. The loop is greatly simplified and
 there is no problem with stability. The maximum error is limited by the
 update interval.

Hrmm.. So I wonder how well this works with fairly coarse grained
clocksources? Quite a bit of the previous design (which is still looks
mostly in place with a superficial review of your patch) is focused on
keeping our long-term 

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-18 Thread John Stultz
On 11/15/2013 11:03 PM, Richard Cochran wrote:
 On Thu, Nov 14, 2013 at 03:50:40PM +0100, Miroslav Lichvar wrote:

  include/linux/timekeeper_internal.h |   4 +
  kernel/time/timekeeping.c   | 209 
 +---
  2 files changed, 31 insertions(+), 182 deletions(-)
 This looks like an impressive simplification...

Indeed!


 - * So the following can be confusing.
 Yep.

 So I really have no idea how the deleted code worked (or didn't work
 for nohz), but I can confirm that nohz time keeping is broken under
 light system load. Running a high load (like recompiling the kernel on
 all cores for CONFIG_HZ_PERIODIC=y ;) hides the issue, but that is
 obviously not the right solution.
Thanks for confirming the issue, and yes, forcing periodic hz isn't the
right solution.

That said there is a bit of a tension between very long nohz periods and
very accurate ntp syncing. Its a bit like sleeping at the wheel: you can
either get your rest, or stay on the road. :)  

It may be that we need to feed some of the ntp error state into the
tick-scheduling logic, so we don't try to sleep very long when we know
we're on a curvy bend.

Also, just for clarity's sake, when you're seeing things broken,
curious how/what you are measuring?

Also is this brokenness something that has been around for awhile for
you or more recently cropped up?  I'm wondering as nohz idle has been
around for quite a few years and I've not seen too many complaints. So
I'm wondering if I'm looking in the right places, or if something has
regressed recently, or if its just that accuracy expectations have gone up?


 Out of my ignorance, two questions spring to mind.

 1. Considering the simplicity of Miroslav's patch, what was the
benefit of the much more complicated code in the first place?

The much more complicated code was designed by Roman quite awhile back
(2006-ish). He was extremely bright, but produced very opaque code. It
made it very difficult to review, but once you sat down for awhile and
worked out the math, it ended up being correct and very efficient.

His concern was mostly about making the timer interrupt very fast on
very slow hardware (m68k was the architecture he co-maintained at the
time). So the code is very optimized for steady state, where the
adjustment value is 1,0, or -1. And we avoid doing any expensive math
operations (no multiplies, no divides), even in the non-steady state path.

Now it was designed before nohz was common (and back when nohz for a
full second was considered a aggresive length), so I could very well
believe that it has been stretched past its limit.  So I think your
point above about understanding exactly how it doesn't work for nohz is key.

Now, there is some parts of his look-ahead logic that I never quite
understood the rational for (see the top part of timekeeping_bigadjust).
At the time it was designed, I preferred a more PID-like approach, but
it was more computationally expensive and I couldn't measure any benefit
to my approach over his, so that part stuck around.


 2. Does this patch work in the CONFIG_HZ_PERIODIC case just as well as
the deleted code?

As I mentioned in my other email, I'm a little concerned about some of
the accounting that is removed. At a blackbox level, its not handling
all the same values, so I suspect there's some issues here, but they may
not be very significant or difficult to fix.

So while I'm very cautious to throwing out the complex code which has
worked relatively well for quite awhile, I am very interested in coming
up with a solution that is easier to understand and validate as correct.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-15 Thread Richard Cochran
On Thu, Nov 14, 2013 at 03:50:40PM +0100, Miroslav Lichvar wrote:

>  include/linux/timekeeper_internal.h |   4 +
>  kernel/time/timekeeping.c   | 209 
> +---
>  2 files changed, 31 insertions(+), 182 deletions(-)

This looks like an impressive simplification...

> -  * So the following can be confusing.

Yep.

So I really have no idea how the deleted code worked (or didn't work
for nohz), but I can confirm that nohz time keeping is broken under
light system load. Running a high load (like recompiling the kernel on
all cores for CONFIG_HZ_PERIODIC=y ;) hides the issue, but that is
obviously not the right solution.

Out of my ignorance, two questions spring to mind.

1. Considering the simplicity of Miroslav's patch, what was the
   benefit of the much more complicated code in the first place?

2. Does this patch work in the CONFIG_HZ_PERIODIC case just as well as
   the deleted code?

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-15 Thread Richard Cochran
On Thu, Nov 14, 2013 at 03:50:40PM +0100, Miroslav Lichvar wrote:

  include/linux/timekeeper_internal.h |   4 +
  kernel/time/timekeeping.c   | 209 
 +---
  2 files changed, 31 insertions(+), 182 deletions(-)

This looks like an impressive simplification...

 -  * So the following can be confusing.

Yep.

So I really have no idea how the deleted code worked (or didn't work
for nohz), but I can confirm that nohz time keeping is broken under
light system load. Running a high load (like recompiling the kernel on
all cores for CONFIG_HZ_PERIODIC=y ;) hides the issue, but that is
obviously not the right solution.

Out of my ignorance, two questions spring to mind.

1. Considering the simplicity of Miroslav's patch, what was the
   benefit of the much more complicated code in the first place?

2. Does this patch work in the CONFIG_HZ_PERIODIC case just as well as
   the deleted code?

Thanks,
Richard
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-14 Thread Rik van Riel

On 11/14/2013 09:50 AM, Miroslav Lichvar wrote:

Since commit 5eb6d205 the system clock is kept separately from NTP time
and it is synchronized by adjusting its multiplier in a feedback loop.
This works well when the updates are done regularly. With nohz and idle
system, however, the loop becomes unstable at a certain update interval.
The loop overcorrects and the frequency doesn't settle down. The clock
has a large error, which seems to grow quadratically with update
interval.



In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
update the maximum error went down from 480 microseconds to 55
nanoseconds.

In a real test on idle machine comparing raw TSC and clock_gettime()
time stamps, the maximum error went down from microseconds to tens of
nanoseconds. A test with clock synchronized to a PTP hardware clock by
phc2sys from linuxptp now shows no difference when running with nohz
enabled and disabled, the clock seems to be stable to few tens of
nanoseconds.


Looks like a big improvement to me.

Also very useful for virtual machines, which have no
good control over when the timekeeping routines will
run, but which can see what time it is when they do
run...


Signed-off-by: Miroslav Lichvar 


Acked-by: Rik van Riel 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-14 Thread Miroslav Lichvar
Since commit 5eb6d205 the system clock is kept separately from NTP time
and it is synchronized by adjusting its multiplier in a feedback loop.
This works well when the updates are done regularly. With nohz and idle
system, however, the loop becomes unstable at a certain update interval.
The loop overcorrects and the frequency doesn't settle down. The clock
has a large error, which seems to grow quadratically with update
interval.

If the constants used in the loop were modified for a maximum update
interval, it would be stable, but too slow to keep up with NTP. Without
knowing when will be the next update it's difficult to implement a loop
that is both fast and stable.

This patch fixes the problem by postponing update of NTP tick length in
the clock and setting the multiplier directly without feedback loop by
dividing the tick length with clock cycles. Previously, a change in tick
length was applied immediately to all ticks since last update, which
caused a jump in the NTP error proportional to the change and the update
interval and which had to be corrected later by the loop.

The only source of the accounted NTP error is now lack of resolution in
the clock multiplier. The error is corrected by adding 0 or 1 to the
calculated multiplier. With a constant tick length, the multiplier will
be switching between the two values. The loop is greatly simplified and
there is no problem with stability. The maximum error is limited by the
update interval.

In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
update the maximum error went down from 480 microseconds to 55
nanoseconds.

In a real test on idle machine comparing raw TSC and clock_gettime()
time stamps, the maximum error went down from microseconds to tens of
nanoseconds. A test with clock synchronized to a PTP hardware clock by
phc2sys from linuxptp now shows no difference when running with nohz
enabled and disabled, the clock seems to be stable to few tens of
nanoseconds.

TODO:
- add forced update_wall_time() and call it after adjtimex() to improve
  accuracy of phase adjustments done by quickly changing NTP frequency
- check if there are any issues with suspend

Signed-off-by: Miroslav Lichvar 
---
 include/linux/timekeeper_internal.h |   4 +
 kernel/time/timekeeping.c   | 209 +---
 2 files changed, 31 insertions(+), 182 deletions(-)

diff --git a/include/linux/timekeeper_internal.h 
b/include/linux/timekeeper_internal.h
index c1825eb..b91ad4b 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -34,12 +34,16 @@ struct timekeeper {
/* Clock shifted nano seconds */
u64 xtime_nsec;
 
+   /* Tick used for calculation of NTP error. */
+   u64 ntp_tick;
/* Difference between accumulated time and NTP time in ntp
 * shifted nano seconds. */
s64 ntp_error;
/* Shift conversion between clock shifted nano seconds and
 * ntp shifted nano seconds. */
u32 ntp_error_shift;
+   /* Correction applied to mult to reduce the error. */
+   u32 mult_ntp_correction;
 
/*
 * wall_to_monotonic is what we need to add to xtime (or xtime corrected
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3abf534..6ee57f7 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -146,6 +146,9 @@ static void tk_setup_internals(struct timekeeper *tk, 
struct clocksource *clock)
 * to counteract clock drifting.
 */
tk->mult = clock->mult;
+   /* zero frequency offset */
+   tk->ntp_tick = tk->xtime_interval << tk->ntp_error_shift;
+   tk->mult_ntp_correction = 0;
 }
 
 /* Timekeeper helper functions. */
@@ -1048,200 +1051,42 @@ static int __init timekeeping_init_ops(void)
 device_initcall(timekeeping_init_ops);
 
 /*
- * If the error is already larger, we look ahead even further
- * to compensate for late or lost adjustments.
+ * Adjust the multiplier according to current NTP tick.
  */
-static __always_inline int timekeeping_bigadjust(struct timekeeper *tk,
-s64 error, s64 *interval,
-s64 *offset)
+static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 {
-   s64 tick_error, i;
-   u32 look_ahead, adj;
-   s32 error2, mult;
+   u32 new_mult;
 
-   /*
-* Use the current error value to determine how much to look ahead.
-* The larger the error the slower we adjust for it to avoid problems
-* with losing too many ticks, otherwise we would overadjust and
-* produce an even larger error.  The smaller the adjustment the
-* faster we try to adjust for it, as lost ticks can do less harm
-* here.  

[PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-14 Thread Miroslav Lichvar
Since commit 5eb6d205 the system clock is kept separately from NTP time
and it is synchronized by adjusting its multiplier in a feedback loop.
This works well when the updates are done regularly. With nohz and idle
system, however, the loop becomes unstable at a certain update interval.
The loop overcorrects and the frequency doesn't settle down. The clock
has a large error, which seems to grow quadratically with update
interval.

If the constants used in the loop were modified for a maximum update
interval, it would be stable, but too slow to keep up with NTP. Without
knowing when will be the next update it's difficult to implement a loop
that is both fast and stable.

This patch fixes the problem by postponing update of NTP tick length in
the clock and setting the multiplier directly without feedback loop by
dividing the tick length with clock cycles. Previously, a change in tick
length was applied immediately to all ticks since last update, which
caused a jump in the NTP error proportional to the change and the update
interval and which had to be corrected later by the loop.

The only source of the accounted NTP error is now lack of resolution in
the clock multiplier. The error is corrected by adding 0 or 1 to the
calculated multiplier. With a constant tick length, the multiplier will
be switching between the two values. The loop is greatly simplified and
there is no problem with stability. The maximum error is limited by the
update interval.

In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
update the maximum error went down from 480 microseconds to 55
nanoseconds.

In a real test on idle machine comparing raw TSC and clock_gettime()
time stamps, the maximum error went down from microseconds to tens of
nanoseconds. A test with clock synchronized to a PTP hardware clock by
phc2sys from linuxptp now shows no difference when running with nohz
enabled and disabled, the clock seems to be stable to few tens of
nanoseconds.

TODO:
- add forced update_wall_time() and call it after adjtimex() to improve
  accuracy of phase adjustments done by quickly changing NTP frequency
- check if there are any issues with suspend

Signed-off-by: Miroslav Lichvar mlich...@redhat.com
---
 include/linux/timekeeper_internal.h |   4 +
 kernel/time/timekeeping.c   | 209 +---
 2 files changed, 31 insertions(+), 182 deletions(-)

diff --git a/include/linux/timekeeper_internal.h 
b/include/linux/timekeeper_internal.h
index c1825eb..b91ad4b 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -34,12 +34,16 @@ struct timekeeper {
/* Clock shifted nano seconds */
u64 xtime_nsec;
 
+   /* Tick used for calculation of NTP error. */
+   u64 ntp_tick;
/* Difference between accumulated time and NTP time in ntp
 * shifted nano seconds. */
s64 ntp_error;
/* Shift conversion between clock shifted nano seconds and
 * ntp shifted nano seconds. */
u32 ntp_error_shift;
+   /* Correction applied to mult to reduce the error. */
+   u32 mult_ntp_correction;
 
/*
 * wall_to_monotonic is what we need to add to xtime (or xtime corrected
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3abf534..6ee57f7 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -146,6 +146,9 @@ static void tk_setup_internals(struct timekeeper *tk, 
struct clocksource *clock)
 * to counteract clock drifting.
 */
tk-mult = clock-mult;
+   /* zero frequency offset */
+   tk-ntp_tick = tk-xtime_interval  tk-ntp_error_shift;
+   tk-mult_ntp_correction = 0;
 }
 
 /* Timekeeper helper functions. */
@@ -1048,200 +1051,42 @@ static int __init timekeeping_init_ops(void)
 device_initcall(timekeeping_init_ops);
 
 /*
- * If the error is already larger, we look ahead even further
- * to compensate for late or lost adjustments.
+ * Adjust the multiplier according to current NTP tick.
  */
-static __always_inline int timekeeping_bigadjust(struct timekeeper *tk,
-s64 error, s64 *interval,
-s64 *offset)
+static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 {
-   s64 tick_error, i;
-   u32 look_ahead, adj;
-   s32 error2, mult;
+   u32 new_mult;
 
-   /*
-* Use the current error value to determine how much to look ahead.
-* The larger the error the slower we adjust for it to avoid problems
-* with losing too many ticks, otherwise we would overadjust and
-* produce an even larger error.  The smaller the adjustment the
-* faster we try to adjust for it, as lost ticks can do less harm
-

Re: [PATCH RFC] timekeeping: Fix clock stability with nohz

2013-11-14 Thread Rik van Riel

On 11/14/2013 09:50 AM, Miroslav Lichvar wrote:

Since commit 5eb6d205 the system clock is kept separately from NTP time
and it is synchronized by adjusting its multiplier in a feedback loop.
This works well when the updates are done regularly. With nohz and idle
system, however, the loop becomes unstable at a certain update interval.
The loop overcorrects and the frequency doesn't settle down. The clock
has a large error, which seems to grow quadratically with update
interval.



In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
update the maximum error went down from 480 microseconds to 55
nanoseconds.

In a real test on idle machine comparing raw TSC and clock_gettime()
time stamps, the maximum error went down from microseconds to tens of
nanoseconds. A test with clock synchronized to a PTP hardware clock by
phc2sys from linuxptp now shows no difference when running with nohz
enabled and disabled, the clock seems to be stable to few tens of
nanoseconds.


Looks like a big improvement to me.

Also very useful for virtual machines, which have no
good control over when the timekeeping routines will
run, but which can see what time it is when they do
run...


Signed-off-by: Miroslav Lichvar mlich...@redhat.com


Acked-by: Rik van Riel r...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/