On 01/24/2013 10:08 AM, John Stultz wrote:
On 01/24/2013 07:00 AM, Romain Francoise wrote:
Willy Tarreau <[email protected]> writes:

What is your workload ?
There are two separate workloads that are affected by this issue:
- one is a pretty typical whole-distribution build, which gets 1.5x to 2x
   slower with v2.6.32.60
- the other is a code simulator, which is basically a bunch of processes
   communicating together using POSIX message queues (mq_timedsend() and
mq_timedreceive()). It runs approximately 10x slower, but doesn't seem
   to use more CPU, it just sits there doing pretty much nothing.

I managed to narrow the second workload down to a single test case, which runs in approximately 0.7s in v2.6.32.59 and 8s (!) in v2.6.32.60. Having a simple test case allowed me to do a full bisection in a test VM, and it
ends on commit 61b76840ddee647c0c223365378c3f394355b7d7 ("time: Avoid
making adjustments if we haven't accumulated anything").

And indeed, if I revert this commit from v2.6.32.60 then everything is
back to normal.

Huh. That's curious.
Especially since "time: Avoid making adjustments if we haven't accumulated anything" should really be shortcutting unnecessary work.

Do you see any similar regression behavior between 3.5 and 3.6 (when the same fix landed upstream)?

If not, I suspect there's some extra detail w/ xtime_cache and the non-logarithmic accumulation that we did prior to 2.6.33, that I missed.

Yep, that's the issue. As soon as I hit send it made sense.

Basically with kernels between 2.6.20-something to 2.6.32, we accumulate time in half second chunks, rather then every timer-tick. This was added because when NOHZ landed, if you were idle for a few seconds, you had to spin for every tick we skipped in the accumulation loop, which created some bad latencies.

However, this required that we create the xtime_cache() which was still updated each tick, so that filesystem timestamps, etc continued to see time increment normally.

Of course, the xtime_cache is updated at the bottom of update_wall_time(). So the early return on (offset < timekeeper.cycle_interval), causes the xtime_cache to not be updated. Then in hrtimer_get_softirq_time(), we pull current_kernel_time() which uses the non-updated xtime_cache, and that likely causes timers to fire with half-second granularity.

So I think reverting 61b76840ddee647c0c223365378c3f394355b7d7 is the right call, since it isn't really appropriate for 3.6.32.

Although the issue it addresses (inconsistencies in the _COARSE clockids) would return, those had been present for quite some time without complaint. So another possible solution would be the following (fair warning, totally untested). Mind giving it a try to see if it resolves the issue?

If it looks good to you I'll put it through some of my own tests as well.

thanks
-john


diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3d35af3..f65a0fb 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -809,7 +809,7 @@ void update_wall_time(void)
 #endif
        /* Check if there's really nothing to do */
        if (offset < timekeeper.cycle_interval)
-               return;
+               goto out;
timekeeper.xtime_nsec = (s64)xtime.tv_nsec << timekeeper.shift; @@ -881,6 +881,7 @@ void update_wall_time(void)
        timekeeper.ntp_error += timekeeper.xtime_nsec <<
                                timekeeper.ntp_error_shift;
+out:
        nsecs = clocksource_cyc2ns(offset, timekeeper.mult, timekeeper.shift);
        update_xtime_cache(nsecs);

--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to