RE: [ 56/75] sched: Fix nohz load accounting -- again!

Doug Smythies Fri, 04 May 2012 15:39:12 -0700

Just earlier this morning, an issue was raised against this patch on the 
related thread on Ubuntu launchpad.
The complaint is that reported load averages are now to high under conditions 
of "high" frequency enter into / exit from cpu idle conditions where the cpu is 
very lightly loaded.
Most of the testing I did was with medium to heavy load on the cpu and 
relatively short idle periods. This is the opposite.
For a quick test, I hacked up my test program, and was able to reproduce the 
issue.
I am still attempting to understand better and also determine the lower bound 
to "high" frequency (I think it is 25 Hertz, and scales from there down, but no 
proof yet). I'll also go backwards and test this scenario without the patch. 
I'll let this list know the results, but it might be a few days.
My quick test results are attached.


Doug Smythies

References (even though I have been told not to include links in e-mails to 
this list):

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/838811 <<< starting from 
post #52

version:
doug@s15:~/c$ uname -a
Linux s15 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64 
x86_64 x86_64 GNU/Linux
doug@s15:~/c$ cat /proc/version
Linux version 3.2.0-24-generic (buildd@yellow) (gcc version 4.6.3 
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012

-----Original Message-----
From: Greg KH [mailto:[email protected]] 
Sent: May-04-2012 13:43
To: [email protected]; [email protected]
Cc: [email protected]; [email protected]; 
[email protected]; Doug Smythies; LesÅ�aw Kopeć; Aman Gupta; Peter 
Zijlstra; Ingo Molnar; Kerin Millar
Subject: [ 56/75] sched: Fix nohz load accounting -- again!

3.3-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Peter Zijlstra <[email protected]>

commit c308b56b5398779cd3da0f62ab26b0453494c3d4 upstream.

Various people reported nohz load tracking still being wrecked, but Doug 
spotted the actual problem. We fold the nohz remainder in too soon, causing us 
to loose samples and under-account.

So instead of playing catch-up up-front, always do a single load-fold with 
whatever state we encounter and only then fold the nohz remainder and play 
catch-up.

Reported-by: Doug Smythies <[email protected]>
Reported-by: LesÅ=82aw Kope=C4=87 <[email protected]>
Reported-by: Aman Gupta <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Kerin Millar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
 kernel/sched/core.c |   53 +++++++++++++++++++++++++---------------------------
 1 file changed, 26 insertions(+), 27 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2266,13 +2266,10 @@ calc_load_n(unsigned long load, unsigned
  * Once we've updated the global active value, we need to apply the exponential
  * weights adjusted to the number of cycles missed.
  */
-static void calc_global_nohz(unsigned long ticks)
+static void calc_global_nohz(void)
 {
        long delta, active, n;
 
-       if (time_before(jiffies, calc_load_update))
-               return;
-
        /*
         * If we crossed a calc_load_update boundary, make sure to fold
         * any pending idle changes, the respective CPUs might have @@ -2284,31 
+2281,25 @@ static void calc_global_nohz(unsigned lo
                atomic_long_add(delta, &calc_load_tasks);
 
        /*
-        * If we were idle for multiple load cycles, apply them.
+        * It could be the one fold was all it took, we done!
         */
-       if (ticks >= LOAD_FREQ) {
-               n = ticks / LOAD_FREQ;
+       if (time_before(jiffies, calc_load_update + 10))
+               return;
 
-               active = atomic_long_read(&calc_load_tasks);
-               active = active > 0 ? active * FIXED_1 : 0;
+       /*
+        * Catch-up, fold however many we are behind still
+        */
+       delta = jiffies - calc_load_update - 10;
+       n = 1 + (delta / LOAD_FREQ);
 
-               avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
-               avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
-               avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
+       active = atomic_long_read(&calc_load_tasks);
+       active = active > 0 ? active * FIXED_1 : 0;
 
-               calc_load_update += n * LOAD_FREQ;
-       }
+       avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
+       avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
+       avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
 
-       /*
-        * Its possible the remainder of the above division also crosses
-        * a LOAD_FREQ period, the regular check in calc_global_load()
-        * which comes after this will take care of that.
-        *
-        * Consider us being 11 ticks before a cycle completion, and us
-        * sleeping for 4*LOAD_FREQ + 22 ticks, then the above code will
-        * age us 4 cycles, and the test in calc_global_load() will
-        * pick up the final one.
-        */
+       calc_load_update += n * LOAD_FREQ;
 }
 #else
 void calc_load_account_idle(struct rq *this_rq) @@ -2320,7 +2311,7 @@ static 
inline long calc_load_fold_idle(v
        return 0;
 }
 
-static void calc_global_nohz(unsigned long ticks)
+static void calc_global_nohz(void)
 {
 }
 #endif
@@ -2348,8 +2339,6 @@ void calc_global_load(unsigned long tick  {
        long active;
 
-       calc_global_nohz(ticks);
-
        if (time_before(jiffies, calc_load_update + 10))
                return;
 
@@ -2361,6 +2350,16 @@ void calc_global_load(unsigned long tick
        avenrun[2] = calc_load(avenrun[2], EXP_15, active);
 
        calc_load_update += LOAD_FREQ;
+
+       /*
+        * Account one period with whatever state we found before
+        * folding in the nohz state and ageing the entire idle period.
+        *
+        * This avoids loosing a sample when we go idle between
+        * calc_load_account_active() (10 ticks ago) and now and thus
+        * under-accounting.
+        */
+       calc_global_nohz();
 }
 
 /*

Doug Smythies 2012.05.04 CPU's enter and exit idle at high frequencies. but 
with very very low actual utilization.
The majority of previous testing was very high utilization.

Excerpt from "Top"

top - 11:57:04 up 1 day, 15:28,  3 users,  load average: 0.91, 0.87, 0.76
Tasks: 144 total,   2 running, 142 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.9%us,  8.8%sy,  0.0%ni, 90.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  1.0%us,  4.8%sy,  0.0%ni, 94.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.7%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.3%us,  2.6%sy,  0.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   7956400k total,  1591044k used,  6365356k free,   292288k buffers
Swap:  8294396k total,        0k used,  8294396k free,   882488k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15970 doug      20   0  4156   88    0 S    2  0.0   0:23.23 waiter
15972 doug      20   0  4156   88    0 S    2  0.0   0:23.24 waiter
15968 doug      20   0  4156   88    0 S    2  0.0   0:23.30 waiter
15971 doug      20   0  4156   88    0 S    2  0.0   0:23.36 waiter
15969 doug      20   0  4156   88    0 S    1  0.0   0:23.51 waiter
15986 doug      20   0 17332 1332  960 R    1  0.0   0:01.24 top
    1 root      20   0 24724 2616 1344 S    0  0.0   0:01.42 init

Similar excerpt from test program output. Note: TOP resource time = user cpu + 
sys cpu times.

Child  0 lp  706 of 9999: Elapsed:1411.97 s. Delta:   2.00 s. user cpu:   5.03 
s. sys cpu:  18.46 s. last sleep freq.: 250.000 Hz. average sleep freq.: 
250.005 Hz.
Child  1 lp  706 of 9999: Elapsed:1411.98 s. Delta:   2.00 s. user cpu:   5.38 
s. sys cpu:  18.31 s. last sleep freq.: 250.000 Hz. average sleep freq.: 
250.004 Hz.
Child  2 lp  706 of 9999: Elapsed:1411.97 s. Delta:   2.00 s. user cpu:   5.02 
s. sys cpu:  18.39 s. last sleep freq.: 250.000 Hz. average sleep freq.: 
250.005 Hz.
Child  3 lp  706 of 9999: Elapsed:1411.97 s. Delta:   2.00 s. user cpu:   5.23 
s. sys cpu:  18.35 s. last sleep freq.: 250.000 Hz. average sleep freq.: 
250.005 Hz.
Child  4 lp  706 of 9999: Elapsed:1411.97 s. Delta:   2.00 s. user cpu:   5.25 
s. sys cpu:  18.22 s. last sleep freq.: 250.000 Hz. average sleep freq.: 
250.005 Hz.
Child  0 lp  707 of 9999: Elapsed:1413.97 s. Delta:   2.00 s. user cpu:   5.04 
s. sys cpu:  18.47 s. last sleep freq.: 250.000 Hz. average sleep freq.: 
250.005 Hz.
Child  1 lp  707 of 9999: Elapsed:1413.98 s. Delta:   2.00 s. user cpu:   5.39 
s. sys cpu:  18.34 s. last sleep freq.: 250.000 Hz. average sleep freq.: 
250.004 Hz.
Child  2 lp  707 of 9999: Elapsed:1413.97 s. Delta:   2.00 s. user cpu:   5.02 
s. sys cpu:  18.41 s. last sleep freq.: 250.000 Hz. average sleep freq.: 
250.005 Hz.
Child  3 lp  707 of 9999: Elapsed:1413.97 s. Delta:   2.00 s. user cpu:   5.24 
s. sys cpu:  18.38 s. last sleep freq.: 250.000 Hz. average sleep freq.: 
250.005 Hz.
Child  4 lp  707 of 9999: Elapsed:1413.97 s. Delta:   2.00 s. user cpu:   5.26 
s. sys cpu:  18.24 s. last sleep freq.: 250.000 Hz. average sleep freq.: 
250.005 Hz.

Resource math: ~23.5 seconds / 1414 seconds * 5 occurances ~~= 0.08 real load 
average. 

Similar output from vmstat (but it was running at a different time than above):
(While I have a relatively high context switch rate and interrupt rate in this 
example, the issue exists over a wide range)

doug@s15:~/c$ vmstat -n 10 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0 6366016 292320 882488    0    0     1     1    7   12  0  0 100  0
 0  0      0 6366008 292320 882488    0    0     0     0 1319 2693  1  2 98  0
 0  0      0 6366008 292320 882488    0    0     0     0 1270 2541  0  0 100  0
 0  0      0 6366008 292324 882488    0    0     0     1 1268 2542  0  0 100  0
 0  0      0 6366008 292324 882488    0    0     0     0 1313 2678  1  1 98  0
 0  0      0 6366008 292324 882488    0    0     0     0 1306 2677  0  1 98  0
 0  0      0 6366008 292324 882488    0    0     0     0 1339 2773  0  3 97  0
 0  0      0 6366008 292324 882488    0    0     0     0 1347 2828  0  3 96  0
 0  0      0 6366008 292324 882488    0    0     0     0 1324 2721  0  2 98  0
 0  0      0 6366008 292328 882488    0    0     0     1 1341 2811  0  3 97  0

RE: [ 56/75] sched: Fix nohz load accounting -- again!

Reply via email to