Re: [patch 10/15] sched/migration: Move calc_load_migrate() into CPU_DYING
On Tue, Jul 12, 2016 at 06:33:56PM +0200, Thomas Gleixner wrote: > Subject: sched/migration: Correct off by one in load migration > From: Thomas Gleixner > > The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into > account that the function is now called from a thread running on the outgoing > CPU. As a result a cpu unplug leakes a load of 1 into the global load > accounting mechanism. > > Fix it by adjusting for the currently running thread which calls > calc_load_migrate(). > > Fixes: e9cd8fa4fcfd: "sched/migration: Move calc_load_migrate() into > CPU_DYING" > Reported-by: Anton Blanchard > Signed-off-by: Thomas Gleixner > +++ b/kernel/sched/loadavg.c > @@ -78,11 +78,11 @@ void get_avenrun(unsigned long *loads, unsigned long > offset, int shift) > loads[2] = (avenrun[2] + offset) << shift; > } > > -long calc_load_fold_active(struct rq *this_rq) > +long calc_load_fold_active(struct rq *this_rq, long adjust) > { > long nr_active, delta = 0; > > - nr_active = this_rq->nr_running; > + nr_active = this_rq->nr_running - adjust; > nr_active += (long)this_rq->nr_uninterruptible; > > if (nr_active != this_rq->calc_load_active) { Yeah, I think this is the only sensible approach. How do you want to route this?
Re: [patch 10/15] sched/migration: Move calc_load_migrate() into CPU_DYING
Hi, On 07/12/2016 10:03 PM, Thomas Gleixner wrote: > Anton, > > On Tue, 12 Jul 2016, Anton Blanchard wrote: >>> It really does not matter when we fold the load for the outgoing cpu. >>> It's almost dead anyway, so there is no harm if we fail to fold the >>> few microseconds which are required for going fully away. >> >> We are seeing the load average shoot up when hot unplugging CPUs (+1 >> for every CPU we offline) on ppc64. This reproduces on bare metal as >> well as inside a KVM guest. A bisect points at this commit. >> >> As an example, a completely idle box with 128 CPUS and 112 hot >> unplugged: >> >> # uptime >> 04:35:30 up 1:23, 2 users, load average: 112.43, 122.94, 125.54 > > Yes, it's an off by one as we now call that from the task which is tearing > down the cpu. Does the patch below fix it? Tested your patch to see that on an idle box on offlinig the cpus I dont see increase in loadaverage. # uptime 01:27:44 up 10 min, 1 user, load average: 0.00, 0.18, 0.18 # lscpu | grep -Ei "on-line|off-line" On-line CPU(s) list: 0-127 # ppc64_cpu --cores-on=2 # lscpu | grep -Ei "on-line|off-line" On-line CPU(s) list: 0-15 Off-line CPU(s) list: 16-127 # sleep 60 # uptime 01:28:52 up 11 min, 1 user, load average: 0.11, 0.19, 0.18 Thanks and Regards, Shilpa > > Thanks, > > tglx > > 8<-- > > Subject: sched/migration: Correct off by one in load migration > From: Thomas Gleixner > > The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into > account that the function is now called from a thread running on the outgoing > CPU. As a result a cpu unplug leakes a load of 1 into the global load > accounting mechanism. > > Fix it by adjusting for the currently running thread which calls > calc_load_migrate(). > > Fixes: e9cd8fa4fcfd: "sched/migration: Move calc_load_migrate() into > CPU_DYING" > Reported-by: Anton Blanchard > Signed-off-by: Thomas Gleixner > --- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 51d7105f529a..97ee9ac7e97c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5394,13 +5394,15 @@ void idle_task_exit(void) > /* > * Since this CPU is going 'away' for a while, fold any nr_active delta > * we might have. Assumes we're called after migrate_tasks() so that the > - * nr_active count is stable. > + * nr_active count is stable. We need to take the teardown thread which > + * is calling this into account, so we hand in adjust = 1 to the load > + * calculation. > * > * Also see the comment "Global load-average calculations". > */ > static void calc_load_migrate(struct rq *rq) > { > - long delta = calc_load_fold_active(rq); > + long delta = calc_load_fold_active(rq, 1); > if (delta) > atomic_long_add(delta, &calc_load_tasks); > } > diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c > index b0b93fd33af9..a2d6eb71f06b 100644 > --- a/kernel/sched/loadavg.c > +++ b/kernel/sched/loadavg.c > @@ -78,11 +78,11 @@ void get_avenrun(unsigned long *loads, unsigned long > offset, int shift) > loads[2] = (avenrun[2] + offset) << shift; > } > > -long calc_load_fold_active(struct rq *this_rq) > +long calc_load_fold_active(struct rq *this_rq, long adjust) > { > long nr_active, delta = 0; > > - nr_active = this_rq->nr_running; > + nr_active = this_rq->nr_running - adjust; > nr_active += (long)this_rq->nr_uninterruptible; > > if (nr_active != this_rq->calc_load_active) { > @@ -188,7 +188,7 @@ void calc_load_enter_idle(void) >* We're going into NOHZ mode, if there's any pending delta, fold it >* into the pending idle delta. >*/ > - delta = calc_load_fold_active(this_rq); > + delta = calc_load_fold_active(this_rq, 0); > if (delta) { > int idx = calc_load_write_idx(); > > @@ -389,7 +389,7 @@ void calc_global_load_tick(struct rq *this_rq) > if (time_before(jiffies, this_rq->calc_load_update)) > return; > > - delta = calc_load_fold_active(this_rq); > + delta = calc_load_fold_active(this_rq, 0); > if (delta) > atomic_long_add(delta, &calc_load_tasks); > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 7cbeb92a1cb9..898c0d2f18fe 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -28,7 +28,7 @@ extern unsigned long calc_load_update; > extern atomic_long_t calc_load_tasks; > > extern void calc_global_load_tick(struct rq *this_rq); > -extern long calc_load_fold_active(struct rq *this_rq); > +extern long calc_load_fold_active(struct rq *this_rq, long adjust); > > #ifdef CONFIG_SMP > extern void cpu_load_update_active(struct rq *this_rq); > > >
Re: [patch 10/15] sched/migration: Move calc_load_migrate() into CPU_DYING
* Thomas Gleixner [2016-07-12 18:33:56]: > Anton, > > On Tue, 12 Jul 2016, Anton Blanchard wrote: > > > It really does not matter when we fold the load for the outgoing cpu. > > > It's almost dead anyway, so there is no harm if we fail to fold the > > > few microseconds which are required for going fully away. > > > > We are seeing the load average shoot up when hot unplugging CPUs (+1 > > for every CPU we offline) on ppc64. This reproduces on bare metal as > > well as inside a KVM guest. A bisect points at this commit. > > > > As an example, a completely idle box with 128 CPUS and 112 hot > > unplugged: > > > > # uptime > > 04:35:30 up 1:23, 2 users, load average: 112.43, 122.94, 125.54 > > Yes, it's an off by one as we now call that from the task which is tearing > down the cpu. Does the patch below fix it? Hi Thomas, Yes this patch fixes the issue. I was able to recreate the problem and also verify with this patch on 4.7.0-rc7. > > Thanks, > > tglx > > 8<-- > > Subject: sched/migration: Correct off by one in load migration > From: Thomas Gleixner > > The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into > account that the function is now called from a thread running on the outgoing > CPU. As a result a cpu unplug leakes a load of 1 into the global load > accounting mechanism. > > Fix it by adjusting for the currently running thread which calls > calc_load_migrate(). > > Fixes: e9cd8fa4fcfd: "sched/migration: Move calc_load_migrate() into > CPU_DYING" > Reported-by: Anton Blanchard Tested-by: Vaidyanathan Srinivasan > Signed-off-by: Thomas Gleixner > > --- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 51d7105f529a..97ee9ac7e97c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5394,13 +5394,15 @@ void idle_task_exit(void) > /* > * Since this CPU is going 'away' for a while, fold any nr_active delta > * we might have. Assumes we're called after migrate_tasks() so that the > - * nr_active count is stable. > + * nr_active count is stable. We need to take the teardown thread which > + * is calling this into account, so we hand in adjust = 1 to the load > + * calculation. > * > * Also see the comment "Global load-average calculations". > */ > static void calc_load_migrate(struct rq *rq) > { > - long delta = calc_load_fold_active(rq); > + long delta = calc_load_fold_active(rq, 1); > if (delta) > atomic_long_add(delta, &calc_load_tasks); > } > diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c > index b0b93fd33af9..a2d6eb71f06b 100644 > --- a/kernel/sched/loadavg.c > +++ b/kernel/sched/loadavg.c > @@ -78,11 +78,11 @@ void get_avenrun(unsigned long *loads, unsigned long > offset, int shift) > loads[2] = (avenrun[2] + offset) << shift; > } > > -long calc_load_fold_active(struct rq *this_rq) > +long calc_load_fold_active(struct rq *this_rq, long adjust) > { > long nr_active, delta = 0; > > - nr_active = this_rq->nr_running; > + nr_active = this_rq->nr_running - adjust; > nr_active += (long)this_rq->nr_uninterruptible; if (nr_active != this_rq->calc_load_active) { delta = nr_active - this_rq->calc_load_active; this_rq->calc_load_active = nr_active; } return delta; Does the above calculation hold good even if we send adjust=1 and bump down nr_active? Tested ok though :) --Vaidy
Re: [patch 10/15] sched/migration: Move calc_load_migrate() into CPU_DYING
Anton, On Tue, 12 Jul 2016, Anton Blanchard wrote: > > It really does not matter when we fold the load for the outgoing cpu. > > It's almost dead anyway, so there is no harm if we fail to fold the > > few microseconds which are required for going fully away. > > We are seeing the load average shoot up when hot unplugging CPUs (+1 > for every CPU we offline) on ppc64. This reproduces on bare metal as > well as inside a KVM guest. A bisect points at this commit. > > As an example, a completely idle box with 128 CPUS and 112 hot > unplugged: > > # uptime > 04:35:30 up 1:23, 2 users, load average: 112.43, 122.94, 125.54 Yes, it's an off by one as we now call that from the task which is tearing down the cpu. Does the patch below fix it? Thanks, tglx 8<-- Subject: sched/migration: Correct off by one in load migration From: Thomas Gleixner The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into account that the function is now called from a thread running on the outgoing CPU. As a result a cpu unplug leakes a load of 1 into the global load accounting mechanism. Fix it by adjusting for the currently running thread which calls calc_load_migrate(). Fixes: e9cd8fa4fcfd: "sched/migration: Move calc_load_migrate() into CPU_DYING" Reported-by: Anton Blanchard Signed-off-by: Thomas Gleixner --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 51d7105f529a..97ee9ac7e97c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5394,13 +5394,15 @@ void idle_task_exit(void) /* * Since this CPU is going 'away' for a while, fold any nr_active delta * we might have. Assumes we're called after migrate_tasks() so that the - * nr_active count is stable. + * nr_active count is stable. We need to take the teardown thread which + * is calling this into account, so we hand in adjust = 1 to the load + * calculation. * * Also see the comment "Global load-average calculations". */ static void calc_load_migrate(struct rq *rq) { - long delta = calc_load_fold_active(rq); + long delta = calc_load_fold_active(rq, 1); if (delta) atomic_long_add(delta, &calc_load_tasks); } diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index b0b93fd33af9..a2d6eb71f06b 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -78,11 +78,11 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift) loads[2] = (avenrun[2] + offset) << shift; } -long calc_load_fold_active(struct rq *this_rq) +long calc_load_fold_active(struct rq *this_rq, long adjust) { long nr_active, delta = 0; - nr_active = this_rq->nr_running; + nr_active = this_rq->nr_running - adjust; nr_active += (long)this_rq->nr_uninterruptible; if (nr_active != this_rq->calc_load_active) { @@ -188,7 +188,7 @@ void calc_load_enter_idle(void) * We're going into NOHZ mode, if there's any pending delta, fold it * into the pending idle delta. */ - delta = calc_load_fold_active(this_rq); + delta = calc_load_fold_active(this_rq, 0); if (delta) { int idx = calc_load_write_idx(); @@ -389,7 +389,7 @@ void calc_global_load_tick(struct rq *this_rq) if (time_before(jiffies, this_rq->calc_load_update)) return; - delta = calc_load_fold_active(this_rq); + delta = calc_load_fold_active(this_rq, 0); if (delta) atomic_long_add(delta, &calc_load_tasks); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7cbeb92a1cb9..898c0d2f18fe 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -28,7 +28,7 @@ extern unsigned long calc_load_update; extern atomic_long_t calc_load_tasks; extern void calc_global_load_tick(struct rq *this_rq); -extern long calc_load_fold_active(struct rq *this_rq); +extern long calc_load_fold_active(struct rq *this_rq, long adjust); #ifdef CONFIG_SMP extern void cpu_load_update_active(struct rq *this_rq);
Re: [patch 10/15] sched/migration: Move calc_load_migrate() into CPU_DYING
Hi Thomas, > It really does not matter when we fold the load for the outgoing cpu. > It's almost dead anyway, so there is no harm if we fail to fold the > few microseconds which are required for going fully away. We are seeing the load average shoot up when hot unplugging CPUs (+1 for every CPU we offline) on ppc64. This reproduces on bare metal as well as inside a KVM guest. A bisect points at this commit. As an example, a completely idle box with 128 CPUS and 112 hot unplugged: # uptime 04:35:30 up 1:23, 2 users, load average: 112.43, 122.94, 125.54 Anton
[patch 10/15] sched/migration: Move calc_load_migrate() into CPU_DYING
It really does not matter when we fold the load for the outgoing cpu. It's almost dead anyway, so there is no harm if we fail to fold the few microseconds which are required for going fully away. Signed-off-by: Thomas Gleixner --- kernel/sched/core.c |3 --- 1 file changed, 3 deletions(-) Index: b/kernel/sched/core.c === --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5685,9 +5685,6 @@ migration_call(struct notifier_block *nf migrate_tasks(rq); BUG_ON(rq->nr_running != 1); /* the migration thread */ raw_spin_unlock_irqrestore(&rq->lock, flags); - break; - - case CPU_DEAD: calc_load_migrate(rq); break; #endif