Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains
Hi, On 17/08/18 11:27, Matt Fleming wrote: > On Thu, 05 Jul, at 05:54:02PM, Valentin Schneider wrote: >> On 05/07/18 14:27, Matt Fleming wrote: >>> On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote: Hi, On 04/07/18 15:24, Matt Fleming wrote: > It's possible that the CPU doing nohz idle balance hasn't had its own > load updated for many seconds. This can lead to huge deltas between > rq->avg_stamp and rq->clock when rebalancing, and has been seen to > cause the following crash: > > divide error: [#1] SMP > Call Trace: > [] update_sd_lb_stats+0xe8/0x560 >> >> My confusion comes from not seeing where that crash happens. Would you mind >> sharing the associated line number? I can feel the "how did I not see this" >> from there but it can't be helped :( > > The divide by zero comes from scale_rt_capacity() where 'total' is a > u64 but gets truncated when passed to div_u64() since the divisor > parameter is u32. > Ah, nasty one. Interestingly enough that bit has been changed quite recently, so I don't think you can get a div by 0 in there anymore - see 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") and subsequent cleanups. > Sure, you could use div64_u64() instead, but the real issue is that > the load hasn't been updated for a very long time and that we're > trying to balance the domains with stale data. > Yeah I agree with that. However, the problem is with cpu_load - blocked load on nohz CPUs will be periodically updated until entirely decayed. And if we end up getting rid of cpu_load (depends on how [1] goes), then there's nothing left to do. But we're not there yet... [1]: https://lore.kernel.org/lkml/20180809135753.21077-1-dietmar.eggem...@arm.com/
Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains
On Thu, 05 Jul, at 05:54:02PM, Valentin Schneider wrote: > On 05/07/18 14:27, Matt Fleming wrote: > > On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote: > >> Hi, > >> > >> On 04/07/18 15:24, Matt Fleming wrote: > >>> It's possible that the CPU doing nohz idle balance hasn't had its own > >>> load updated for many seconds. This can lead to huge deltas between > >>> rq->avg_stamp and rq->clock when rebalancing, and has been seen to > >>> cause the following crash: > >>> > >>> divide error: [#1] SMP > >>> Call Trace: > >>> [] update_sd_lb_stats+0xe8/0x560 > > My confusion comes from not seeing where that crash happens. Would you mind > sharing the associated line number? I can feel the "how did I not see this" > from there but it can't be helped :( The divide by zero comes from scale_rt_capacity() where 'total' is a u64 but gets truncated when passed to div_u64() since the divisor parameter is u32. Sure, you could use div64_u64() instead, but the real issue is that the load hasn't been updated for a very long time and that we're trying to balance the domains with stale data.
Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains
On 05/07/18 14:27, Matt Fleming wrote: > On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote: >> Hi, >> >> On 04/07/18 15:24, Matt Fleming wrote: >>> It's possible that the CPU doing nohz idle balance hasn't had its own >>> load updated for many seconds. This can lead to huge deltas between >>> rq->avg_stamp and rq->clock when rebalancing, and has been seen to >>> cause the following crash: >>> >>> divide error: [#1] SMP >>> Call Trace: >>> [] update_sd_lb_stats+0xe8/0x560 My confusion comes from not seeing where that crash happens. Would you mind sharing the associated line number? I can feel the "how did I not see this" from there but it can't be helped :( >>> [] find_busiest_group+0x2d/0x4b0 >>> [] load_balance+0x170/0x950 >>> [] rebalance_domains+0x13f/0x290 >>> [] __do_softirq+0xec/0x300 >>> [] irq_exit+0xfa/0x110 >>> [] reschedule_interrupt+0xc9/0xd0 >>> >> >> Do you have some sort of reproducer for that crash? If not I guess I can cook >> something up with a quiet userspace & rt-app, though I've never seen that one >> on arm64. > > Unfortunately no, I don't have a reproduction case. Would love to have > one to verify the patch though. > >>> Make sure we update the rq clock and load before balancing. >>> >>> Cc: Ingo Molnar >>> Cc: Mike Galbraith >>> Cc: Peter Zijlstra >>> Signed-off-by: Matt Fleming >>> --- >>> kernel/sched/fair.c | 10 ++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 2f0a0be4d344..2c81662c858a 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, >>> unsigned int flags, >>> */ >>> smp_mb(); >>> >>> + /* >>> +* Ensure this_rq's clock and load are up-to-date before we >>> +* rebalance since it's possible that they haven't been >>> +* updated for multiple schedule periods, i.e. many seconds. >>> +*/ >>> + raw_spin_lock_irq(&this_rq->lock); >>> + update_rq_clock(this_rq); >>> + cpu_load_update_idle(this_rq); >>> + raw_spin_unlock_irq(&this_rq->lock); >>> + >> >> I'm failing to understand why the updates further down below are seemingly >> not enough. After we've potentially done >> >> update_rq_clock(rq); >> cpu_load_update_idle(rq); >> >> for all nohz cpus != this_cpu, we still end up doing: >> >> if (idle != CPU_NEWLY_IDLE) { >> update_blocked_averages(this_cpu); >> has_blocked_load |= this_rq->has_blocked_load; >> } >> >> which should properly update this_rq's clock and load before we attempt to do >> any balancing on it. > > But cpu_load_update_idle() and update_blocked_averages() are not the same > thing. > Right, we don't do any rq->cpu_load[] update for this_rq in the current nohz code (i.e. by using update_blocked_averages()), which I think we do want to do. I'm just miserably failing to find how not doing this leads to a div by 0.
Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains
On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote: > Hi, > > On 04/07/18 15:24, Matt Fleming wrote: > > It's possible that the CPU doing nohz idle balance hasn't had its own > > load updated for many seconds. This can lead to huge deltas between > > rq->avg_stamp and rq->clock when rebalancing, and has been seen to > > cause the following crash: > > > > divide error: [#1] SMP > > Call Trace: > > [] update_sd_lb_stats+0xe8/0x560 > > [] find_busiest_group+0x2d/0x4b0 > > [] load_balance+0x170/0x950 > > [] rebalance_domains+0x13f/0x290 > > [] __do_softirq+0xec/0x300 > > [] irq_exit+0xfa/0x110 > > [] reschedule_interrupt+0xc9/0xd0 > > > > Do you have some sort of reproducer for that crash? If not I guess I can cook > something up with a quiet userspace & rt-app, though I've never seen that one > on arm64. Unfortunately no, I don't have a reproduction case. Would love to have one to verify the patch though. > > Make sure we update the rq clock and load before balancing. > > > > Cc: Ingo Molnar > > Cc: Mike Galbraith > > Cc: Peter Zijlstra > > Signed-off-by: Matt Fleming > > --- > > kernel/sched/fair.c | 10 ++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 2f0a0be4d344..2c81662c858a 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, > > unsigned int flags, > > */ > > smp_mb(); > > > > + /* > > +* Ensure this_rq's clock and load are up-to-date before we > > +* rebalance since it's possible that they haven't been > > +* updated for multiple schedule periods, i.e. many seconds. > > +*/ > > + raw_spin_lock_irq(&this_rq->lock); > > + update_rq_clock(this_rq); > > + cpu_load_update_idle(this_rq); > > + raw_spin_unlock_irq(&this_rq->lock); > > + > > I'm failing to understand why the updates further down below are seemingly > not enough. After we've potentially done > > update_rq_clock(rq); > cpu_load_update_idle(rq); > > for all nohz cpus != this_cpu, we still end up doing: > > if (idle != CPU_NEWLY_IDLE) { > update_blocked_averages(this_cpu); > has_blocked_load |= this_rq->has_blocked_load; > } > > which should properly update this_rq's clock and load before we attempt to do > any balancing on it. But cpu_load_update_idle() and update_blocked_averages() are not the same thing.
Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains
Hi, On 04/07/18 15:24, Matt Fleming wrote: > It's possible that the CPU doing nohz idle balance hasn't had its own > load updated for many seconds. This can lead to huge deltas between > rq->avg_stamp and rq->clock when rebalancing, and has been seen to > cause the following crash: > > divide error: [#1] SMP > Call Trace: > [] update_sd_lb_stats+0xe8/0x560 > [] find_busiest_group+0x2d/0x4b0 > [] load_balance+0x170/0x950 > [] rebalance_domains+0x13f/0x290 > [] __do_softirq+0xec/0x300 > [] irq_exit+0xfa/0x110 > [] reschedule_interrupt+0xc9/0xd0 > Do you have some sort of reproducer for that crash? If not I guess I can cook something up with a quiet userspace & rt-app, though I've never seen that one on arm64. > Make sure we update the rq clock and load before balancing. > > Cc: Ingo Molnar > Cc: Mike Galbraith > Cc: Peter Zijlstra > Signed-off-by: Matt Fleming > --- > kernel/sched/fair.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 2f0a0be4d344..2c81662c858a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, > unsigned int flags, >*/ > smp_mb(); > > + /* > + * Ensure this_rq's clock and load are up-to-date before we > + * rebalance since it's possible that they haven't been > + * updated for multiple schedule periods, i.e. many seconds. > + */ > + raw_spin_lock_irq(&this_rq->lock); > + update_rq_clock(this_rq); > + cpu_load_update_idle(this_rq); > + raw_spin_unlock_irq(&this_rq->lock); > + I'm failing to understand why the updates further down below are seemingly not enough. After we've potentially done update_rq_clock(rq); cpu_load_update_idle(rq); for all nohz cpus != this_cpu, we still end up doing: if (idle != CPU_NEWLY_IDLE) { update_blocked_averages(this_cpu); has_blocked_load |= this_rq->has_blocked_load; } which should properly update this_rq's clock and load before we attempt to do any balancing on it. > for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { > if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) > continue; >
[PATCH] sched/fair: Avoid divide by zero when rebalancing domains
It's possible that the CPU doing nohz idle balance hasn't had its own load updated for many seconds. This can lead to huge deltas between rq->avg_stamp and rq->clock when rebalancing, and has been seen to cause the following crash: divide error: [#1] SMP Call Trace: [] update_sd_lb_stats+0xe8/0x560 [] find_busiest_group+0x2d/0x4b0 [] load_balance+0x170/0x950 [] rebalance_domains+0x13f/0x290 [] __do_softirq+0xec/0x300 [] irq_exit+0xfa/0x110 [] reschedule_interrupt+0xc9/0xd0 Make sure we update the rq clock and load before balancing. Cc: Ingo Molnar Cc: Mike Galbraith Cc: Peter Zijlstra Signed-off-by: Matt Fleming --- kernel/sched/fair.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2f0a0be4d344..2c81662c858a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, */ smp_mb(); + /* +* Ensure this_rq's clock and load are up-to-date before we +* rebalance since it's possible that they haven't been +* updated for multiple schedule periods, i.e. many seconds. +*/ + raw_spin_lock_irq(&this_rq->lock); + update_rq_clock(this_rq); + cpu_load_update_idle(this_rq); + raw_spin_unlock_irq(&this_rq->lock); + for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) continue; -- 2.13.6