Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
On 19/8/27 上午1:38, bseg...@google.com wrote: Valentin Schneider writes: On 23/08/2019 21:00, bseg...@google.com wrote: [...] Could you mention in the message that this a throttled cfs_rq can have account_cfs_rq_runtime called on it because it is throttled before idle_balance, and the idle_balance calls update_rq_clock to add time that is accounted to the task. Mayhaps even a comment for the extra condition. I think this solution is less risky than unthrottling in this area, so other than that: Reviewed-by: Ben Segall If you don't mind squashing this in: -8<- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b1d9cec9b1ed..b47b0bcf56bc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4630,6 +4630,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) if (!cfs_rq_throttled(cfs_rq)) goto next; + /* By the above check, this should never be true */ + WARN_ON(cfs_rq->runtime_remaining > 0); + + /* Pick the minimum amount to return to a positive quota state */ runtime = -cfs_rq->runtime_remaining + 1; if (runtime > remaining) runtime = remaining; ->8- I'm not adamant about the extra comment, but the WARN_ON would be nice IMO. @Ben, do you reckon we want to strap Cc: Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth") to the thing? AFAICT the pick_next_task_fair() + idle_balance() dance you described should still be possible on that commit. I'm not sure about stable policy in general, but it seems reasonable. The WARN_ON might want to be WARN_ON_ONCE, and it seems fine to have it or not. Thanks Ben and Valentin for all of the comments. Per Xunlei's suggestion, I used SCHED_WARN_ON instead in v3. Regarding whether cc stable, I'm also not sure. Other than that, Reviewed-by: Valentin Schneider [...]
Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
Valentin Schneider writes: > On 23/08/2019 21:00, bseg...@google.com wrote: > [...] >> Could you mention in the message that this a throttled cfs_rq can have >> account_cfs_rq_runtime called on it because it is throttled before >> idle_balance, and the idle_balance calls update_rq_clock to add time >> that is accounted to the task. >> > > Mayhaps even a comment for the extra condition. > >> I think this solution is less risky than unthrottling >> in this area, so other than that: >> >> Reviewed-by: Ben Segall >> > > If you don't mind squashing this in: > > -8<- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b1d9cec9b1ed..b47b0bcf56bc 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4630,6 +4630,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth > *cfs_b, u64 remaining) > if (!cfs_rq_throttled(cfs_rq)) > goto next; > > + /* By the above check, this should never be true */ > + WARN_ON(cfs_rq->runtime_remaining > 0); > + > + /* Pick the minimum amount to return to a positive quota state > */ > runtime = -cfs_rq->runtime_remaining + 1; > if (runtime > remaining) > runtime = remaining; > ->8- > > I'm not adamant about the extra comment, but the WARN_ON would be nice IMO. > > > @Ben, do you reckon we want to strap > > Cc: > Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge > against bandwidth") > > to the thing? AFAICT the pick_next_task_fair() + idle_balance() dance you > described should still be possible on that commit. I'm not sure about stable policy in general, but it seems reasonable. The WARN_ON might want to be WARN_ON_ONCE, and it seems fine to have it or not. > > > Other than that, > > Reviewed-by: Valentin Schneider > > [...]
Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
On 23/08/2019 21:00, bseg...@google.com wrote: [...] > Could you mention in the message that this a throttled cfs_rq can have > account_cfs_rq_runtime called on it because it is throttled before > idle_balance, and the idle_balance calls update_rq_clock to add time > that is accounted to the task. > Mayhaps even a comment for the extra condition. > I think this solution is less risky than unthrottling > in this area, so other than that: > > Reviewed-by: Ben Segall > If you don't mind squashing this in: -8<- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b1d9cec9b1ed..b47b0bcf56bc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4630,6 +4630,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) if (!cfs_rq_throttled(cfs_rq)) goto next; + /* By the above check, this should never be true */ + WARN_ON(cfs_rq->runtime_remaining > 0); + + /* Pick the minimum amount to return to a positive quota state */ runtime = -cfs_rq->runtime_remaining + 1; if (runtime > remaining) runtime = remaining; ->8- I'm not adamant about the extra comment, but the WARN_ON would be nice IMO. @Ben, do you reckon we want to strap Cc: Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth") to the thing? AFAICT the pick_next_task_fair() + idle_balance() dance you described should still be possible on that commit. Other than that, Reviewed-by: Valentin Schneider [...]
Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
Liangyan writes: > do_sched_cfs_period_timer() will refill cfs_b runtime and call > distribute_cfs_runtime() to unthrottle cfs_rq, sometimes cfs_b->runtime > will allocate all quota to one cfs_rq incorrectly. > This will cause other cfs_rq can't get runtime and will be throttled. > We find that one throttled cfs_rq has non-negative > cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64 > in snippet: distribute_cfs_runtime() { > runtime = -cfs_rq->runtime_remaining + 1; }. > This cast will cause that runtime will be a large number and > cfs_b->runtime will be subtracted to be zero at last. > > This commit prevents cfs_rq to be assgined new runtime if it has been > throttled to avoid the above incorrect type cast. > > Signed-off-by: Liangyan Could you mention in the message that this a throttled cfs_rq can have account_cfs_rq_runtime called on it because it is throttled before idle_balance, and the idle_balance calls update_rq_clock to add time that is accounted to the task. I think this solution is less risky than unthrottling in this area, so other than that: Reviewed-by: Ben Segall > --- > kernel/sched/fair.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 81fd8a2a605b..b14d67d28138 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4068,6 +4068,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq > *cfs_rq, u64 delta_exec) > if (likely(cfs_rq->runtime_remaining > 0)) > return; > > + if (cfs_rq->throttled) > + return; > /* >* if we're unable to extend our runtime we resched so that the active >* hierarchy can be throttled
Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
On 16/08/2019 18:19, Valentin Schneider wrote: [...] > Yeah it's probably pretty stupid. IIRC throttled cfs_rq means frozen > rq_clock, so any subsequent call to update_curr() on a throttled cfs_rq > should lead to an early bailout anyway due to delta_exec <= 0. > Did some more tracing, seems like the issue is we can make ->runtime_remaining positive in assign_cfs_rq_runtime() but not mark the cfs_rq as unthrottled. So AFAICT we'd need something like this: -8<- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1054d2cf6aaa..ffbb4dfc4b81 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4385,6 +4385,11 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq) return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time; } +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) +{ + return cfs_bandwidth_used() && cfs_rq->throttled; +} + /* returns 0 on failure to allocate runtime */ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) { @@ -4411,6 +4416,9 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_rq->runtime_remaining += amount; + if (cfs_rq->runtime_remaining > 0 && cfs_rq_throttled(cfs_rq)) + unthrottle_cfs_rq(cfs_rq); + return cfs_rq->runtime_remaining > 0; } @@ -4439,11 +4447,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) __account_cfs_rq_runtime(cfs_rq, delta_exec); } -static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) -{ - return cfs_bandwidth_used() && cfs_rq->throttled; -} - /* check whether cfs_rq, or any parent, is throttled */ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) { ->8- Does that make sense? If so we *may* want to add some ->runtime_remaining wrappers (e.g. {add/remove}_runtime()) and have the check in there to make sure it's not forgotten.
Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
On 16/08/2019 16:39, Liangyan wrote: > Thanks for the feedback. > Add some debug prints and get below log. It seems that pick_next_task_fair > throttle the cfs_rq first, then call put_prev_entity to assign runtime to > this cfs_rq. > [...] > > Regarding the suggested change, i’m not sure whether it is ok to skip the > runtime account for curr task. > Yeah it's probably pretty stupid. IIRC throttled cfs_rq means frozen rq_clock, so any subsequent call to update_curr() on a throttled cfs_rq should lead to an early bailout anyway due to delta_exec <= 0.
Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
On 16/08/2019 15:02, Valentin Schneider wrote: > On 16/08/2019 08:08, Liangyan wrote: >> Please check below dmesg log with “WARN_ON(cfs_rq->runtime_remaining > 0)”. >> If apply my patch, the warning is gone. Append the reproducing case in the >> end. >> > > [...] > > Huh, thanks for the log & the reproducer. I'm still struggling to > understand how we could hit the condition you're adding, since > account_cfs_rq_runtime() shouldn't be called for throttled cfs_rqs (which > I guess is the bug). Also, if the cfs_rq is throttled, shouldn't we > prevent any further decrement of its ->runtime_remaining ? > > I had a look at the callers of account_cfs_rq_runtime(): > > - update_curr(). Seems safe, but has a cfs_rq->curr check at the top. This > won't catch throttled cfs_rq's because AFAICT their curr pointer isn't > NULL'd on throttle. > > - check_enqueue_throttle(). Already has a cfs_rq_throttled() check. > > - set_next_task_fair(). Peter shuffled the whole set/put task thing > recently but last I looked it seemed all sane. > > I'll try to make sense of it, but have also Cc'd Paul since unlike me he > actually knows this stuff. > Hah, seems like we get update_curr() calls on throttled rqs via put_prev_entity(): [ 151.538560] put_prev_entity+0x8d/0x100 [ 151.538562] put_prev_task_fair+0x22/0x40 [ 151.538564] pick_next_task_fair+0x140/0x390 [ 151.538566] __schedule+0x122/0x6c0 [ 151.538568] schedule+0x2d/0x90 [ 151.538570] exit_to_usermode_loop+0x61/0x100 [ 151.538572] prepare_exit_to_usermode+0x91/0xa0 [ 151.538573] retint_user+0x8/0x8 Debug warns: -8<- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1054d2cf6aaa..41e0e78de4fe 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -828,6 +828,8 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) } #endif /* CONFIG_SMP */ +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq); + /* * Update the current task's runtime statistics. */ @@ -840,6 +842,8 @@ static void update_curr(struct cfs_rq *cfs_rq) if (unlikely(!curr)) return; + WARN_ON(cfs_rq_throttled(cfs_rq)); + delta_exec = now - curr->exec_start; if (unlikely((s64)delta_exec <= 0)) return; @@ -10169,6 +10173,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p) struct cfs_rq *cfs_rq = cfs_rq_of(se); set_next_entity(cfs_rq, se); + WARN_ON(cfs_rq_throttled(cfs_rq)); /* ensure bandwidth has been allocated on our new cfs_rq */ account_cfs_rq_runtime(cfs_rq, 0); } ->8- So I guess what we'd want there is something like -8<- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1054d2cf6aaa..b2c40f994aa9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -828,6 +828,8 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) } #endif /* CONFIG_SMP */ +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq); + /* * Update the current task's runtime statistics. */ @@ -840,6 +842,9 @@ static void update_curr(struct cfs_rq *cfs_rq) if (unlikely(!curr)) return; + if (cfs_rq_throttled(cfs_rq)) + return; + delta_exec = now - curr->exec_start; if (unlikely((s64)delta_exec <= 0)) return; ->8- but I still don't comprehend how we can get there in the first place.
Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
On 16/08/2019 08:08, Liangyan wrote: > Please check below dmesg log with “WARN_ON(cfs_rq->runtime_remaining > 0)”. > If apply my patch, the warning is gone. Append the reproducing case in the > end. > [...] Huh, thanks for the log & the reproducer. I'm still struggling to understand how we could hit the condition you're adding, since account_cfs_rq_runtime() shouldn't be called for throttled cfs_rqs (which I guess is the bug). Also, if the cfs_rq is throttled, shouldn't we prevent any further decrement of its ->runtime_remaining ? I had a look at the callers of account_cfs_rq_runtime(): - update_curr(). Seems safe, but has a cfs_rq->curr check at the top. This won't catch throttled cfs_rq's because AFAICT their curr pointer isn't NULL'd on throttle. - check_enqueue_throttle(). Already has a cfs_rq_throttled() check. - set_next_task_fair(). Peter shuffled the whole set/put task thing recently but last I looked it seemed all sane. I'll try to make sense of it, but have also Cc'd Paul since unlike me he actually knows this stuff.
Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
On 14/08/2019 19:00, Liangyan wrote: > do_sched_cfs_period_timer() will refill cfs_b runtime and call > distribute_cfs_runtime() to unthrottle cfs_rq, sometimes cfs_b->runtime > will allocate all quota to one cfs_rq incorrectly. > This will cause other cfs_rq can't get runtime and will be throttled. > We find that one throttled cfs_rq has non-negative > cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64 > in snippet: distribute_cfs_runtime() { > runtime = -cfs_rq->runtime_remaining + 1; }. > This cast will cause that runtime will be a large number and > cfs_b->runtime will be subtracted to be zero at last. > I'm a complete CFS bandwidth noob but let me give this a try... -Wconversion does pick this up (turning this thing on made me understand why it's not on by default) kernel/sched/fair.c: In function ‘distribute_cfs_runtime’: kernel/sched/fair.c:4633:13: warning: conversion to ‘u64’ {aka ‘long long unsigned int’} from ‘s64’ {aka ‘long long int’} may change the sign of the result [-Wsign-conversion] runtime = -cfs_rq->runtime_remaining + 1; ^ kernel/sched/fair.c:4638:29: warning: conversion to ‘long long unsigned int’ from ‘s64’ {aka ‘long long int’} may change the sign of the result [-Wsign-conversion] cfs_rq->runtime_remaining += runtime; ^~ The thing is we have a !cfs_rq_throttled() check just before the snippet you're calling out, so AFAICT cfs_rq->runtime_remaining has to be <= 0 there (otherwise this cfs_rq wouldn't be throttled). I doubt you can get this to fire, but just to be sure... -8<- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bbd90adabe2a..836948a3ae23 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4630,6 +4630,8 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) if (!cfs_rq_throttled(cfs_rq)) goto next; + WARN_ON(cfs_rq->runtime_remaining > 0); + runtime = -cfs_rq->runtime_remaining + 1; if (runtime > remaining) runtime = remaining; ->8- Other than those signed/unsigned shenanigans, I only see one other scenario that leads to a cfs_rq getting allocated all the remaining runtime: its .runtime_remaining just has to be greater or equal (in absolute value) than the remaining runtime. If that's happening consistently, I suppose that could be due to long delays between update_curr_fair() calls, but I can't think right why that would happen. > This commit prevents cfs_rq to be assgined new runtime if it has been > throttled to avoid the above incorrect type cast. > > Signed-off-by: Liangyan > --- > kernel/sched/fair.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 81fd8a2a605b..b14d67d28138 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4068,6 +4068,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq > *cfs_rq, u64 delta_exec) > if (likely(cfs_rq->runtime_remaining > 0)) > return; > > + if (cfs_rq->throttled) > + return; > /* >* if we're unable to extend our runtime we resched so that the active >* hierarchy can be throttled >
[PATCH] sched/fair: don't assign runtime for throttled cfs_rq
do_sched_cfs_period_timer() will refill cfs_b runtime and call distribute_cfs_runtime() to unthrottle cfs_rq, sometimes cfs_b->runtime will allocate all quota to one cfs_rq incorrectly. This will cause other cfs_rq can't get runtime and will be throttled. We find that one throttled cfs_rq has non-negative cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64 in snippet: distribute_cfs_runtime() { runtime = -cfs_rq->runtime_remaining + 1; }. This cast will cause that runtime will be a large number and cfs_b->runtime will be subtracted to be zero at last. This commit prevents cfs_rq to be assgined new runtime if it has been throttled to avoid the above incorrect type cast. Signed-off-by: Liangyan --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 81fd8a2a605b..b14d67d28138 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4068,6 +4068,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) if (likely(cfs_rq->runtime_remaining > 0)) return; + if (cfs_rq->throttled) + return; /* * if we're unable to extend our runtime we resched so that the active * hierarchy can be throttled -- 2.14.4.44.g2045bb6