Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
On 08-Mar 10:48, Peter Zijlstra wrote: > On Thu, Feb 22, 2018 at 05:01:53PM +, Patrick Bellasi wrote: > > +#define UTIL_EST_NEED_UPDATE_FLAG 0x1 > > > @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq > > *cfs_rq, > > if (!task_sleep) > > return; > > > > + /* > > +* Skip update of task's estimated utilization if the PELT signal has > > +* never been updated (at least once) since last enqueue time. > > +*/ > > + ue = READ_ONCE(p->se.avg.util_est); > > + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG) > > + return; > > The name and function seem inverted, if the flag is set, we do _NOT_ > update util_est. My reading was along the line "when the flag is set we need an update of util_avg to collect a new util_est sample"... ... but I agree that's confusing... and unnecessary long. > How about something like UTIL_EST_UNCHANGED ? That would give: I would prefer UTIL_AVG_UNCHANGED, since the flags is reset when we have a change in util_avg, thus enabling util_est updates. > /* >* If the PELT values haven't changed since enqueue time, >* skip the util_est update. >*/ > if (enqueue & UTIL_EST_UNCHANGED) > return; -- #include Patrick Bellasi
Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
On 08-Mar 10:48, Peter Zijlstra wrote: > On Thu, Feb 22, 2018 at 05:01:53PM +, Patrick Bellasi wrote: > > +#define UTIL_EST_NEED_UPDATE_FLAG 0x1 > > > @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq > > *cfs_rq, > > if (!task_sleep) > > return; > > > > + /* > > +* Skip update of task's estimated utilization if the PELT signal has > > +* never been updated (at least once) since last enqueue time. > > +*/ > > + ue = READ_ONCE(p->se.avg.util_est); > > + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG) > > + return; > > The name and function seem inverted, if the flag is set, we do _NOT_ > update util_est. My reading was along the line "when the flag is set we need an update of util_avg to collect a new util_est sample"... ... but I agree that's confusing... and unnecessary long. > How about something like UTIL_EST_UNCHANGED ? That would give: I would prefer UTIL_AVG_UNCHANGED, since the flags is reset when we have a change in util_avg, thus enabling util_est updates. > /* >* If the PELT values haven't changed since enqueue time, >* skip the util_est update. >*/ > if (enqueue & UTIL_EST_UNCHANGED) > return; -- #include Patrick Bellasi
Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
On Thu, Feb 22, 2018 at 05:01:53PM +, Patrick Bellasi wrote: > +#define UTIL_EST_NEED_UPDATE_FLAG 0x1 > @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq > *cfs_rq, > if (!task_sleep) > return; > > + /* > + * Skip update of task's estimated utilization if the PELT signal has > + * never been updated (at least once) since last enqueue time. > + */ > + ue = READ_ONCE(p->se.avg.util_est); > + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG) > + return; The name and function seem inverted, if the flag is set, we do _NOT_ update util_est. How about something like UTIL_EST_UNCHANGED ? That would give: /* * If the PELT values haven't changed since enqueue time, * skip the util_est update. */ if (enqueue & UTIL_EST_UNCHANGED) return;
Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
On Thu, Feb 22, 2018 at 05:01:53PM +, Patrick Bellasi wrote: > +#define UTIL_EST_NEED_UPDATE_FLAG 0x1 > @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq > *cfs_rq, > if (!task_sleep) > return; > > + /* > + * Skip update of task's estimated utilization if the PELT signal has > + * never been updated (at least once) since last enqueue time. > + */ > + ue = READ_ONCE(p->se.avg.util_est); > + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG) > + return; The name and function seem inverted, if the flag is set, we do _NOT_ update util_est. How about something like UTIL_EST_UNCHANGED ? That would give: /* * If the PELT values haven't changed since enqueue time, * skip the util_est update. */ if (enqueue & UTIL_EST_UNCHANGED) return;
Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
On Wed, Mar 07, 2018 at 11:38:52AM +0100, Peter Zijlstra wrote: > On Thu, Feb 22, 2018 at 05:01:53PM +, Patrick Bellasi wrote: > > @@ -5218,7 +5242,7 @@ static inline void util_est_enqueue(struct cfs_rq > > *cfs_rq, > > > > /* Update root cfs_rq's estimated utilization */ > > enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); > > - enqueued += _task_util_est(p); > > + enqueued += (_task_util_est(p) | 0x1); > > UTIL_EST_NEED_UPDATE_FLAG, although I do agree that 0x1 is much easier > to type ;-) > > But you set it for the cfs_rq value ?! That doesn't seem right. > > > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); > > } > > > > @@ -5310,7 +5334,7 @@ static inline void util_est_dequeue(struct cfs_rq > > *cfs_rq, > > if (cfs_rq->nr_running) { > > ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); > > ue.enqueued -= min_t(unsigned int, ue.enqueued, > > -_task_util_est(p)); > > +(_task_util_est(p) | > > UTIL_EST_NEED_UPDATE_FLAG)); > > } > > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued); > > OK, so you unconditionally set that bit here to make the add/sub match. Clearly I wasn't having a good day yesterday.
Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
On Wed, Mar 07, 2018 at 11:38:52AM +0100, Peter Zijlstra wrote: > On Thu, Feb 22, 2018 at 05:01:53PM +, Patrick Bellasi wrote: > > @@ -5218,7 +5242,7 @@ static inline void util_est_enqueue(struct cfs_rq > > *cfs_rq, > > > > /* Update root cfs_rq's estimated utilization */ > > enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); > > - enqueued += _task_util_est(p); > > + enqueued += (_task_util_est(p) | 0x1); > > UTIL_EST_NEED_UPDATE_FLAG, although I do agree that 0x1 is much easier > to type ;-) > > But you set it for the cfs_rq value ?! That doesn't seem right. > > > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); > > } > > > > @@ -5310,7 +5334,7 @@ static inline void util_est_dequeue(struct cfs_rq > > *cfs_rq, > > if (cfs_rq->nr_running) { > > ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); > > ue.enqueued -= min_t(unsigned int, ue.enqueued, > > -_task_util_est(p)); > > +(_task_util_est(p) | > > UTIL_EST_NEED_UPDATE_FLAG)); > > } > > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued); > > OK, so you unconditionally set that bit here to make the add/sub match. Clearly I wasn't having a good day yesterday.
Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
On Thu, Feb 22, 2018 at 05:01:53PM +, Patrick Bellasi wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8364771f7301..1bf9a86ebc39 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3047,6 +3047,29 @@ static inline void cfs_rq_util_change(struct cfs_rq > *cfs_rq) > } > } > > +/* > + * When a task is dequeued, its estimated utilization should not be update if > + * its util_avg has not been updated at least once. > + * This flag is used to synchronize util_avg updates with util_est updates. > + * We map this information into the LSB bit of the utilization saved at > + * dequeue time (i.e. util_est.dequeued). > + */ > +#define UTIL_EST_NEED_UPDATE_FLAG 0x1 > + > +static inline void cfs_se_util_change(struct sched_avg *avg) > +{ if (!sched_feat(UTIL_EST)) return; > + if (sched_feat(UTIL_EST)) { > + struct util_est ue = READ_ONCE(avg->util_est); > + > + if (!(ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG)) > + return; > + > + /* Reset flag to report util_avg has been updated */ > + ue.enqueued &= ~UTIL_EST_NEED_UPDATE_FLAG; > + WRITE_ONCE(avg->util_est, ue); > + } and loose the indent. Also, since we only update the enqueued value, we don't need to load/store the entire util_est thing here. > +} > + > #ifdef CONFIG_SMP > /* > * Approximate: > @@ -3308,6 +3331,7 @@ __update_load_avg_se(u64 now, int cpu, struct cfs_rq > *cfs_rq, struct sched_entit > cfs_rq->curr == se)) { > > ___update_load_avg(>avg, se_weight(se), se_runnable(se)); > + cfs_se_util_change(>avg); > return 1; > } > So we only clear the bit for @se updates. > @@ -5218,7 +5242,7 @@ static inline void util_est_enqueue(struct cfs_rq > *cfs_rq, > > /* Update root cfs_rq's estimated utilization */ > enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); > - enqueued += _task_util_est(p); > + enqueued += (_task_util_est(p) | 0x1); UTIL_EST_NEED_UPDATE_FLAG, although I do agree that 0x1 is much easier to type ;-) But you set it for the cfs_rq value ?! That doesn't seem right. > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); > } > > @@ -5310,7 +5334,7 @@ static inline void util_est_dequeue(struct cfs_rq > *cfs_rq, > if (cfs_rq->nr_running) { > ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); > ue.enqueued -= min_t(unsigned int, ue.enqueued, > - _task_util_est(p)); > + (_task_util_est(p) | > UTIL_EST_NEED_UPDATE_FLAG)); > } > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued); > Would it be really horrible if you separate the value and flag using a bitfield/shifts? > @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq > *cfs_rq, > if (!task_sleep) > return; > > + /* > + * Skip update of task's estimated utilization if the PELT signal has > + * never been updated (at least once) since last enqueue time. > + */ > + ue = READ_ONCE(p->se.avg.util_est); > + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG) > + return; > + > /* >* Skip update of task's estimated utilization when its EWMA is >* already ~1% close to its last activation value. >*/ > + ue.enqueued = (task_util(p) | UTIL_EST_NEED_UPDATE_FLAG); > last_ewma_diff = ue.enqueued - ue.ewma; > if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100))) > return; I see what you do, but Yuck! that's really nasty. Then again, I've not actually got a better suggestion.
Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
On Thu, Feb 22, 2018 at 05:01:53PM +, Patrick Bellasi wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8364771f7301..1bf9a86ebc39 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3047,6 +3047,29 @@ static inline void cfs_rq_util_change(struct cfs_rq > *cfs_rq) > } > } > > +/* > + * When a task is dequeued, its estimated utilization should not be update if > + * its util_avg has not been updated at least once. > + * This flag is used to synchronize util_avg updates with util_est updates. > + * We map this information into the LSB bit of the utilization saved at > + * dequeue time (i.e. util_est.dequeued). > + */ > +#define UTIL_EST_NEED_UPDATE_FLAG 0x1 > + > +static inline void cfs_se_util_change(struct sched_avg *avg) > +{ if (!sched_feat(UTIL_EST)) return; > + if (sched_feat(UTIL_EST)) { > + struct util_est ue = READ_ONCE(avg->util_est); > + > + if (!(ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG)) > + return; > + > + /* Reset flag to report util_avg has been updated */ > + ue.enqueued &= ~UTIL_EST_NEED_UPDATE_FLAG; > + WRITE_ONCE(avg->util_est, ue); > + } and loose the indent. Also, since we only update the enqueued value, we don't need to load/store the entire util_est thing here. > +} > + > #ifdef CONFIG_SMP > /* > * Approximate: > @@ -3308,6 +3331,7 @@ __update_load_avg_se(u64 now, int cpu, struct cfs_rq > *cfs_rq, struct sched_entit > cfs_rq->curr == se)) { > > ___update_load_avg(>avg, se_weight(se), se_runnable(se)); > + cfs_se_util_change(>avg); > return 1; > } > So we only clear the bit for @se updates. > @@ -5218,7 +5242,7 @@ static inline void util_est_enqueue(struct cfs_rq > *cfs_rq, > > /* Update root cfs_rq's estimated utilization */ > enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); > - enqueued += _task_util_est(p); > + enqueued += (_task_util_est(p) | 0x1); UTIL_EST_NEED_UPDATE_FLAG, although I do agree that 0x1 is much easier to type ;-) But you set it for the cfs_rq value ?! That doesn't seem right. > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); > } > > @@ -5310,7 +5334,7 @@ static inline void util_est_dequeue(struct cfs_rq > *cfs_rq, > if (cfs_rq->nr_running) { > ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); > ue.enqueued -= min_t(unsigned int, ue.enqueued, > - _task_util_est(p)); > + (_task_util_est(p) | > UTIL_EST_NEED_UPDATE_FLAG)); > } > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued); > Would it be really horrible if you separate the value and flag using a bitfield/shifts? > @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq > *cfs_rq, > if (!task_sleep) > return; > > + /* > + * Skip update of task's estimated utilization if the PELT signal has > + * never been updated (at least once) since last enqueue time. > + */ > + ue = READ_ONCE(p->se.avg.util_est); > + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG) > + return; > + > /* >* Skip update of task's estimated utilization when its EWMA is >* already ~1% close to its last activation value. >*/ > + ue.enqueued = (task_util(p) | UTIL_EST_NEED_UPDATE_FLAG); > last_ewma_diff = ue.enqueued - ue.ewma; > if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100))) > return; I see what you do, but Yuck! that's really nasty. Then again, I've not actually got a better suggestion.
Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
The changelog is missing the below CCs. :( Since that's a new patch in this series, I expect some feedbacks and thus I'll add them on the next respin. On 22-Feb 17:01, Patrick Bellasi wrote: > The estimated utilization of a task is currently updated every time the > task is dequeued. However, to keep overheads under control, PELT signals > are effectively updated at maximum once every 1ms. > > Thus, for really short running tasks, it can happen that their util_avg > value has not been updates since their last enqueue. If such tasks are > also frequently running tasks (e.g. the kind of workload generated by > hackbench) it can also happen that their util_avg is updated only every > few activations. > > This means that updating util_est at every dequeue potentially introduces > not necessary overheads and it's also conceptually wrong if the util_avg > signal has never been updated during a task activation. > > Let's introduce a throttling mechanism on task's util_est updates > to sync them with util_avg updates. To make the solution memory > efficient, both in terms of space and load/store operations, we encode a > synchronization flag into the LSB of util_est.enqueued. > This makes util_est an even values only metric, which is still > considered good enough for its purpose. > The synchronization bit is (re)set by __update_load_avg_se() once the > PELT signal of a task has been updated during its last activation. > > Such a throttling mechanism allows to keep under control util_est > overheads in the wakeup hot path, thus making it a suitable mechanism > which can be enabled also on high-intensity workload systems. > Thus, this now switches on by default the estimation utilization > scheduler feature. > > Suggested-by: Chris Redpath> Signed-off-by: Patrick Bellasi Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Paul Turner Cc: Vincent Guittot Cc: Morten Rasmussen Cc: Dietmar Eggemann Cc: linux-kernel@vger.kernel.org [...] -- #include Patrick Bellasi
Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
The changelog is missing the below CCs. :( Since that's a new patch in this series, I expect some feedbacks and thus I'll add them on the next respin. On 22-Feb 17:01, Patrick Bellasi wrote: > The estimated utilization of a task is currently updated every time the > task is dequeued. However, to keep overheads under control, PELT signals > are effectively updated at maximum once every 1ms. > > Thus, for really short running tasks, it can happen that their util_avg > value has not been updates since their last enqueue. If such tasks are > also frequently running tasks (e.g. the kind of workload generated by > hackbench) it can also happen that their util_avg is updated only every > few activations. > > This means that updating util_est at every dequeue potentially introduces > not necessary overheads and it's also conceptually wrong if the util_avg > signal has never been updated during a task activation. > > Let's introduce a throttling mechanism on task's util_est updates > to sync them with util_avg updates. To make the solution memory > efficient, both in terms of space and load/store operations, we encode a > synchronization flag into the LSB of util_est.enqueued. > This makes util_est an even values only metric, which is still > considered good enough for its purpose. > The synchronization bit is (re)set by __update_load_avg_se() once the > PELT signal of a task has been updated during its last activation. > > Such a throttling mechanism allows to keep under control util_est > overheads in the wakeup hot path, thus making it a suitable mechanism > which can be enabled also on high-intensity workload systems. > Thus, this now switches on by default the estimation utilization > scheduler feature. > > Suggested-by: Chris Redpath > Signed-off-by: Patrick Bellasi Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Paul Turner Cc: Vincent Guittot Cc: Morten Rasmussen Cc: Dietmar Eggemann Cc: linux-kernel@vger.kernel.org [...] -- #include Patrick Bellasi
[PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
The estimated utilization of a task is currently updated every time the task is dequeued. However, to keep overheads under control, PELT signals are effectively updated at maximum once every 1ms. Thus, for really short running tasks, it can happen that their util_avg value has not been updates since their last enqueue. If such tasks are also frequently running tasks (e.g. the kind of workload generated by hackbench) it can also happen that their util_avg is updated only every few activations. This means that updating util_est at every dequeue potentially introduces not necessary overheads and it's also conceptually wrong if the util_avg signal has never been updated during a task activation. Let's introduce a throttling mechanism on task's util_est updates to sync them with util_avg updates. To make the solution memory efficient, both in terms of space and load/store operations, we encode a synchronization flag into the LSB of util_est.enqueued. This makes util_est an even values only metric, which is still considered good enough for its purpose. The synchronization bit is (re)set by __update_load_avg_se() once the PELT signal of a task has been updated during its last activation. Such a throttling mechanism allows to keep under control util_est overheads in the wakeup hot path, thus making it a suitable mechanism which can be enabled also on high-intensity workload systems. Thus, this now switches on by default the estimation utilization scheduler feature. Suggested-by: Chris RedpathSigned-off-by: Patrick Bellasi --- Changes in v5: - set SCHED_FEAT(UTIL_EST, true) as default (Peter) --- kernel/sched/fair.c | 39 +++ kernel/sched/features.h | 2 +- 2 files changed, 36 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8364771f7301..1bf9a86ebc39 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3047,6 +3047,29 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) } } +/* + * When a task is dequeued, its estimated utilization should not be update if + * its util_avg has not been updated at least once. + * This flag is used to synchronize util_avg updates with util_est updates. + * We map this information into the LSB bit of the utilization saved at + * dequeue time (i.e. util_est.dequeued). + */ +#define UTIL_EST_NEED_UPDATE_FLAG 0x1 + +static inline void cfs_se_util_change(struct sched_avg *avg) +{ + if (sched_feat(UTIL_EST)) { + struct util_est ue = READ_ONCE(avg->util_est); + + if (!(ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG)) + return; + + /* Reset flag to report util_avg has been updated */ + ue.enqueued &= ~UTIL_EST_NEED_UPDATE_FLAG; + WRITE_ONCE(avg->util_est, ue); + } +} + #ifdef CONFIG_SMP /* * Approximate: @@ -3308,6 +3331,7 @@ __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entit cfs_rq->curr == se)) { ___update_load_avg(>avg, se_weight(se), se_runnable(se)); + cfs_se_util_change(>avg); return 1; } @@ -5218,7 +5242,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq, /* Update root cfs_rq's estimated utilization */ enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); - enqueued += _task_util_est(p); + enqueued += (_task_util_est(p) | 0x1); WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); } @@ -5310,7 +5334,7 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq, if (cfs_rq->nr_running) { ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); ue.enqueued -= min_t(unsigned int, ue.enqueued, -_task_util_est(p)); +(_task_util_est(p) | UTIL_EST_NEED_UPDATE_FLAG)); } WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued); @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq, if (!task_sleep) return; + /* +* Skip update of task's estimated utilization if the PELT signal has +* never been updated (at least once) since last enqueue time. +*/ + ue = READ_ONCE(p->se.avg.util_est); + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG) + return; + /* * Skip update of task's estimated utilization when its EWMA is * already ~1% close to its last activation value. */ - ue = READ_ONCE(p->se.avg.util_est); - ue.enqueued = task_util(p); + ue.enqueued = (task_util(p) | UTIL_EST_NEED_UPDATE_FLAG); last_ewma_diff = ue.enqueued - ue.ewma; if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100))) return; diff --git a/kernel/sched/features.h
[PATCH v5 4/4] sched/fair: update util_est only on util_avg updates
The estimated utilization of a task is currently updated every time the task is dequeued. However, to keep overheads under control, PELT signals are effectively updated at maximum once every 1ms. Thus, for really short running tasks, it can happen that their util_avg value has not been updates since their last enqueue. If such tasks are also frequently running tasks (e.g. the kind of workload generated by hackbench) it can also happen that their util_avg is updated only every few activations. This means that updating util_est at every dequeue potentially introduces not necessary overheads and it's also conceptually wrong if the util_avg signal has never been updated during a task activation. Let's introduce a throttling mechanism on task's util_est updates to sync them with util_avg updates. To make the solution memory efficient, both in terms of space and load/store operations, we encode a synchronization flag into the LSB of util_est.enqueued. This makes util_est an even values only metric, which is still considered good enough for its purpose. The synchronization bit is (re)set by __update_load_avg_se() once the PELT signal of a task has been updated during its last activation. Such a throttling mechanism allows to keep under control util_est overheads in the wakeup hot path, thus making it a suitable mechanism which can be enabled also on high-intensity workload systems. Thus, this now switches on by default the estimation utilization scheduler feature. Suggested-by: Chris Redpath Signed-off-by: Patrick Bellasi --- Changes in v5: - set SCHED_FEAT(UTIL_EST, true) as default (Peter) --- kernel/sched/fair.c | 39 +++ kernel/sched/features.h | 2 +- 2 files changed, 36 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8364771f7301..1bf9a86ebc39 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3047,6 +3047,29 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) } } +/* + * When a task is dequeued, its estimated utilization should not be update if + * its util_avg has not been updated at least once. + * This flag is used to synchronize util_avg updates with util_est updates. + * We map this information into the LSB bit of the utilization saved at + * dequeue time (i.e. util_est.dequeued). + */ +#define UTIL_EST_NEED_UPDATE_FLAG 0x1 + +static inline void cfs_se_util_change(struct sched_avg *avg) +{ + if (sched_feat(UTIL_EST)) { + struct util_est ue = READ_ONCE(avg->util_est); + + if (!(ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG)) + return; + + /* Reset flag to report util_avg has been updated */ + ue.enqueued &= ~UTIL_EST_NEED_UPDATE_FLAG; + WRITE_ONCE(avg->util_est, ue); + } +} + #ifdef CONFIG_SMP /* * Approximate: @@ -3308,6 +3331,7 @@ __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entit cfs_rq->curr == se)) { ___update_load_avg(>avg, se_weight(se), se_runnable(se)); + cfs_se_util_change(>avg); return 1; } @@ -5218,7 +5242,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq, /* Update root cfs_rq's estimated utilization */ enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); - enqueued += _task_util_est(p); + enqueued += (_task_util_est(p) | 0x1); WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); } @@ -5310,7 +5334,7 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq, if (cfs_rq->nr_running) { ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued); ue.enqueued -= min_t(unsigned int, ue.enqueued, -_task_util_est(p)); +(_task_util_est(p) | UTIL_EST_NEED_UPDATE_FLAG)); } WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued); @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq, if (!task_sleep) return; + /* +* Skip update of task's estimated utilization if the PELT signal has +* never been updated (at least once) since last enqueue time. +*/ + ue = READ_ONCE(p->se.avg.util_est); + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG) + return; + /* * Skip update of task's estimated utilization when its EWMA is * already ~1% close to its last activation value. */ - ue = READ_ONCE(p->se.avg.util_est); - ue.enqueued = task_util(p); + ue.enqueued = (task_util(p) | UTIL_EST_NEED_UPDATE_FLAG); last_ewma_diff = ue.enqueued - ue.ewma; if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100))) return; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index