Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Tue, Sep 02, 2014 at 12:41:24AM -0700, Jason Low wrote: > On Mon, 2014-09-01 at 14:55 +0200, Peter Zijlstra wrote: > > But yes, unbounded errors here are a problem, sure relaxing the updates > > makes things go fast, they also make things go skew. > > Okay. In that case, would you like to take our original patch which > avoids unnecessary updates? This seems a safe patch; yes I'll queue it while we think of smarter things :-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Tue, Sep 02, 2014 at 12:41:24AM -0700, Jason Low wrote: On Mon, 2014-09-01 at 14:55 +0200, Peter Zijlstra wrote: But yes, unbounded errors here are a problem, sure relaxing the updates makes things go fast, they also make things go skew. Okay. In that case, would you like to take our original patch which avoids unnecessary updates? This seems a safe patch; yes I'll queue it while we think of smarter things :-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Mon, 2014-09-01 at 14:55 +0200, Peter Zijlstra wrote: > But yes, unbounded errors here are a problem, sure relaxing the updates > makes things go fast, they also make things go skew. Okay. In that case, would you like to take our original patch which avoids unnecessary updates? - Subject: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load When running workloads on 2+ socket systems, based on perf profiles, the update_cfs_rq_blocked_load function often shows up as taking up a noticeable % of run time. Much of the contention is in __update_cfs_rq_tg_load_contrib when we update the tg load contribution stats. However, it turns out that in many cases, they don't need to be updated and "tg_contrib" is 0. This patch adds a check in __update_cfs_rq_tg_load_contrib to skip updating tg load contribution stats when nothing needs to be updated. This reduces the cacheline contention that would be unnecessary. Cc: Yuyang Du Cc: Aswin Chandramouleeswaran Cc: Chegu Vinod Cc: Scott J Norton Reviewed-by: Ben Segall Reviewed-by: Waiman Long Signed-off-by: Jason Low --- kernel/sched/fair.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3427a8..45e346c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2382,6 +2382,9 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg; tg_contrib -= cfs_rq->tg_load_contrib; + if (!tg_contrib) + return; + if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) { atomic_long_add(tg_contrib, >load_avg); cfs_rq->tg_load_contrib += tg_contrib; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Mon, 2014-09-01 at 14:55 +0200, Peter Zijlstra wrote: But yes, unbounded errors here are a problem, sure relaxing the updates makes things go fast, they also make things go skew. Okay. In that case, would you like to take our original patch which avoids unnecessary updates? - Subject: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load When running workloads on 2+ socket systems, based on perf profiles, the update_cfs_rq_blocked_load function often shows up as taking up a noticeable % of run time. Much of the contention is in __update_cfs_rq_tg_load_contrib when we update the tg load contribution stats. However, it turns out that in many cases, they don't need to be updated and tg_contrib is 0. This patch adds a check in __update_cfs_rq_tg_load_contrib to skip updating tg load contribution stats when nothing needs to be updated. This reduces the cacheline contention that would be unnecessary. Cc: Yuyang Du yuyang...@intel.com Cc: Aswin Chandramouleeswaran as...@hp.com Cc: Chegu Vinod chegu_vi...@hp.com Cc: Scott J Norton scott.nor...@hp.com Reviewed-by: Ben Segall bseg...@google.com Reviewed-by: Waiman Long waiman.l...@hp.com Signed-off-by: Jason Low jason.l...@hp.com --- kernel/sched/fair.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3427a8..45e346c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2382,6 +2382,9 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, tg_contrib = cfs_rq-runnable_load_avg + cfs_rq-blocked_load_avg; tg_contrib -= cfs_rq-tg_load_contrib; + if (!tg_contrib) + return; + if (force_update || abs(tg_contrib) cfs_rq-tg_load_contrib / 8) { atomic_long_add(tg_contrib, tg-load_avg); cfs_rq-tg_load_contrib += tg_contrib; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Thu, Aug 28, 2014 at 12:46:36PM -0700, Jason Low wrote: > On Wed, 2014-08-27 at 16:32 -0700, Tim Chen wrote: > > If there are multiple non-forced updates, option 1's error seems to > > accumulate and non-bounded as we do not actually update? > > Is this a concern? > > It should be fine. Once the delta is large enough, we will end up doing > the update anyway. Well, the thing is you can have nr_cpus * 12.5% of outstanding delta; that might be a lot, esp on the large machines. Now there's two problems with all this; the first is the relative threshold, typically such per-cpu things have a fixed update threshold, this makes it much easier to qualify the actual error. Secondly the indeed the nr_cpus in the error bound. Some things; like the proportion code scale the threshold by log2(nr_cpus) in an attempt to do something sensible there. But yes, unbounded errors here are a problem, sure relaxing the updates makes things go fast, they also make things go skew. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Thu, Aug 28, 2014 at 12:46:36PM -0700, Jason Low wrote: On Wed, 2014-08-27 at 16:32 -0700, Tim Chen wrote: If there are multiple non-forced updates, option 1's error seems to accumulate and non-bounded as we do not actually update? Is this a concern? It should be fine. Once the delta is large enough, we will end up doing the update anyway. Well, the thing is you can have nr_cpus * 12.5% of outstanding delta; that might be a lot, esp on the large machines. Now there's two problems with all this; the first is the relative threshold, typically such per-cpu things have a fixed update threshold, this makes it much easier to qualify the actual error. Secondly the indeed the nr_cpus in the error bound. Some things; like the proportion code scale the threshold by log2(nr_cpus) in an attempt to do something sensible there. But yes, unbounded errors here are a problem, sure relaxing the updates makes things go fast, they also make things go skew. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Wed, 2014-08-27 at 16:32 -0700, Tim Chen wrote: > On Wed, 2014-08-27 at 10:34 -0700, Jason Low wrote: > > On Tue, 2014-08-26 at 16:24 -0700, Paul Turner wrote: > > > On Tue, Aug 26, 2014 at 4:11 PM, Jason Low wrote: > > > > Based on perf profiles, the update_cfs_rq_blocked_load function > > > > constantly > > > > shows up as taking up a noticeable % of system run time. This is > > > > especially > > > > apparent on larger numa systems. > > > > > > > > Much of the contention is in __update_cfs_rq_tg_load_contrib when we're > > > > updating the tg load contribution stats. However, it was noticed that > > > > the > > > > values often don't get modified by much. In fact, much of the time, they > > > > don't get modified at all. However, the update can always get attempted > > > > due > > > > to force_update. > > > > > > > > In this patch, we remove the force_update in only the > > > > __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get > > > > modified only if the delta is large enough (in the current code, they > > > > get > > > > updated when the delta is larger than 12.5%). This is a way to > > > > rate-limit > > > > the updates while largely keeping the values accurate. > > > > > > > > When testing this change with AIM7 workloads, we found that it was able > > > > to > > > > reduce the overhead of the function by up to a factor of 20x. > > > > > > Looks reasonable. > > > > > > > > > > > Cc: Yuyang Du > > > > Cc: Waiman Long > > > > Cc: Mel Gorman > > > > Cc: Mike Galbraith > > > > Cc: Rik van Riel > > > > Cc: Aswin Chandramouleeswaran > > > > Cc: Chegu Vinod > > > > Cc: Scott J Norton > > > > Signed-off-by: Jason Low > > > > --- > > > > kernel/sched/fair.c | 10 -- > > > > 1 files changed, 4 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index fea7d33..7a6e18b 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -2352,8 +2352,7 @@ static inline u64 > > > > __synchronize_entity_decay(struct sched_entity *se) > > > > } > > > > > > > > #ifdef CONFIG_FAIR_GROUP_SCHED > > > > -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq > > > > *cfs_rq, > > > > -int force_update) > > > > +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq > > > > *cfs_rq) > > > > { > > > > struct task_group *tg = cfs_rq->tg; > > > > long tg_contrib; > > > > @@ -2361,7 +2360,7 @@ static inline void > > > > __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, > > > > tg_contrib = cfs_rq->runnable_load_avg + > > > > cfs_rq->blocked_load_avg; > > > > tg_contrib -= cfs_rq->tg_load_contrib; > > > > > > > > - if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / > > > > 8) { > > > > > > Another option with slightly higher accuracy would be to increase the > > > sensitivity here when force_update == 1. > > > > > > E.g.: > > > abs(tg_contrib) > cfs_rq->tg_load_contrib / (8 * (1 + force_update))) > > > { ... > > > > > > Alternatively we could bound total inaccuracy: > > >int divisor = force_update ? NR_CPUS : 8; > > >if (abs(tg_contrib) > cfs_rq->tg_load_contrib / divisor) { ... > > > > > > > > > [ And probably rename force_update to want_update ] > > > > Out of the 2 additional options, I think the first one is better. The > > other option of using NR_CPUS looks like we're increasing the update > > rate as the system gets larger, and its the larger systems that are > > typically more affected by the contention. > > Probably num_present_cpus() will be better than NR_CPUS, which can > be much larger than the actual cpus present. Yeah, num_present_cpus(), though the same issue would still be there. > > > > Do you prefer either of the 2 other options over this v2 patch? If so, I > > can test and send out a new patch, otherwise, I'll keep this current v2 > > patch. > > If there are multiple non-forced updates, option 1's error seems to > accumulate and non-bounded as we do not actually update? > Is this a concern? It should be fine. Once the delta is large enough, we will end up doing the update anyway. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Wed, 2014-08-27 at 16:32 -0700, Tim Chen wrote: On Wed, 2014-08-27 at 10:34 -0700, Jason Low wrote: On Tue, 2014-08-26 at 16:24 -0700, Paul Turner wrote: On Tue, Aug 26, 2014 at 4:11 PM, Jason Low jason.l...@hp.com wrote: Based on perf profiles, the update_cfs_rq_blocked_load function constantly shows up as taking up a noticeable % of system run time. This is especially apparent on larger numa systems. Much of the contention is in __update_cfs_rq_tg_load_contrib when we're updating the tg load contribution stats. However, it was noticed that the values often don't get modified by much. In fact, much of the time, they don't get modified at all. However, the update can always get attempted due to force_update. In this patch, we remove the force_update in only the __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get modified only if the delta is large enough (in the current code, they get updated when the delta is larger than 12.5%). This is a way to rate-limit the updates while largely keeping the values accurate. When testing this change with AIM7 workloads, we found that it was able to reduce the overhead of the function by up to a factor of 20x. Looks reasonable. Cc: Yuyang Du yuyang...@intel.com Cc: Waiman Long waiman.l...@hp.com Cc: Mel Gorman mgor...@suse.de Cc: Mike Galbraith umgwanakikb...@gmail.com Cc: Rik van Riel r...@redhat.com Cc: Aswin Chandramouleeswaran as...@hp.com Cc: Chegu Vinod chegu_vi...@hp.com Cc: Scott J Norton scott.nor...@hp.com Signed-off-by: Jason Low jason.l...@hp.com --- kernel/sched/fair.c | 10 -- 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fea7d33..7a6e18b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) } #ifdef CONFIG_FAIR_GROUP_SCHED -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, -int force_update) +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq-tg; long tg_contrib; @@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, tg_contrib = cfs_rq-runnable_load_avg + cfs_rq-blocked_load_avg; tg_contrib -= cfs_rq-tg_load_contrib; - if (force_update || abs(tg_contrib) cfs_rq-tg_load_contrib / 8) { Another option with slightly higher accuracy would be to increase the sensitivity here when force_update == 1. E.g.: abs(tg_contrib) cfs_rq-tg_load_contrib / (8 * (1 + force_update))) { ... Alternatively we could bound total inaccuracy: int divisor = force_update ? NR_CPUS : 8; if (abs(tg_contrib) cfs_rq-tg_load_contrib / divisor) { ... [ And probably rename force_update to want_update ] Out of the 2 additional options, I think the first one is better. The other option of using NR_CPUS looks like we're increasing the update rate as the system gets larger, and its the larger systems that are typically more affected by the contention. Probably num_present_cpus() will be better than NR_CPUS, which can be much larger than the actual cpus present. Yeah, num_present_cpus(), though the same issue would still be there. Do you prefer either of the 2 other options over this v2 patch? If so, I can test and send out a new patch, otherwise, I'll keep this current v2 patch. If there are multiple non-forced updates, option 1's error seems to accumulate and non-bounded as we do not actually update? Is this a concern? It should be fine. Once the delta is large enough, we will end up doing the update anyway. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Wed, 2014-08-27 at 10:34 -0700, Jason Low wrote: > On Tue, 2014-08-26 at 16:24 -0700, Paul Turner wrote: > > On Tue, Aug 26, 2014 at 4:11 PM, Jason Low wrote: > > > Based on perf profiles, the update_cfs_rq_blocked_load function constantly > > > shows up as taking up a noticeable % of system run time. This is > > > especially > > > apparent on larger numa systems. > > > > > > Much of the contention is in __update_cfs_rq_tg_load_contrib when we're > > > updating the tg load contribution stats. However, it was noticed that the > > > values often don't get modified by much. In fact, much of the time, they > > > don't get modified at all. However, the update can always get attempted > > > due > > > to force_update. > > > > > > In this patch, we remove the force_update in only the > > > __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get > > > modified only if the delta is large enough (in the current code, they get > > > updated when the delta is larger than 12.5%). This is a way to rate-limit > > > the updates while largely keeping the values accurate. > > > > > > When testing this change with AIM7 workloads, we found that it was able to > > > reduce the overhead of the function by up to a factor of 20x. > > > > Looks reasonable. > > > > > > > > Cc: Yuyang Du > > > Cc: Waiman Long > > > Cc: Mel Gorman > > > Cc: Mike Galbraith > > > Cc: Rik van Riel > > > Cc: Aswin Chandramouleeswaran > > > Cc: Chegu Vinod > > > Cc: Scott J Norton > > > Signed-off-by: Jason Low > > > --- > > > kernel/sched/fair.c | 10 -- > > > 1 files changed, 4 insertions(+), 6 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index fea7d33..7a6e18b 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct > > > sched_entity *se) > > > } > > > > > > #ifdef CONFIG_FAIR_GROUP_SCHED > > > -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, > > > -int force_update) > > > +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) > > > { > > > struct task_group *tg = cfs_rq->tg; > > > long tg_contrib; > > > @@ -2361,7 +2360,7 @@ static inline void > > > __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, > > > tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg; > > > tg_contrib -= cfs_rq->tg_load_contrib; > > > > > > - if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / > > > 8) { > > > > Another option with slightly higher accuracy would be to increase the > > sensitivity here when force_update == 1. > > > > E.g.: > > abs(tg_contrib) > cfs_rq->tg_load_contrib / (8 * (1 + force_update))) { > > ... > > > > Alternatively we could bound total inaccuracy: > >int divisor = force_update ? NR_CPUS : 8; > >if (abs(tg_contrib) > cfs_rq->tg_load_contrib / divisor) { ... > > > > > > [ And probably rename force_update to want_update ] > > Out of the 2 additional options, I think the first one is better. The > other option of using NR_CPUS looks like we're increasing the update > rate as the system gets larger, and its the larger systems that are > typically more affected by the contention. Probably num_present_cpus() will be better than NR_CPUS, which can be much larger than the actual cpus present. > > Do you prefer either of the 2 other options over this v2 patch? If so, I > can test and send out a new patch, otherwise, I'll keep this current v2 > patch. If there are multiple non-forced updates, option 1's error seems to accumulate and non-bounded as we do not actually update? Is this a concern? Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Tue, 2014-08-26 at 16:24 -0700, Paul Turner wrote: > On Tue, Aug 26, 2014 at 4:11 PM, Jason Low wrote: > > Based on perf profiles, the update_cfs_rq_blocked_load function constantly > > shows up as taking up a noticeable % of system run time. This is especially > > apparent on larger numa systems. > > > > Much of the contention is in __update_cfs_rq_tg_load_contrib when we're > > updating the tg load contribution stats. However, it was noticed that the > > values often don't get modified by much. In fact, much of the time, they > > don't get modified at all. However, the update can always get attempted due > > to force_update. > > > > In this patch, we remove the force_update in only the > > __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get > > modified only if the delta is large enough (in the current code, they get > > updated when the delta is larger than 12.5%). This is a way to rate-limit > > the updates while largely keeping the values accurate. > > > > When testing this change with AIM7 workloads, we found that it was able to > > reduce the overhead of the function by up to a factor of 20x. > > Looks reasonable. > > > > > Cc: Yuyang Du > > Cc: Waiman Long > > Cc: Mel Gorman > > Cc: Mike Galbraith > > Cc: Rik van Riel > > Cc: Aswin Chandramouleeswaran > > Cc: Chegu Vinod > > Cc: Scott J Norton > > Signed-off-by: Jason Low > > --- > > kernel/sched/fair.c | 10 -- > > 1 files changed, 4 insertions(+), 6 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index fea7d33..7a6e18b 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct > > sched_entity *se) > > } > > > > #ifdef CONFIG_FAIR_GROUP_SCHED > > -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, > > -int force_update) > > +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) > > { > > struct task_group *tg = cfs_rq->tg; > > long tg_contrib; > > @@ -2361,7 +2360,7 @@ static inline void > > __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, > > tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg; > > tg_contrib -= cfs_rq->tg_load_contrib; > > > > - if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) { > > Another option with slightly higher accuracy would be to increase the > sensitivity here when force_update == 1. > > E.g.: > abs(tg_contrib) > cfs_rq->tg_load_contrib / (8 * (1 + force_update))) { > ... > > Alternatively we could bound total inaccuracy: >int divisor = force_update ? NR_CPUS : 8; >if (abs(tg_contrib) > cfs_rq->tg_load_contrib / divisor) { ... > > > [ And probably rename force_update to want_update ] Out of the 2 additional options, I think the first one is better. The other option of using NR_CPUS looks like we're increasing the update rate as the system gets larger, and its the larger systems that are typically more affected by the contention. Do you prefer either of the 2 other options over this v2 patch? If so, I can test and send out a new patch, otherwise, I'll keep this current v2 patch. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Tue, 2014-08-26 at 16:24 -0700, Paul Turner wrote: On Tue, Aug 26, 2014 at 4:11 PM, Jason Low jason.l...@hp.com wrote: Based on perf profiles, the update_cfs_rq_blocked_load function constantly shows up as taking up a noticeable % of system run time. This is especially apparent on larger numa systems. Much of the contention is in __update_cfs_rq_tg_load_contrib when we're updating the tg load contribution stats. However, it was noticed that the values often don't get modified by much. In fact, much of the time, they don't get modified at all. However, the update can always get attempted due to force_update. In this patch, we remove the force_update in only the __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get modified only if the delta is large enough (in the current code, they get updated when the delta is larger than 12.5%). This is a way to rate-limit the updates while largely keeping the values accurate. When testing this change with AIM7 workloads, we found that it was able to reduce the overhead of the function by up to a factor of 20x. Looks reasonable. Cc: Yuyang Du yuyang...@intel.com Cc: Waiman Long waiman.l...@hp.com Cc: Mel Gorman mgor...@suse.de Cc: Mike Galbraith umgwanakikb...@gmail.com Cc: Rik van Riel r...@redhat.com Cc: Aswin Chandramouleeswaran as...@hp.com Cc: Chegu Vinod chegu_vi...@hp.com Cc: Scott J Norton scott.nor...@hp.com Signed-off-by: Jason Low jason.l...@hp.com --- kernel/sched/fair.c | 10 -- 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fea7d33..7a6e18b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) } #ifdef CONFIG_FAIR_GROUP_SCHED -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, -int force_update) +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq-tg; long tg_contrib; @@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, tg_contrib = cfs_rq-runnable_load_avg + cfs_rq-blocked_load_avg; tg_contrib -= cfs_rq-tg_load_contrib; - if (force_update || abs(tg_contrib) cfs_rq-tg_load_contrib / 8) { Another option with slightly higher accuracy would be to increase the sensitivity here when force_update == 1. E.g.: abs(tg_contrib) cfs_rq-tg_load_contrib / (8 * (1 + force_update))) { ... Alternatively we could bound total inaccuracy: int divisor = force_update ? NR_CPUS : 8; if (abs(tg_contrib) cfs_rq-tg_load_contrib / divisor) { ... [ And probably rename force_update to want_update ] Out of the 2 additional options, I think the first one is better. The other option of using NR_CPUS looks like we're increasing the update rate as the system gets larger, and its the larger systems that are typically more affected by the contention. Do you prefer either of the 2 other options over this v2 patch? If so, I can test and send out a new patch, otherwise, I'll keep this current v2 patch. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Wed, 2014-08-27 at 10:34 -0700, Jason Low wrote: On Tue, 2014-08-26 at 16:24 -0700, Paul Turner wrote: On Tue, Aug 26, 2014 at 4:11 PM, Jason Low jason.l...@hp.com wrote: Based on perf profiles, the update_cfs_rq_blocked_load function constantly shows up as taking up a noticeable % of system run time. This is especially apparent on larger numa systems. Much of the contention is in __update_cfs_rq_tg_load_contrib when we're updating the tg load contribution stats. However, it was noticed that the values often don't get modified by much. In fact, much of the time, they don't get modified at all. However, the update can always get attempted due to force_update. In this patch, we remove the force_update in only the __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get modified only if the delta is large enough (in the current code, they get updated when the delta is larger than 12.5%). This is a way to rate-limit the updates while largely keeping the values accurate. When testing this change with AIM7 workloads, we found that it was able to reduce the overhead of the function by up to a factor of 20x. Looks reasonable. Cc: Yuyang Du yuyang...@intel.com Cc: Waiman Long waiman.l...@hp.com Cc: Mel Gorman mgor...@suse.de Cc: Mike Galbraith umgwanakikb...@gmail.com Cc: Rik van Riel r...@redhat.com Cc: Aswin Chandramouleeswaran as...@hp.com Cc: Chegu Vinod chegu_vi...@hp.com Cc: Scott J Norton scott.nor...@hp.com Signed-off-by: Jason Low jason.l...@hp.com --- kernel/sched/fair.c | 10 -- 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fea7d33..7a6e18b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) } #ifdef CONFIG_FAIR_GROUP_SCHED -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, -int force_update) +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq-tg; long tg_contrib; @@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, tg_contrib = cfs_rq-runnable_load_avg + cfs_rq-blocked_load_avg; tg_contrib -= cfs_rq-tg_load_contrib; - if (force_update || abs(tg_contrib) cfs_rq-tg_load_contrib / 8) { Another option with slightly higher accuracy would be to increase the sensitivity here when force_update == 1. E.g.: abs(tg_contrib) cfs_rq-tg_load_contrib / (8 * (1 + force_update))) { ... Alternatively we could bound total inaccuracy: int divisor = force_update ? NR_CPUS : 8; if (abs(tg_contrib) cfs_rq-tg_load_contrib / divisor) { ... [ And probably rename force_update to want_update ] Out of the 2 additional options, I think the first one is better. The other option of using NR_CPUS looks like we're increasing the update rate as the system gets larger, and its the larger systems that are typically more affected by the contention. Probably num_present_cpus() will be better than NR_CPUS, which can be much larger than the actual cpus present. Do you prefer either of the 2 other options over this v2 patch? If so, I can test and send out a new patch, otherwise, I'll keep this current v2 patch. If there are multiple non-forced updates, option 1's error seems to accumulate and non-bounded as we do not actually update? Is this a concern? Thanks. Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Tue, Aug 26, 2014 at 4:11 PM, Jason Low wrote: > Based on perf profiles, the update_cfs_rq_blocked_load function constantly > shows up as taking up a noticeable % of system run time. This is especially > apparent on larger numa systems. > > Much of the contention is in __update_cfs_rq_tg_load_contrib when we're > updating the tg load contribution stats. However, it was noticed that the > values often don't get modified by much. In fact, much of the time, they > don't get modified at all. However, the update can always get attempted due > to force_update. > > In this patch, we remove the force_update in only the > __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get > modified only if the delta is large enough (in the current code, they get > updated when the delta is larger than 12.5%). This is a way to rate-limit > the updates while largely keeping the values accurate. > > When testing this change with AIM7 workloads, we found that it was able to > reduce the overhead of the function by up to a factor of 20x. Looks reasonable. > > Cc: Yuyang Du > Cc: Waiman Long > Cc: Mel Gorman > Cc: Mike Galbraith > Cc: Rik van Riel > Cc: Aswin Chandramouleeswaran > Cc: Chegu Vinod > Cc: Scott J Norton > Signed-off-by: Jason Low > --- > kernel/sched/fair.c | 10 -- > 1 files changed, 4 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fea7d33..7a6e18b 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct > sched_entity *se) > } > > #ifdef CONFIG_FAIR_GROUP_SCHED > -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, > -int force_update) > +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) > { > struct task_group *tg = cfs_rq->tg; > long tg_contrib; > @@ -2361,7 +2360,7 @@ static inline void > __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, > tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg; > tg_contrib -= cfs_rq->tg_load_contrib; > > - if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) { Another option with slightly higher accuracy would be to increase the sensitivity here when force_update == 1. E.g.: abs(tg_contrib) > cfs_rq->tg_load_contrib / (8 * (1 + force_update))) { ... Alternatively we could bound total inaccuracy: int divisor = force_update ? NR_CPUS : 8; if (abs(tg_contrib) > cfs_rq->tg_load_contrib / divisor) { ... [ And probably rename force_update to want_update ] > + if (abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) { > atomic_long_add(tg_contrib, >load_avg); > cfs_rq->tg_load_contrib += tg_contrib; > } > @@ -2436,8 +2435,7 @@ static inline void update_rq_runnable_avg(struct rq > *rq, int runnable) > __update_tg_runnable_avg(>avg, >cfs); > } > #else /* CONFIG_FAIR_GROUP_SCHED */ > -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, > -int force_update) {} > +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) {} > static inline void __update_tg_runnable_avg(struct sched_avg *sa, > struct cfs_rq *cfs_rq) {} > static inline void __update_group_entity_contrib(struct sched_entity *se) {} > @@ -2537,7 +2535,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq > *cfs_rq, int force_update) > cfs_rq->last_decay = now; > } > > - __update_cfs_rq_tg_load_contrib(cfs_rq, force_update); > + __update_cfs_rq_tg_load_contrib(cfs_rq); > } > > /* Add the load generated by se into cfs_rq's child load-average */ > -- > 1.7.1 > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
Based on perf profiles, the update_cfs_rq_blocked_load function constantly shows up as taking up a noticeable % of system run time. This is especially apparent on larger numa systems. Much of the contention is in __update_cfs_rq_tg_load_contrib when we're updating the tg load contribution stats. However, it was noticed that the values often don't get modified by much. In fact, much of the time, they don't get modified at all. However, the update can always get attempted due to force_update. In this patch, we remove the force_update in only the __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get modified only if the delta is large enough (in the current code, they get updated when the delta is larger than 12.5%). This is a way to rate-limit the updates while largely keeping the values accurate. When testing this change with AIM7 workloads, we found that it was able to reduce the overhead of the function by up to a factor of 20x. Cc: Yuyang Du Cc: Waiman Long Cc: Mel Gorman Cc: Mike Galbraith Cc: Rik van Riel Cc: Aswin Chandramouleeswaran Cc: Chegu Vinod Cc: Scott J Norton Signed-off-by: Jason Low --- kernel/sched/fair.c | 10 -- 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fea7d33..7a6e18b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) } #ifdef CONFIG_FAIR_GROUP_SCHED -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, -int force_update) +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq->tg; long tg_contrib; @@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg; tg_contrib -= cfs_rq->tg_load_contrib; - if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) { + if (abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) { atomic_long_add(tg_contrib, >load_avg); cfs_rq->tg_load_contrib += tg_contrib; } @@ -2436,8 +2435,7 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) __update_tg_runnable_avg(>avg, >cfs); } #else /* CONFIG_FAIR_GROUP_SCHED */ -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, -int force_update) {} +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) {} static inline void __update_tg_runnable_avg(struct sched_avg *sa, struct cfs_rq *cfs_rq) {} static inline void __update_group_entity_contrib(struct sched_entity *se) {} @@ -2537,7 +2535,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) cfs_rq->last_decay = now; } - __update_cfs_rq_tg_load_contrib(cfs_rq, force_update); + __update_cfs_rq_tg_load_contrib(cfs_rq); } /* Add the load generated by se into cfs_rq's child load-average */ -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
Based on perf profiles, the update_cfs_rq_blocked_load function constantly shows up as taking up a noticeable % of system run time. This is especially apparent on larger numa systems. Much of the contention is in __update_cfs_rq_tg_load_contrib when we're updating the tg load contribution stats. However, it was noticed that the values often don't get modified by much. In fact, much of the time, they don't get modified at all. However, the update can always get attempted due to force_update. In this patch, we remove the force_update in only the __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get modified only if the delta is large enough (in the current code, they get updated when the delta is larger than 12.5%). This is a way to rate-limit the updates while largely keeping the values accurate. When testing this change with AIM7 workloads, we found that it was able to reduce the overhead of the function by up to a factor of 20x. Cc: Yuyang Du yuyang...@intel.com Cc: Waiman Long waiman.l...@hp.com Cc: Mel Gorman mgor...@suse.de Cc: Mike Galbraith umgwanakikb...@gmail.com Cc: Rik van Riel r...@redhat.com Cc: Aswin Chandramouleeswaran as...@hp.com Cc: Chegu Vinod chegu_vi...@hp.com Cc: Scott J Norton scott.nor...@hp.com Signed-off-by: Jason Low jason.l...@hp.com --- kernel/sched/fair.c | 10 -- 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fea7d33..7a6e18b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) } #ifdef CONFIG_FAIR_GROUP_SCHED -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, -int force_update) +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq-tg; long tg_contrib; @@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, tg_contrib = cfs_rq-runnable_load_avg + cfs_rq-blocked_load_avg; tg_contrib -= cfs_rq-tg_load_contrib; - if (force_update || abs(tg_contrib) cfs_rq-tg_load_contrib / 8) { + if (abs(tg_contrib) cfs_rq-tg_load_contrib / 8) { atomic_long_add(tg_contrib, tg-load_avg); cfs_rq-tg_load_contrib += tg_contrib; } @@ -2436,8 +2435,7 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) __update_tg_runnable_avg(rq-avg, rq-cfs); } #else /* CONFIG_FAIR_GROUP_SCHED */ -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, -int force_update) {} +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) {} static inline void __update_tg_runnable_avg(struct sched_avg *sa, struct cfs_rq *cfs_rq) {} static inline void __update_group_entity_contrib(struct sched_entity *se) {} @@ -2537,7 +2535,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) cfs_rq-last_decay = now; } - __update_cfs_rq_tg_load_contrib(cfs_rq, force_update); + __update_cfs_rq_tg_load_contrib(cfs_rq); } /* Add the load generated by se into cfs_rq's child load-average */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
On Tue, Aug 26, 2014 at 4:11 PM, Jason Low jason.l...@hp.com wrote: Based on perf profiles, the update_cfs_rq_blocked_load function constantly shows up as taking up a noticeable % of system run time. This is especially apparent on larger numa systems. Much of the contention is in __update_cfs_rq_tg_load_contrib when we're updating the tg load contribution stats. However, it was noticed that the values often don't get modified by much. In fact, much of the time, they don't get modified at all. However, the update can always get attempted due to force_update. In this patch, we remove the force_update in only the __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get modified only if the delta is large enough (in the current code, they get updated when the delta is larger than 12.5%). This is a way to rate-limit the updates while largely keeping the values accurate. When testing this change with AIM7 workloads, we found that it was able to reduce the overhead of the function by up to a factor of 20x. Looks reasonable. Cc: Yuyang Du yuyang...@intel.com Cc: Waiman Long waiman.l...@hp.com Cc: Mel Gorman mgor...@suse.de Cc: Mike Galbraith umgwanakikb...@gmail.com Cc: Rik van Riel r...@redhat.com Cc: Aswin Chandramouleeswaran as...@hp.com Cc: Chegu Vinod chegu_vi...@hp.com Cc: Scott J Norton scott.nor...@hp.com Signed-off-by: Jason Low jason.l...@hp.com --- kernel/sched/fair.c | 10 -- 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fea7d33..7a6e18b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) } #ifdef CONFIG_FAIR_GROUP_SCHED -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, -int force_update) +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq-tg; long tg_contrib; @@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, tg_contrib = cfs_rq-runnable_load_avg + cfs_rq-blocked_load_avg; tg_contrib -= cfs_rq-tg_load_contrib; - if (force_update || abs(tg_contrib) cfs_rq-tg_load_contrib / 8) { Another option with slightly higher accuracy would be to increase the sensitivity here when force_update == 1. E.g.: abs(tg_contrib) cfs_rq-tg_load_contrib / (8 * (1 + force_update))) { ... Alternatively we could bound total inaccuracy: int divisor = force_update ? NR_CPUS : 8; if (abs(tg_contrib) cfs_rq-tg_load_contrib / divisor) { ... [ And probably rename force_update to want_update ] + if (abs(tg_contrib) cfs_rq-tg_load_contrib / 8) { atomic_long_add(tg_contrib, tg-load_avg); cfs_rq-tg_load_contrib += tg_contrib; } @@ -2436,8 +2435,7 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) __update_tg_runnable_avg(rq-avg, rq-cfs); } #else /* CONFIG_FAIR_GROUP_SCHED */ -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, -int force_update) {} +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) {} static inline void __update_tg_runnable_avg(struct sched_avg *sa, struct cfs_rq *cfs_rq) {} static inline void __update_group_entity_contrib(struct sched_entity *se) {} @@ -2537,7 +2535,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) cfs_rq-last_decay = now; } - __update_cfs_rq_tg_load_contrib(cfs_rq, force_update); + __update_cfs_rq_tg_load_contrib(cfs_rq); } /* Add the load generated by se into cfs_rq's child load-average */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/