Re: SCHED_DEADLINE cpudeadline.{h,c} fixup
On Sun, Aug 14, 2016 at 04:27:05PM +0200, Tommaso Cucinotta wrote: > Hi, > > this is a rework of the cpudeadline bugfix and speed-up patch-set, that > integrates all comments received so far from Luca, Juri and Peter. > > Compared with the previous post, here: > -) I'm keeping out the minimally invasive bugfix, as it's already been >merged in tip/sched/core > -) I moved some little code refactory around change_key_dl() out of the > (now) 2nd patch, to the 1st one. Now the 2nd (speed-up) patch just > changes the heapify_up/down() functions > -) I rebased on top of commit f0b22e39 > -) I repeated an extensive set of tests through the framework published >separately at: https://github.com/tomcucinotta/cpudl-bench >repeating new no-behavior-change tests, new heap-consistency tests, >and new a/b benchmarks (I'm working on a new i5 laptop now), results at: > https://github.com/tomcucinotta/cpudl-bench/blob/master/cpudl-10.pdf >highlighting up to a 14% speed-up when averaging over 100K ops. See the >enclosed README in that repo for more info. Thanks!
SCHED_DEADLINE cpudeadline.{h,c} fixup
Hi, this is a rework of the cpudeadline bugfix and speed-up patch-set, that integrates all comments received so far from Luca, Juri and Peter. Compared with the previous post, here: -) I'm keeping out the minimally invasive bugfix, as it's already been merged in tip/sched/core -) I moved some little code refactory around change_key_dl() out of the (now) 2nd patch, to the 1st one. Now the 2nd (speed-up) patch just changes the heapify_up/down() functions -) I rebased on top of commit f0b22e39 -) I repeated an extensive set of tests through the framework published separately at: https://github.com/tomcucinotta/cpudl-bench repeating new no-behavior-change tests, new heap-consistency tests, and new a/b benchmarks (I'm working on a new i5 laptop now), results at: https://github.com/tomcucinotta/cpudl-bench/blob/master/cpudl-10.pdf highlighting up to a 14% speed-up when averaging over 100K ops. See the enclosed README in that repo for more info. I'm leaving below the original description of all 4 patches. -- The first patch is a minimally invasive (1-line) fix for the deadline wrap-around bug. This leaves some weirdness in how cpudl_change_key() is called. Therefore, the second patch does a minimum of refactory to make things more explicit and clear. The 3rd patch contains now the actual performance enhancement (avoiding unneeded swaps during heapify operations), which has been measured to achieve up to 14% of speed-up for cpudl_set() calls. This has been measured with a randomly generated workload of 1K,10K,100K random heap insertions and deletions (75% cpudl_set() calls with is_valid=1 and 25% with is_valid=0), and randomly generated cpu IDs, with up to 256 CPUs. Benchmarking code at: https://github.com/tomcucinotta/cpudl-bench Finally, the 4th patch is another clear-up patch touching cpudeadline.{h,c} and deadline.c. Now you call cpudl_clear(cp, cpu) and cpudl_set(cp, cpu, dl) instead of cpudl_set(cp, cpu, 0 /* dl */, 0 /* is_valid */) and cpudl_set(cp, cpu, dl, 1 /* is_valid */). Any further comment is welcome, thanks! Tommaso --
SCHED_DEADLINE cpudeadline.{h,c} fixup
Hi, this is a rework of the cpudeadline bugfix and speed-up patch-set, that integrates all comments received so far from Luca and Juri. The first patch is a minimally invasive (1-line) fix for the deadline wrap-around bug. This leaves some weirdness in how cpudl_change_key() is called. Therefore, the second patch does a minimum of refactory to make things more explicit and clear. The 3rd patch contains now the actual performance enhancement (avoiding unneeded swaps during heapify operations), which has been measured to achieve up to 10% , of speed-up for cpudl_set() calls. This has been measured with a andomly generated workload of 1K,10K,100K random heap insertions and deletions (75% cpudl_set() calls with is_valid=1 and 25% with is_valid=0), and randomly generated cpu IDs, with up to 256 CPUs. Benchmarking code is available at: https://github.com/tomcucinotta/cpudl-bench Obtained speed-up plot: https://github.com/tomcucinotta/cpudl-bench/blob/master/cpudl.pdf Finally, the 4th patch is another clear-up patch touching cpudeadline.{h,c} and deadline.c. Now you call cpudl_clear(cp, cpu) and cpudl_set(cp, cpu, dl) instead of cpudl_set(cp, cpu, 0 /* dl */, 0 /* is_valid */) and cpudl_set(cp, cpu, dl, 1 /* is_valid */). Please, share your comments, thanks! Tommaso
SCHED_DEADLINE cpudeadline.{h,c} fixup
Hi all, I took Luca's advice to isolate the deadline wrap-around bugfix with a first minimally invasive patch (1-line). This leaves some weirdness in how cpudl_change_key() is called. Therefore, the second patch does a minimum of refactory to make things more explicit and clear. The 3rd patch contains now the actual performance enhancement (avoiding unneeded swaps during heapify operations), which, as said in the previous post, achieves up to 6% of speed-up for cpudl_set() calls. Finally, the 4th patch is another clear-up patch touching cpudeadline.{h,c} and deadline.c. Now you call cpudl_clear(cp, cpu) and cpudl_set(cp, cpu, dl) instead of cpudl_set(cp, cpu, 0 /* dl */, 0 /* is_valid */) and cpudl_set(cp, cpu, dl, 1 /* is_valid */). Please, let me know how this looks like now. Thanks, Tommaso
Re: SCHED_DEADLINE cpudeadline.{h,c} fixup
Hi Tommaso, On 18/05/16 00:43, Tommaso Cucinotta wrote: > On 17/05/2016 13:46, luca abeni wrote: > >Maybe the ... change can be split in a separate > >patch, which is a bugfix (and IMHO uncontroversial)? > > Ok, the bugfix alone might look like the attached. Couldn't avoid > the little refactoring of the multiple occurrences of the same loop > up the heap into the heapify_up(), mirroring the heapify() that was > already there (renamed heapify_down() for clarity). > > I'll rebase the speed-up patch on top of this, if it's a better approach. > > Anyone with further comments? > Couldn't spend any time on this yet, apologies. But, for the next posting, could you please do it without attaching the patches? I usually use git send-mail for posting. It would make the review easier, I think. Best, - Juri > Thanks again! > > T. > -- > Tommaso Cucinotta, Computer Engineering PhD > Associate Professor at the Real-Time Systems Laboratory (ReTiS) > Scuola Superiore Sant'Anna, Pisa, Italy > http://retis.sssup.it/people/tommaso > From cfaa75eb77843f7da875a54c7e6631b271bf0663 Mon Sep 17 00:00:00 2001 > From: Tommaso Cucinotta > Date: Tue, 17 May 2016 15:54:11 +0200 > Subject: [PATCH] Deadline wrap-around bugfix for the SCHED_DEADLINE cpu heap. > > --- > kernel/sched/cpudeadline.c | 38 +++--- > 1 file changed, 19 insertions(+), 19 deletions(-) > > diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c > index 5be5882..3c42702 100644 > --- a/kernel/sched/cpudeadline.c > +++ b/kernel/sched/cpudeadline.c > @@ -41,7 +41,7 @@ static void cpudl_exchange(struct cpudl *cp, int a, int b) > swap(cp->elements[cpu_a].idx, cp->elements[cpu_b].idx); > } > > -static void cpudl_heapify(struct cpudl *cp, int idx) > +static void cpudl_heapify_down(struct cpudl *cp, int idx) > { > int l, r, largest; > > @@ -66,20 +66,25 @@ static void cpudl_heapify(struct cpudl *cp, int idx) > } > } > > +static void cpudl_heapify_up(struct cpudl *cp, int idx) > +{ > + while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl, > + cp->elements[idx].dl)) { > + cpudl_exchange(cp, idx, parent(idx)); > + idx = parent(idx); > + } > +} > + > static void cpudl_change_key(struct cpudl *cp, int idx, u64 new_dl) > { > WARN_ON(idx == IDX_INVALID || !cpu_present(idx)); > > if (dl_time_before(new_dl, cp->elements[idx].dl)) { > cp->elements[idx].dl = new_dl; > - cpudl_heapify(cp, idx); > + cpudl_heapify_down(cp, idx); > } else { > cp->elements[idx].dl = new_dl; > - while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl, > - cp->elements[idx].dl)) { > - cpudl_exchange(cp, idx, parent(idx)); > - idx = parent(idx); > - } > + cpudl_heapify_up(cp, idx); > } > } > > @@ -154,24 +159,19 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int > is_valid) > cp->size--; > cp->elements[new_cpu].idx = old_idx; > cp->elements[cpu].idx = IDX_INVALID; > - while (old_idx > 0 && dl_time_before( > - cp->elements[parent(old_idx)].dl, > - cp->elements[old_idx].dl)) { > - cpudl_exchange(cp, old_idx, parent(old_idx)); > - old_idx = parent(old_idx); > - } > + cpudl_heapify_up(cp, old_idx); > cpumask_set_cpu(cpu, cp->free_cpus); > -cpudl_heapify(cp, old_idx); > +cpudl_heapify_down(cp, old_idx); > > goto out; > } > > if (old_idx == IDX_INVALID) { > - cp->size++; > - cp->elements[cp->size - 1].dl = 0; > - cp->elements[cp->size - 1].cpu = cpu; > - cp->elements[cpu].idx = cp->size - 1; > - cpudl_change_key(cp, cp->size - 1, dl); > + int size1 = cp->size++; > + cp->elements[size1].dl = dl; > + cp->elements[size1].cpu = cpu; > + cp->elements[cpu].idx = size1; > + cpudl_heapify_up(cp, size1); > cpumask_clear_cpu(cpu, cp->free_cpus); > } else { > cpudl_change_key(cp, old_idx, dl); > -- > 2.7.4 >
Re: SCHED_DEADLINE cpudeadline.{h,c} fixup
On 17/05/2016 13:46, luca abeni wrote: Maybe the ... change can be split in a separate patch, which is a bugfix (and IMHO uncontroversial)? Ok, the bugfix alone might look like the attached. Couldn't avoid the little refactoring of the multiple occurrences of the same loop up the heap into the heapify_up(), mirroring the heapify() that was already there (renamed heapify_down() for clarity). I'll rebase the speed-up patch on top of this, if it's a better approach. Anyone with further comments? Thanks again! T. -- Tommaso Cucinotta, Computer Engineering PhD Associate Professor at the Real-Time Systems Laboratory (ReTiS) Scuola Superiore Sant'Anna, Pisa, Italy http://retis.sssup.it/people/tommaso >From cfaa75eb77843f7da875a54c7e6631b271bf0663 Mon Sep 17 00:00:00 2001 From: Tommaso Cucinotta Date: Tue, 17 May 2016 15:54:11 +0200 Subject: [PATCH] Deadline wrap-around bugfix for the SCHED_DEADLINE cpu heap. --- kernel/sched/cpudeadline.c | 38 +++--- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c index 5be5882..3c42702 100644 --- a/kernel/sched/cpudeadline.c +++ b/kernel/sched/cpudeadline.c @@ -41,7 +41,7 @@ static void cpudl_exchange(struct cpudl *cp, int a, int b) swap(cp->elements[cpu_a].idx, cp->elements[cpu_b].idx); } -static void cpudl_heapify(struct cpudl *cp, int idx) +static void cpudl_heapify_down(struct cpudl *cp, int idx) { int l, r, largest; @@ -66,20 +66,25 @@ static void cpudl_heapify(struct cpudl *cp, int idx) } } +static void cpudl_heapify_up(struct cpudl *cp, int idx) +{ + while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl, + cp->elements[idx].dl)) { + cpudl_exchange(cp, idx, parent(idx)); + idx = parent(idx); + } +} + static void cpudl_change_key(struct cpudl *cp, int idx, u64 new_dl) { WARN_ON(idx == IDX_INVALID || !cpu_present(idx)); if (dl_time_before(new_dl, cp->elements[idx].dl)) { cp->elements[idx].dl = new_dl; - cpudl_heapify(cp, idx); + cpudl_heapify_down(cp, idx); } else { cp->elements[idx].dl = new_dl; - while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl, - cp->elements[idx].dl)) { - cpudl_exchange(cp, idx, parent(idx)); - idx = parent(idx); - } + cpudl_heapify_up(cp, idx); } } @@ -154,24 +159,19 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid) cp->size--; cp->elements[new_cpu].idx = old_idx; cp->elements[cpu].idx = IDX_INVALID; - while (old_idx > 0 && dl_time_before( -cp->elements[parent(old_idx)].dl, -cp->elements[old_idx].dl)) { - cpudl_exchange(cp, old_idx, parent(old_idx)); - old_idx = parent(old_idx); - } + cpudl_heapify_up(cp, old_idx); cpumask_set_cpu(cpu, cp->free_cpus); -cpudl_heapify(cp, old_idx); +cpudl_heapify_down(cp, old_idx); goto out; } if (old_idx == IDX_INVALID) { - cp->size++; - cp->elements[cp->size - 1].dl = 0; - cp->elements[cp->size - 1].cpu = cpu; - cp->elements[cpu].idx = cp->size - 1; - cpudl_change_key(cp, cp->size - 1, dl); + int size1 = cp->size++; + cp->elements[size1].dl = dl; + cp->elements[size1].cpu = cpu; + cp->elements[cpu].idx = size1; + cpudl_heapify_up(cp, size1); cpumask_clear_cpu(cpu, cp->free_cpus); } else { cpudl_change_key(cp, old_idx, dl); -- 2.7.4
Re: SCHED_DEADLINE cpudeadline.{h,c} fixup
Hi all, On Mon, 16 May 2016 18:00:04 +0200 Tommaso Cucinotta wrote: > Hi, > > looking at the SCHED_DEADLINE code, I spotted an opportunity to > make cpudeadline.c faster, in that we can skip real swaps during > re-heapify()ication of items after addition/removal. As such ops > are done under a domain spinlock, it sounded like an interesting > try. [...] I do not know the cpudeadline code too much, but I think every "dl = 0" looks like a bug... So, I think this hunk actually fixes a real bug: [...] - cp->elements[cp->size - 1].dl = 0; - cp->elements[cp->size - 1].cpu = cpu; - cp->elements[cpu].idx = cp->size - 1; - cpudl_change_key(cp, cp->size - 1, dl); - cpumask_clear_cpu(cpu, cp->free_cpus); + cpumask_set_cpu(cpu, cp->free_cpus); } else { - cpudl_change_key(cp, old_idx, dl); + if (old_idx == IDX_INVALID) { + int sz1 = cp->size++; + cp->elements[sz1].dl = dl; [...] Maybe the "cp->elements[cp->size - 1].dl = 0" -> "cp->elements[cp->size - 1].dl = 0" change can be split in a separate patch, which is a bugfix (and IMHO uncontroversial)? Thanks, Luca > > Indeed, I've got a speed-up of up to ~6% for the cpudl_set() calls > on a randomly generated workload of 1K,10K,100K random insertions > and deletions (75% cpudl_set() calls with is_valid=1 and 25% with > is_valid=0), and randomly generated cpu IDs with 2, 4, ..., 256 CPUs. > Details in the attached plot. > > The attached patch does this, along with a minimum of rework of > cpudeadline.c internals, and a final clean-up of the cpudeadline.h > interface (second patch). > > The measurements have been made on an Intel Core2 Duo with the CPU > frequency fixed at max, by letting cpudeadline.c be initialized with > various numbers of CPUs, then making many calls sequentially, taking > the rdtsc among calls, then dumping all numbers through printk(), > and I'm plotting the average of clock ticks between consecutive calls. > [ I can share the benchmarking code as well if needed ] > > Also, this fixes what seems to me a bug I noticed comparing the whole > heap contents as handledbut the modified code vs the original one, > insertion by insertion. The problem is in this code: > > cp->elements[cp->size - 1].dl = 0; > cp->elements[cp->size - 1].cpu = cpu; > cp->elements[cpu].idx = cp->size - 1; > mycpudl_change_key(cp, cp->size - 1, dl); > > when fed by an absolute deadline that is so large to have a negative > value as a s64. In such a case, as from dl_time_before(), the kernel > should handle correctly the abs deadline wrap-around, however the > current code in cpudeadline.c goes mad, and doesn't re-heapify > correctly the just inserted element... that said, if these are ns, > such a bug should be hit after a ~292 years of uptime :-D... > > I'd be happy to hear comments from others. I can provide additional > info / make additional experiments as needed. > > Please, reply-all to this e-mail, I'm not subscribed to linux-kernel@. > > Thanks, > > Tommaso
SCHED_DEADLINE cpudeadline.{h,c} fixup
Hi, looking at the SCHED_DEADLINE code, I spotted an opportunity to make cpudeadline.c faster, in that we can skip real swaps during re-heapify()ication of items after addition/removal. As such ops are done under a domain spinlock, it sounded like an interesting try. Indeed, I've got a speed-up of up to ~6% for the cpudl_set() calls on a randomly generated workload of 1K,10K,100K random insertions and deletions (75% cpudl_set() calls with is_valid=1 and 25% with is_valid=0), and randomly generated cpu IDs with 2, 4, ..., 256 CPUs. Details in the attached plot. The attached patch does this, along with a minimum of rework of cpudeadline.c internals, and a final clean-up of the cpudeadline.h interface (second patch). The measurements have been made on an Intel Core2 Duo with the CPU frequency fixed at max, by letting cpudeadline.c be initialized with various numbers of CPUs, then making many calls sequentially, taking the rdtsc among calls, then dumping all numbers through printk(), and I'm plotting the average of clock ticks between consecutive calls. [ I can share the benchmarking code as well if needed ] Also, this fixes what seems to me a bug I noticed comparing the whole heap contents as handledbut the modified code vs the original one, insertion by insertion. The problem is in this code: cp->elements[cp->size - 1].dl = 0; cp->elements[cp->size - 1].cpu = cpu; cp->elements[cpu].idx = cp->size - 1; mycpudl_change_key(cp, cp->size - 1, dl); when fed by an absolute deadline that is so large to have a negative value as a s64. In such a case, as from dl_time_before(), the kernel should handle correctly the abs deadline wrap-around, however the current code in cpudeadline.c goes mad, and doesn't re-heapify correctly the just inserted element... that said, if these are ns, such a bug should be hit after a ~292 years of uptime :-D... I'd be happy to hear comments from others. I can provide additional info / make additional experiments as needed. Please, reply-all to this e-mail, I'm not subscribed to linux-kernel@. Thanks, Tommaso -- Tommaso Cucinotta, Computer Engineering PhD Associate Professor at the Real-Time Systems Laboratory (ReTiS) Scuola Superiore Sant'Anna, Pisa, Italy http://retis.sssup.it/people/tommaso >From ee54c2849f1d9d7f7f8faeb474a61074cae868b9 Mon Sep 17 00:00:00 2001 From: Tommaso Cucinotta Date: Thu, 12 May 2016 19:06:37 +0200 Subject: [PATCH 1/2] Make deadline max-heap faster and fix deadline wrap-around bug. --- kernel/sched/cpudeadline.c | 122 - 1 file changed, 77 insertions(+), 45 deletions(-) diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c index 5a75b08..245d929 100644 --- a/kernel/sched/cpudeadline.c +++ b/kernel/sched/cpudeadline.c @@ -31,55 +31,91 @@ static inline int right_child(int i) return (i << 1) + 2; } -static void cpudl_exchange(struct cpudl *cp, int a, int b) -{ - int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; - - swap(cp->elements[a].cpu, cp->elements[b].cpu); - swap(cp->elements[a].dl , cp->elements[b].dl ); - - swap(cp->elements[cpu_a].idx, cp->elements[cpu_b].idx); -} - -static void cpudl_heapify(struct cpudl *cp, int idx) +static void cpudl_heapify_down(struct cpudl *cp, int idx) { int l, r, largest; + int orig_cpu = cp->elements[idx].cpu; + u64 orig_dl = cp->elements[idx].dl; + /* adapted from lib/prio_heap.c */ while(1) { + u64 largest_dl; l = left_child(idx); r = right_child(idx); largest = idx; + largest_dl = orig_dl; - if ((l < cp->size) && dl_time_before(cp->elements[idx].dl, - cp->elements[l].dl)) + if ((l < cp->size) && dl_time_before(orig_dl, cp->elements[l].dl)) { largest = l; - if ((r < cp->size) && dl_time_before(cp->elements[largest].dl, - cp->elements[r].dl)) + largest_dl = cp->elements[l].dl; + } + if ((r < cp->size) && dl_time_before(largest_dl, cp->elements[r].dl)) largest = r; + if (largest == idx) break; - /* Push idx down the heap one level and bump one up */ - cpudl_exchange(cp, largest, idx); + /* pull largest child onto idx */ + cp->elements[idx].cpu = cp->elements[largest].cpu; + cp->elements[idx].dl = cp->elements[largest].dl; + cp->elements[cp->elements[idx].cpu].idx = idx; idx = largest; } + /* actual push down of saved original values orig_* */ + cp->elements[idx].cpu = orig_cpu; + cp->elements[idx].dl = orig_dl; + cp->elements[cp->elements[idx].cpu].idx = idx; +} + +static void cpudl_heapify_up(struct cpudl *cp, int idx) +{ + int p; + + int orig_cpu = cp->elements[idx].cpu; + u64 orig_dl = cp->elements[idx].dl; + + while (idx != 0) { + p = parent(idx); + if (dl_time_before(cp->elements[idx].dl, cp->elements[p].dl)) + break; + /* pull parent onto idx */ + cp->elements[idx].cpu = cp->elements[p].cpu; + cp->elements[idx].dl = cp->elements[p].dl; + cp->elements[cp->elements[idx].cpu].idx = idx; +