Re: SCHED_DEADLINE cpudeadline.{h,c} fixup

2016-08-19 Thread Peter Zijlstra
On Sun, Aug 14, 2016 at 04:27:05PM +0200, Tommaso Cucinotta wrote:
> Hi,
> 
> this is a rework of the cpudeadline bugfix and speed-up patch-set, that
> integrates all comments received so far from Luca, Juri and Peter.
> 
> Compared with the previous post, here:
> -) I'm keeping out the minimally invasive bugfix, as it's already been
>merged in tip/sched/core
> -) I moved some little code refactory around change_key_dl() out of the
>   (now) 2nd patch, to the 1st one. Now the 2nd (speed-up) patch just
>   changes the heapify_up/down() functions
> -) I rebased on top of commit f0b22e39
> -) I repeated an extensive set of tests through the framework published
>separately at: https://github.com/tomcucinotta/cpudl-bench
>repeating new no-behavior-change tests, new heap-consistency tests,
>and new a/b benchmarks (I'm working on a new i5 laptop now), results at:
>  https://github.com/tomcucinotta/cpudl-bench/blob/master/cpudl-10.pdf
>highlighting up to a 14% speed-up when averaging over 100K ops. See the
>enclosed README in that repo for more info.

Thanks!


SCHED_DEADLINE cpudeadline.{h,c} fixup

2016-08-14 Thread Tommaso Cucinotta
Hi,

this is a rework of the cpudeadline bugfix and speed-up patch-set, that
integrates all comments received so far from Luca, Juri and Peter.

Compared with the previous post, here:
-) I'm keeping out the minimally invasive bugfix, as it's already been
   merged in tip/sched/core
-) I moved some little code refactory around change_key_dl() out of the
  (now) 2nd patch, to the 1st one. Now the 2nd (speed-up) patch just
  changes the heapify_up/down() functions
-) I rebased on top of commit f0b22e39
-) I repeated an extensive set of tests through the framework published
   separately at: https://github.com/tomcucinotta/cpudl-bench
   repeating new no-behavior-change tests, new heap-consistency tests,
   and new a/b benchmarks (I'm working on a new i5 laptop now), results at:
 https://github.com/tomcucinotta/cpudl-bench/blob/master/cpudl-10.pdf
   highlighting up to a 14% speed-up when averaging over 100K ops. See the
   enclosed README in that repo for more info.

I'm leaving below the original description of all 4 patches.

--
The first patch is a minimally invasive (1-line) fix for the deadline
wrap-around bug. This leaves some weirdness in how cpudl_change_key() is
called. Therefore, the second patch does a minimum of refactory to make
things more explicit and clear.

The 3rd patch contains now the actual performance enhancement (avoiding
unneeded swaps during heapify operations), which has been measured to
achieve up to 14% of speed-up for cpudl_set() calls.
This has been measured with a randomly generated workload of 1K,10K,100K
random heap insertions and deletions (75% cpudl_set() calls with is_valid=1
and 25% with is_valid=0), and randomly generated cpu IDs, with up to 256
CPUs. Benchmarking code at: https://github.com/tomcucinotta/cpudl-bench

Finally, the 4th patch is another clear-up patch touching cpudeadline.{h,c}
and deadline.c. Now you call cpudl_clear(cp, cpu) and cpudl_set(cp, cpu, dl)
instead of cpudl_set(cp, cpu, 0 /* dl */, 0  /* is_valid */) and
cpudl_set(cp, cpu, dl, 1 /* is_valid */).

Any further comment is welcome, thanks!

  Tommaso
--


SCHED_DEADLINE cpudeadline.{h,c} fixup

2016-07-19 Thread Tommaso Cucinotta
Hi,

this is a rework of the cpudeadline bugfix and speed-up patch-set, that
integrates all comments received so far from Luca and Juri.

The first patch is a minimally invasive (1-line) fix for the deadline
wrap-around bug. This leaves some weirdness in how cpudl_change_key() is
called. Therefore, the second patch does a minimum of refactory to make
things more explicit and clear.
The 3rd patch contains now the actual performance enhancement (avoiding
unneeded swaps during heapify operations), which has been measured to
achieve up to 10% , of speed-up for cpudl_set() calls.
This has been measured with a andomly generated workload of 1K,10K,100K
random heap insertions and deletions (75% cpudl_set() calls with is_valid=1
and 25% with is_valid=0), and randomly generated cpu IDs, with up to 256
CPUs.

Benchmarking code is available at:
  https://github.com/tomcucinotta/cpudl-bench
Obtained speed-up plot:
  https://github.com/tomcucinotta/cpudl-bench/blob/master/cpudl.pdf

Finally, the 4th patch is another clear-up patch touching cpudeadline.{h,c}
and deadline.c. Now you call cpudl_clear(cp, cpu) and cpudl_set(cp, cpu, dl)
instead of cpudl_set(cp, cpu, 0 /* dl */, 0  /* is_valid */) and
cpudl_set(cp, cpu, dl, 1 /* is_valid */).

Please, share your comments, thanks!

  Tommaso



SCHED_DEADLINE cpudeadline.{h,c} fixup

2016-05-19 Thread Tommaso Cucinotta
Hi all,

I took Luca's advice to isolate the deadline wrap-around bugfix with a
first minimally invasive patch (1-line). This leaves some weirdness in
how cpudl_change_key() is called.
Therefore, the second patch does a minimum of refactory to make things
more explicit and clear.
The 3rd patch contains now the actual performance enhancement (avoiding
unneeded swaps during heapify operations), which, as said in the previous
post, achieves up to 6% of speed-up for cpudl_set() calls.
Finally, the 4th patch is another clear-up patch touching cpudeadline.{h,c}
and deadline.c. Now you call cpudl_clear(cp, cpu) and cpudl_set(cp, cpu, dl)
instead of cpudl_set(cp, cpu, 0 /* dl */, 0  /* is_valid */) and
cpudl_set(cp, cpu, dl, 1 /* is_valid */).

Please, let me know how this looks like now.

Thanks,

  Tommaso



Re: SCHED_DEADLINE cpudeadline.{h,c} fixup

2016-05-18 Thread Juri Lelli
Hi Tommaso,

On 18/05/16 00:43, Tommaso Cucinotta wrote:
> On 17/05/2016 13:46, luca abeni wrote:
> >Maybe the ... change can be split in a separate
> >patch, which is a bugfix (and IMHO uncontroversial)?
> 
> Ok, the bugfix alone might look like the attached. Couldn't avoid
> the little refactoring of the multiple occurrences of the same loop
> up the heap into the heapify_up(), mirroring the heapify() that was
> already there (renamed heapify_down() for clarity).
> 
> I'll rebase the speed-up patch on top of this, if it's a better approach.
> 
> Anyone with further comments?
> 

Couldn't spend any time on this yet, apologies. But, for the next
posting, could you please do it without attaching the patches? I usually
use git send-mail for posting. It would make the review easier, I think.

Best,

- Juri

> Thanks again!
> 
>   T.
> -- 
> Tommaso Cucinotta, Computer Engineering PhD
> Associate Professor at the Real-Time Systems Laboratory (ReTiS)
> Scuola Superiore Sant'Anna, Pisa, Italy
> http://retis.sssup.it/people/tommaso

> From cfaa75eb77843f7da875a54c7e6631b271bf0663 Mon Sep 17 00:00:00 2001
> From: Tommaso Cucinotta 
> Date: Tue, 17 May 2016 15:54:11 +0200
> Subject: [PATCH] Deadline wrap-around bugfix for the SCHED_DEADLINE cpu heap.
> 
> ---
>  kernel/sched/cpudeadline.c | 38 +++---
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 5be5882..3c42702 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -41,7 +41,7 @@ static void cpudl_exchange(struct cpudl *cp, int a, int b)
>   swap(cp->elements[cpu_a].idx, cp->elements[cpu_b].idx);
>  }
>  
> -static void cpudl_heapify(struct cpudl *cp, int idx)
> +static void cpudl_heapify_down(struct cpudl *cp, int idx)
>  {
>   int l, r, largest;
>  
> @@ -66,20 +66,25 @@ static void cpudl_heapify(struct cpudl *cp, int idx)
>   }
>  }
>  
> +static void cpudl_heapify_up(struct cpudl *cp, int idx)
> +{
> + while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl,
> + cp->elements[idx].dl)) {
> + cpudl_exchange(cp, idx, parent(idx));
> + idx = parent(idx);
> + }
> +}
> +
>  static void cpudl_change_key(struct cpudl *cp, int idx, u64 new_dl)
>  {
>   WARN_ON(idx == IDX_INVALID || !cpu_present(idx));
>  
>   if (dl_time_before(new_dl, cp->elements[idx].dl)) {
>   cp->elements[idx].dl = new_dl;
> - cpudl_heapify(cp, idx);
> + cpudl_heapify_down(cp, idx);
>   } else {
>   cp->elements[idx].dl = new_dl;
> - while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl,
> - cp->elements[idx].dl)) {
> - cpudl_exchange(cp, idx, parent(idx));
> - idx = parent(idx);
> - }
> + cpudl_heapify_up(cp, idx);
>   }
>  }
>  
> @@ -154,24 +159,19 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int 
> is_valid)
>   cp->size--;
>   cp->elements[new_cpu].idx = old_idx;
>   cp->elements[cpu].idx = IDX_INVALID;
> - while (old_idx > 0 && dl_time_before(
> - cp->elements[parent(old_idx)].dl,
> - cp->elements[old_idx].dl)) {
> - cpudl_exchange(cp, old_idx, parent(old_idx));
> - old_idx = parent(old_idx);
> - }
> + cpudl_heapify_up(cp, old_idx);
>   cpumask_set_cpu(cpu, cp->free_cpus);
> -cpudl_heapify(cp, old_idx);
> +cpudl_heapify_down(cp, old_idx);
>  
>   goto out;
>   }
>  
>   if (old_idx == IDX_INVALID) {
> - cp->size++;
> - cp->elements[cp->size - 1].dl = 0;
> - cp->elements[cp->size - 1].cpu = cpu;
> - cp->elements[cpu].idx = cp->size - 1;
> - cpudl_change_key(cp, cp->size - 1, dl);
> + int size1 = cp->size++;
> + cp->elements[size1].dl = dl;
> + cp->elements[size1].cpu = cpu;
> + cp->elements[cpu].idx = size1;
> + cpudl_heapify_up(cp, size1);
>   cpumask_clear_cpu(cpu, cp->free_cpus);
>   } else {
>   cpudl_change_key(cp, old_idx, dl);
> -- 
> 2.7.4
> 



Re: SCHED_DEADLINE cpudeadline.{h,c} fixup

2016-05-17 Thread Tommaso Cucinotta

On 17/05/2016 13:46, luca abeni wrote:

Maybe the ... change can be split in a separate
patch, which is a bugfix (and IMHO uncontroversial)?


Ok, the bugfix alone might look like the attached. Couldn't avoid
the little refactoring of the multiple occurrences of the same loop
up the heap into the heapify_up(), mirroring the heapify() that was
already there (renamed heapify_down() for clarity).

I'll rebase the speed-up patch on top of this, if it's a better approach.

Anyone with further comments?

Thanks again!

T.
--
Tommaso Cucinotta, Computer Engineering PhD
Associate Professor at the Real-Time Systems Laboratory (ReTiS)
Scuola Superiore Sant'Anna, Pisa, Italy
http://retis.sssup.it/people/tommaso
>From cfaa75eb77843f7da875a54c7e6631b271bf0663 Mon Sep 17 00:00:00 2001
From: Tommaso Cucinotta 
Date: Tue, 17 May 2016 15:54:11 +0200
Subject: [PATCH] Deadline wrap-around bugfix for the SCHED_DEADLINE cpu heap.

---
 kernel/sched/cpudeadline.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 5be5882..3c42702 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -41,7 +41,7 @@ static void cpudl_exchange(struct cpudl *cp, int a, int b)
 	swap(cp->elements[cpu_a].idx, cp->elements[cpu_b].idx);
 }
 
-static void cpudl_heapify(struct cpudl *cp, int idx)
+static void cpudl_heapify_down(struct cpudl *cp, int idx)
 {
 	int l, r, largest;
 
@@ -66,20 +66,25 @@ static void cpudl_heapify(struct cpudl *cp, int idx)
 	}
 }
 
+static void cpudl_heapify_up(struct cpudl *cp, int idx)
+{
+	while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl,
+			cp->elements[idx].dl)) {
+		cpudl_exchange(cp, idx, parent(idx));
+		idx = parent(idx);
+	}
+}
+
 static void cpudl_change_key(struct cpudl *cp, int idx, u64 new_dl)
 {
 	WARN_ON(idx == IDX_INVALID || !cpu_present(idx));
 
 	if (dl_time_before(new_dl, cp->elements[idx].dl)) {
 		cp->elements[idx].dl = new_dl;
-		cpudl_heapify(cp, idx);
+		cpudl_heapify_down(cp, idx);
 	} else {
 		cp->elements[idx].dl = new_dl;
-		while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl,
-	cp->elements[idx].dl)) {
-			cpudl_exchange(cp, idx, parent(idx));
-			idx = parent(idx);
-		}
+		cpudl_heapify_up(cp, idx);
 	}
 }
 
@@ -154,24 +159,19 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
 		cp->size--;
 		cp->elements[new_cpu].idx = old_idx;
 		cp->elements[cpu].idx = IDX_INVALID;
-		while (old_idx > 0 && dl_time_before(
-cp->elements[parent(old_idx)].dl,
-cp->elements[old_idx].dl)) {
-			cpudl_exchange(cp, old_idx, parent(old_idx));
-			old_idx = parent(old_idx);
-		}
+		cpudl_heapify_up(cp, old_idx);
 		cpumask_set_cpu(cpu, cp->free_cpus);
-cpudl_heapify(cp, old_idx);
+cpudl_heapify_down(cp, old_idx);
 
 		goto out;
 	}
 
 	if (old_idx == IDX_INVALID) {
-		cp->size++;
-		cp->elements[cp->size - 1].dl = 0;
-		cp->elements[cp->size - 1].cpu = cpu;
-		cp->elements[cpu].idx = cp->size - 1;
-		cpudl_change_key(cp, cp->size - 1, dl);
+		int size1 = cp->size++;
+		cp->elements[size1].dl = dl;
+		cp->elements[size1].cpu = cpu;
+		cp->elements[cpu].idx = size1;
+		cpudl_heapify_up(cp, size1);
 		cpumask_clear_cpu(cpu, cp->free_cpus);
 	} else {
 		cpudl_change_key(cp, old_idx, dl);
-- 
2.7.4



Re: SCHED_DEADLINE cpudeadline.{h,c} fixup

2016-05-17 Thread luca abeni
Hi all,

On Mon, 16 May 2016 18:00:04 +0200
Tommaso Cucinotta  wrote:

> Hi,
> 
> looking at the SCHED_DEADLINE code, I spotted an opportunity to
> make cpudeadline.c faster, in that we can skip real swaps during
> re-heapify()ication of items after addition/removal. As such ops
> are done under a domain spinlock, it sounded like an interesting
> try.
[...]

I do not know the cpudeadline code too much, but I think every "dl = 0"
looks like a bug... So, I think this hunk actually fixes a real bug:
[...]
-   cp->elements[cp->size - 1].dl = 0;
-   cp->elements[cp->size - 1].cpu = cpu;
-   cp->elements[cpu].idx = cp->size - 1;
-   cpudl_change_key(cp, cp->size - 1, dl);
-   cpumask_clear_cpu(cpu, cp->free_cpus);
+   cpumask_set_cpu(cpu, cp->free_cpus);
} else {
-   cpudl_change_key(cp, old_idx, dl);
+   if (old_idx == IDX_INVALID) {
+   int sz1 = cp->size++;
+   cp->elements[sz1].dl = dl;
[...]

Maybe the "cp->elements[cp->size - 1].dl = 0"  ->
"cp->elements[cp->size - 1].dl = 0" change can be split in a separate
patch, which is a bugfix (and IMHO uncontroversial)?


Thanks,
Luca

> 
> Indeed, I've got a speed-up of up to ~6% for the cpudl_set() calls
> on a randomly generated workload of 1K,10K,100K random insertions
> and deletions (75% cpudl_set() calls with is_valid=1 and 25% with
> is_valid=0), and randomly generated cpu IDs with 2, 4, ..., 256 CPUs.
> Details in the attached plot.
> 
> The attached patch does this, along with a minimum of rework of
> cpudeadline.c internals, and a final clean-up of the cpudeadline.h
> interface (second patch).
> 
> The measurements have been made on an Intel Core2 Duo with the CPU
> frequency fixed at max, by letting cpudeadline.c be initialized with
> various numbers of CPUs, then  making many calls sequentially, taking
> the rdtsc among calls, then dumping all numbers through printk(),
> and I'm plotting the average of clock ticks between consecutive calls.
> [ I can share the benchmarking code as well if needed ]
> 
> Also, this fixes what seems to me a bug I noticed comparing the whole
> heap contents as handledbut the modified code vs the original one,
> insertion by insertion. The problem is in this code:
> 
>   cp->elements[cp->size - 1].dl = 0;
>   cp->elements[cp->size - 1].cpu = cpu;
>   cp->elements[cpu].idx = cp->size - 1;
>   mycpudl_change_key(cp, cp->size - 1, dl);
> 
> when fed by an absolute deadline that is so large to have a negative
> value as a s64. In such a case, as from dl_time_before(), the kernel
> should handle correctly the abs deadline wrap-around, however the
> current code in cpudeadline.c goes mad, and doesn't re-heapify
> correctly the just inserted element... that said, if these are ns,
> such a bug should be hit after a ~292 years of uptime :-D...
> 
> I'd be happy to hear comments from others. I can provide additional
> info / make additional experiments as needed.
> 
> Please, reply-all to this e-mail, I'm not subscribed to linux-kernel@.
> 
> Thanks,
> 
>   Tommaso



SCHED_DEADLINE cpudeadline.{h,c} fixup

2016-05-16 Thread Tommaso Cucinotta

Hi,

looking at the SCHED_DEADLINE code, I spotted an opportunity to
make cpudeadline.c faster, in that we can skip real swaps during
re-heapify()ication of items after addition/removal. As such ops
are done under a domain spinlock, it sounded like an interesting
try.

Indeed, I've got a speed-up of up to ~6% for the cpudl_set() calls
on a randomly generated workload of 1K,10K,100K random insertions
and deletions (75% cpudl_set() calls with is_valid=1 and 25% with
is_valid=0), and randomly generated cpu IDs with 2, 4, ..., 256 CPUs.
Details in the attached plot.

The attached patch does this, along with a minimum of rework of
cpudeadline.c internals, and a final clean-up of the cpudeadline.h
interface (second patch).

The measurements have been made on an Intel Core2 Duo with the CPU
frequency fixed at max, by letting cpudeadline.c be initialized with
various numbers of CPUs, then  making many calls sequentially, taking
the rdtsc among calls, then dumping all numbers through printk(),
and I'm plotting the average of clock ticks between consecutive calls.
[ I can share the benchmarking code as well if needed ]

Also, this fixes what seems to me a bug I noticed comparing the whole
heap contents as handledbut the modified code vs the original one,
insertion by insertion. The problem is in this code:

cp->elements[cp->size - 1].dl = 0;
cp->elements[cp->size - 1].cpu = cpu;
cp->elements[cpu].idx = cp->size - 1;
mycpudl_change_key(cp, cp->size - 1, dl);

when fed by an absolute deadline that is so large to have a negative
value as a s64. In such a case, as from dl_time_before(), the kernel
should handle correctly the abs deadline wrap-around, however the
current code in cpudeadline.c goes mad, and doesn't re-heapify
correctly the just inserted element... that said, if these are ns,
such a bug should be hit after a ~292 years of uptime :-D...

I'd be happy to hear comments from others. I can provide additional
info / make additional experiments as needed.

Please, reply-all to this e-mail, I'm not subscribed to linux-kernel@.

Thanks,

Tommaso
--
Tommaso Cucinotta, Computer Engineering PhD
Associate Professor at the Real-Time Systems Laboratory (ReTiS)
Scuola Superiore Sant'Anna, Pisa, Italy
http://retis.sssup.it/people/tommaso
>From ee54c2849f1d9d7f7f8faeb474a61074cae868b9 Mon Sep 17 00:00:00 2001
From: Tommaso Cucinotta 
Date: Thu, 12 May 2016 19:06:37 +0200
Subject: [PATCH 1/2] Make deadline max-heap faster and fix deadline
 wrap-around bug.

---
 kernel/sched/cpudeadline.c | 122 -
 1 file changed, 77 insertions(+), 45 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 5a75b08..245d929 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -31,55 +31,91 @@ static inline int right_child(int i)
 	return (i << 1) + 2;
 }
 
-static void cpudl_exchange(struct cpudl *cp, int a, int b)
-{
-	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
-
-	swap(cp->elements[a].cpu, cp->elements[b].cpu);
-	swap(cp->elements[a].dl , cp->elements[b].dl );
-
-	swap(cp->elements[cpu_a].idx, cp->elements[cpu_b].idx);
-}
-
-static void cpudl_heapify(struct cpudl *cp, int idx)
+static void cpudl_heapify_down(struct cpudl *cp, int idx)
 {
 	int l, r, largest;
 
+	int orig_cpu = cp->elements[idx].cpu;
+	u64 orig_dl = cp->elements[idx].dl;
+
 	/* adapted from lib/prio_heap.c */
 	while(1) {
+		u64 largest_dl;
 		l = left_child(idx);
 		r = right_child(idx);
 		largest = idx;
+		largest_dl = orig_dl;
 
-		if ((l < cp->size) && dl_time_before(cp->elements[idx].dl,
-			cp->elements[l].dl))
+		if ((l < cp->size) && dl_time_before(orig_dl, cp->elements[l].dl)) {
 			largest = l;
-		if ((r < cp->size) && dl_time_before(cp->elements[largest].dl,
-			cp->elements[r].dl))
+			largest_dl = cp->elements[l].dl;
+		}
+		if ((r < cp->size) && dl_time_before(largest_dl, cp->elements[r].dl))
 			largest = r;
+
 		if (largest == idx)
 			break;
 
-		/* Push idx down the heap one level and bump one up */
-		cpudl_exchange(cp, largest, idx);
+		/* pull largest child onto idx */
+		cp->elements[idx].cpu = cp->elements[largest].cpu;
+		cp->elements[idx].dl = cp->elements[largest].dl;
+		cp->elements[cp->elements[idx].cpu].idx = idx;
 		idx = largest;
 	}
+	/* actual push down of saved original values orig_* */
+	cp->elements[idx].cpu = orig_cpu;
+	cp->elements[idx].dl = orig_dl;
+	cp->elements[cp->elements[idx].cpu].idx = idx;
+}
+
+static void cpudl_heapify_up(struct cpudl *cp, int idx)
+{
+	int p;
+
+	int orig_cpu = cp->elements[idx].cpu;
+	u64 orig_dl = cp->elements[idx].dl;
+
+	while (idx != 0) {
+		p = parent(idx);
+		if (dl_time_before(cp->elements[idx].dl, cp->elements[p].dl))
+			break;
+		/* pull parent onto idx */
+		cp->elements[idx].cpu = cp->elements[p].cpu;
+		cp->elements[idx].dl = cp->elements[p].dl;
+		cp->elements[cp->elements[idx].cpu].idx = idx;
+