Re: [patch] sched: align rq to cacheline boundary

2007-04-10 Thread Siddha, Suresh B
On Tue, Apr 10, 2007 at 08:36:56AM +0200, Ingo Molnar wrote:
> the runqueue is really supposed to be cacheline-isolated at _both_ ends 
> - at its beginning and at its end as well.

Then either we need to define the first element in the struct as
cacheline aligned or move into section where all the elements are
cacheline aligned. My current patch will not align it at the end.

> > Remember also that the linesize on VSMP is 4k.
> 
> that sucks ...
> 
> maybe, to mitigate some of the costs, do a special PER_CPU_CACHE_ALIGNED 
> area that collects per-cpu fields that also have significant cross-CPU 
> use and need cacheline isolation? Such cacheline-aligned variables, if 
> collected separately, would pack up more tightly and would cause only 
> half of the wasted space.

I will post a patch(once Jermey's page align percpu patch goes in to -mm)

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: align rq to cacheline boundary

2007-04-10 Thread Siddha, Suresh B
On Tue, Apr 10, 2007 at 08:24:22AM +0200, Ingo Molnar wrote:
> 
> * Siddha, Suresh B <[EMAIL PROTECTED]> wrote:
> 
> > Align the per cpu runqueue to the cacheline boundary. This will 
> > minimize the number of cachelines touched during remote wakeup.
> 
> > -static DEFINE_PER_CPU(struct rq, runqueues);
> > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
> 
> ouch!! Now how did _that_ slip through. The runqueues had been 
> cacheline-aligned for ages. Or at least, they were supposed to be.

perhaps the per_cpu definition gave the impression of cacheline aligned too..

> 
> could you see any improvement in profiles or workloads with this patch 
> applied? (just curious - it's an obviously right fix)

We have seen 0.5% perf improvement on database workload on a 2 node setup.
0.5% is a very good improvement for this workload.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: align rq to cacheline boundary

2007-04-10 Thread Andi Kleen

> >  
> > -static DEFINE_PER_CPU(struct rq, runqueues);
> > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
> 
> Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, 

On x86 just the real possible map now -- that tends to be much smaller.

There might be some other architectures who still allocate per cpu
for all of NR_CPUs (or always set possible map to that), but those
should be just fixed.

> which is 
> rather a lot.

We should have solved the problem of limited per cpu space in .22 at least
with some patches by Jeremy. I also plan a few other changes the will
use more per CPU memory again.

> Remember also that the linesize on VSMP is 4k.
> 
> And that putting a gap in the per-cpu memory like this will reduce its
> overall cache-friendliness.

When he avoids false sharing on remote wakeup it should be more cache friendly.

> Need more convincing, please.

Was this based on some benchmark where it showed?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: align rq to cacheline boundary

2007-04-09 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> > -static DEFINE_PER_CPU(struct rq, runqueues);
> > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
> 
> Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, 
> which is rather a lot.

yes - but one (special) issue here is that there are other 'hot' but 
truly per-CPU structures nearby:

 8067e800 D per_cpu__current_kprobe
 8067e820 D per_cpu__kprobe_ctlblk
 8067e960 D per_cpu__mmu_gathers
 8067f960 d per_cpu__runqueues
 80680c60 d per_cpu__cpu_domains
 80680df0 d per_cpu__sched_group_cpus

cpu_domains is being dirtied too (sd->nr_balance_failed, 
sd->last_balanc, etc.) and mmu_gathers too. So while both mmu_gathers 
and cpu_domains are mostly purely per-CPU, runqueue fields can bounce 
around alot and drag those nearby fields with them (and then get dragged 
back due to those nearby fields being used per-CPU again.)

the runqueue is really supposed to be cacheline-isolated at _both_ ends 
- at its beginning and at its end as well.

> And that putting a gap in the per-cpu memory like this will reduce its 
> overall cache-friendliness.

yes - although the per-cpu runqueue overhead is nearly 5K anyway.

> Remember also that the linesize on VSMP is 4k.

that sucks ...

maybe, to mitigate some of the costs, do a special PER_CPU_CACHE_ALIGNED 
area that collects per-cpu fields that also have significant cross-CPU 
use and need cacheline isolation? Such cacheline-aligned variables, if 
collected separately, would pack up more tightly and would cause only 
half of the wasted space.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: align rq to cacheline boundary

2007-04-09 Thread Ingo Molnar

* Siddha, Suresh B <[EMAIL PROTECTED]> wrote:

> Align the per cpu runqueue to the cacheline boundary. This will 
> minimize the number of cachelines touched during remote wakeup.

> -static DEFINE_PER_CPU(struct rq, runqueues);
> +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;

ouch!! Now how did _that_ slip through. The runqueues had been 
cacheline-aligned for ages. Or at least, they were supposed to be.

could you see any improvement in profiles or workloads with this patch 
applied? (just curious - it's an obviously right fix)

> Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>

Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: align rq to cacheline boundary

2007-04-09 Thread Ravikiran G Thirumalai
On Mon, Apr 09, 2007 at 03:17:05PM -0700, Siddha, Suresh B wrote:
> On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote:
> > On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote:
> > > On Mon, 9 Apr 2007 11:08:53 -0700
> > > "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote:
> 
> Kiran, can you educate me when I am supposed to use
> cacheline_aligned_in_smp
> Vs
> __cacheline_aligned_in_smp ?

As far as my understanding goes, the four underscore version is for 
aligning members/elements within a data structure, and the two underscore 
version is for aligning statically defined variables.
The dual underscore version places the variable in a separate section meant
for cacheline aligned variables, so that there is no false sharing on the
cacheline with a consecutive datum.  For regular statically defined data
structures, the latter has to be used, but since your patch uses per-cpu data, 
which is already in a separate section, you had to use the former I guess.


> 
> > As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline 
> > aligned per-cpu data in another section, just like we do with 
> > .data.cacheline_aligned section, but keep this new section between
> > __percpu_start and __percpu_end?
> 
> Yes. But that will still waste some memory in the new section, if the data
> elements are not multiples of 4k.

Yes.  But the wastage depends on the data structure now being aligned rather
than the structure that happened to be there before.  You cannot not lose
memory while padding I guess :).  But padding for per-cpu data seems a bit 
odd and I am not sure if it is worth it for 0.5% gain.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: align rq to cacheline boundary

2007-04-09 Thread Siddha, Suresh B
On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote:
> On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote:
> > On Mon, 9 Apr 2007 11:08:53 -0700
> > "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote:
> > > -static DEFINE_PER_CPU(struct rq, runqueues);
> > > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
> > 
> > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is
> > rather a lot.

Atleast on x86_64, this depends on cpu_possible_map and not NR_CPUS.

> > 
> > Remember also that the linesize on VSMP is 4k.
> > 
> > And that putting a gap in the per-cpu memory like this will reduce its
> > overall cache-friendliness.
> > 
> 
> The internode line size yes.  But Suresh is using 
> cacheline_aligned_in_smp,
> which uses SMP_CACHE_BYTES (L1_CACHE_BYTES).  So this does not align the 
> per-cpu variable to 4k.  However, if the motivation for this patch was 
> significant performance difference, then, the above padding needs to be on 
> the internode cacheline size using cacheline_internodealigned_in_smp.

I see a 0.5% perf improvement on database workload(which is a good
improvement for this workload). This patch is minimizing number of cache
lines that it touches during a remote task wakeup.

Kiran, can you educate me when I am supposed to use
cacheline_aligned_in_smp
Vs
__cacheline_aligned_in_smp ?

> As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline 
> aligned per-cpu data in another section, just like we do with 
> .data.cacheline_aligned section, but keep this new section between
> __percpu_start and __percpu_end?

Yes. But that will still waste some memory in the new section, if the data
elements are not multiples of 4k.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: align rq to cacheline boundary

2007-04-09 Thread Ravikiran G Thirumalai
On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote:
> On Mon, 9 Apr 2007 11:08:53 -0700
> "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote:
> 
> > Align the per cpu runqueue to the cacheline boundary. This will minimize the
> > number of cachelines touched during remote wakeup.
> > 
> > Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
> > ---
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index b9a6837..eca33c5 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -278,7 +278,7 @@ struct rq {
> > struct lock_class_key rq_lock_key;
> >  };
> >  
> > -static DEFINE_PER_CPU(struct rq, runqueues);
> > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
> 
> Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is
> rather a lot.
> 
> Remember also that the linesize on VSMP is 4k.
> 
> And that putting a gap in the per-cpu memory like this will reduce its
> overall cache-friendliness.
> 

The internode line size yes.  But Suresh is using cacheline_aligned_in_smp,
which uses SMP_CACHE_BYTES (L1_CACHE_BYTES).  So this does not align the 
per-cpu variable to 4k.  However, if the motivation for this patch was 
significant performance difference, then, the above padding needs to be on 
the internode cacheline size using cacheline_internodealigned_in_smp.
cacheline_internodealigned_in_smp aligns a data structure to the
internode line size, which is 4k for vSMPowered systems and L1 line size 
for all other architectures.

As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline 
aligned per-cpu data in another section, just like we do with 
.data.cacheline_aligned section, but keep this new section between
__percpu_start and __percpu_end?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: align rq to cacheline boundary

2007-04-09 Thread Andrew Morton
On Mon, 9 Apr 2007 11:08:53 -0700
"Siddha, Suresh B" <[EMAIL PROTECTED]> wrote:

> Align the per cpu runqueue to the cacheline boundary. This will minimize the
> number of cachelines touched during remote wakeup.
> 
> Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
> ---
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index b9a6837..eca33c5 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -278,7 +278,7 @@ struct rq {
>   struct lock_class_key rq_lock_key;
>  };
>  
> -static DEFINE_PER_CPU(struct rq, runqueues);
> +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;

Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is
rather a lot.

Remember also that the linesize on VSMP is 4k.

And that putting a gap in the per-cpu memory like this will reduce its
overall cache-friendliness.

Need more convincing, please.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] sched: align rq to cacheline boundary

2007-04-09 Thread Siddha, Suresh B
Align the per cpu runqueue to the cacheline boundary. This will minimize the
number of cachelines touched during remote wakeup.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index b9a6837..eca33c5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -278,7 +278,7 @@ struct rq {
struct lock_class_key rq_lock_key;
 };
 
-static DEFINE_PER_CPU(struct rq, runqueues);
+static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
 
 static inline int cpu_of(struct rq *rq)
 {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/