Re: [patch] sched: align rq to cacheline boundary
On Tue, Apr 10, 2007 at 08:36:56AM +0200, Ingo Molnar wrote: > the runqueue is really supposed to be cacheline-isolated at _both_ ends > - at its beginning and at its end as well. Then either we need to define the first element in the struct as cacheline aligned or move into section where all the elements are cacheline aligned. My current patch will not align it at the end. > > Remember also that the linesize on VSMP is 4k. > > that sucks ... > > maybe, to mitigate some of the costs, do a special PER_CPU_CACHE_ALIGNED > area that collects per-cpu fields that also have significant cross-CPU > use and need cacheline isolation? Such cacheline-aligned variables, if > collected separately, would pack up more tightly and would cause only > half of the wasted space. I will post a patch(once Jermey's page align percpu patch goes in to -mm) thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
On Tue, Apr 10, 2007 at 08:24:22AM +0200, Ingo Molnar wrote: > > * Siddha, Suresh B <[EMAIL PROTECTED]> wrote: > > > Align the per cpu runqueue to the cacheline boundary. This will > > minimize the number of cachelines touched during remote wakeup. > > > -static DEFINE_PER_CPU(struct rq, runqueues); > > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; > > ouch!! Now how did _that_ slip through. The runqueues had been > cacheline-aligned for ages. Or at least, they were supposed to be. perhaps the per_cpu definition gave the impression of cacheline aligned too.. > > could you see any improvement in profiles or workloads with this patch > applied? (just curious - it's an obviously right fix) We have seen 0.5% perf improvement on database workload on a 2 node setup. 0.5% is a very good improvement for this workload. thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
> > > > -static DEFINE_PER_CPU(struct rq, runqueues); > > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; > > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, On x86 just the real possible map now -- that tends to be much smaller. There might be some other architectures who still allocate per cpu for all of NR_CPUs (or always set possible map to that), but those should be just fixed. > which is > rather a lot. We should have solved the problem of limited per cpu space in .22 at least with some patches by Jeremy. I also plan a few other changes the will use more per CPU memory again. > Remember also that the linesize on VSMP is 4k. > > And that putting a gap in the per-cpu memory like this will reduce its > overall cache-friendliness. When he avoids false sharing on remote wakeup it should be more cache friendly. > Need more convincing, please. Was this based on some benchmark where it showed? -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
* Andrew Morton <[EMAIL PROTECTED]> wrote: > > -static DEFINE_PER_CPU(struct rq, runqueues); > > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; > > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, > which is rather a lot. yes - but one (special) issue here is that there are other 'hot' but truly per-CPU structures nearby: 8067e800 D per_cpu__current_kprobe 8067e820 D per_cpu__kprobe_ctlblk 8067e960 D per_cpu__mmu_gathers 8067f960 d per_cpu__runqueues 80680c60 d per_cpu__cpu_domains 80680df0 d per_cpu__sched_group_cpus cpu_domains is being dirtied too (sd->nr_balance_failed, sd->last_balanc, etc.) and mmu_gathers too. So while both mmu_gathers and cpu_domains are mostly purely per-CPU, runqueue fields can bounce around alot and drag those nearby fields with them (and then get dragged back due to those nearby fields being used per-CPU again.) the runqueue is really supposed to be cacheline-isolated at _both_ ends - at its beginning and at its end as well. > And that putting a gap in the per-cpu memory like this will reduce its > overall cache-friendliness. yes - although the per-cpu runqueue overhead is nearly 5K anyway. > Remember also that the linesize on VSMP is 4k. that sucks ... maybe, to mitigate some of the costs, do a special PER_CPU_CACHE_ALIGNED area that collects per-cpu fields that also have significant cross-CPU use and need cacheline isolation? Such cacheline-aligned variables, if collected separately, would pack up more tightly and would cause only half of the wasted space. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
* Siddha, Suresh B <[EMAIL PROTECTED]> wrote: > Align the per cpu runqueue to the cacheline boundary. This will > minimize the number of cachelines touched during remote wakeup. > -static DEFINE_PER_CPU(struct rq, runqueues); > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; ouch!! Now how did _that_ slip through. The runqueues had been cacheline-aligned for ages. Or at least, they were supposed to be. could you see any improvement in profiles or workloads with this patch applied? (just curious - it's an obviously right fix) > Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
On Mon, Apr 09, 2007 at 03:17:05PM -0700, Siddha, Suresh B wrote: > On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote: > > On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote: > > > On Mon, 9 Apr 2007 11:08:53 -0700 > > > "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote: > > Kiran, can you educate me when I am supposed to use > cacheline_aligned_in_smp > Vs > __cacheline_aligned_in_smp ? As far as my understanding goes, the four underscore version is for aligning members/elements within a data structure, and the two underscore version is for aligning statically defined variables. The dual underscore version places the variable in a separate section meant for cacheline aligned variables, so that there is no false sharing on the cacheline with a consecutive datum. For regular statically defined data structures, the latter has to be used, but since your patch uses per-cpu data, which is already in a separate section, you had to use the former I guess. > > > As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline > > aligned per-cpu data in another section, just like we do with > > .data.cacheline_aligned section, but keep this new section between > > __percpu_start and __percpu_end? > > Yes. But that will still waste some memory in the new section, if the data > elements are not multiples of 4k. Yes. But the wastage depends on the data structure now being aligned rather than the structure that happened to be there before. You cannot not lose memory while padding I guess :). But padding for per-cpu data seems a bit odd and I am not sure if it is worth it for 0.5% gain. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote: > On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote: > > On Mon, 9 Apr 2007 11:08:53 -0700 > > "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote: > > > -static DEFINE_PER_CPU(struct rq, runqueues); > > > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; > > > > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is > > rather a lot. Atleast on x86_64, this depends on cpu_possible_map and not NR_CPUS. > > > > Remember also that the linesize on VSMP is 4k. > > > > And that putting a gap in the per-cpu memory like this will reduce its > > overall cache-friendliness. > > > > The internode line size yes. But Suresh is using > cacheline_aligned_in_smp, > which uses SMP_CACHE_BYTES (L1_CACHE_BYTES). So this does not align the > per-cpu variable to 4k. However, if the motivation for this patch was > significant performance difference, then, the above padding needs to be on > the internode cacheline size using cacheline_internodealigned_in_smp. I see a 0.5% perf improvement on database workload(which is a good improvement for this workload). This patch is minimizing number of cache lines that it touches during a remote task wakeup. Kiran, can you educate me when I am supposed to use cacheline_aligned_in_smp Vs __cacheline_aligned_in_smp ? > As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline > aligned per-cpu data in another section, just like we do with > .data.cacheline_aligned section, but keep this new section between > __percpu_start and __percpu_end? Yes. But that will still waste some memory in the new section, if the data elements are not multiples of 4k. thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote: > On Mon, 9 Apr 2007 11:08:53 -0700 > "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote: > > > Align the per cpu runqueue to the cacheline boundary. This will minimize the > > number of cachelines touched during remote wakeup. > > > > Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]> > > --- > > > > diff --git a/kernel/sched.c b/kernel/sched.c > > index b9a6837..eca33c5 100644 > > --- a/kernel/sched.c > > +++ b/kernel/sched.c > > @@ -278,7 +278,7 @@ struct rq { > > struct lock_class_key rq_lock_key; > > }; > > > > -static DEFINE_PER_CPU(struct rq, runqueues); > > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; > > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is > rather a lot. > > Remember also that the linesize on VSMP is 4k. > > And that putting a gap in the per-cpu memory like this will reduce its > overall cache-friendliness. > The internode line size yes. But Suresh is using cacheline_aligned_in_smp, which uses SMP_CACHE_BYTES (L1_CACHE_BYTES). So this does not align the per-cpu variable to 4k. However, if the motivation for this patch was significant performance difference, then, the above padding needs to be on the internode cacheline size using cacheline_internodealigned_in_smp. cacheline_internodealigned_in_smp aligns a data structure to the internode line size, which is 4k for vSMPowered systems and L1 line size for all other architectures. As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline aligned per-cpu data in another section, just like we do with .data.cacheline_aligned section, but keep this new section between __percpu_start and __percpu_end? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
On Mon, 9 Apr 2007 11:08:53 -0700 "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote: > Align the per cpu runqueue to the cacheline boundary. This will minimize the > number of cachelines touched during remote wakeup. > > Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]> > --- > > diff --git a/kernel/sched.c b/kernel/sched.c > index b9a6837..eca33c5 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -278,7 +278,7 @@ struct rq { > struct lock_class_key rq_lock_key; > }; > > -static DEFINE_PER_CPU(struct rq, runqueues); > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is rather a lot. Remember also that the linesize on VSMP is 4k. And that putting a gap in the per-cpu memory like this will reduce its overall cache-friendliness. Need more convincing, please. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] sched: align rq to cacheline boundary
Align the per cpu runqueue to the cacheline boundary. This will minimize the number of cachelines touched during remote wakeup. Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]> --- diff --git a/kernel/sched.c b/kernel/sched.c index b9a6837..eca33c5 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -278,7 +278,7 @@ struct rq { struct lock_class_key rq_lock_key; }; -static DEFINE_PER_CPU(struct rq, runqueues); +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; static inline int cpu_of(struct rq *rq) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/