Re: [patch] Change softlockup trigger limit using a kernel parameter
On Wed, Jul 18, 2007 at 11:09:07PM -0700, Andrew Morton wrote: >On Wed, 18 Jul 2007 22:41:21 -0700 Ravikiran G Thirumalai <[EMAIL PROTECTED]> >wrote: > >> On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote: >> > On Mon, 16 Jul 2007 15:26:50 -0700 >> > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: >> > >> > > Kernel warns of softlockups if the softlockup thread is not able to run >> > > on a CPU for 10s. It is useful to lower the softlockup warning >> > > threshold in testing environments to catch potential lockups early. >> > > Following patch adds a kernel parameter 'softlockup_lim' to control >> > > the softlockup threshold. >> > > >> > >> > Why not make it tunable at runtime? >> >> Sure! Like a sysctl? >> Hi Andrew, Here's another take, incorporating all of your comments. Thanks, Kiran Control the trigger limit for softlockup warnings. This is useful for debugging softlockups, by lowering the softlockup_thresh to identify possible softlockups earlier. This patch: 1. Adds a sysctl softlockup_thresh with valid values of 1-60s (Higher value to disable false positives) 2. Changes the softlockup printk to print the cpu softlockup time Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> --- Patch applies on top of Ingo's softlockup patches Documentation/sysctl/kernel.txt |8 include/linux/sched.h |1 + kernel/softlockup.c |8 +--- kernel/sysctl.c | 25 +++-- 4 files changed, 33 insertions(+), 9 deletions(-) Index: linux-2.6.22/kernel/softlockup.c === --- linux-2.6.22.orig/kernel/softlockup.c 2007-07-18 11:15:18.506614500 -0700 +++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700 @@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri static DEFINE_PER_CPU(struct task_struct *, watchdog_task); static int did_panic; +int softlockup_thresh = 10; static int softlock_panic(struct notifier_block *this, unsigned long event, void *ptr) @@ -101,7 +102,7 @@ void softlockup_tick(void) wake_up_process(per_cpu(watchdog_task, this_cpu)); /* Warn about unreasonable 10+ seconds delays: */ - if (now <= (touch_timestamp + 10)) + if (now <= (touch_timestamp + softlockup_thresh)) return; regs = get_irq_regs(); @@ -109,8 +110,9 @@ void softlockup_tick(void) per_cpu(print_timestamp, this_cpu) = touch_timestamp; spin_lock(_lock); - printk(KERN_ERR "BUG: soft lockup detected on CPU#%d! [%s:%d]\n", - this_cpu, current->comm, current->pid); + printk(KERN_ERR "BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n", + this_cpu, now - touch_timestamp, + current->comm, current->pid); if (regs) show_regs(regs); else Index: linux-2.6.22/kernel/sysctl.c === --- linux-2.6.22.orig/kernel/sysctl.c 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/kernel/sysctl.c2007-07-19 13:42:50.275478000 -0700 @@ -79,6 +79,12 @@ extern int compat_log; extern int maps_protect; extern int sysctl_stat_interval; +/* Constants used for minimum and maximum */ +static int one = 1; +static int zero; +static int sixty = 60; +static int one_hundred = 100; + /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; static int minolduid; @@ -615,16 +621,23 @@ static ctl_table kern_table[] = { .proc_handler = _dointvec, }, #endif +#ifdef CONFIG_DETECT_SOFTLOCKUP + { + .ctl_name = CTL_UNNUMBERED, + .procname = "softlockup_thresh", + .data = _thresh, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = _dointvec_minmax, + .strategy = _intvec, + .extra1 = , + .extra2 = , + }, +#endif { .ctl_name = 0 } }; -/* Constants for minimum and maximum testing in vm_table. - We use these as one-element integer vectors. */ -static int zero; -static int one_hundred = 100; - - static ctl_table vm_table[] = { { .ctl_name = VM_OVERCOMMIT_MEMORY, Index: linux-2.6.22/Documentation/sysctl/kernel.txt === --- linux-2.6.22.orig/Documentation/sysctl/kernel.txt 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/Documentation/sysctl
Re: [patch] Change softlockup trigger limit using a kernel parameter
On Thu, Jul 19, 2007 at 11:51:14AM -0700, Jeremy Fitzhardinge wrote: >Ingo Molnar wrote: >> just in case someone sees false positives and wants to turn it off. > >Why not make 0=off? A patch to disable softlockup during boot already went in. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=97842216b8400fe9d1a20468959e2989180f8f79 It uses kernel boot parameter to disable softlockup, not exactly disabling softlockup at run time though. Do we still need 0=off? Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Change softlockup trigger limit using a kernel parameter
On Thu, Jul 19, 2007 at 11:11:42AM +0200, Ingo Molnar wrote: > >* Andrew Morton <[EMAIL PROTECTED]> wrote: > >> > +softlockup_thresh: >> > + >> > +This value can be used to lower the softlockup tolerance >> > +threshold. The default threshold is 10s. If a cpu is locked up >> > +for 10s, the kernel complains. Valid values are 1-10s. >> > + >> >> neato. > >please make sure this is applied after the softlockup watchdog patches i >already did. (in particular the cleanup patch, which this could interact >with) > >also, i think the valid threshold should be between 1 and 60 seconds i >think. 60 seconds! Is that not a pretty high threshold? The reason for lowering the tolerance threshold from 10s is to catch bugs early in lab environments, but why do we need to raise the tolerance thresh beyond 10s? Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Change softlockup trigger limit using a kernel parameter
On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote: > On Mon, 16 Jul 2007 15:26:50 -0700 > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > Kernel warns of softlockups if the softlockup thread is not able to run > > on a CPU for 10s. It is useful to lower the softlockup warning > > threshold in testing environments to catch potential lockups early. > > Following patch adds a kernel parameter 'softlockup_lim' to control > > the softlockup threshold. > > > > Why not make it tunable at runtime? Sure! Like a sysctl? Here's a patch that does that (On top of Ingo's softlockup-improve-debug-output.patch) > > > > > Control the trigger limit for softlockup warnings. This is useful for > > debugging softlockups, by lowering the softlockup_lim to identify > > possible softlockups earlier. > > Please check your patches with scripts/checkpatch.pl. Yep will-do. (checkpatch emitted one warning for the patch below, but that was because of a 'stylo' that already exists in include/linux/sysctl.h -- which probably needs a style change patch by itself) --- Control the trigger limit for softlockup warnings. This is useful for debugging softlockups, by lowering the softlockup_thresh sysctl, to identify possible softlockups earlier. Patch also changes the softlockup printk to print the cpu softlockup time. Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.22/kernel/softlockup.c === --- linux-2.6.22.orig/kernel/softlockup.c 2007-07-18 11:15:18.506614500 -0700 +++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700 @@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri static DEFINE_PER_CPU(struct task_struct *, watchdog_task); static int did_panic; +int softlockup_thresh = 10; static int softlock_panic(struct notifier_block *this, unsigned long event, void *ptr) @@ -101,7 +102,7 @@ void softlockup_tick(void) wake_up_process(per_cpu(watchdog_task, this_cpu)); /* Warn about unreasonable 10+ seconds delays: */ - if (now <= (touch_timestamp + 10)) + if (now <= (touch_timestamp + softlockup_thresh)) return; regs = get_irq_regs(); @@ -109,8 +110,9 @@ void softlockup_tick(void) per_cpu(print_timestamp, this_cpu) = touch_timestamp; spin_lock(_lock); - printk(KERN_ERR "BUG: soft lockup detected on CPU#%d! [%s:%d]\n", - this_cpu, current->comm, current->pid); + printk(KERN_ERR "BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n", + this_cpu, now - touch_timestamp, + current->comm, current->pid); if (regs) show_regs(regs); else Index: linux-2.6.22/kernel/sysctl.c === --- linux-2.6.22.orig/kernel/sysctl.c 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/kernel/sysctl.c2007-07-18 21:05:57.877436750 -0700 @@ -78,6 +78,7 @@ extern int percpu_pagelist_fraction; extern int compat_log; extern int maps_protect; extern int sysctl_stat_interval; +extern int softlockup_thresh; /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; @@ -206,6 +207,10 @@ static ctl_table root_table[] = { { .ctl_name = 0 } }; +/* Constants for kernel table minimum and maximum */ +static int one = 1; +static int ten = 10; + static ctl_table kern_table[] = { { .ctl_name = KERN_PANIC, @@ -615,6 +620,19 @@ static ctl_table kern_table[] = { .proc_handler = _dointvec, }, #endif +#ifdef CONFIG_DETECT_SOFTLOCKUP + { + .ctl_name = KERN_SOFTLOCKUP_THRESHOLD, + .procname = "softlockup_thresh", + .data = _thresh, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = _dointvec_minmax, + .strategy = _intvec, + .extra1 = , + .extra2 = , + }, +#endif { .ctl_name = 0 } }; Index: linux-2.6.22/include/linux/sysctl.h === --- linux-2.6.22.orig/include/linux/sysctl.h2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/include/linux/sysctl.h 2007-07-18 21:41:56.584347500 -0700 @@ -165,6 +165,7 @@ enum KERN_MAX_LOCK_DEPTH=74, KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */ KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */ + KERN_SOFTLOCKUP_THRESHOLD=77, /* int: softlockup tolerance threshold */
Re: [patch] Change softlockup trigger limit using a kernel parameter
On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote: On Mon, 16 Jul 2007 15:26:50 -0700 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: Kernel warns of softlockups if the softlockup thread is not able to run on a CPU for 10s. It is useful to lower the softlockup warning threshold in testing environments to catch potential lockups early. Following patch adds a kernel parameter 'softlockup_lim' to control the softlockup threshold. Why not make it tunable at runtime? Sure! Like a sysctl? Here's a patch that does that (On top of Ingo's softlockup-improve-debug-output.patch) Control the trigger limit for softlockup warnings. This is useful for debugging softlockups, by lowering the softlockup_lim to identify possible softlockups earlier. Please check your patches with scripts/checkpatch.pl. Yep will-do. (checkpatch emitted one warning for the patch below, but that was because of a 'stylo' that already exists in include/linux/sysctl.h -- which probably needs a style change patch by itself) --- Control the trigger limit for softlockup warnings. This is useful for debugging softlockups, by lowering the softlockup_thresh sysctl, to identify possible softlockups earlier. Patch also changes the softlockup printk to print the cpu softlockup time. Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.22/kernel/softlockup.c === --- linux-2.6.22.orig/kernel/softlockup.c 2007-07-18 11:15:18.506614500 -0700 +++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700 @@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri static DEFINE_PER_CPU(struct task_struct *, watchdog_task); static int did_panic; +int softlockup_thresh = 10; static int softlock_panic(struct notifier_block *this, unsigned long event, void *ptr) @@ -101,7 +102,7 @@ void softlockup_tick(void) wake_up_process(per_cpu(watchdog_task, this_cpu)); /* Warn about unreasonable 10+ seconds delays: */ - if (now = (touch_timestamp + 10)) + if (now = (touch_timestamp + softlockup_thresh)) return; regs = get_irq_regs(); @@ -109,8 +110,9 @@ void softlockup_tick(void) per_cpu(print_timestamp, this_cpu) = touch_timestamp; spin_lock(print_lock); - printk(KERN_ERR BUG: soft lockup detected on CPU#%d! [%s:%d]\n, - this_cpu, current-comm, current-pid); + printk(KERN_ERR BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n, + this_cpu, now - touch_timestamp, + current-comm, current-pid); if (regs) show_regs(regs); else Index: linux-2.6.22/kernel/sysctl.c === --- linux-2.6.22.orig/kernel/sysctl.c 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/kernel/sysctl.c2007-07-18 21:05:57.877436750 -0700 @@ -78,6 +78,7 @@ extern int percpu_pagelist_fraction; extern int compat_log; extern int maps_protect; extern int sysctl_stat_interval; +extern int softlockup_thresh; /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; @@ -206,6 +207,10 @@ static ctl_table root_table[] = { { .ctl_name = 0 } }; +/* Constants for kernel table minimum and maximum */ +static int one = 1; +static int ten = 10; + static ctl_table kern_table[] = { { .ctl_name = KERN_PANIC, @@ -615,6 +620,19 @@ static ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif +#ifdef CONFIG_DETECT_SOFTLOCKUP + { + .ctl_name = KERN_SOFTLOCKUP_THRESHOLD, + .procname = softlockup_thresh, + .data = softlockup_thresh, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .strategy = sysctl_intvec, + .extra1 = one, + .extra2 = ten, + }, +#endif { .ctl_name = 0 } }; Index: linux-2.6.22/include/linux/sysctl.h === --- linux-2.6.22.orig/include/linux/sysctl.h2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/include/linux/sysctl.h 2007-07-18 21:41:56.584347500 -0700 @@ -165,6 +165,7 @@ enum KERN_MAX_LOCK_DEPTH=74, KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */ KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */ + KERN_SOFTLOCKUP_THRESHOLD=77, /* int: softlockup tolerance threshold */ }; Index: linux-2.6.22/Documentation/sysctl/kernel.txt === --- linux-2.6.22.orig/Documentation
Re: [patch] Change softlockup trigger limit using a kernel parameter
On Thu, Jul 19, 2007 at 11:11:42AM +0200, Ingo Molnar wrote: * Andrew Morton [EMAIL PROTECTED] wrote: +softlockup_thresh: + +This value can be used to lower the softlockup tolerance +threshold. The default threshold is 10s. If a cpu is locked up +for 10s, the kernel complains. Valid values are 1-10s. + neato. please make sure this is applied after the softlockup watchdog patches i already did. (in particular the cleanup patch, which this could interact with) also, i think the valid threshold should be between 1 and 60 seconds i think. 60 seconds! Is that not a pretty high threshold? The reason for lowering the tolerance threshold from 10s is to catch bugs early in lab environments, but why do we need to raise the tolerance thresh beyond 10s? Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Change softlockup trigger limit using a kernel parameter
On Thu, Jul 19, 2007 at 11:51:14AM -0700, Jeremy Fitzhardinge wrote: Ingo Molnar wrote: just in case someone sees false positives and wants to turn it off. Why not make 0=off? A patch to disable softlockup during boot already went in. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=97842216b8400fe9d1a20468959e2989180f8f79 It uses kernel boot parameter to disable softlockup, not exactly disabling softlockup at run time though. Do we still need 0=off? Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Change softlockup trigger limit using a kernel parameter
On Wed, Jul 18, 2007 at 11:09:07PM -0700, Andrew Morton wrote: On Wed, 18 Jul 2007 22:41:21 -0700 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote: On Mon, 16 Jul 2007 15:26:50 -0700 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: Kernel warns of softlockups if the softlockup thread is not able to run on a CPU for 10s. It is useful to lower the softlockup warning threshold in testing environments to catch potential lockups early. Following patch adds a kernel parameter 'softlockup_lim' to control the softlockup threshold. Why not make it tunable at runtime? Sure! Like a sysctl? Hi Andrew, Here's another take, incorporating all of your comments. Thanks, Kiran Control the trigger limit for softlockup warnings. This is useful for debugging softlockups, by lowering the softlockup_thresh to identify possible softlockups earlier. This patch: 1. Adds a sysctl softlockup_thresh with valid values of 1-60s (Higher value to disable false positives) 2. Changes the softlockup printk to print the cpu softlockup time Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] --- Patch applies on top of Ingo's softlockup patches Documentation/sysctl/kernel.txt |8 include/linux/sched.h |1 + kernel/softlockup.c |8 +--- kernel/sysctl.c | 25 +++-- 4 files changed, 33 insertions(+), 9 deletions(-) Index: linux-2.6.22/kernel/softlockup.c === --- linux-2.6.22.orig/kernel/softlockup.c 2007-07-18 11:15:18.506614500 -0700 +++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700 @@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri static DEFINE_PER_CPU(struct task_struct *, watchdog_task); static int did_panic; +int softlockup_thresh = 10; static int softlock_panic(struct notifier_block *this, unsigned long event, void *ptr) @@ -101,7 +102,7 @@ void softlockup_tick(void) wake_up_process(per_cpu(watchdog_task, this_cpu)); /* Warn about unreasonable 10+ seconds delays: */ - if (now = (touch_timestamp + 10)) + if (now = (touch_timestamp + softlockup_thresh)) return; regs = get_irq_regs(); @@ -109,8 +110,9 @@ void softlockup_tick(void) per_cpu(print_timestamp, this_cpu) = touch_timestamp; spin_lock(print_lock); - printk(KERN_ERR BUG: soft lockup detected on CPU#%d! [%s:%d]\n, - this_cpu, current-comm, current-pid); + printk(KERN_ERR BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n, + this_cpu, now - touch_timestamp, + current-comm, current-pid); if (regs) show_regs(regs); else Index: linux-2.6.22/kernel/sysctl.c === --- linux-2.6.22.orig/kernel/sysctl.c 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/kernel/sysctl.c2007-07-19 13:42:50.275478000 -0700 @@ -79,6 +79,12 @@ extern int compat_log; extern int maps_protect; extern int sysctl_stat_interval; +/* Constants used for minimum and maximum */ +static int one = 1; +static int zero; +static int sixty = 60; +static int one_hundred = 100; + /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; static int minolduid; @@ -615,16 +621,23 @@ static ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif +#ifdef CONFIG_DETECT_SOFTLOCKUP + { + .ctl_name = CTL_UNNUMBERED, + .procname = softlockup_thresh, + .data = softlockup_thresh, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .strategy = sysctl_intvec, + .extra1 = one, + .extra2 = sixty, + }, +#endif { .ctl_name = 0 } }; -/* Constants for minimum and maximum testing in vm_table. - We use these as one-element integer vectors. */ -static int zero; -static int one_hundred = 100; - - static ctl_table vm_table[] = { { .ctl_name = VM_OVERCOMMIT_MEMORY, Index: linux-2.6.22/Documentation/sysctl/kernel.txt === --- linux-2.6.22.orig/Documentation/sysctl/kernel.txt 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/Documentation/sysctl/kernel.txt2007-07-19 13:42:09.748945250 -0700 @@ -320,6 +320,14 @@ kernel. This value defaults to SHMMAX. == +softlockup_thresh: + +This value can be used to lower the softlockup
[patch] Change softlockup trigger limit using a kernel parameter
Kernel warns of softlockups if the softlockup thread is not able to run on a CPU for 10s. It is useful to lower the softlockup warning threshold in testing environments to catch potential lockups early. Following patch adds a kernel parameter 'softlockup_lim' to control the softlockup threshold. Thanks, Kiran Control the trigger limit for softlockup warnings. This is useful for debugging softlockups, by lowering the softlockup_lim to identify possible softlockups earlier. Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.22/Documentation/kernel-parameters.txt === --- linux-2.6.22.orig/Documentation/kernel-parameters.txt 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/Documentation/kernel-parameters.txt2007-07-16 14:07:34.907116001 -0700 @@ -1781,6 +1781,10 @@ and is between 256 and 4096 characters. snd-ymfpci= [HW,ALSA] + softlockup_lim= + [KNL] Soft lock up tolerance limit in seconds + { 1 - 10 } + sonycd535= [HW,CD] Format: [,] Index: linux-2.6.22/kernel/softlockup.c === --- linux-2.6.22.orig/kernel/softlockup.c 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/kernel/softlockup.c2007-07-11 12:50:05.280460591 -0700 @@ -21,6 +21,7 @@ static DEFINE_PER_CPU(unsigned long, pri static DEFINE_PER_CPU(struct task_struct *, watchdog_task); static int did_panic = 0; +static int softlockup_lim = 10; static int softlock_panic(struct notifier_block *this, unsigned long event, void *ptr) @@ -34,6 +35,20 @@ static struct notifier_block panic_block .notifier_call = softlock_panic, }; +static int __init softlockup_lim_setup(char *str) +{ + int lim = simple_strtol(str, NULL, 0); + if(lim > 0 && lim <= 10) { + softlockup_lim = lim; + } else { + printk("%s: Invalid softlockup_lim parameter. ", __func__); + printk("Using defaults.\n"); + } + return 1; +} + +__setup("softlockup_lim=", softlockup_lim_setup); + /* * Returns seconds, approximately. We don't need nanosecond * resolution, and we don't need to waste time with a big divide when @@ -97,7 +112,7 @@ void softlockup_tick(void) wake_up_process(per_cpu(watchdog_task, this_cpu)); /* Warn about unreasonable 10+ seconds delays: */ - if (now > (touch_timestamp + 10)) { + if (now > (touch_timestamp + softlockup_lim)) { per_cpu(print_timestamp, this_cpu) = touch_timestamp; spin_lock(_lock); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] Change softlockup trigger limit using a kernel parameter
Kernel warns of softlockups if the softlockup thread is not able to run on a CPU for 10s. It is useful to lower the softlockup warning threshold in testing environments to catch potential lockups early. Following patch adds a kernel parameter 'softlockup_lim' to control the softlockup threshold. Thanks, Kiran Control the trigger limit for softlockup warnings. This is useful for debugging softlockups, by lowering the softlockup_lim to identify possible softlockups earlier. Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.22/Documentation/kernel-parameters.txt === --- linux-2.6.22.orig/Documentation/kernel-parameters.txt 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/Documentation/kernel-parameters.txt2007-07-16 14:07:34.907116001 -0700 @@ -1781,6 +1781,10 @@ and is between 256 and 4096 characters. snd-ymfpci= [HW,ALSA] + softlockup_lim= + [KNL] Soft lock up tolerance limit in seconds + { 1 - 10 } + sonycd535= [HW,CD] Format: io[,irq] Index: linux-2.6.22/kernel/softlockup.c === --- linux-2.6.22.orig/kernel/softlockup.c 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/kernel/softlockup.c2007-07-11 12:50:05.280460591 -0700 @@ -21,6 +21,7 @@ static DEFINE_PER_CPU(unsigned long, pri static DEFINE_PER_CPU(struct task_struct *, watchdog_task); static int did_panic = 0; +static int softlockup_lim = 10; static int softlock_panic(struct notifier_block *this, unsigned long event, void *ptr) @@ -34,6 +35,20 @@ static struct notifier_block panic_block .notifier_call = softlock_panic, }; +static int __init softlockup_lim_setup(char *str) +{ + int lim = simple_strtol(str, NULL, 0); + if(lim 0 lim = 10) { + softlockup_lim = lim; + } else { + printk(%s: Invalid softlockup_lim parameter. , __func__); + printk(Using defaults.\n); + } + return 1; +} + +__setup(softlockup_lim=, softlockup_lim_setup); + /* * Returns seconds, approximately. We don't need nanosecond * resolution, and we don't need to waste time with a big divide when @@ -97,7 +112,7 @@ void softlockup_tick(void) wake_up_process(per_cpu(watchdog_task, this_cpu)); /* Warn about unreasonable 10+ seconds delays: */ - if (now (touch_timestamp + 10)) { + if (now (touch_timestamp + softlockup_lim)) { per_cpu(print_timestamp, this_cpu) = touch_timestamp; spin_lock(print_lock); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86_64: Avoid too many remote cpu references due to /proc/stat
On Thu, Jul 12, 2007 at 07:13:17PM -0700, Andrew Morton wrote: > On Thu, 12 Jul 2007 17:06:16 -0700 Ravikiran G Thirumalai <[EMAIL PROTECTED]> > wrote: > > > Too many remote cpu references due to /proc/stat. > > > > On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem. > > On every call to kstat_irqs, the process brings in per-cpu data from all > > online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS > > results in (256+32*63) * 63 remote cpu references on a 64 cpu config. > > /proc/stat is parsed by common commands like top, who etc, causing > > lots of cacheline transfers > > (256+32*63) * 63 = 143136 > > Do you have any actual numbers for how much this hurts? Under certain load conditions with 2.6.21/22, this hurts a *lot* infact, I have noticed delays in seconds when /proc/stat is read (48P box). However, the reason for this is not the transfer of per-cpu kstat, but dereferencing of the percpu data offset from a remote cpu pda! Since the cpu pda is kmalloced, with 2.6.21, the zone pcp structures happen to share the same page (and in some cases the internode cacheline as well) with the cpu_pda. Excessive page allocation activities on the remote node causes reads to the cpu pda of the remote cpus to result in cache misses. > > > This statistic seems useless. Other 'big iron' arches disable this. > > Can we disable computing/reporting this statistic? This piece of > > statistic is not human readable on x86_64 anymore, > > Did you consider using percpu_counters (or such) at interrupt-time? > (warning: percpu_counters aren't presently interrupt safe). No, but given that alloc_percpu of a counter wastes so much memory, an array of NR_IRQS percpu_counters will only cause more bloat no? > > > If not, can we optimize computing this statistic so as to avoid > > too many remote references (patch to follow) > > You other patch is a straightforward optimisation and should just be merged. > Sure! But I still don't see this statistic _really_ being useful anywhere; /proc/stat output with NR_CPUS of 64 or 96 is ugly, but I might be missing something > But afaict it will only provide a 2x speedup which I doubt is sufficient? No, it provides much more than 2x speedup with the 'hurting' workload. This is because the optimized patch goes like this: for_each_possible_cpu(i) { ... for (j = 0 ; j < NR_IRQS ; j++) { unsigned int temp = kstat_cpu(i).irqs[j]; ... } } But the existing code with kstat_irqs goes as: for (j = 0 ; j < NR_IRQS ; j++) { for_each_possible_cpu(cpu) { sum += kstat_cpu(cpu).irqs[irq] ... } Former case causes less cache misses to the pda per-cpu dataoffset than the latter by a huge margin. Cache miss itself is not so bad per miss, but the number of misses in the latter case adds up! Yes, the cpu pda needs to be on an internode boundary by itself, I will be submitting a patch for the same soon. I am thinking of page aligning cpu pda and using the 12 lower bits in %gs to store the cpu processor id maybe :) > > > > Another thought is: how many of the NR_IRQS counters are actually non-zero? > Because a pretty obvious optimisation would be to have a global > bitmap[NR_IRQS] and do > > if (!bitmap[irq]) > bitmap[irq] = 1; > > at interrupt-time, then just print a "0" for the interrupts which have > never occurred within show_stats(). Yep. That would work too. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86_64: Avoid too many remote cpu references due to /proc/stat
On Thu, Jul 12, 2007 at 07:13:17PM -0700, Andrew Morton wrote: On Thu, 12 Jul 2007 17:06:16 -0700 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: Too many remote cpu references due to /proc/stat. On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem. On every call to kstat_irqs, the process brings in per-cpu data from all online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS results in (256+32*63) * 63 remote cpu references on a 64 cpu config. /proc/stat is parsed by common commands like top, who etc, causing lots of cacheline transfers (256+32*63) * 63 = 143136 Do you have any actual numbers for how much this hurts? Under certain load conditions with 2.6.21/22, this hurts a *lot* infact, I have noticed delays in seconds when /proc/stat is read (48P box). However, the reason for this is not the transfer of per-cpu kstat, but dereferencing of the percpu data offset from a remote cpu pda! Since the cpu pda is kmalloced, with 2.6.21, the zone pcp structures happen to share the same page (and in some cases the internode cacheline as well) with the cpu_pda. Excessive page allocation activities on the remote node causes reads to the cpu pda of the remote cpus to result in cache misses. This statistic seems useless. Other 'big iron' arches disable this. Can we disable computing/reporting this statistic? This piece of statistic is not human readable on x86_64 anymore, Did you consider using percpu_counters (or such) at interrupt-time? (warning: percpu_counters aren't presently interrupt safe). No, but given that alloc_percpu of a counter wastes so much memory, an array of NR_IRQS percpu_counters will only cause more bloat no? If not, can we optimize computing this statistic so as to avoid too many remote references (patch to follow) You other patch is a straightforward optimisation and should just be merged. Sure! But I still don't see this statistic _really_ being useful anywhere; /proc/stat output with NR_CPUS of 64 or 96 is ugly, but I might be missing something But afaict it will only provide a 2x speedup which I doubt is sufficient? No, it provides much more than 2x speedup with the 'hurting' workload. This is because the optimized patch goes like this: for_each_possible_cpu(i) { ... for (j = 0 ; j NR_IRQS ; j++) { unsigned int temp = kstat_cpu(i).irqs[j]; ... } } But the existing code with kstat_irqs goes as: for (j = 0 ; j NR_IRQS ; j++) { for_each_possible_cpu(cpu) { sum += kstat_cpu(cpu).irqs[irq] ... } Former case causes less cache misses to the pda per-cpu dataoffset than the latter by a huge margin. Cache miss itself is not so bad per miss, but the number of misses in the latter case adds up! Yes, the cpu pda needs to be on an internode boundary by itself, I will be submitting a patch for the same soon. I am thinking of page aligning cpu pda and using the 12 lower bits in %gs to store the cpu processor id maybe :) Another thought is: how many of the NR_IRQS counters are actually non-zero? Because a pretty obvious optimisation would be to have a global bitmap[NR_IRQS] and do if (!bitmap[irq]) bitmap[irq] = 1; at interrupt-time, then just print a 0 for the interrupts which have never occurred within show_stats(). Yep. That would work too. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] Avoid too many remote cpu references due to /proc/stat
On Thu, Jul 12, 2007 at 05:06:15PM -0700, Ravikiran G Thirumalai wrote: > Too many remote cpu references due to /proc/stat. > > On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem. > On every call to kstat_irqs, the process brings in per-cpu data from all > online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS > results in (256+32*63) * 63 remote cpu references on a 64 cpu config. > /proc/stat is parsed by common commands like top, who etc, causing > lots of cacheline transfers > > This statistic seems useless. Other 'big iron' arches disable this. > Can we disable computing/reporting this statistic? This piece of > statistic is not human readable on x86_64 anymore, > > If not, can we optimize computing this statistic so as to avoid > too many remote references (patch to follow) Optimize show_stat to collect per-irq information just once. On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem. On every call to kstat_irqs, the process brings in per-cpu data from all online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS results in (256+32*63) * 63 remote cpu references on a 64 cpu config. Considering the fact that we already compute this value per-cpu, we can save on the remote references as below. Signed-off-by: Alok N Kataria <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/proc/proc_misc.c === --- linux-2.6.22.orig/fs/proc/proc_misc.c 2007-07-11 14:32:33.013197741 -0700 +++ linux-2.6.22/fs/proc/proc_misc.c2007-07-12 16:28:24.871389279 -0700 @@ -443,6 +443,11 @@ static int show_stat(struct seq_file *p, unsigned long jif; cputime64_t user, nice, system, idle, iowait, irq, softirq, steal; u64 sum = 0; + unsigned int *per_irq_sum; + + per_irq_sum = kzalloc(sizeof(unsigned int)*NR_IRQS, GFP_KERNEL); + if (!per_irq_sum) + return -ENOMEM; user = nice = system = idle = iowait = irq = softirq = steal = cputime64_zero; @@ -461,8 +466,11 @@ static int show_stat(struct seq_file *p, irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq); softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq); steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal); - for (j = 0 ; j < NR_IRQS ; j++) - sum += kstat_cpu(i).irqs[j]; + for (j = 0 ; j < NR_IRQS ; j++) { + unsigned int temp = kstat_cpu(i).irqs[j]; + sum += temp; + per_irq_sum[j] += temp; + } } seq_printf(p, "cpu %llu %llu %llu %llu %llu %llu %llu %llu\n", @@ -500,7 +508,7 @@ static int show_stat(struct seq_file *p, #if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64) for (i = 0; i < NR_IRQS; i++) - seq_printf(p, " %u", kstat_irqs(i)); + seq_printf(p, " %u", per_irq_sum[i]); #endif seq_printf(p, @@ -515,6 +523,7 @@ static int show_stat(struct seq_file *p, nr_running(), nr_iowait()); + kfree(per_irq_sum); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] x86_64: Avoid too many remote cpu references due to /proc/stat
Too many remote cpu references due to /proc/stat. On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem. On every call to kstat_irqs, the process brings in per-cpu data from all online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS results in (256+32*63) * 63 remote cpu references on a 64 cpu config. /proc/stat is parsed by common commands like top, who etc, causing lots of cacheline transfers This statistic seems useless. Other 'big iron' arches disable this. Can we disable computing/reporting this statistic? This piece of statistic is not human readable on x86_64 anymore, If not, can we optimize computing this statistic so as to avoid too many remote references (patch to follow) Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Index: linux-2.6.22/fs/proc/proc_misc.c === --- linux-2.6.22.orig/fs/proc/proc_misc.c 2007-07-12 16:31:02.0 -0700 +++ linux-2.6.22/fs/proc/proc_misc.c2007-07-12 16:33:45.226221759 -0700 @@ -498,7 +498,8 @@ static int show_stat(struct seq_file *p, } seq_printf(p, "intr %llu", (unsigned long long)sum); -#if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64) +#if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64) \ + && !defined(CONFIG_X86_64) for (i = 0; i < NR_IRQS; i++) seq_printf(p, " %u", kstat_irqs(i)); #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] x86_64: Avoid too many remote cpu references due to /proc/stat
Too many remote cpu references due to /proc/stat. On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem. On every call to kstat_irqs, the process brings in per-cpu data from all online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS results in (256+32*63) * 63 remote cpu references on a 64 cpu config. /proc/stat is parsed by common commands like top, who etc, causing lots of cacheline transfers This statistic seems useless. Other 'big iron' arches disable this. Can we disable computing/reporting this statistic? This piece of statistic is not human readable on x86_64 anymore, If not, can we optimize computing this statistic so as to avoid too many remote references (patch to follow) Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Index: linux-2.6.22/fs/proc/proc_misc.c === --- linux-2.6.22.orig/fs/proc/proc_misc.c 2007-07-12 16:31:02.0 -0700 +++ linux-2.6.22/fs/proc/proc_misc.c2007-07-12 16:33:45.226221759 -0700 @@ -498,7 +498,8 @@ static int show_stat(struct seq_file *p, } seq_printf(p, intr %llu, (unsigned long long)sum); -#if !defined(CONFIG_PPC64) !defined(CONFIG_ALPHA) !defined(CONFIG_IA64) +#if !defined(CONFIG_PPC64) !defined(CONFIG_ALPHA) !defined(CONFIG_IA64) \ +!defined(CONFIG_X86_64) for (i = 0; i NR_IRQS; i++) seq_printf(p, %u, kstat_irqs(i)); #endif - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] Avoid too many remote cpu references due to /proc/stat
On Thu, Jul 12, 2007 at 05:06:15PM -0700, Ravikiran G Thirumalai wrote: Too many remote cpu references due to /proc/stat. On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem. On every call to kstat_irqs, the process brings in per-cpu data from all online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS results in (256+32*63) * 63 remote cpu references on a 64 cpu config. /proc/stat is parsed by common commands like top, who etc, causing lots of cacheline transfers This statistic seems useless. Other 'big iron' arches disable this. Can we disable computing/reporting this statistic? This piece of statistic is not human readable on x86_64 anymore, If not, can we optimize computing this statistic so as to avoid too many remote references (patch to follow) Optimize show_stat to collect per-irq information just once. On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem. On every call to kstat_irqs, the process brings in per-cpu data from all online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS results in (256+32*63) * 63 remote cpu references on a 64 cpu config. Considering the fact that we already compute this value per-cpu, we can save on the remote references as below. Signed-off-by: Alok N Kataria [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Index: linux-2.6.22/fs/proc/proc_misc.c === --- linux-2.6.22.orig/fs/proc/proc_misc.c 2007-07-11 14:32:33.013197741 -0700 +++ linux-2.6.22/fs/proc/proc_misc.c2007-07-12 16:28:24.871389279 -0700 @@ -443,6 +443,11 @@ static int show_stat(struct seq_file *p, unsigned long jif; cputime64_t user, nice, system, idle, iowait, irq, softirq, steal; u64 sum = 0; + unsigned int *per_irq_sum; + + per_irq_sum = kzalloc(sizeof(unsigned int)*NR_IRQS, GFP_KERNEL); + if (!per_irq_sum) + return -ENOMEM; user = nice = system = idle = iowait = irq = softirq = steal = cputime64_zero; @@ -461,8 +466,11 @@ static int show_stat(struct seq_file *p, irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq); softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq); steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal); - for (j = 0 ; j NR_IRQS ; j++) - sum += kstat_cpu(i).irqs[j]; + for (j = 0 ; j NR_IRQS ; j++) { + unsigned int temp = kstat_cpu(i).irqs[j]; + sum += temp; + per_irq_sum[j] += temp; + } } seq_printf(p, cpu %llu %llu %llu %llu %llu %llu %llu %llu\n, @@ -500,7 +508,7 @@ static int show_stat(struct seq_file *p, #if !defined(CONFIG_PPC64) !defined(CONFIG_ALPHA) !defined(CONFIG_IA64) for (i = 0; i NR_IRQS; i++) - seq_printf(p, %u, kstat_irqs(i)); + seq_printf(p, %u, per_irq_sum[i]); #endif seq_printf(p, @@ -515,6 +523,7 @@ static int show_stat(struct seq_file *p, nr_running(), nr_iowait()); + kfree(per_irq_sum); return 0; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/12] acpi: fix another compile warning
On Wed, Jun 20, 2007 at 09:36:30AM -0400, Len Brown wrote: > On Wednesday 20 June 2007 04:49, Andreas Herrmann wrote: > > On Tue, Jun 19, 2007 at 11:38:02PM -0400, Len Brown wrote: > > > On Tuesday 19 June 2007 18:50, Andreas Herrmann wrote: > > I fear, however, that this patch defeats the purpose of > b0bd35e622ffbda2c01dc67a0381c6a18817a29a -- which was to make selecting > NUMA more user-friendly. So it might make more sense to simply revert that > patch entirely. > > The underlying problem is that Kconfig doesn't support using select > on targets which themselves have dependencies. > > Signed-off-by: Len Brown <[EMAIL PROTECTED]> > > > diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig > index 5ce9443..e9d7767 100644 > --- a/arch/x86_64/Kconfig > +++ b/arch/x86_64/Kconfig > @@ -364,9 +364,9 @@ config NODES_SHIFT > config X86_64_ACPI_NUMA > bool "ACPI NUMA detection" > depends on NUMA > - select ACPI > + depends on ACPI > select PCI > - select ACPI_NUMA > + depends on ACPI_NUMA > default y > help >Enable ACPI SRAT based node topology detection. > - arch/x86_64/Kconfig:706:error: found recursive dependency: PCI -> ACPI -> X86_64_ACPI_NUMA -> PCI -> USB_ARCH_HAS_OHCI -> USB_ARCH_HAS_HCD -> MOUSE_APPLETOUCH -> USB -> USB_SISUSBVGAmake[1]: *** [menuconfig] Error 1 I guess we have to change PCI to 'depends on' as well. Else, select PM? I am OK with either approach. Thanks, Kiran Paper over 'select' inadequacies. Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Index: linux-2.6.22-rc5/arch/x86_64/Kconfig === --- linux-2.6.22-rc5.orig/arch/x86_64/Kconfig 2007-06-18 16:02:19.571323415 -0700 +++ linux-2.6.22-rc5/arch/x86_64/Kconfig2007-06-20 11:34:29.845354250 -0700 @@ -366,6 +366,7 @@ config X86_64_ACPI_NUMA depends on NUMA select ACPI select PCI + select PM select ACPI_NUMA default y help - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/12] acpi: fix another compile warning
On Wed, Jun 20, 2007 at 09:36:30AM -0400, Len Brown wrote: On Wednesday 20 June 2007 04:49, Andreas Herrmann wrote: On Tue, Jun 19, 2007 at 11:38:02PM -0400, Len Brown wrote: On Tuesday 19 June 2007 18:50, Andreas Herrmann wrote: I fear, however, that this patch defeats the purpose of b0bd35e622ffbda2c01dc67a0381c6a18817a29a -- which was to make selecting NUMA more user-friendly. So it might make more sense to simply revert that patch entirely. The underlying problem is that Kconfig doesn't support using select on targets which themselves have dependencies. Signed-off-by: Len Brown [EMAIL PROTECTED] diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig index 5ce9443..e9d7767 100644 --- a/arch/x86_64/Kconfig +++ b/arch/x86_64/Kconfig @@ -364,9 +364,9 @@ config NODES_SHIFT config X86_64_ACPI_NUMA bool ACPI NUMA detection depends on NUMA - select ACPI + depends on ACPI select PCI - select ACPI_NUMA + depends on ACPI_NUMA default y help Enable ACPI SRAT based node topology detection. - arch/x86_64/Kconfig:706:error: found recursive dependency: PCI - ACPI - X86_64_ACPI_NUMA - PCI - USB_ARCH_HAS_OHCI - USB_ARCH_HAS_HCD - MOUSE_APPLETOUCH - USB - USB_SISUSBVGAmake[1]: *** [menuconfig] Error 1 I guess we have to change PCI to 'depends on' as well. Else, select PM? I am OK with either approach. Thanks, Kiran Paper over 'select' inadequacies. Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Index: linux-2.6.22-rc5/arch/x86_64/Kconfig === --- linux-2.6.22-rc5.orig/arch/x86_64/Kconfig 2007-06-18 16:02:19.571323415 -0700 +++ linux-2.6.22-rc5/arch/x86_64/Kconfig2007-06-20 11:34:29.845354250 -0700 @@ -366,6 +366,7 @@ config X86_64_ACPI_NUMA depends on NUMA select ACPI select PCI + select PM select ACPI_NUMA default y help - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Mon, Jun 18, 2007 at 01:20:55AM -0700, Andrew Morton wrote: > On Mon, 18 Jun 2007 10:12:04 +0200 Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > > > Subject: [patch] x86: fix spin-loop starvation bug > > From: Ingo Molnar <[EMAIL PROTECTED]> > > > > Miklos Szeredi reported very long pauses (several seconds, sometimes > > more) on his T60 (with a Core2Duo) which he managed to track down to > > wait_task_inactive()'s open-coded busy-loop. He observed that an > > interrupt on one core tries to acquire the runqueue-lock but does not > > succeed in doing so for a very long time - while wait_task_inactive() on > > the other core loops waiting for the first core to deschedule a task > > (which it wont do while spinning in an interrupt handler). > > > > The problem is: both the spin_lock() code and the wait_task_inactive() > > loop uses cpu_relax()/rep_nop(), so in theory the CPU should have > > guaranteed MESI-fairness to the two cores - but that didnt happen: one > > of the cores was able to monopolize the cacheline that holds the > > runqueue lock, for extended periods of time. > > > > This patch changes the spin-loop to assert an atomic op after every REP > > NOP instance - this will cause the CPU to express its "MESI interest" in > > that cacheline after every REP NOP. > > Kiran, if you're still able to reproduce that zone->lru_lock starvation > problem, > this would be a good one to try... We tried this approach a week back (speak of co-incidences), and it did not help the problem. I'd changed calls to the zone->lru_lock spin_lock to do spin_trylock in a while loop with cpu_relax instead. It did not help, This was on top of 2.6.17 kernels. But the good news is 2.6.21, as is does not have the starvation issue -- that is, zone->lru_lock does not seem to get contended that much under the same workload. However, this was not on the same hardware I reported zone->lru_lock contention on (8 socket dual core opteron). I don't have access to it anymore :( Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Mon, Jun 18, 2007 at 01:20:55AM -0700, Andrew Morton wrote: On Mon, 18 Jun 2007 10:12:04 +0200 Ingo Molnar [EMAIL PROTECTED] wrote: Subject: [patch] x86: fix spin-loop starvation bug From: Ingo Molnar [EMAIL PROTECTED] Miklos Szeredi reported very long pauses (several seconds, sometimes more) on his T60 (with a Core2Duo) which he managed to track down to wait_task_inactive()'s open-coded busy-loop. He observed that an interrupt on one core tries to acquire the runqueue-lock but does not succeed in doing so for a very long time - while wait_task_inactive() on the other core loops waiting for the first core to deschedule a task (which it wont do while spinning in an interrupt handler). The problem is: both the spin_lock() code and the wait_task_inactive() loop uses cpu_relax()/rep_nop(), so in theory the CPU should have guaranteed MESI-fairness to the two cores - but that didnt happen: one of the cores was able to monopolize the cacheline that holds the runqueue lock, for extended periods of time. This patch changes the spin-loop to assert an atomic op after every REP NOP instance - this will cause the CPU to express its MESI interest in that cacheline after every REP NOP. Kiran, if you're still able to reproduce that zone-lru_lock starvation problem, this would be a good one to try... We tried this approach a week back (speak of co-incidences), and it did not help the problem. I'd changed calls to the zone-lru_lock spin_lock to do spin_trylock in a while loop with cpu_relax instead. It did not help, This was on top of 2.6.17 kernels. But the good news is 2.6.21, as is does not have the starvation issue -- that is, zone-lru_lock does not seem to get contended that much under the same workload. However, this was not on the same hardware I reported zone-lru_lock contention on (8 socket dual core opteron). I don't have access to it anymore :( Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.21.3: NFS: Buggy server - nlink == 0!
While running a dbench stress test on a nfs mounted file system, I notice the subject error message on the client machine. The client machine is a 48 core box with NUMA characteristics and 1024 dbench processes running continuously in a loop, while another memory hog application runs in parallel. The client is on 2.6.21.3. The server is booted up with 2.6.21.3 as well. Attached is the server configuration. Same test on a 2.6.16 client does not spew out these messages. Is this really the server issue, or is the NFS client to be blamed here? fstab on the client goes as: vus2:/mnt/sda5 /nfstestnfs udp,wsize=32768,rsize=32768 0 0 /etc/exports on the server looks like: /mnt/sda5 *(rw,no_root_squash,sync) I will be happy to post full config/dmesg if required. The client side NFS config goes as: # # Network File Systems # CONFIG_NFS_FS=y CONFIG_NFS_V3=y # CONFIG_NFS_V3_ACL is not set # CONFIG_NFS_V4 is not set CONFIG_NFS_DIRECTIO=y CONFIG_NFSD=y CONFIG_NFSD_V3=y # CONFIG_NFSD_V3_ACL is not set # CONFIG_NFSD_V4 is not set CONFIG_NFSD_TCP=y CONFIG_LOCKD=y CONFIG_LOCKD_V4=y CONFIG_EXPORTFS=y CONFIG_NFS_COMMON=y CONFIG_SUNRPC=y # CONFIG_RPCSEC_GSS_KRB5 is not set # CONFIG_RPCSEC_GSS_SPKM3 is not set The server side NFS config goes as: # # Network File Systems # CONFIG_NFS_FS=m CONFIG_NFS_V3=y CONFIG_NFS_V3_ACL=y # CONFIG_NFS_V4 is not set CONFIG_NFS_DIRECTIO=y CONFIG_NFSD=m CONFIG_NFSD_V2_ACL=y CONFIG_NFSD_V3=y CONFIG_NFSD_V3_ACL=y CONFIG_NFSD_V4=y CONFIG_NFSD_TCP=y CONFIG_LOCKD=m CONFIG_LOCKD_V4=y CONFIG_EXPORTFS=m CONFIG_NFS_ACL_SUPPORT=m CONFIG_NFS_COMMON=y CONFIG_SUNRPC=m Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.21.3: NFS: Buggy server - nlink == 0!
While running a dbench stress test on a nfs mounted file system, I notice the subject error message on the client machine. The client machine is a 48 core box with NUMA characteristics and 1024 dbench processes running continuously in a loop, while another memory hog application runs in parallel. The client is on 2.6.21.3. The server is booted up with 2.6.21.3 as well. Attached is the server configuration. Same test on a 2.6.16 client does not spew out these messages. Is this really the server issue, or is the NFS client to be blamed here? fstab on the client goes as: vus2:/mnt/sda5 /nfstestnfs udp,wsize=32768,rsize=32768 0 0 /etc/exports on the server looks like: /mnt/sda5 *(rw,no_root_squash,sync) I will be happy to post full config/dmesg if required. The client side NFS config goes as: # # Network File Systems # CONFIG_NFS_FS=y CONFIG_NFS_V3=y # CONFIG_NFS_V3_ACL is not set # CONFIG_NFS_V4 is not set CONFIG_NFS_DIRECTIO=y CONFIG_NFSD=y CONFIG_NFSD_V3=y # CONFIG_NFSD_V3_ACL is not set # CONFIG_NFSD_V4 is not set CONFIG_NFSD_TCP=y CONFIG_LOCKD=y CONFIG_LOCKD_V4=y CONFIG_EXPORTFS=y CONFIG_NFS_COMMON=y CONFIG_SUNRPC=y # CONFIG_RPCSEC_GSS_KRB5 is not set # CONFIG_RPCSEC_GSS_SPKM3 is not set The server side NFS config goes as: # # Network File Systems # CONFIG_NFS_FS=m CONFIG_NFS_V3=y CONFIG_NFS_V3_ACL=y # CONFIG_NFS_V4 is not set CONFIG_NFS_DIRECTIO=y CONFIG_NFSD=m CONFIG_NFSD_V2_ACL=y CONFIG_NFSD_V3=y CONFIG_NFSD_V3_ACL=y CONFIG_NFSD_V4=y CONFIG_NFSD_TCP=y CONFIG_LOCKD=m CONFIG_LOCKD_V4=y CONFIG_EXPORTFS=m CONFIG_NFS_ACL_SUPPORT=m CONFIG_NFS_COMMON=y CONFIG_SUNRPC=m Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3
On Thu, May 24, 2007 at 11:03:56AM +0200, Martin Schwidefsky wrote: > On Wed, 2007-05-23 at 11:57 -0700, Ravikiran G Thirumalai wrote: > > Current git with the patches applied and the default configuration for > s390 decreases the section size fof .data.percpu from 0x3e50 to 0x3e00. > 0.5% decrease. > Thanks!! Fenghua, you could use my 'Acked by' if needed Acked-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3
On Thu, May 24, 2007 at 11:03:56AM +0200, Martin Schwidefsky wrote: On Wed, 2007-05-23 at 11:57 -0700, Ravikiran G Thirumalai wrote: Current git with the patches applied and the default configuration for s390 decreases the section size fof .data.percpu from 0x3e50 to 0x3e00. 0.5% decrease. Thanks!! Fenghua, you could use my 'Acked by' if needed Acked-by: Ravikiran Thirumalai [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3
On Wed, May 23, 2007 at 12:09:56PM -0700, Yu, Fenghua wrote: > > >Has there been any measurable benefit yet due to tail padding? > > We don't have data that tail padding actually helps. It all > depends on what data the linker lays out in the cachelines. > > As of now we just want to create the infrastructure (so that > more and more people who need it, can use it). So what we have now is space wastage on some architectures, space savings on some, but with no measurable performance benefit due to the infrastructure itself. Why not push the infrastructure when we really need it, as against pushing it now when we are not sure if it benefits? Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3
On Wed, May 23, 2007 at 11:26:53AM -0700, Yu, Fenghua wrote: > > elements are cacheline aligned. And as such, this differentiates the > local > > only data and remotely accessed data cleanly. > > >OK, but could we please have a concise description of the impact > >of these changes on kernel memory footprint? Increase or decrease? > >And by approximately how much? > > Depending on how linker places percpu data, the patches could > increase or decrease percpu section size. Data from 2.6.21-rc7-mm2: > > On x86 SMP, the section size is increased from 0x7768 to 0x790c. > 1.3% increase. > > On X86-64 SMP, the size is decreased from 0x72d0 to 0x6540. > 11.8% decrease. > > On X86-64 VSMP, the size is increased from 0x72d0 to 0x8340. > 14.3% increase. > > On IA64 SMP, the size is decreased from 0x8370 to 0x7fc0. > 2.8% decrease. Has there been any measurable benefit yet due to tail padding? It would also be interesting to check the wastage/savings on another large cache architecture like S390 (which has a 256 byte cache line) Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3
On Wed, May 23, 2007 at 11:26:53AM -0700, Yu, Fenghua wrote: elements are cacheline aligned. And as such, this differentiates the local only data and remotely accessed data cleanly. OK, but could we please have a concise description of the impact of these changes on kernel memory footprint? Increase or decrease? And by approximately how much? Depending on how linker places percpu data, the patches could increase or decrease percpu section size. Data from 2.6.21-rc7-mm2: On x86 SMP, the section size is increased from 0x7768 to 0x790c. 1.3% increase. On X86-64 SMP, the size is decreased from 0x72d0 to 0x6540. 11.8% decrease. On X86-64 VSMP, the size is increased from 0x72d0 to 0x8340. 14.3% increase. On IA64 SMP, the size is decreased from 0x8370 to 0x7fc0. 2.8% decrease. Has there been any measurable benefit yet due to tail padding? It would also be interesting to check the wastage/savings on another large cache architecture like S390 (which has a 256 byte cache line) Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3
On Wed, May 23, 2007 at 12:09:56PM -0700, Yu, Fenghua wrote: Has there been any measurable benefit yet due to tail padding? We don't have data that tail padding actually helps. It all depends on what data the linker lays out in the cachelines. As of now we just want to create the infrastructure (so that more and more people who need it, can use it). So what we have now is space wastage on some architectures, space savings on some, but with no measurable performance benefit due to the infrastructure itself. Why not push the infrastructure when we really need it, as against pushing it now when we are not sure if it benefits? Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ACPI build problem (2.6.21 CONFIG_X86_64_ACPI_NUMA)
On Sat, Apr 28, 2007 at 01:59:46AM -0400, Len Brown wrote: > On Thursday 26 April 2007 09:26, you wrote: > ... > CONFIG_ACPI depends on CONFIG_PM, yet this build fails because you have > CONFIG_ACPI=y and CONFIG_PM=n > > Unfortunately kconfig doesn't trace dependencies when "select" is used, > making select sort of evil. Ie. select can't target anything which > itself has dependencies. > > In 2.6.15, b0bd35e622ffbda2c01dc67a0381c6a18817a29a added the select below, > and subsequently ACPI became dependent on PM, which broke your build. > Technically, this could have broken your build in other ways too > since ACPI already had other dependencies. > > +# Dummy CONFIG option to select ACPI_NUMA from drivers/acpi/Kconfig. > + > +config X86_64_ACPI_NUMA > + bool "ACPI NUMA detection" > + depends on NUMA > + select ACPI > + select ACPI_NUMA > + default y > + help > + Enable ACPI SRAT based node topology detection. > > If you de-select CONFIG_X86_64_ACPI_NUMA then the select goes away > and kconfig should work properly. > > It isn't immediately clear to me how the NUMA Kconfig patch is a step > forward, but perhaps the authors can comment. > Before the above change, to select NUMA for em64T systems, you *had* to select K8_NUMA in the 'Processor type and features' sub menu to select CONFIG_NUMA, and then select ACPI NUMA from the generic PM menu, which is not very intuitive when you have a EM64T NUMA system. On x86_64 platforms, NUMA topology detection could be K8 style detection, ACPI based topology or NUMA emulation. Most of the real NUMA systems would need ACPI detection, older AMD platforms would need K8_NUMA, and NUMA emulation is a debugging/emulation option to emulate NUMA topology on a regular SMP box. Since CONFIG_NUMA + CONFIG_NUMA_EMU is a valid combination, and since CONFIG_NUMA + CONFIG_K8_NUMA is a valid combination, it makes sense to have X86_64_ACPI_NUMA on its own, and grouped along with other NUMA topology styles in the Processor menu. Hope this is good enough reason :) Regarding the subject build breakage, CONFIG_ACPI depends on CONFIG_PM with an explicit 'depends' keyword. If PM does not get selected and CONFIG_ACPI gets selected, I would think it is a Kconfig bug which needs fixing no? Else, we can add a workaround by selecting CONFIG_PM if X86_64_ACPI_NUMA is chosen. But AFAICT the right thing is to fix the Kconfig bug. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ACPI build problem (2.6.21 CONFIG_X86_64_ACPI_NUMA)
On Sat, Apr 28, 2007 at 01:59:46AM -0400, Len Brown wrote: On Thursday 26 April 2007 09:26, you wrote: ... CONFIG_ACPI depends on CONFIG_PM, yet this build fails because you have CONFIG_ACPI=y and CONFIG_PM=n Unfortunately kconfig doesn't trace dependencies when select is used, making select sort of evil. Ie. select can't target anything which itself has dependencies. In 2.6.15, b0bd35e622ffbda2c01dc67a0381c6a18817a29a added the select below, and subsequently ACPI became dependent on PM, which broke your build. Technically, this could have broken your build in other ways too since ACPI already had other dependencies. +# Dummy CONFIG option to select ACPI_NUMA from drivers/acpi/Kconfig. + +config X86_64_ACPI_NUMA + bool ACPI NUMA detection + depends on NUMA + select ACPI + select ACPI_NUMA + default y + help + Enable ACPI SRAT based node topology detection. If you de-select CONFIG_X86_64_ACPI_NUMA then the select goes away and kconfig should work properly. It isn't immediately clear to me how the NUMA Kconfig patch is a step forward, but perhaps the authors can comment. Before the above change, to select NUMA for em64T systems, you *had* to select K8_NUMA in the 'Processor type and features' sub menu to select CONFIG_NUMA, and then select ACPI NUMA from the generic PM menu, which is not very intuitive when you have a EM64T NUMA system. On x86_64 platforms, NUMA topology detection could be K8 style detection, ACPI based topology or NUMA emulation. Most of the real NUMA systems would need ACPI detection, older AMD platforms would need K8_NUMA, and NUMA emulation is a debugging/emulation option to emulate NUMA topology on a regular SMP box. Since CONFIG_NUMA + CONFIG_NUMA_EMU is a valid combination, and since CONFIG_NUMA + CONFIG_K8_NUMA is a valid combination, it makes sense to have X86_64_ACPI_NUMA on its own, and grouped along with other NUMA topology styles in the Processor menu. Hope this is good enough reason :) Regarding the subject build breakage, CONFIG_ACPI depends on CONFIG_PM with an explicit 'depends' keyword. If PM does not get selected and CONFIG_ACPI gets selected, I would think it is a Kconfig bug which needs fixing no? Else, we can add a workaround by selecting CONFIG_PM if X86_64_ACPI_NUMA is chosen. But AFAICT the right thing is to fix the Kconfig bug. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2.6.21-rc6] failsafe mechanism to HPET clock calibration
Provide a failsafe mechanism to avoid kernel spinning for ever at read_hpet_tsc during early kernel bootup. This failsafe mechanism was introduced in 21-rc, http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=2f7a2a79c3ebb44f8b1b7d9b4fd3a650eb69e544 But looks like the hpet split from time.c lost the commit. This patch reintroduces the failsafe mechanism Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Cc: Jack Steiner <[EMAIL PROTECTED]> Cc: john stultz <[EMAIL PROTECTED]> Cc: Andi Kleen <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/arch/x86_64/kernel/hpet.c === --- linux-2.6.21-rc5.orig/arch/x86_64/kernel/hpet.c 2007-04-11 16:03:05.0 -0700 +++ linux-2.6.21-rc5/arch/x86_64/kernel/hpet.c 2007-04-11 18:49:36.0 -0700 @@ -191,6 +191,7 @@ int hpet_reenable(void) #define TICK_COUNT 1 #define TICK_MIN 5000 +#define MAX_TRIES 5 /* * Some platforms take periodic SMI interrupts with 5ms duration. Make sure none @@ -198,13 +199,15 @@ int hpet_reenable(void) */ static void __init read_hpet_tsc(int *hpet, int *tsc) { - int tsc1, tsc2, hpet1; + int tsc1, tsc2, hpet1, i; - do { + for (i = 0; i < MAX_TRIES; i++) { tsc1 = get_cycles_sync(); hpet1 = hpet_readl(HPET_COUNTER); tsc2 = get_cycles_sync(); - } while (tsc2 - tsc1 > TICK_MIN); + if (tsc2 - tsc1 > TICK_MIN) + break; + } *hpet = hpet1; *tsc = tsc2; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2.6.21-rc6] failsafe mechanism to HPET clock calibration
Provide a failsafe mechanism to avoid kernel spinning for ever at read_hpet_tsc during early kernel bootup. This failsafe mechanism was introduced in 21-rc, http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=2f7a2a79c3ebb44f8b1b7d9b4fd3a650eb69e544 But looks like the hpet split from time.c lost the commit. This patch reintroduces the failsafe mechanism Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Cc: Jack Steiner [EMAIL PROTECTED] Cc: john stultz [EMAIL PROTECTED] Cc: Andi Kleen [EMAIL PROTECTED] Index: linux-2.6.21-rc5/arch/x86_64/kernel/hpet.c === --- linux-2.6.21-rc5.orig/arch/x86_64/kernel/hpet.c 2007-04-11 16:03:05.0 -0700 +++ linux-2.6.21-rc5/arch/x86_64/kernel/hpet.c 2007-04-11 18:49:36.0 -0700 @@ -191,6 +191,7 @@ int hpet_reenable(void) #define TICK_COUNT 1 #define TICK_MIN 5000 +#define MAX_TRIES 5 /* * Some platforms take periodic SMI interrupts with 5ms duration. Make sure none @@ -198,13 +199,15 @@ int hpet_reenable(void) */ static void __init read_hpet_tsc(int *hpet, int *tsc) { - int tsc1, tsc2, hpet1; + int tsc1, tsc2, hpet1, i; - do { + for (i = 0; i MAX_TRIES; i++) { tsc1 = get_cycles_sync(); hpet1 = hpet_readl(HPET_COUNTER); tsc2 = get_cycles_sync(); - } while (tsc2 - tsc1 TICK_MIN); + if (tsc2 - tsc1 TICK_MIN) + break; + } *hpet = hpet1; *tsc = tsc2; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
On Mon, Apr 09, 2007 at 03:17:05PM -0700, Siddha, Suresh B wrote: > On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote: > > On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote: > > > On Mon, 9 Apr 2007 11:08:53 -0700 > > > "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote: > > Kiran, can you educate me when I am supposed to use > cacheline_aligned_in_smp > Vs > __cacheline_aligned_in_smp ? As far as my understanding goes, the four underscore version is for aligning members/elements within a data structure, and the two underscore version is for aligning statically defined variables. The dual underscore version places the variable in a separate section meant for cacheline aligned variables, so that there is no false sharing on the cacheline with a consecutive datum. For regular statically defined data structures, the latter has to be used, but since your patch uses per-cpu data, which is already in a separate section, you had to use the former I guess. > > > As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline > > aligned per-cpu data in another section, just like we do with > > .data.cacheline_aligned section, but keep this new section between > > __percpu_start and __percpu_end? > > Yes. But that will still waste some memory in the new section, if the data > elements are not multiples of 4k. Yes. But the wastage depends on the data structure now being aligned rather than the structure that happened to be there before. You cannot not lose memory while padding I guess :). But padding for per-cpu data seems a bit odd and I am not sure if it is worth it for 0.5% gain. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Pad irq_desc to internode cacheline size
On Mon, Apr 09, 2007 at 03:47:52PM -0600, Eric W. Biederman wrote: > Andrew Morton <[EMAIL PROTECTED]> writes: > > > This will consume nearly 4k per irq won't it? What is the upper bound > > here, across all configs and all hardware? > > > > Is VSMP the only arch which has cacheline_internodealigned_in_smp > > larger than cacheline_aligned_in_smp? > > Ugh. We set internode aligned to 4k for all of x86_64. !!! No. internode aligned is 4k only if CONFIG_X86_VSMP is chosen. The internode line size defaults to SMP_CACHE_BYTES for all other machine types. Please note that an INTERNODE_CACHE_SHIFT of 12 is defined under #ifdef CONFIG_X86_VSMP in include/asm-x86_64/cache.h > > I believe this ups our worst case memory consumption for > the array from 1M to 32M. Although the low end might be 2M. > I can't recall if an irq_desc takes one cache line or two > after we have put the cpu masks in it. > > My gut feel says that what we want to do is delay this until > we are dynamically allocating the array members. Then we can at > least have the chance of allocating the memory on the proper NUMA > node, and won't need the extra NUMA alignment. I was thinking on those lines as well. But, until we get there, can we have this in as stop gap? The patch does not increase the memory footprint for any other architecture other than vSMPowered systems. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote: > On Mon, 9 Apr 2007 11:08:53 -0700 > "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote: > > > Align the per cpu runqueue to the cacheline boundary. This will minimize the > > number of cachelines touched during remote wakeup. > > > > Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]> > > --- > > > > diff --git a/kernel/sched.c b/kernel/sched.c > > index b9a6837..eca33c5 100644 > > --- a/kernel/sched.c > > +++ b/kernel/sched.c > > @@ -278,7 +278,7 @@ struct rq { > > struct lock_class_key rq_lock_key; > > }; > > > > -static DEFINE_PER_CPU(struct rq, runqueues); > > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; > > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is > rather a lot. > > Remember also that the linesize on VSMP is 4k. > > And that putting a gap in the per-cpu memory like this will reduce its > overall cache-friendliness. > The internode line size yes. But Suresh is using cacheline_aligned_in_smp, which uses SMP_CACHE_BYTES (L1_CACHE_BYTES). So this does not align the per-cpu variable to 4k. However, if the motivation for this patch was significant performance difference, then, the above padding needs to be on the internode cacheline size using cacheline_internodealigned_in_smp. cacheline_internodealigned_in_smp aligns a data structure to the internode line size, which is 4k for vSMPowered systems and L1 line size for all other architectures. As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline aligned per-cpu data in another section, just like we do with .data.cacheline_aligned section, but keep this new section between __percpu_start and __percpu_end? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] Pad irq_desc to internode cacheline size
We noticed a drop in n/w performance due to the irq_desc being cacheline aligned rather than internode aligned. We see 50% of expected performance when two e1000 nics local to two different nodes have consecutive irq descriptors allocated, due to false sharing. Note that this patch does away with cacheline padding for the UP case, as it does not seem useful for UP configurations. Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/include/linux/irq.h === --- linux-2.6.21-rc5.orig/include/linux/irq.h 2007-04-09 10:16:23.560848473 -0700 +++ linux-2.6.21-rc5/include/linux/irq.h2007-04-09 10:16:45.401177929 -0700 @@ -175,7 +175,7 @@ struct irq_desc { struct proc_dir_entry *dir; #endif const char *name; -} cacheline_aligned; +} cacheline_internodealigned_in_smp; extern struct irq_desc irq_desc[NR_IRQS]; Index: linux-2.6.21-rc5/kernel/irq/handle.c === --- linux-2.6.21-rc5.orig/kernel/irq/handle.c 2007-02-04 10:44:54.0 -0800 +++ linux-2.6.21-rc5/kernel/irq/handle.c2007-04-09 12:26:40.473326023 -0700 @@ -48,7 +48,7 @@ handle_bad_irq(unsigned int irq, struct * * Controller mappings for all interrupt sources: */ -struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned = { +struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = { [0 ... NR_IRQS-1] = { .status = IRQ_DISABLED, .chip = _irq_chip, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] Pad irq_desc to internode cacheline size
We noticed a drop in n/w performance due to the irq_desc being cacheline aligned rather than internode aligned. We see 50% of expected performance when two e1000 nics local to two different nodes have consecutive irq descriptors allocated, due to false sharing. Note that this patch does away with cacheline padding for the UP case, as it does not seem useful for UP configurations. Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.21-rc5/include/linux/irq.h === --- linux-2.6.21-rc5.orig/include/linux/irq.h 2007-04-09 10:16:23.560848473 -0700 +++ linux-2.6.21-rc5/include/linux/irq.h2007-04-09 10:16:45.401177929 -0700 @@ -175,7 +175,7 @@ struct irq_desc { struct proc_dir_entry *dir; #endif const char *name; -} cacheline_aligned; +} cacheline_internodealigned_in_smp; extern struct irq_desc irq_desc[NR_IRQS]; Index: linux-2.6.21-rc5/kernel/irq/handle.c === --- linux-2.6.21-rc5.orig/kernel/irq/handle.c 2007-02-04 10:44:54.0 -0800 +++ linux-2.6.21-rc5/kernel/irq/handle.c2007-04-09 12:26:40.473326023 -0700 @@ -48,7 +48,7 @@ handle_bad_irq(unsigned int irq, struct * * Controller mappings for all interrupt sources: */ -struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned = { +struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = { [0 ... NR_IRQS-1] = { .status = IRQ_DISABLED, .chip = no_irq_chip, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote: On Mon, 9 Apr 2007 11:08:53 -0700 Siddha, Suresh B [EMAIL PROTECTED] wrote: Align the per cpu runqueue to the cacheline boundary. This will minimize the number of cachelines touched during remote wakeup. Signed-off-by: Suresh Siddha [EMAIL PROTECTED] --- diff --git a/kernel/sched.c b/kernel/sched.c index b9a6837..eca33c5 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -278,7 +278,7 @@ struct rq { struct lock_class_key rq_lock_key; }; -static DEFINE_PER_CPU(struct rq, runqueues); +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is rather a lot. Remember also that the linesize on VSMP is 4k. And that putting a gap in the per-cpu memory like this will reduce its overall cache-friendliness. The internode line size yes. But Suresh is using cacheline_aligned_in_smp, which uses SMP_CACHE_BYTES (L1_CACHE_BYTES). So this does not align the per-cpu variable to 4k. However, if the motivation for this patch was significant performance difference, then, the above padding needs to be on the internode cacheline size using cacheline_internodealigned_in_smp. cacheline_internodealigned_in_smp aligns a data structure to the internode line size, which is 4k for vSMPowered systems and L1 line size for all other architectures. As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline aligned per-cpu data in another section, just like we do with .data.cacheline_aligned section, but keep this new section between __percpu_start and __percpu_end? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Pad irq_desc to internode cacheline size
On Mon, Apr 09, 2007 at 03:47:52PM -0600, Eric W. Biederman wrote: Andrew Morton [EMAIL PROTECTED] writes: This will consume nearly 4k per irq won't it? What is the upper bound here, across all configs and all hardware? Is VSMP the only arch which has cacheline_internodealigned_in_smp larger than cacheline_aligned_in_smp? Ugh. We set internode aligned to 4k for all of x86_64. !!! No. internode aligned is 4k only if CONFIG_X86_VSMP is chosen. The internode line size defaults to SMP_CACHE_BYTES for all other machine types. Please note that an INTERNODE_CACHE_SHIFT of 12 is defined under #ifdef CONFIG_X86_VSMP in include/asm-x86_64/cache.h I believe this ups our worst case memory consumption for the array from 1M to 32M. Although the low end might be 2M. I can't recall if an irq_desc takes one cache line or two after we have put the cpu masks in it. My gut feel says that what we want to do is delay this until we are dynamically allocating the array members. Then we can at least have the chance of allocating the memory on the proper NUMA node, and won't need the extra NUMA alignment. I was thinking on those lines as well. But, until we get there, can we have this in as stop gap? The patch does not increase the memory footprint for any other architecture other than vSMPowered systems. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: align rq to cacheline boundary
On Mon, Apr 09, 2007 at 03:17:05PM -0700, Siddha, Suresh B wrote: On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote: On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote: On Mon, 9 Apr 2007 11:08:53 -0700 Siddha, Suresh B [EMAIL PROTECTED] wrote: Kiran, can you educate me when I am supposed to use cacheline_aligned_in_smp Vs __cacheline_aligned_in_smp ? As far as my understanding goes, the four underscore version is for aligning members/elements within a data structure, and the two underscore version is for aligning statically defined variables. The dual underscore version places the variable in a separate section meant for cacheline aligned variables, so that there is no false sharing on the cacheline with a consecutive datum. For regular statically defined data structures, the latter has to be used, but since your patch uses per-cpu data, which is already in a separate section, you had to use the former I guess. As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline aligned per-cpu data in another section, just like we do with .data.cacheline_aligned section, but keep this new section between __percpu_start and __percpu_end? Yes. But that will still waste some memory in the new section, if the data elements are not multiples of 4k. Yes. But the wastage depends on the data structure now being aligned rather than the structure that happened to be there before. You cannot not lose memory while padding I guess :). But padding for per-cpu data seems a bit odd and I am not sure if it is worth it for 0.5% gain. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch] queued spinlocks (i386)
On Fri, Mar 23, 2007 at 10:40:17AM +0100, Eric Dumazet wrote: > On Fri, 23 Mar 2007 09:59:11 +0100 > Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > > Implement queued spinlocks for i386. This shouldn't increase the size of > > the spinlock structure, while still able to handle 2^16 CPUs. > > > > Not completely implemented with assembly yet, to make the algorithm a bit > > clearer. > > > > The queued spinlock has 2 fields, a head and a tail, which are indexes > > into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an > > "atomic_inc_return" on the head index, and keeps the returned value as > > a ticket. The CPU then spins until the tail index is equal to that > > ticket. > > > > To unlock a spinlock, the tail index is incremented (this can be non > > atomic, because only the lock owner will modify tail). > > > > Implementation inefficiencies aside, this change should have little > > effect on performance for uncontended locks, but will have quite a > > large cost for highly contended locks [O(N) cacheline transfers vs > > O(1) per lock aquisition, where N is the number of CPUs contending]. > > The benefit is is that contended locks will not cause any starvation. > > > > Just an idea. Big NUMA hardware seems to have fairness logic that > > prevents starvation for the regular spinlock logic. But it might be > > interesting for -rt kernel or systems with starvation issues. > > It's a very nice idea Nick. Amen to that. > > You also have for free the number or cpus that are before you. > > On big SMP/NUMA, we could use this information to call a special > lock_cpu_relax() function to avoid cacheline transferts. > > asm volatile(LOCK_PREFIX "xaddw %0, %1\n\t" >: "+r" (pos), "+m" (lock->qhead) : : "memory"); > for (;;) { > unsigned short nwait = pos - lock->qtail; > if (likely(nwait == 0)) > break; > lock_cpu_relax(lock, nwait); > } > > lock_cpu_relax(raw_spinlock_t *lock, unsigned int nwait) > { > unsigned int cycles = nwait * lock->min_cycles_per_round; > busy_loop(cycles); > } Good Idea. Hopefully, this should reduce the number of cacheline transfers in the contended case. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch] queued spinlocks (i386)
On Fri, Mar 23, 2007 at 10:40:17AM +0100, Eric Dumazet wrote: On Fri, 23 Mar 2007 09:59:11 +0100 Nick Piggin [EMAIL PROTECTED] wrote: Implement queued spinlocks for i386. This shouldn't increase the size of the spinlock structure, while still able to handle 2^16 CPUs. Not completely implemented with assembly yet, to make the algorithm a bit clearer. The queued spinlock has 2 fields, a head and a tail, which are indexes into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an atomic_inc_return on the head index, and keeps the returned value as a ticket. The CPU then spins until the tail index is equal to that ticket. To unlock a spinlock, the tail index is incremented (this can be non atomic, because only the lock owner will modify tail). Implementation inefficiencies aside, this change should have little effect on performance for uncontended locks, but will have quite a large cost for highly contended locks [O(N) cacheline transfers vs O(1) per lock aquisition, where N is the number of CPUs contending]. The benefit is is that contended locks will not cause any starvation. Just an idea. Big NUMA hardware seems to have fairness logic that prevents starvation for the regular spinlock logic. But it might be interesting for -rt kernel or systems with starvation issues. It's a very nice idea Nick. Amen to that. You also have for free the number or cpus that are before you. On big SMP/NUMA, we could use this information to call a special lock_cpu_relax() function to avoid cacheline transferts. asm volatile(LOCK_PREFIX xaddw %0, %1\n\t : +r (pos), +m (lock-qhead) : : memory); for (;;) { unsigned short nwait = pos - lock-qtail; if (likely(nwait == 0)) break; lock_cpu_relax(lock, nwait); } lock_cpu_relax(raw_spinlock_t *lock, unsigned int nwait) { unsigned int cycles = nwait * lock-min_cycles_per_round; busy_loop(cycles); } Good Idea. Hopefully, this should reduce the number of cacheline transfers in the contended case. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [REVISED] net/ipv4/multipath_wrandom.c: check kmalloc() return value.
On Mon, Mar 12, 2007 at 01:56:13PM -0700, David Miller wrote: > From: Pekka J Enberg <[EMAIL PROTECTED]> > Date: Mon, 12 Mar 2007 14:15:16 +0200 (EET) > > > On 3/9/07, David Miller <[EMAIL PROTECTED]> wrote: > > > The whole cahce-multipath subsystem has to have it's guts revamped for > > > proper error handling. > > > > (Untested patch follows.) > > I'm not accepting untested patches, if people want to fix this > code it cannot be done in a shoddy manner any more, it will > need to be fixed and tested by people who really care about > this code. Hi Dave, We do care for multipath cached option. We do use it often with node aware multipath. We haven't experienced oopsen or crashes for our use cases since the past year. We will test Amit and Pekka's patch with fault injection. We will also go through the core multipath-cached code and give a try at cleaning it up. > > We need real care for this code or I will remove it in 2.6.23 Can you please point us to any open bug reports on multipath cached? We did a Call for testing on netdev sometime back but did not get any bug reports. http://marc.info/?l=linux-netdev=114827370231872=2 The kernel bugzilla shows zaroo boogs for multipath cached as well. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [REVISED] net/ipv4/multipath_wrandom.c: check kmalloc() return value.
On Mon, Mar 12, 2007 at 01:56:13PM -0700, David Miller wrote: From: Pekka J Enberg [EMAIL PROTECTED] Date: Mon, 12 Mar 2007 14:15:16 +0200 (EET) On 3/9/07, David Miller [EMAIL PROTECTED] wrote: The whole cahce-multipath subsystem has to have it's guts revamped for proper error handling. (Untested patch follows.) I'm not accepting untested patches, if people want to fix this code it cannot be done in a shoddy manner any more, it will need to be fixed and tested by people who really care about this code. Hi Dave, We do care for multipath cached option. We do use it often with node aware multipath. We haven't experienced oopsen or crashes for our use cases since the past year. We will test Amit and Pekka's patch with fault injection. We will also go through the core multipath-cached code and give a try at cleaning it up. We need real care for this code or I will remove it in 2.6.23 Can you please point us to any open bug reports on multipath cached? We did a Call for testing on netdev sometime back but did not get any bug reports. http://marc.info/?l=linux-netdevm=114827370231872w=2 The kernel bugzilla shows zaroo boogs for multipath cached as well. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote: > > Seeing the code helps. But there was a subtle problem with hold time instrumentation here. The code assumed the critical section exiting through spin_unlock_irq entered critical section with spin_lock_irq, but that might not be the case always, and the instrumentation for hold time goes bad when that happens (as in shrink_inactive_list) > > > The > > instrumentation goes like this: > > > > void __lockfunc _spin_lock_irq(spinlock_t *lock) > > { > > unsigned long long t1,t2; > > local_irq_disable(); > > t1 = get_cycles_sync(); > > preempt_disable(); > > spin_acquire(>dep_map, 0, 0, _RET_IP_); > > _raw_spin_lock(lock); > > t2 = get_cycles_sync(); > > lock->raw_lock.htsc = t2; > > if (lock->spin_time < (t2 - t1)) > > lock->spin_time = t2 - t1; > > } > > ... > > > > void __lockfunc _spin_unlock_irq(spinlock_t *lock) > > { > > unsigned long long t1 ; > > spin_release(>dep_map, 1, _RET_IP_); > > t1 = get_cycles_sync(); > > if (lock->cs_time < (t1 - lock->raw_lock.htsc)) > > lock->cs_time = t1 - lock->raw_lock.htsc; > > _raw_spin_unlock(lock); > > local_irq_enable(); > > preempt_enable(); > > } > > ... > > OK, now we need to do a dump_stack() each time we discover a new max hold > time. That might a bit tricky: the printk code does spinlocking too so > things could go recursively deadlocky. Maybe make spin_unlock_irq() return > the hold time then do: What I found now after fixing the above is that hold time is not bad -- 249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS. The spin time is still in orders of seconds. Hence this looks like a hardware fairness issue. Attaching the instrumentation patch with this email FR. Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h === --- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 2007-01-14 22:36:46.694248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h 2007-01-15 15:40:36.554248000 -0800 @@ -6,6 +6,18 @@ #include #include +/* Like get_cycles, but make sure the CPU is synchronized. */ +static inline unsigned long long get_cycles_sync2(void) +{ + unsigned long long ret; + unsigned eax; + /* Don't do an additional sync on CPUs where we know + RDTSC is already synchronous. */ + alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC, + "=a" (eax), "0" (1) : "ebx","ecx","edx","memory"); + rdtscll(ret); + return ret; +} /* * Your basic SMP spinlocks, allowing only a single CPU anywhere * @@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s "jle 3b\n\t" "jmp 1b\n" "2:\t" : "=m" (lock->slock) : : "memory"); + lock->htsc = get_cycles_sync2(); } /* @@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags "jmp 4b\n" "5:\n\t" : "+m" (lock->slock) : "r" ((unsigned)flags) : "memory"); + lock->htsc = get_cycles_sync2(); } #endif @@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw :"=q" (oldval), "=m" (lock->slock) :"0" (0) : "memory"); + if (oldval) + lock->htsc = get_cycles_sync2(); return oldval > 0; } static inline void __raw_spin_unlock(raw_spinlock_t *lock) { + unsigned long long t = get_cycles_sync2(); + if (lock->hold_time < t - lock->htsc) + lock->hold_time = t - lock->htsc; asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory"); } Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h === --- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h 2007-01-14 22:36:46.714248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h 2007-01-15 14:23:37.204248000 -0800 @@ -7,9 +7,11 @@ typedef struct { unsigned int slock; + unsigned long long hold_time; + unsigned long long htsc; } raw_spinlock_t; -#define __RAW_SPIN_LOCK_UNLOCKED { 1 } +#define __RAW_SPIN_LOCK_UNLOCKED { 1,0,0 } typedef struct { unsigned int lock; Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h === --- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h 2007-01-14 22:36:48.464248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h 2007-01-14 22:41:30.964248000 -0800 @@ -231,8 +231,8 @@ do { \ # define spin_unlock(lock)
Re: High lock spin time for zone-lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote: Seeing the code helps. But there was a subtle problem with hold time instrumentation here. The code assumed the critical section exiting through spin_unlock_irq entered critical section with spin_lock_irq, but that might not be the case always, and the instrumentation for hold time goes bad when that happens (as in shrink_inactive_list) The instrumentation goes like this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { unsigned long long t1,t2; local_irq_disable(); t1 = get_cycles_sync(); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); t2 = get_cycles_sync(); lock-raw_lock.htsc = t2; if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } ... void __lockfunc _spin_unlock_irq(spinlock_t *lock) { unsigned long long t1 ; spin_release(lock-dep_map, 1, _RET_IP_); t1 = get_cycles_sync(); if (lock-cs_time (t1 - lock-raw_lock.htsc)) lock-cs_time = t1 - lock-raw_lock.htsc; _raw_spin_unlock(lock); local_irq_enable(); preempt_enable(); } ... OK, now we need to do a dump_stack() each time we discover a new max hold time. That might a bit tricky: the printk code does spinlocking too so things could go recursively deadlocky. Maybe make spin_unlock_irq() return the hold time then do: What I found now after fixing the above is that hold time is not bad -- 249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS. The spin time is still in orders of seconds. Hence this looks like a hardware fairness issue. Attaching the instrumentation patch with this email FR. Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h === --- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 2007-01-14 22:36:46.694248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h 2007-01-15 15:40:36.554248000 -0800 @@ -6,6 +6,18 @@ #include asm/page.h #include asm/processor.h +/* Like get_cycles, but make sure the CPU is synchronized. */ +static inline unsigned long long get_cycles_sync2(void) +{ + unsigned long long ret; + unsigned eax; + /* Don't do an additional sync on CPUs where we know + RDTSC is already synchronous. */ + alternative_io(cpuid, ASM_NOP2, X86_FEATURE_SYNC_RDTSC, + =a (eax), 0 (1) : ebx,ecx,edx,memory); + rdtscll(ret); + return ret; +} /* * Your basic SMP spinlocks, allowing only a single CPU anywhere * @@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s jle 3b\n\t jmp 1b\n 2:\t : =m (lock-slock) : : memory); + lock-htsc = get_cycles_sync2(); } /* @@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags jmp 4b\n 5:\n\t : +m (lock-slock) : r ((unsigned)flags) : memory); + lock-htsc = get_cycles_sync2(); } #endif @@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw :=q (oldval), =m (lock-slock) :0 (0) : memory); + if (oldval) + lock-htsc = get_cycles_sync2(); return oldval 0; } static inline void __raw_spin_unlock(raw_spinlock_t *lock) { + unsigned long long t = get_cycles_sync2(); + if (lock-hold_time t - lock-htsc) + lock-hold_time = t - lock-htsc; asm volatile(movl $1,%0 :=m (lock-slock) :: memory); } Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h === --- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h 2007-01-14 22:36:46.714248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h 2007-01-15 14:23:37.204248000 -0800 @@ -7,9 +7,11 @@ typedef struct { unsigned int slock; + unsigned long long hold_time; + unsigned long long htsc; } raw_spinlock_t; -#define __RAW_SPIN_LOCK_UNLOCKED { 1 } +#define __RAW_SPIN_LOCK_UNLOCKED { 1,0,0 } typedef struct { unsigned int lock; Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h === --- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h 2007-01-14 22:36:48.464248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h 2007-01-14 22:41:30.964248000 -0800 @@ -231,8 +231,8 @@ do { \ # define spin_unlock(lock) __raw_spin_unlock((lock)-raw_lock) # define read_unlock(lock) __raw_read_unlock((lock)-raw_lock) # define
Re: High lock spin time for zone->lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote: > > On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[EMAIL > > PROTECTED]> wrote: > > > >void __lockfunc _spin_lock_irq(spinlock_t *lock) > > > >{ > > > >local_irq_disable(); > > > >> rdtsc(t1); > > > >preempt_disable(); > > > >spin_acquire(>dep_map, 0, 0, _RET_IP_); > > > >_raw_spin_lock(lock); > > > >> rdtsc(t2); > > > >if (lock->spin_time < (t2 - t1)) > > > >lock->spin_time = t2 - t1; > > > >} > > > > > > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or > > > >more > > > >while the maximal CS time was 3 seconds or so. > > > > > > What is the "CS time"? > > > > Critical Section :). This is the maximal time interval I measured from > > t2 above to the time point we release the spin lock. This is the hold > > time I guess. > > By no means. The theory here is that CPUA is taking and releasing the > lock at high frequency, but CPUB never manages to get in and take it. In > which case the maximum-acquisition-time is much larger than the > maximum-hold-time. > > I'd suggest that you use a similar trick to measure the maximum hold time: > start the timer after we got the lock, stop it just before we release the > lock (assuming that the additional rdtsc delay doesn't "fix" things, of > course...) Well, that is exactly what I described above as CS time. The instrumentation goes like this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { unsigned long long t1,t2; local_irq_disable(); t1 = get_cycles_sync(); preempt_disable(); spin_acquire(>dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); t2 = get_cycles_sync(); lock->raw_lock.htsc = t2; if (lock->spin_time < (t2 - t1)) lock->spin_time = t2 - t1; } ... void __lockfunc _spin_unlock_irq(spinlock_t *lock) { unsigned long long t1 ; spin_release(>dep_map, 1, _RET_IP_); t1 = get_cycles_sync(); if (lock->cs_time < (t1 - lock->raw_lock.htsc)) lock->cs_time = t1 - lock->raw_lock.htsc; _raw_spin_unlock(lock); local_irq_enable(); preempt_enable(); } Am I missing something? Is this not what you just described? (The synchronizing rdtsc might not be really required at all locations, but I doubt if it would contribute a significant fraction to 33s or even the 3s hold time on a 2.6 GHZ opteron). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote: On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); rdtsc(t1); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); rdtsc(t2); if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } On some runs, we found that the zone-lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. What is the CS time? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. By no means. The theory here is that CPUA is taking and releasing the lock at high frequency, but CPUB never manages to get in and take it. In which case the maximum-acquisition-time is much larger than the maximum-hold-time. I'd suggest that you use a similar trick to measure the maximum hold time: start the timer after we got the lock, stop it just before we release the lock (assuming that the additional rdtsc delay doesn't fix things, of course...) Well, that is exactly what I described above as CS time. The instrumentation goes like this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { unsigned long long t1,t2; local_irq_disable(); t1 = get_cycles_sync(); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); t2 = get_cycles_sync(); lock-raw_lock.htsc = t2; if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } ... void __lockfunc _spin_unlock_irq(spinlock_t *lock) { unsigned long long t1 ; spin_release(lock-dep_map, 1, _RET_IP_); t1 = get_cycles_sync(); if (lock-cs_time (t1 - lock-raw_lock.htsc)) lock-cs_time = t1 - lock-raw_lock.htsc; _raw_spin_unlock(lock); local_irq_enable(); preempt_enable(); } Am I missing something? Is this not what you just described? (The synchronizing rdtsc might not be really required at all locations, but I doubt if it would contribute a significant fraction to 33s or even the 3s hold time on a 2.6 GHZ opteron). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 05:11:16PM -0800, Andrew Morton wrote: > On Fri, 12 Jan 2007 17:00:39 -0800 > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > But is > > lru_lock an issue is another question. > > I doubt it, although there might be changes we can make in there to > work around it. > > I tested with PAGEVEC_SIZE define to 62 and 126 -- no difference. I still notice the atrociously high spin times. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote: > Ravikiran G Thirumalai wrote: > >Hi, > >We noticed high interrupt hold off times while running some memory > >intensive > >tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, > > [...] > > >We did not use any lock debugging options and used plain old rdtsc to > >measure cycles. (We disable cpu freq scaling in the BIOS). All we did was > >this: > > > >void __lockfunc _spin_lock_irq(spinlock_t *lock) > >{ > >local_irq_disable(); > >> rdtsc(t1); > >preempt_disable(); > >spin_acquire(>dep_map, 0, 0, _RET_IP_); > >_raw_spin_lock(lock); > >> rdtsc(t2); > >if (lock->spin_time < (t2 - t1)) > >lock->spin_time = t2 - t1; > >} > > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or more > >while the maximal CS time was 3 seconds or so. > > What is the "CS time"? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. > > It would be interesting to know how long the maximal lru_lock *hold* time > is, > which could give us a better indication of whether it is a hardware problem. > > For example, if the maximum hold time is 10ms, that it might indicate a > hardware fairness problem. The maximal hold time was about 3s. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 01:45:43PM -0800, Christoph Lameter wrote: > On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: > > Moreover mostatomic operations are to remote memory which is also > increasing the problem by making the atomic ops take longer. Typically > mature NUMA system have implemented hardware provisions that can deal with > such high degrees of contention. If this is simply a SMP system that was > turned into a NUMA box then this is a new hardware scenario for the > engineers. This is using HT as all AMD systems do, but this is one of the 8 socket systems. I ran the same test on a 2 node Tyan AMD box, and did not notice the atrocious spin times. It would be interesting to see how a 4 socket HT box would fare. Unfortunately, I do not have access to one. If someone has access to such a box, I can provide the test case and instrumentation patches. It could very well be the hardware limitation in this case, which means, all the more reason to enable interrupts with spin locks while spinning. But is lru_lock an issue is another question. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 11:46:22AM -0800, Christoph Lameter wrote: > On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: > > > The test was simple, we have 16 processes, each allocating 3.5G of memory > > and and touching each and every page and returning. Each of the process is > > bound to a node (socket), with the local node being the preferred node for > > allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each > > socket has 4G of physical memory and there are two cores on each socket. On > > start of the test, the machine becomes unresponsive after sometime and > > prints out softlockup and OOM messages. We then found out the cause > > for softlockups being the excessive spin times on zone_lru lock. The fact > > that spin_lock_irq disables interrupts while spinning made matters very bad. > > We instrumented the spin_lock_irq code and found that the spin time on the > > lru locks was in the order of a few seconds (tens of seconds at times) and > > the hold time was comparatively lesser. > > So the issue is two processes contenting on the zone lock for one node? > You are overallocating the 4G node with two processes attempting to > allocate 7.5GB? So we go off node for 3.5G of the allocation? Yes. > > Does the system scale the right way if you stay within the bounds of node > memory? I.e. allocate 1.5GB from each process? Yes. We see problems only when we oversubscribe memory. > > Have you tried increasing the size of the per cpu caches in > /proc/sys/vm/percpu_pagelist_fraction? No not yet. I can give it a try. > > > While the softlockups and the like went away by enabling interrupts during > > spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , > > Andi thought maybe this is exposing a problem with zone->lru_locks and > > hence warrants a discussion on lkml, hence this post. Are there any > > plans/patches/ideas to address the spin time under such extreme conditions? > > Could this be a hardware problem? Some issue with atomic ops in the > Sun hardware? I think that is unlikely -- because when we donot oversubscribe memory, the tests complete quickly without softlockups ane the like. Peter has also noticed this (presumeably on different hardware). I would think this could also be locking unfairness (cpus of the same node getting the lock and starving out other nodes) case under extreme contention. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
High lock spin time for zone->lru_lock under extreme conditions
Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, lost ticks and even wall time drifting (which is probably a bug in the x86_64 timer subsystem). The test was simple, we have 16 processes, each allocating 3.5G of memory and and touching each and every page and returning. Each of the process is bound to a node (socket), with the local node being the preferred node for allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each socket has 4G of physical memory and there are two cores on each socket. On start of the test, the machine becomes unresponsive after sometime and prints out softlockup and OOM messages. We then found out the cause for softlockups being the excessive spin times on zone_lru lock. The fact that spin_lock_irq disables interrupts while spinning made matters very bad. We instrumented the spin_lock_irq code and found that the spin time on the lru locks was in the order of a few seconds (tens of seconds at times) and the hold time was comparatively lesser. We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); > rdtsc(t1); preempt_disable(); spin_acquire(>dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); > rdtsc(t2); if (lock->spin_time < (t2 - t1)) lock->spin_time = t2 - t1; } On some runs, we found that the zone->lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. While the softlockups and the like went away by enabling interrupts during spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , Andi thought maybe this is exposing a problem with zone->lru_locks and hence warrants a discussion on lkml, hence this post. Are there any plans/patches/ideas to address the spin time under such extreme conditions? I will be happy to provide any additional information (config/dmesg/test case if needed. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
High lock spin time for zone-lru_lock under extreme conditions
Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, lost ticks and even wall time drifting (which is probably a bug in the x86_64 timer subsystem). The test was simple, we have 16 processes, each allocating 3.5G of memory and and touching each and every page and returning. Each of the process is bound to a node (socket), with the local node being the preferred node for allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each socket has 4G of physical memory and there are two cores on each socket. On start of the test, the machine becomes unresponsive after sometime and prints out softlockup and OOM messages. We then found out the cause for softlockups being the excessive spin times on zone_lru lock. The fact that spin_lock_irq disables interrupts while spinning made matters very bad. We instrumented the spin_lock_irq code and found that the spin time on the lru locks was in the order of a few seconds (tens of seconds at times) and the hold time was comparatively lesser. We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); rdtsc(t1); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); rdtsc(t2); if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } On some runs, we found that the zone-lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. While the softlockups and the like went away by enabling interrupts during spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , Andi thought maybe this is exposing a problem with zone-lru_locks and hence warrants a discussion on lkml, hence this post. Are there any plans/patches/ideas to address the spin time under such extreme conditions? I will be happy to provide any additional information (config/dmesg/test case if needed. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 11:46:22AM -0800, Christoph Lameter wrote: On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: The test was simple, we have 16 processes, each allocating 3.5G of memory and and touching each and every page and returning. Each of the process is bound to a node (socket), with the local node being the preferred node for allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each socket has 4G of physical memory and there are two cores on each socket. On start of the test, the machine becomes unresponsive after sometime and prints out softlockup and OOM messages. We then found out the cause for softlockups being the excessive spin times on zone_lru lock. The fact that spin_lock_irq disables interrupts while spinning made matters very bad. We instrumented the spin_lock_irq code and found that the spin time on the lru locks was in the order of a few seconds (tens of seconds at times) and the hold time was comparatively lesser. So the issue is two processes contenting on the zone lock for one node? You are overallocating the 4G node with two processes attempting to allocate 7.5GB? So we go off node for 3.5G of the allocation? Yes. Does the system scale the right way if you stay within the bounds of node memory? I.e. allocate 1.5GB from each process? Yes. We see problems only when we oversubscribe memory. Have you tried increasing the size of the per cpu caches in /proc/sys/vm/percpu_pagelist_fraction? No not yet. I can give it a try. While the softlockups and the like went away by enabling interrupts during spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , Andi thought maybe this is exposing a problem with zone-lru_locks and hence warrants a discussion on lkml, hence this post. Are there any plans/patches/ideas to address the spin time under such extreme conditions? Could this be a hardware problem? Some issue with atomic ops in the Sun hardware? I think that is unlikely -- because when we donot oversubscribe memory, the tests complete quickly without softlockups ane the like. Peter has also noticed this (presumeably on different hardware). I would think this could also be locking unfairness (cpus of the same node getting the lock and starving out other nodes) case under extreme contention. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 01:45:43PM -0800, Christoph Lameter wrote: On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: Moreover mostatomic operations are to remote memory which is also increasing the problem by making the atomic ops take longer. Typically mature NUMA system have implemented hardware provisions that can deal with such high degrees of contention. If this is simply a SMP system that was turned into a NUMA box then this is a new hardware scenario for the engineers. This is using HT as all AMD systems do, but this is one of the 8 socket systems. I ran the same test on a 2 node Tyan AMD box, and did not notice the atrocious spin times. It would be interesting to see how a 4 socket HT box would fare. Unfortunately, I do not have access to one. If someone has access to such a box, I can provide the test case and instrumentation patches. It could very well be the hardware limitation in this case, which means, all the more reason to enable interrupts with spin locks while spinning. But is lru_lock an issue is another question. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote: Ravikiran G Thirumalai wrote: Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, [...] We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); rdtsc(t1); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); rdtsc(t2); if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } On some runs, we found that the zone-lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. What is the CS time? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. It would be interesting to know how long the maximal lru_lock *hold* time is, which could give us a better indication of whether it is a hardware problem. For example, if the maximum hold time is 10ms, that it might indicate a hardware fairness problem. The maximal hold time was about 3s. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 05:11:16PM -0800, Andrew Morton wrote: On Fri, 12 Jan 2007 17:00:39 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: But is lru_lock an issue is another question. I doubt it, although there might be changes we can make in there to work around it. mentions PAGEVEC_SIZE again I tested with PAGEVEC_SIZE define to 62 and 126 -- no difference. I still notice the atrociously high spin times. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + spin_lock_irq-enable-interrupts-while-spinning-i386-implementation.patch added to -mm tree
On Sun, Jan 07, 2007 at 01:06:58PM -0800, Ravikiran G Thirumalai wrote: > > > Question is, now we have 2 versions of spin_locks_irq implementation > with CONFIG_PARAVIRT -- one with regular cli sti and other with virtualized > CLI/STI -- sounds odd! Sunday morning hangovers !! spin_lock_irq is not inlined so there is just one version even with CONFIG_PARAVIRT. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + spin_lock_irq-enable-interrupts-while-spinning-i386-implementation.patch added to -mm tree
On Sun, Jan 07, 2007 at 12:05:03PM -0800, Andrew Morton wrote: > On Sun, 07 Jan 2007 05:24:45 -0800 > Daniel Walker <[EMAIL PROTECTED]> wrote: > > > Now it fails with CONFIG_PARAVIRT off . > > > > scripts/kconfig/conf -s arch/i386/Kconfig > > CHK include/linux/version.h > > CHK include/linux/compile.h > > CHK include/linux/utsrelease.h > > UPD include/linux/compile.h > > CC arch/i386/kernel/asm-offsets.s > > In file included from include/linux/spinlock.h:88, > > from include/linux/module.h:10, > > from include/linux/crypto.h:22, > > from arch/i386/kernel/asm-offsets.c:8: > > include/asm/spinlock.h: In function '__raw_spin_lock_irq': > > include/asm/spinlock.h:100: error: expected string literal before > > '__CLI_STI_INPUT_ARGS' > > bah. > > --- > a/include/asm-i386/spinlock.h~spin_lock_irq-enable-interrupts-while-spinning-i386-implementation-fix-fix > +++ a/include/asm-i386/spinlock.h > @@ -14,6 +14,7 @@ > #define STI_STRING "sti" > #define CLI_STI_CLOBBERS > #define CLI_STI_INPUT_ARGS > +#define __CLI_STI_INPUT_ARGS > #endif /* CONFIG_PARAVIRT */ Apologies for the broken patch and thanks for the fix, But, the above is needed to fix the build even with CONFIG_PARAVIRT!!! Apparently because arch/i386/mm/boot_ioremap.c undefs CONFIG_PARAVIRT. Question is, now we have 2 versions of spin_locks_irq implementation with CONFIG_PARAVIRT -- one with regular cli sti and other with virtualized CLI/STI -- sounds odd! Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + spin_lock_irq-enable-interrupts-while-spinning-i386-implementation.patch added to -mm tree
On Sun, Jan 07, 2007 at 12:05:03PM -0800, Andrew Morton wrote: On Sun, 07 Jan 2007 05:24:45 -0800 Daniel Walker [EMAIL PROTECTED] wrote: Now it fails with CONFIG_PARAVIRT off . scripts/kconfig/conf -s arch/i386/Kconfig CHK include/linux/version.h CHK include/linux/compile.h CHK include/linux/utsrelease.h UPD include/linux/compile.h CC arch/i386/kernel/asm-offsets.s In file included from include/linux/spinlock.h:88, from include/linux/module.h:10, from include/linux/crypto.h:22, from arch/i386/kernel/asm-offsets.c:8: include/asm/spinlock.h: In function '__raw_spin_lock_irq': include/asm/spinlock.h:100: error: expected string literal before '__CLI_STI_INPUT_ARGS' bah. --- a/include/asm-i386/spinlock.h~spin_lock_irq-enable-interrupts-while-spinning-i386-implementation-fix-fix +++ a/include/asm-i386/spinlock.h @@ -14,6 +14,7 @@ #define STI_STRING sti #define CLI_STI_CLOBBERS #define CLI_STI_INPUT_ARGS +#define __CLI_STI_INPUT_ARGS #endif /* CONFIG_PARAVIRT */ Apologies for the broken patch and thanks for the fix, But, the above is needed to fix the build even with CONFIG_PARAVIRT!!! Apparently because arch/i386/mm/boot_ioremap.c undefs CONFIG_PARAVIRT. Question is, now we have 2 versions of spin_locks_irq implementation with CONFIG_PARAVIRT -- one with regular cli sti and other with virtualized CLI/STI -- sounds odd! Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + spin_lock_irq-enable-interrupts-while-spinning-i386-implementation.patch added to -mm tree
On Sun, Jan 07, 2007 at 01:06:58PM -0800, Ravikiran G Thirumalai wrote: Question is, now we have 2 versions of spin_locks_irq implementation with CONFIG_PARAVIRT -- one with regular cli sti and other with virtualized CLI/STI -- sounds odd! Sunday morning hangovers !! spin_lock_irq is not inlined so there is just one version even with CONFIG_PARAVIRT. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] [patch 1/2] spin_lock_irq: Enable interrupts while spinning -- preperatory patch
On Wed, Jan 03, 2007 at 12:16:35AM -0800, Andrew Morton wrote: > On Tue, 2 Jan 2007 23:59:23 -0800 > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > The following patches do just that. The first patch is preparatory in nature > > and the second one changes the x86_64 implementation of spin_lock_irq. > > Patch passed overnight runs of kernbench and dbench on 4 way x86_64 smp. > > The end result of this is, I think, that i386 enables irqs while spinning > in spin_lock_irqsave() but not while spinning in spin_lock_irq(). And > x86_64 does the opposite. No, right now we have on mainline (non PREEMPT case); i386x86_64 - spin_lock_irq cli when spin cli when spin spin_lock_irqsave spin with intr enabled spin with intr enabled The posted patchset changed this to: i386x86_64 - spin_lock_irq cli when spin spin with intr enabled spin_lock_irqsave spin with intr enabled spin with intr enabled > > Odd, isn't it? Well we just implemented the x86_64 part. Here goes the i386 part as well for spin_lock_irq. i386: Enable interrupts while spinning for a lock with spin_lock_irq Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Index: linux-2.6.20-rc1/include/asm-i386/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-i386/spinlock.h 2006-12-28 17:18:32.142775000 -0800 +++ linux-2.6.20-rc1/include/asm-i386/spinlock.h2007-01-03 10:18:32.243662000 -0800 @@ -82,7 +82,22 @@ static inline void __raw_spin_lock_flags CLI_STI_INPUT_ARGS : "memory" CLI_STI_CLOBBERS); } -# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) + +static inline void __raw_spin_lock_irq(raw_spinlock_t *lock) +{ + asm volatile("\n1:\t" +LOCK_PREFIX " ; decb %0\n\t" +"jns 3f\n" +STI_STRING "\n" +"2:\t" +"rep;nop\n\t" +"cmpb $0,%0\n\t" +"jle 2b\n\t" +CLI_STRING "\n" +"jmp 1b\n" +"3:\n\t" +: "+m" (lock->slock) : : "memory"); +} #endif static inline int __raw_spin_trylock(raw_spinlock_t *lock) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[rfc] [patch 2/2] spin_lock_irq: Enable interrupts while spinning -- x86_64 implementation
Implement interrupt enabling while spinning for lock for spin_lock_irq Signed-off by: Pravin B. Shelar <[EMAIL PROTECTED]> Signed-off by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.20-rc1/include/asm-x86_64/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-x86_64/spinlock.h 2006-12-28 17:18:32.142775000 -0800 +++ linux-2.6.20-rc1/include/asm-x86_64/spinlock.h 2006-12-29 14:05:04.012954000 -0800 @@ -63,7 +63,21 @@ static inline void __raw_spin_lock_flags "5:\n\t" : "+m" (lock->slock) : "r" ((unsigned)flags) : "memory"); } -#define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) +static inline void __raw_spin_lock_irq(raw_spinlock_t *lock) +{ + asm volatile( + "\n1:\t" + LOCK_PREFIX " ; decl %0\n\t" + "jns 2f\n" + "sti\n" /* Enable interrupts during spin */ + "3:\n" + "rep;nop\n\t" + "cmpl $0,%0\n\t" + "jle 3b\n\t" + "cli\n" + "jmp 1b\n" + "2:\t" : "+m" (lock->slock) : : "memory"); +} #endif static inline int __raw_spin_trylock(raw_spinlock_t *lock) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[rfc] [patch 1/2] spin_lock_irq: Enable interrupts while spinning -- preperatory patch
There seems to be no good reason for spin_lock_irq to disable interrupts while spinning. Zwane Mwaikambo had an implementation couple of years ago, and the only objection seemed to be concerns about buggy code using spin_lock_irq whilst interrupts disabled http://lkml.org/lkml/2004/5/26/87 That shouldn't be a concern anymore. Besides, spin_lock_irqsave now enables interrupts while spinning. As to the motivation, on a Sun x4600 8 socket 16 core x86_64 NUMA box, we notice softlockups and quite a few lost timer ticks under extreme memory pressure. The reason turned out to be that zone->lru_lock tends to get heavily contended, and NUMA nodes try to grab the locks using spin_lock_irq. Instrumentation showed us that interrupt hold offs can last for a few seconds, and even the main timer interrupts get held off for long -- which is not good. Enabling interrupts for spinlocks while spinning made the machine responsive and the softlockups/lost ticks went away. Although the scenario above was an extreme condition (very high memory pressure), I guess it still makes sense to enable interrupts while spinning for a lock. The following patches do just that. The first patch is preparatory in nature and the second one changes the x86_64 implementation of spin_lock_irq. Patch passed overnight runs of kernbench and dbench on 4 way x86_64 smp. Comments? Thanks, Kiran Preparatory patch to enable interrupts while spinning with spinlock irqs. Any arch which needs this feature just has to implement __raw_spin_lock_irq Signed-off by: Pravin B. Shelar <[EMAIL PROTECTED]> Signed-off by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.20-rc1/include/asm-alpha/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-alpha/spinlock.h 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.20-rc1/include/asm-alpha/spinlock.h 2006-12-28 17:18:32.132775000 -0800 @@ -13,6 +13,7 @@ */ #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) +#define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) #define __raw_spin_is_locked(x)((x)->lock != 0) #define __raw_spin_unlock_wait(x) \ do { cpu_relax(); } while ((x)->lock) Index: linux-2.6.20-rc1/include/asm-arm/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-arm/spinlock.h2006-11-29 13:57:37.0 -0800 +++ linux-2.6.20-rc1/include/asm-arm/spinlock.h 2006-12-28 17:18:32.132775000 -0800 @@ -22,6 +22,7 @@ do { while (__raw_spin_is_locked(lock)) cpu_relax(); } while (0) #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) +#define __raw_spin_lock_irq(lock, flags) __raw_spin_lock(lock) static inline void __raw_spin_lock(raw_spinlock_t *lock) { Index: linux-2.6.20-rc1/include/asm-cris/arch-v32/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-cris/arch-v32/spinlock.h 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.20-rc1/include/asm-cris/arch-v32/spinlock.h 2006-12-29 16:36:27.182954000 -0800 @@ -36,7 +36,7 @@ static inline void _raw_spin_lock_flags { _raw_spin_lock(lock); } - +#define __raw_spin_lock_irq(lock) _raw_spin_lock(lock) /* * Read-write spinlocks, allowing multiple readers * but only one writer. Index: linux-2.6.20-rc1/include/asm-i386/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-i386/spinlock.h 2006-12-21 14:34:33.871573917 -0800 +++ linux-2.6.20-rc1/include/asm-i386/spinlock.h2006-12-28 17:18:32.142775000 -0800 @@ -82,6 +82,7 @@ static inline void __raw_spin_lock_flags CLI_STI_INPUT_ARGS : "memory" CLI_STI_CLOBBERS); } +# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) #endif static inline int __raw_spin_trylock(raw_spinlock_t *lock) Index: linux-2.6.20-rc1/include/asm-ia64/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-ia64/spinlock.h 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.20-rc1/include/asm-ia64/spinlock.h2006-12-28 17:18:32.142775000 -0800 @@ -87,7 +87,7 @@ __raw_spin_lock_flags (raw_spinlock_t *l } #define __raw_spin_lock(lock) __raw_spin_lock_flags(lock, 0) - +# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) /* Unlock by doing an ordered store and releasing the cacheline with nta */ static inline void __raw_spin_unlock(raw_spinlock_t *x) { barrier(); @@ -96,6 +96,7 @@ static inline void __raw_spin_unlock(raw #else /* !ASM_SUPPORTED */ #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) +# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) # define __raw_spin_lock(x) \ do {
[rfc] [patch 1/2] spin_lock_irq: Enable interrupts while spinning -- preperatory patch
There seems to be no good reason for spin_lock_irq to disable interrupts while spinning. Zwane Mwaikambo had an implementation couple of years ago, and the only objection seemed to be concerns about buggy code using spin_lock_irq whilst interrupts disabled http://lkml.org/lkml/2004/5/26/87 That shouldn't be a concern anymore. Besides, spin_lock_irqsave now enables interrupts while spinning. As to the motivation, on a Sun x4600 8 socket 16 core x86_64 NUMA box, we notice softlockups and quite a few lost timer ticks under extreme memory pressure. The reason turned out to be that zone-lru_lock tends to get heavily contended, and NUMA nodes try to grab the locks using spin_lock_irq. Instrumentation showed us that interrupt hold offs can last for a few seconds, and even the main timer interrupts get held off for long -- which is not good. Enabling interrupts for spinlocks while spinning made the machine responsive and the softlockups/lost ticks went away. Although the scenario above was an extreme condition (very high memory pressure), I guess it still makes sense to enable interrupts while spinning for a lock. The following patches do just that. The first patch is preparatory in nature and the second one changes the x86_64 implementation of spin_lock_irq. Patch passed overnight runs of kernbench and dbench on 4 way x86_64 smp. Comments? Thanks, Kiran Preparatory patch to enable interrupts while spinning with spinlock irqs. Any arch which needs this feature just has to implement __raw_spin_lock_irq Signed-off by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.20-rc1/include/asm-alpha/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-alpha/spinlock.h 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.20-rc1/include/asm-alpha/spinlock.h 2006-12-28 17:18:32.132775000 -0800 @@ -13,6 +13,7 @@ */ #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) +#define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) #define __raw_spin_is_locked(x)((x)-lock != 0) #define __raw_spin_unlock_wait(x) \ do { cpu_relax(); } while ((x)-lock) Index: linux-2.6.20-rc1/include/asm-arm/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-arm/spinlock.h2006-11-29 13:57:37.0 -0800 +++ linux-2.6.20-rc1/include/asm-arm/spinlock.h 2006-12-28 17:18:32.132775000 -0800 @@ -22,6 +22,7 @@ do { while (__raw_spin_is_locked(lock)) cpu_relax(); } while (0) #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) +#define __raw_spin_lock_irq(lock, flags) __raw_spin_lock(lock) static inline void __raw_spin_lock(raw_spinlock_t *lock) { Index: linux-2.6.20-rc1/include/asm-cris/arch-v32/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-cris/arch-v32/spinlock.h 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.20-rc1/include/asm-cris/arch-v32/spinlock.h 2006-12-29 16:36:27.182954000 -0800 @@ -36,7 +36,7 @@ static inline void _raw_spin_lock_flags { _raw_spin_lock(lock); } - +#define __raw_spin_lock_irq(lock) _raw_spin_lock(lock) /* * Read-write spinlocks, allowing multiple readers * but only one writer. Index: linux-2.6.20-rc1/include/asm-i386/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-i386/spinlock.h 2006-12-21 14:34:33.871573917 -0800 +++ linux-2.6.20-rc1/include/asm-i386/spinlock.h2006-12-28 17:18:32.142775000 -0800 @@ -82,6 +82,7 @@ static inline void __raw_spin_lock_flags CLI_STI_INPUT_ARGS : memory CLI_STI_CLOBBERS); } +# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) #endif static inline int __raw_spin_trylock(raw_spinlock_t *lock) Index: linux-2.6.20-rc1/include/asm-ia64/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-ia64/spinlock.h 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.20-rc1/include/asm-ia64/spinlock.h2006-12-28 17:18:32.142775000 -0800 @@ -87,7 +87,7 @@ __raw_spin_lock_flags (raw_spinlock_t *l } #define __raw_spin_lock(lock) __raw_spin_lock_flags(lock, 0) - +# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) /* Unlock by doing an ordered store and releasing the cacheline with nta */ static inline void __raw_spin_unlock(raw_spinlock_t *x) { barrier(); @@ -96,6 +96,7 @@ static inline void __raw_spin_unlock(raw #else /* !ASM_SUPPORTED */ #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) +# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) # define __raw_spin_lock(x) \ do {
[rfc] [patch 2/2] spin_lock_irq: Enable interrupts while spinning -- x86_64 implementation
Implement interrupt enabling while spinning for lock for spin_lock_irq Signed-off by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.20-rc1/include/asm-x86_64/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-x86_64/spinlock.h 2006-12-28 17:18:32.142775000 -0800 +++ linux-2.6.20-rc1/include/asm-x86_64/spinlock.h 2006-12-29 14:05:04.012954000 -0800 @@ -63,7 +63,21 @@ static inline void __raw_spin_lock_flags 5:\n\t : +m (lock-slock) : r ((unsigned)flags) : memory); } -#define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) +static inline void __raw_spin_lock_irq(raw_spinlock_t *lock) +{ + asm volatile( + \n1:\t + LOCK_PREFIX ; decl %0\n\t + jns 2f\n + sti\n /* Enable interrupts during spin */ + 3:\n + rep;nop\n\t + cmpl $0,%0\n\t + jle 3b\n\t + cli\n + jmp 1b\n + 2:\t : +m (lock-slock) : : memory); +} #endif static inline int __raw_spin_trylock(raw_spinlock_t *lock) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] [patch 1/2] spin_lock_irq: Enable interrupts while spinning -- preperatory patch
On Wed, Jan 03, 2007 at 12:16:35AM -0800, Andrew Morton wrote: On Tue, 2 Jan 2007 23:59:23 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: The following patches do just that. The first patch is preparatory in nature and the second one changes the x86_64 implementation of spin_lock_irq. Patch passed overnight runs of kernbench and dbench on 4 way x86_64 smp. The end result of this is, I think, that i386 enables irqs while spinning in spin_lock_irqsave() but not while spinning in spin_lock_irq(). And x86_64 does the opposite. No, right now we have on mainline (non PREEMPT case); i386x86_64 - spin_lock_irq cli when spin cli when spin spin_lock_irqsave spin with intr enabled spin with intr enabled The posted patchset changed this to: i386x86_64 - spin_lock_irq cli when spin spin with intr enabled spin_lock_irqsave spin with intr enabled spin with intr enabled Odd, isn't it? Well we just implemented the x86_64 part. Here goes the i386 part as well for spin_lock_irq. i386: Enable interrupts while spinning for a lock with spin_lock_irq Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Index: linux-2.6.20-rc1/include/asm-i386/spinlock.h === --- linux-2.6.20-rc1.orig/include/asm-i386/spinlock.h 2006-12-28 17:18:32.142775000 -0800 +++ linux-2.6.20-rc1/include/asm-i386/spinlock.h2007-01-03 10:18:32.243662000 -0800 @@ -82,7 +82,22 @@ static inline void __raw_spin_lock_flags CLI_STI_INPUT_ARGS : memory CLI_STI_CLOBBERS); } -# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock) + +static inline void __raw_spin_lock_irq(raw_spinlock_t *lock) +{ + asm volatile(\n1:\t +LOCK_PREFIX ; decb %0\n\t +jns 3f\n +STI_STRING \n +2:\t +rep;nop\n\t +cmpb $0,%0\n\t +jle 2b\n\t +CLI_STRING \n +jmp 1b\n +3:\n\t +: +m (lock-slock) : : memory); +} #endif static inline int __raw_spin_trylock(raw_spinlock_t *lock) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] x86: Fix dev_to_node for x86 and x86_64
Hi Andrew, dev_to_node() does not work as expected on x86 and x86_64 as pointed out earlier here: http://lkml.org/lkml/2006/11/7/10 Following patch fixes it, please apply. (Note: The fix depends on support for PCI domains for x86/x86_64) Thanks, Kiran dev_to_node does not work as expected on x86_64 (and i386). This is because node value returned by pcibus_to_node is initialized after a struct device is created with current x86_64 code. We need the node value initialized before the call to pci_scan_bus_parented, as the generic devices are allocated and initialized off pci_scan_child_bus, which gets called from pci_scan_bus_parented The following patch does that using "pci_sysdata" introduced by the PCI domain patches in -mm. Signed-off-by: Alok N Kataria <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.20-rc1/arch/i386/pci/acpi.c === --- linux-2.6.20-rc1.orig/arch/i386/pci/acpi.c 2006-12-28 11:51:52.542775000 -0800 +++ linux-2.6.20-rc1/arch/i386/pci/acpi.c 2006-12-28 12:01:19.242775000 -0800 @@ -9,6 +9,7 @@ struct pci_bus * __devinit pci_acpi_scan { struct pci_bus *bus; struct pci_sysdata *sd; + int pxm; /* Allocate per-root-bus (not per bus) arch-specific data. * TODO: leak; this memory is never freed. @@ -30,15 +31,21 @@ struct pci_bus * __devinit pci_acpi_scan } #endif /* CONFIG_PCI_DOMAINS */ + sd->node = -1; + + pxm = acpi_get_pxm(device->handle); +#ifdef CONFIG_ACPI_NUMA + if (pxm >= 0) + sd->node = pxm_to_node(pxm); +#endif + bus = pci_scan_bus_parented(NULL, busnum, _root_ops, sd); if (!bus) kfree(sd); #ifdef CONFIG_ACPI_NUMA if (bus != NULL) { - int pxm = acpi_get_pxm(device->handle); if (pxm >= 0) { - sd->node = pxm_to_node(pxm); printk("bus %d -> pxm %d -> node %d\n", busnum, pxm, sd->node); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] x86: Fix dev_to_node for x86 and x86_64
Hi Andrew, dev_to_node() does not work as expected on x86 and x86_64 as pointed out earlier here: http://lkml.org/lkml/2006/11/7/10 Following patch fixes it, please apply. (Note: The fix depends on support for PCI domains for x86/x86_64) Thanks, Kiran dev_to_node does not work as expected on x86_64 (and i386). This is because node value returned by pcibus_to_node is initialized after a struct device is created with current x86_64 code. We need the node value initialized before the call to pci_scan_bus_parented, as the generic devices are allocated and initialized off pci_scan_child_bus, which gets called from pci_scan_bus_parented The following patch does that using pci_sysdata introduced by the PCI domain patches in -mm. Signed-off-by: Alok N Kataria [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.20-rc1/arch/i386/pci/acpi.c === --- linux-2.6.20-rc1.orig/arch/i386/pci/acpi.c 2006-12-28 11:51:52.542775000 -0800 +++ linux-2.6.20-rc1/arch/i386/pci/acpi.c 2006-12-28 12:01:19.242775000 -0800 @@ -9,6 +9,7 @@ struct pci_bus * __devinit pci_acpi_scan { struct pci_bus *bus; struct pci_sysdata *sd; + int pxm; /* Allocate per-root-bus (not per bus) arch-specific data. * TODO: leak; this memory is never freed. @@ -30,15 +31,21 @@ struct pci_bus * __devinit pci_acpi_scan } #endif /* CONFIG_PCI_DOMAINS */ + sd-node = -1; + + pxm = acpi_get_pxm(device-handle); +#ifdef CONFIG_ACPI_NUMA + if (pxm = 0) + sd-node = pxm_to_node(pxm); +#endif + bus = pci_scan_bus_parented(NULL, busnum, pci_root_ops, sd); if (!bus) kfree(sd); #ifdef CONFIG_ACPI_NUMA if (bus != NULL) { - int pxm = acpi_get_pxm(device-handle); if (pxm = 0) { - sd-node = pxm_to_node(pxm); printk(bus %d - pxm %d - node %d\n, busnum, pxm, sd-node); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] mm: Set HASHDIST_DEFAULT to 1 for x86_64 NUMA
Enable system hashtable memory to be distributed among nodes on x86_64 NUMA Forcing the kernel to use node interleaved vmalloc instead of bootmem for the system hashtable memory (alloc_large_system_hash) reduces the memory imbalance on node 0 by around 40MB on a 8 node x86_64 NUMA box: Before the following patch, on bootup of a 8 node box: Node 0 MemTotal: 3407488 kB Node 0 MemFree: 3206296 kB Node 0 MemUsed:201192 kB Node 0 Active: 7012 kB Node 0 Inactive: 512 kB Node 0 Dirty: 0 kB Node 0 Writeback: 0 kB Node 0 FilePages:1912 kB Node 0 Mapped:420 kB Node 0 AnonPages:5612 kB Node 0 PageTables:468 kB Node 0 NFS_Unstable:0 kB Node 0 Bounce: 0 kB Node 0 Slab: 5408 kB Node 0 SReclaimable: 644 kB Node 0 SUnreclaim: 4764 kB After the patch (or using hashdist=1 on the kernel command line): Node 0 MemTotal: 3407488 kB Node 0 MemFree: 3247608 kB Node 0 MemUsed:159880 kB Node 0 Active: 3012 kB Node 0 Inactive: 616 kB Node 0 Dirty: 0 kB Node 0 Writeback: 0 kB Node 0 FilePages:2424 kB Node 0 Mapped:380 kB Node 0 AnonPages:1200 kB Node 0 PageTables:396 kB Node 0 NFS_Unstable:0 kB Node 0 Bounce: 0 kB Node 0 Slab: 6304 kB Node 0 SReclaimable: 1596 kB Node 0 SUnreclaim: 4708 kB I guess it is a good idea to keep HASHDIST_DEFAULT "on" for x86_64 NUMA since x86_64 has no dearth of vmalloc space? Or maybe enable hash distribution for all 64bit NUMA arches? The following patch does it only for x86_64. Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.20-rc1/include/linux/bootmem.h === --- linux-2.6.20-rc1.orig/include/linux/bootmem.h 2006-12-21 14:34:36.321610875 -0800 +++ linux-2.6.20-rc1/include/linux/bootmem.h2006-12-26 15:55:04.501064560 -0800 @@ -122,9 +122,9 @@ extern void *alloc_large_system_hash(con #define HASH_EARLY 0x0001 /* Allocating during early boot? */ /* Only NUMA needs hash distribution. - * IA64 is known to have sufficient vmalloc space. + * IA64 and x86_64 have sufficient vmalloc space. */ -#if defined(CONFIG_NUMA) && defined(CONFIG_IA64) +#if defined(CONFIG_NUMA) && (defined(CONFIG_IA64) || defined(CONFIG_X86_64)) #define HASHDIST_DEFAULT 1 #else #define HASHDIST_DEFAULT 0 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] mm: Set HASHDIST_DEFAULT to 1 for x86_64 NUMA
Enable system hashtable memory to be distributed among nodes on x86_64 NUMA Forcing the kernel to use node interleaved vmalloc instead of bootmem for the system hashtable memory (alloc_large_system_hash) reduces the memory imbalance on node 0 by around 40MB on a 8 node x86_64 NUMA box: Before the following patch, on bootup of a 8 node box: Node 0 MemTotal: 3407488 kB Node 0 MemFree: 3206296 kB Node 0 MemUsed:201192 kB Node 0 Active: 7012 kB Node 0 Inactive: 512 kB Node 0 Dirty: 0 kB Node 0 Writeback: 0 kB Node 0 FilePages:1912 kB Node 0 Mapped:420 kB Node 0 AnonPages:5612 kB Node 0 PageTables:468 kB Node 0 NFS_Unstable:0 kB Node 0 Bounce: 0 kB Node 0 Slab: 5408 kB Node 0 SReclaimable: 644 kB Node 0 SUnreclaim: 4764 kB After the patch (or using hashdist=1 on the kernel command line): Node 0 MemTotal: 3407488 kB Node 0 MemFree: 3247608 kB Node 0 MemUsed:159880 kB Node 0 Active: 3012 kB Node 0 Inactive: 616 kB Node 0 Dirty: 0 kB Node 0 Writeback: 0 kB Node 0 FilePages:2424 kB Node 0 Mapped:380 kB Node 0 AnonPages:1200 kB Node 0 PageTables:396 kB Node 0 NFS_Unstable:0 kB Node 0 Bounce: 0 kB Node 0 Slab: 6304 kB Node 0 SReclaimable: 1596 kB Node 0 SUnreclaim: 4708 kB I guess it is a good idea to keep HASHDIST_DEFAULT on for x86_64 NUMA since x86_64 has no dearth of vmalloc space? Or maybe enable hash distribution for all 64bit NUMA arches? The following patch does it only for x86_64. Signed-off-by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.20-rc1/include/linux/bootmem.h === --- linux-2.6.20-rc1.orig/include/linux/bootmem.h 2006-12-21 14:34:36.321610875 -0800 +++ linux-2.6.20-rc1/include/linux/bootmem.h2006-12-26 15:55:04.501064560 -0800 @@ -122,9 +122,9 @@ extern void *alloc_large_system_hash(con #define HASH_EARLY 0x0001 /* Allocating during early boot? */ /* Only NUMA needs hash distribution. - * IA64 is known to have sufficient vmalloc space. + * IA64 and x86_64 have sufficient vmalloc space. */ -#if defined(CONFIG_NUMA) defined(CONFIG_IA64) +#if defined(CONFIG_NUMA) (defined(CONFIG_IA64) || defined(CONFIG_X86_64)) #define HASHDIST_DEFAULT 1 #else #define HASHDIST_DEFAULT 0 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] x86_64: Fix boot hang due to nmi watchdog init code
2.6.19 stopped booting (or booted based on build/config) on our x86_64 systems due to a bug introduced in 2.6.19. check_nmi_watchdog schedules an IPI on all cpus to busy wait on a flag, but fails to set the busywait flag if NMI functionality is disabled. This causes the secondary cpus to spin in an endless loop, causing the kernel bootup to hang. Depending upon the build, the busywait flag got overwritten (stack variable) and caused the kernel to bootup on certain builds. Following patch fixes the bug by setting the busywait flag before returning from check_nmi_watchdog. I guess using a stack variable is not good here as the calling function could potentially return while the busy wait loop is still spinning on the flag. I would think this is a good candidate for 2.6.19 stable as well. Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.19/arch/x86_64/kernel/nmi.c === --- linux-2.6.19.orig/arch/x86_64/kernel/nmi.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19/arch/x86_64/kernel/nmi.c 2006-12-04 18:02:42.462737000 -0800 @@ -210,9 +210,10 @@ static __init void nmi_cpu_busy(void *da } #endif +static int endflag = 0; + int __init check_nmi_watchdog (void) { - volatile int endflag = 0; int *counts; int cpu; @@ -253,6 +254,7 @@ int __init check_nmi_watchdog (void) if (!atomic_read(_active)) { kfree(counts); atomic_set(_active, -1); + endflag = 1; return -1; } endflag = 1; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] x86_64: Fix boot hang due to nmi watchdog init code
2.6.19 stopped booting (or booted based on build/config) on our x86_64 systems due to a bug introduced in 2.6.19. check_nmi_watchdog schedules an IPI on all cpus to busy wait on a flag, but fails to set the busywait flag if NMI functionality is disabled. This causes the secondary cpus to spin in an endless loop, causing the kernel bootup to hang. Depending upon the build, the busywait flag got overwritten (stack variable) and caused the kernel to bootup on certain builds. Following patch fixes the bug by setting the busywait flag before returning from check_nmi_watchdog. I guess using a stack variable is not good here as the calling function could potentially return while the busy wait loop is still spinning on the flag. I would think this is a good candidate for 2.6.19 stable as well. Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.19/arch/x86_64/kernel/nmi.c === --- linux-2.6.19.orig/arch/x86_64/kernel/nmi.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19/arch/x86_64/kernel/nmi.c 2006-12-04 18:02:42.462737000 -0800 @@ -210,9 +210,10 @@ static __init void nmi_cpu_busy(void *da } #endif +static int endflag = 0; + int __init check_nmi_watchdog (void) { - volatile int endflag = 0; int *counts; int cpu; @@ -253,6 +254,7 @@ int __init check_nmi_watchdog (void) if (!atomic_read(nmi_active)) { kfree(counts); atomic_set(nmi_active, -1); + endflag = 1; return -1; } endflag = 1; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock
On Wed, Sep 07, 2005 at 11:19:24AM +0200, Jens Axboe wrote: > On Tue, Sep 06 2005, Ravikiran G Thirumalai wrote: > > The following patchset breaks down the global ide_lock to per-hwgroup lock. > > We have taken the following approach. > > Curious, what is the point of this? > On smp machines with multiple ide interfaces, we take per-group lock instead of a global lock, there by breaking the lock to per-irq hwgroups. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock
On Wed, Sep 07, 2005 at 06:09:10PM +0100, Alan Cox wrote: > On Maw, 2005-09-06 at 16:33 -0700, Ravikiran G Thirumalai wrote: > > 2. Change the core ide code to use hwgroup->lock instead of ide_lock. > > Deprecate ide_lock (patch 2) > > hwgroups and IDE locking requirements are frequently completely > unrelated. Its clear from the changes proposed you've not tested on real > hardware for each case and you have not studied the documented errata I tested it on a 2way box with a piix controller. It got through Bonnie++. I have access to piix controllers only, so that was the only controller I changed. I did not read the errata though... :( Do you think the approach is unsafe, even if the piix tune routine is serialized with a per-driver lock? Bartlomiej? Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 4/4] ide: Break ide_lock -- remove ide_lock from piix driver
On Wed, Sep 07, 2005 at 06:06:23PM +0100, Alan Cox wrote: > On Maw, 2005-09-06 at 16:44 -0700, Ravikiran G Thirumalai wrote: > > Patch to convert piix driver to use per-driver/hwgroup lock and kill > > ide_lock. In the case of piix, hwgroup->lock should be sufficient. > > PIIX requires that both channels are quiescent when retuning in some > cases. It wasn't totally safe before, its now totally broken. Start by Then the change to piix controller in my patchset is bad, How about changing the ide_lock to per-driver lock in this case? Locking for rest of the controllers in the system is left equivalent to what ide_lock did earlier.. > fixing the IDE layer locking properly (or forward porting my patches and > then fixing them for all the refcounting changes and other stuff done > since). Can you please point me to the patchset... Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/4] ide: Break ide_lock -- replace ide_lock with hwgroup->lock in core ide
On Tue, Sep 06, 2005 at 04:40:28PM -0700, Ravikiran G Thirumalai wrote: > Patch to convert ide core code to use hwgroup lock instead of a global > ide_lock. > > Index: linux-2.6.13/drivers/ide/ide-io.c > === > --- linux-2.6.13.orig/drivers/ide/ide-io.c2005-09-06 11:22:29.0 > -0700 > @@ -1211,11 +1214,11 @@ >*/ > if (masked_irq != IDE_NO_IRQ && hwif->irq != masked_irq) > disable_irq_nosync(hwif->irq); > - spin_unlock(_lock); > + spin_unlock(>lock); > local_irq_enable(); > /* allow other IRQs while we start this request */ > startstop = start_request(drive, rq); > - spin_lock_irq(_lock); > + spin_unlock_irq(>lock); ^^^ My bad, Fixed patch attached. Thanks, Kiran Patch to convert ide core code to use hwgroup lock instead of a global ide_lock. Signed-off-by: Vaibhav V. Nivargi <[EMAIL PROTECTED]> Signed-off-by: Alok N. Kataria <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.13/drivers/ide/ide-cd.c === --- linux-2.6.13.orig/drivers/ide/ide-cd.c 2005-09-06 15:08:07.0 -0700 +++ linux-2.6.13/drivers/ide/ide-cd.c 2005-09-06 15:16:15.0 -0700 @@ -590,7 +590,8 @@ static void cdrom_end_request (ide_drive_t *drive, int uptodate) { - struct request *rq = HWGROUP(drive)->rq; + ide_hwgroup_t *hwgroup = HWGROUP(drive); + struct request *rq = hwgroup->rq; int nsectors = rq->hard_cur_sectors; if ((rq->flags & REQ_SENSE) && uptodate) { @@ -612,10 +613,10 @@ /* * now end failed request */ - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); end_that_request_chunk(failed, 0, failed->data_len); end_that_request_last(failed); - spin_unlock_irqrestore(_lock, flags); + spin_unlock_irqrestore(>lock, flags); } cdrom_analyze_sense_data(drive, failed, sense); @@ -636,7 +637,8 @@ Returns 1 if the request was ended. */ static int cdrom_decode_status(ide_drive_t *drive, int good_stat, int *stat_ret) { - struct request *rq = HWGROUP(drive)->rq; + ide_hwgroup_t *hwgroup = HWGROUP(drive); + struct request *rq = hwgroup->rq; int stat, err, sense_key; /* Check for errors. */ @@ -698,10 +700,10 @@ * request sense has completed */ if (stat & ERR_STAT) { - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); blkdev_dequeue_request(rq); HWGROUP(drive)->rq = NULL; - spin_unlock_irqrestore(_lock, flags); + spin_unlock_irqrestore(>lock, flags); cdrom_queue_request_sense(drive, rq->sense, rq); } else @@ -741,9 +743,9 @@ * take a breather relying on the * unplug timer to kick us again */ - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); blk_plug_device(drive->queue); - spin_unlock_irqrestore(_lock,flags); + spin_unlock_irqrestore(>lock, flags); return 1; } } @@ -839,6 +841,7 @@ ide_startstop_t startstop; struct cdrom_info *info = drive->driver_data; ide_hwif_t *hwif = drive->hwif; + ide_hwgroup_t *hwgroup = hwif->hwgroup; /* Wait for the controller to be idle. */ if (ide_wait_stat(, drive, 0, BUSY_STAT, WAIT_READY)) @@ -866,10 +869,10 @@ unsigned long flags; /* packet command */ - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); hwif->OUTBSYNC(drive, WIN_PACKETCMD, IDE_COMMAND_REG); ndelay(400); - spin_unlock_irqrestore(_lock, flags); + spin_unlock_irqrestore(>lock, flags); return (*handler) (drive); } @@ -160
Re: [patch 2/4] ide: Break ide_lock -- replace ide_lock with hwgroup-lock in core ide
On Tue, Sep 06, 2005 at 04:40:28PM -0700, Ravikiran G Thirumalai wrote: Patch to convert ide core code to use hwgroup lock instead of a global ide_lock. Index: linux-2.6.13/drivers/ide/ide-io.c === --- linux-2.6.13.orig/drivers/ide/ide-io.c2005-09-06 11:22:29.0 -0700 @@ -1211,11 +1214,11 @@ */ if (masked_irq != IDE_NO_IRQ hwif-irq != masked_irq) disable_irq_nosync(hwif-irq); - spin_unlock(ide_lock); + spin_unlock(hwgroup-lock); local_irq_enable(); /* allow other IRQs while we start this request */ startstop = start_request(drive, rq); - spin_lock_irq(ide_lock); + spin_unlock_irq(hwgroup-lock); ^^^ My bad, Fixed patch attached. Thanks, Kiran Patch to convert ide core code to use hwgroup lock instead of a global ide_lock. Signed-off-by: Vaibhav V. Nivargi [EMAIL PROTECTED] Signed-off-by: Alok N. Kataria [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.13/drivers/ide/ide-cd.c === --- linux-2.6.13.orig/drivers/ide/ide-cd.c 2005-09-06 15:08:07.0 -0700 +++ linux-2.6.13/drivers/ide/ide-cd.c 2005-09-06 15:16:15.0 -0700 @@ -590,7 +590,8 @@ static void cdrom_end_request (ide_drive_t *drive, int uptodate) { - struct request *rq = HWGROUP(drive)-rq; + ide_hwgroup_t *hwgroup = HWGROUP(drive); + struct request *rq = hwgroup-rq; int nsectors = rq-hard_cur_sectors; if ((rq-flags REQ_SENSE) uptodate) { @@ -612,10 +613,10 @@ /* * now end failed request */ - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); end_that_request_chunk(failed, 0, failed-data_len); end_that_request_last(failed); - spin_unlock_irqrestore(ide_lock, flags); + spin_unlock_irqrestore(hwgroup-lock, flags); } cdrom_analyze_sense_data(drive, failed, sense); @@ -636,7 +637,8 @@ Returns 1 if the request was ended. */ static int cdrom_decode_status(ide_drive_t *drive, int good_stat, int *stat_ret) { - struct request *rq = HWGROUP(drive)-rq; + ide_hwgroup_t *hwgroup = HWGROUP(drive); + struct request *rq = hwgroup-rq; int stat, err, sense_key; /* Check for errors. */ @@ -698,10 +700,10 @@ * request sense has completed */ if (stat ERR_STAT) { - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); blkdev_dequeue_request(rq); HWGROUP(drive)-rq = NULL; - spin_unlock_irqrestore(ide_lock, flags); + spin_unlock_irqrestore(hwgroup-lock, flags); cdrom_queue_request_sense(drive, rq-sense, rq); } else @@ -741,9 +743,9 @@ * take a breather relying on the * unplug timer to kick us again */ - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); blk_plug_device(drive-queue); - spin_unlock_irqrestore(ide_lock,flags); + spin_unlock_irqrestore(hwgroup-lock, flags); return 1; } } @@ -839,6 +841,7 @@ ide_startstop_t startstop; struct cdrom_info *info = drive-driver_data; ide_hwif_t *hwif = drive-hwif; + ide_hwgroup_t *hwgroup = hwif-hwgroup; /* Wait for the controller to be idle. */ if (ide_wait_stat(startstop, drive, 0, BUSY_STAT, WAIT_READY)) @@ -866,10 +869,10 @@ unsigned long flags; /* packet command */ - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); hwif-OUTBSYNC(drive, WIN_PACKETCMD, IDE_COMMAND_REG); ndelay(400); - spin_unlock_irqrestore(ide_lock, flags); + spin_unlock_irqrestore(hwgroup-lock, flags); return (*handler) (drive); } @@ -1609,7 +1612,8 @@ static ide_startstop_t cdrom_newpc_intr(ide_drive_t *drive) { struct cdrom_info *info = drive
Re: [patch 4/4] ide: Break ide_lock -- remove ide_lock from piix driver
On Wed, Sep 07, 2005 at 06:06:23PM +0100, Alan Cox wrote: On Maw, 2005-09-06 at 16:44 -0700, Ravikiran G Thirumalai wrote: Patch to convert piix driver to use per-driver/hwgroup lock and kill ide_lock. In the case of piix, hwgroup-lock should be sufficient. PIIX requires that both channels are quiescent when retuning in some cases. It wasn't totally safe before, its now totally broken. Start by Then the change to piix controller in my patchset is bad, How about changing the ide_lock to per-driver lock in this case? Locking for rest of the controllers in the system is left equivalent to what ide_lock did earlier.. fixing the IDE layer locking properly (or forward porting my patches and then fixing them for all the refcounting changes and other stuff done since). Can you please point me to the patchset... Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock
On Wed, Sep 07, 2005 at 06:09:10PM +0100, Alan Cox wrote: On Maw, 2005-09-06 at 16:33 -0700, Ravikiran G Thirumalai wrote: 2. Change the core ide code to use hwgroup-lock instead of ide_lock. Deprecate ide_lock (patch 2) hwgroups and IDE locking requirements are frequently completely unrelated. Its clear from the changes proposed you've not tested on real hardware for each case and you have not studied the documented errata I tested it on a 2way box with a piix controller. It got through Bonnie++. I have access to piix controllers only, so that was the only controller I changed. I did not read the errata though... :( Do you think the approach is unsafe, even if the piix tune routine is serialized with a per-driver lock? Bartlomiej? Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock
On Wed, Sep 07, 2005 at 11:19:24AM +0200, Jens Axboe wrote: On Tue, Sep 06 2005, Ravikiran G Thirumalai wrote: The following patchset breaks down the global ide_lock to per-hwgroup lock. We have taken the following approach. Curious, what is the point of this? On smp machines with multiple ide interfaces, we take per-group lock instead of a global lock, there by breaking the lock to per-irq hwgroups. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/4] ide: Break ide_lock -- change controller drivers
Patch to make ide-host controllers use hwgroup lock where serialization with hwgroup->lock is necessary Signed-off-by: Vaibhav V. Nivargi <[EMAIL PROTECTED]> Signed-off-by: Alok N. Kataria <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Index: linux-2.6.13/drivers/ide/legacy/ht6560b.c === --- linux-2.6.13.orig/drivers/ide/legacy/ht6560b.c 2005-08-28 19:41:01.0 -0400 +++ linux-2.6.13/drivers/ide/legacy/ht6560b.c 2005-09-06 15:57:46.113541000 -0400 @@ -256,9 +256,11 @@ { unsigned long flags; int t = HT_PREFETCH_MODE << 8; - - spin_lock_irqsave(_lock, flags); - + ide_hwgroup_t *hwgroup = HWGROUP(drive); + + spin_lock_irqsave(>lock, flags); + spin_lock(_lock); + /* * Prefetch mode and unmask irq seems to conflict */ @@ -270,9 +272,10 @@ drive->drive_data &= ~t; /* disable prefetch mode */ drive->no_unmask = 0; } - - spin_unlock_irqrestore(_lock, flags); - + + spin_unlock(_lock); + spin_unlock_irqrestore(>lock, flags); + #ifdef DEBUG printk("ht6560b: drive %s prefetch mode %sabled\n", drive->name, (state ? "en" : "dis")); #endif @@ -282,6 +285,7 @@ { unsigned long flags; u8 timing; + ide_hwgroup_t *hwgroup = HWGROUP(drive); switch (pio) { case 8: /* set prefetch off */ @@ -291,14 +295,15 @@ } timing = ht_pio2timings(drive, pio); - - spin_lock_irqsave(_lock, flags); + + spin_lock_irqsave(>lock, flags); + spin_lock(_lock); drive->drive_data &= 0xff00; drive->drive_data |= timing; - spin_unlock_irqrestore(_lock, flags); - + spin_unlock(_lock); + spin_unlock_irqrestore(>lock, flags); #ifdef DEBUG printk("ht6560b: drive %s tuned to pio mode %#x timing=%#x\n", drive->name, pio, timing); #endif Index: linux-2.6.13/drivers/ide/pci/cmd640.c === --- linux-2.6.13.orig/drivers/ide/pci/cmd640.c 2005-08-28 19:41:01.0 -0400 +++ linux-2.6.13/drivers/ide/pci/cmd640.c 2005-09-06 15:50:35.330618750 -0400 @@ -442,11 +442,14 @@ static void set_prefetch_mode (unsigned int index, int mode) { ide_drive_t *drive = cmd_drives[index]; + ide_hwgroup_t *hwgroup = HWGROUP(drive); int reg = prefetch_regs[index]; u8 b; unsigned long flags; - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); + spin_lock(_lock); + b = __get_cmd640_reg(reg); if (mode) { /* want prefetch on? */ #if CMD640_PREFETCH_MASKS @@ -462,7 +465,9 @@ b |= prefetch_masks[index]; /* disable prefetch */ } __put_cmd640_reg(reg, b); - spin_unlock_irqrestore(_lock, flags); + + spin_unlock(_lock); + spin_unlock_irqrestore(>lock, flags); } /* Index: linux-2.6.13/drivers/ide/pci/piix.c === --- linux-2.6.13.orig/drivers/ide/pci/piix.c2005-09-06 15:49:48.271677750 -0400 +++ linux-2.6.13/drivers/ide/pci/piix.c 2005-09-06 15:56:55.982408000 -0400 @@ -215,6 +215,7 @@ { ide_hwif_t *hwif= HWIF(drive); struct pci_dev *dev = hwif->pci_dev; + ide_hwgroup_t *hwgroup = HWGROUP(drive); int is_slave= (>drives[1] == drive); int master_port = hwif->channel ? 0x42 : 0x40; int slave_port = 0x44; @@ -229,7 +230,8 @@ { 2, 3 }, }; pio = ide_get_best_pio_mode(drive, pio, 5, NULL); - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); + spin_lock(_lock); pci_read_config_word(dev, master_port, _data); if (is_slave) { master_data = master_data | 0x4000; @@ -249,7 +251,8 @@ pci_write_config_word(dev, master_port, master_data); if (is_slave) pci_write_config_byte(dev, slave_port, slave_data); - spin_unlock_irqrestore(_lock, flags); + spin_unlock(_lock); + spin_unlock_irqrestore(>lock, flags); } /** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/4] ide: Break ide_lock -- remove ide_lock from piix driver
Patch to convert piix driver to use per-driver/hwgroup lock and kill ide_lock. In the case of piix, hwgroup->lock should be sufficient. Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Index: linux-2.6.13/drivers/ide/pci/piix.c === --- linux-2.6.13.orig/drivers/ide/pci/piix.c2005-09-06 12:00:25.0 -0700 +++ linux-2.6.13/drivers/ide/pci/piix.c 2005-09-06 13:22:49.0 -0700 @@ -231,7 +231,6 @@ pio = ide_get_best_pio_mode(drive, pio, 5, NULL); spin_lock_irqsave(>lock, flags); - spin_lock(_lock); pci_read_config_word(dev, master_port, _data); if (is_slave) { master_data = master_data | 0x4000; @@ -251,7 +250,6 @@ pci_write_config_word(dev, master_port, master_data); if (is_slave) pci_write_config_byte(dev, slave_port, slave_data); - spin_unlock(_lock); spin_unlock_irqrestore(>lock, flags); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/4] ide: Break ide_lock -- replace ide_lock with hwgroup->lock in core ide
Patch to convert ide core code to use hwgroup lock instead of a global ide_lock. Signed-off-by: Vaibhav V. Nivargi <[EMAIL PROTECTED]> Signed-off-by: Alok N. Kataria <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.13/drivers/ide/ide-cd.c === --- linux-2.6.13.orig/drivers/ide/ide-cd.c 2005-09-06 11:22:29.0 -0700 +++ linux-2.6.13/drivers/ide/ide-cd.c 2005-09-06 11:34:43.0 -0700 @@ -590,7 +590,8 @@ static void cdrom_end_request (ide_drive_t *drive, int uptodate) { - struct request *rq = HWGROUP(drive)->rq; + ide_hwgroup_t *hwgroup = HWGROUP(drive); + struct request *rq = hwgroup->rq; int nsectors = rq->hard_cur_sectors; if ((rq->flags & REQ_SENSE) && uptodate) { @@ -612,10 +613,10 @@ /* * now end failed request */ - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); end_that_request_chunk(failed, 0, failed->data_len); end_that_request_last(failed); - spin_unlock_irqrestore(_lock, flags); + spin_unlock_irqrestore(>lock, flags); } cdrom_analyze_sense_data(drive, failed, sense); @@ -636,7 +637,8 @@ Returns 1 if the request was ended. */ static int cdrom_decode_status(ide_drive_t *drive, int good_stat, int *stat_ret) { - struct request *rq = HWGROUP(drive)->rq; + ide_hwgroup_t *hwgroup = HWGROUP(drive); + struct request *rq = hwgroup->rq; int stat, err, sense_key; /* Check for errors. */ @@ -698,10 +700,10 @@ * request sense has completed */ if (stat & ERR_STAT) { - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); blkdev_dequeue_request(rq); HWGROUP(drive)->rq = NULL; - spin_unlock_irqrestore(_lock, flags); + spin_unlock_irqrestore(>lock, flags); cdrom_queue_request_sense(drive, rq->sense, rq); } else @@ -741,9 +743,9 @@ * take a breather relying on the * unplug timer to kick us again */ - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); blk_plug_device(drive->queue); - spin_unlock_irqrestore(_lock,flags); + spin_unlock_irqrestore(>lock, flags); return 1; } } @@ -839,6 +841,7 @@ ide_startstop_t startstop; struct cdrom_info *info = drive->driver_data; ide_hwif_t *hwif = drive->hwif; + ide_hwgroup_t *hwgroup = hwif->hwgroup; /* Wait for the controller to be idle. */ if (ide_wait_stat(, drive, 0, BUSY_STAT, WAIT_READY)) @@ -866,10 +869,10 @@ unsigned long flags; /* packet command */ - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); hwif->OUTBSYNC(drive, WIN_PACKETCMD, IDE_COMMAND_REG); ndelay(400); - spin_unlock_irqrestore(_lock, flags); + spin_unlock_irqrestore(>lock, flags); return (*handler) (drive); } @@ -1609,7 +1612,8 @@ static ide_startstop_t cdrom_newpc_intr(ide_drive_t *drive) { struct cdrom_info *info = drive->driver_data; - struct request *rq = HWGROUP(drive)->rq; + ide_hwgroup_t *hwgroup = HWGROUP(drive); + struct request *rq = hwgroup->rq; int dma_error, dma, stat, ireason, len, thislen; u8 lowcyl, highcyl; xfer_func_t *xferfunc; @@ -1737,11 +1741,11 @@ if (!rq->data_len) post_transform_command(rq); - spin_lock_irqsave(_lock, flags); + spin_lock_irqsave(>lock, flags); blkdev_dequeue_request(rq); end_that_request_last(rq); HWGROUP(drive)->rq = NULL; - spin_unlock_irqrestore(_lock, flags); + spin_unlock_irqrestore(>lock, flags); return ide_stopped; } Index: linux-2.6.13/drivers/ide/ide-disk.c === --- linux-2.6.13.orig/drivers/ide/ide-disk.c2005-09-06 11:22:29.0 -0700 +++ linux-2.6.13/drivers/ide/ide-disk.c 2005-09-06 11:34:43.0 -0700 @@ -786,11 +786,12 @@ static int
[patch 1/4] ide: Break ide_lock -- Move hwif tuning code after hwif_init
Following patch moves the hwif tuning code from probe_hwif to ideprobe_init after ideprobe_init calls hwif_init so that all hwif's have associated hwgroups. With this patch, we should always have hwgroups for hwifs during calls the drive tune routines. Signed-off-by: Alok N Kataria <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Index: linux-2.6.13/drivers/ide/ide-probe.c === --- linux-2.6.13.orig/drivers/ide/ide-probe.c 2005-09-06 11:22:29.0 -0700 +++ linux-2.6.13/drivers/ide/ide-probe.c2005-09-06 15:12:58.0 -0700 @@ -852,8 +852,15 @@ if (!hwif->present) { ide_hwif_release_regions(hwif); - return; } +} + +static void ide_tune_drives(ide_hwif_t * hwif) +{ + unsigned int unit; + + if (!hwif->present) + return; for (unit = 0; unit < MAX_DRIVES; ++unit) { ide_drive_t *drive = >drives[unit]; @@ -1443,9 +1450,12 @@ for (index = 0; index < MAX_HWIFS; ++index) if (probe[index]) probe_hwif(_hwifs[index]); - for (index = 0; index < MAX_HWIFS; ++index) - if (probe[index]) + for (index = 0; index < MAX_HWIFS; ++index) { + if (probe[index]) { hwif_init(_hwifs[index]); + ide_tune_drives(_hwifs[index]); + } + } for (index = 0; index < MAX_HWIFS; ++index) { if (probe[index]) { ide_hwif_t *hwif = _hwifs[index]; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/4] ide: Break ide_lock to per-hwgroup lock
The following patchset breaks down the global ide_lock to per-hwgroup lock. We have taken the following approach. 1. Move the hwif tuning code from probe_hwif to ideprobe_init, after hwif_init so that hwgroups are present for all the hwifs when the tune routines for the hwifs are invoked (patch 1) 2. Change the core ide code to use hwgroup->lock instead of ide_lock. Deprecate ide_lock (patch 2) 3. Change the host controllers to use hwgroup->lock (patch 3) 4. Change host controller drivers to use per driver lock instead of ide_lock where needed or hwgroup->lock on case by case basis. This can be done incrementally for various controllers and we will have working code between patches -- this is done now for piix controller only. Eventually, we can change all controllers to remove ide_lock Thanks to Bartlomiej for comments and suggestions. Patchset follows. Patchset tested on a smp box with a piix controller. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/4] ide: Break ide_lock -- Move hwif tuning code after hwif_init
Following patch moves the hwif tuning code from probe_hwif to ideprobe_init after ideprobe_init calls hwif_init so that all hwif's have associated hwgroups. With this patch, we should always have hwgroups for hwifs during calls the drive tune routines. Signed-off-by: Alok N Kataria [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Index: linux-2.6.13/drivers/ide/ide-probe.c === --- linux-2.6.13.orig/drivers/ide/ide-probe.c 2005-09-06 11:22:29.0 -0700 +++ linux-2.6.13/drivers/ide/ide-probe.c2005-09-06 15:12:58.0 -0700 @@ -852,8 +852,15 @@ if (!hwif-present) { ide_hwif_release_regions(hwif); - return; } +} + +static void ide_tune_drives(ide_hwif_t * hwif) +{ + unsigned int unit; + + if (!hwif-present) + return; for (unit = 0; unit MAX_DRIVES; ++unit) { ide_drive_t *drive = hwif-drives[unit]; @@ -1443,9 +1450,12 @@ for (index = 0; index MAX_HWIFS; ++index) if (probe[index]) probe_hwif(ide_hwifs[index]); - for (index = 0; index MAX_HWIFS; ++index) - if (probe[index]) + for (index = 0; index MAX_HWIFS; ++index) { + if (probe[index]) { hwif_init(ide_hwifs[index]); + ide_tune_drives(ide_hwifs[index]); + } + } for (index = 0; index MAX_HWIFS; ++index) { if (probe[index]) { ide_hwif_t *hwif = ide_hwifs[index]; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/4] ide: Break ide_lock -- replace ide_lock with hwgroup-lock in core ide
Patch to convert ide core code to use hwgroup lock instead of a global ide_lock. Signed-off-by: Vaibhav V. Nivargi [EMAIL PROTECTED] Signed-off-by: Alok N. Kataria [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.13/drivers/ide/ide-cd.c === --- linux-2.6.13.orig/drivers/ide/ide-cd.c 2005-09-06 11:22:29.0 -0700 +++ linux-2.6.13/drivers/ide/ide-cd.c 2005-09-06 11:34:43.0 -0700 @@ -590,7 +590,8 @@ static void cdrom_end_request (ide_drive_t *drive, int uptodate) { - struct request *rq = HWGROUP(drive)-rq; + ide_hwgroup_t *hwgroup = HWGROUP(drive); + struct request *rq = hwgroup-rq; int nsectors = rq-hard_cur_sectors; if ((rq-flags REQ_SENSE) uptodate) { @@ -612,10 +613,10 @@ /* * now end failed request */ - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); end_that_request_chunk(failed, 0, failed-data_len); end_that_request_last(failed); - spin_unlock_irqrestore(ide_lock, flags); + spin_unlock_irqrestore(hwgroup-lock, flags); } cdrom_analyze_sense_data(drive, failed, sense); @@ -636,7 +637,8 @@ Returns 1 if the request was ended. */ static int cdrom_decode_status(ide_drive_t *drive, int good_stat, int *stat_ret) { - struct request *rq = HWGROUP(drive)-rq; + ide_hwgroup_t *hwgroup = HWGROUP(drive); + struct request *rq = hwgroup-rq; int stat, err, sense_key; /* Check for errors. */ @@ -698,10 +700,10 @@ * request sense has completed */ if (stat ERR_STAT) { - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); blkdev_dequeue_request(rq); HWGROUP(drive)-rq = NULL; - spin_unlock_irqrestore(ide_lock, flags); + spin_unlock_irqrestore(hwgroup-lock, flags); cdrom_queue_request_sense(drive, rq-sense, rq); } else @@ -741,9 +743,9 @@ * take a breather relying on the * unplug timer to kick us again */ - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); blk_plug_device(drive-queue); - spin_unlock_irqrestore(ide_lock,flags); + spin_unlock_irqrestore(hwgroup-lock, flags); return 1; } } @@ -839,6 +841,7 @@ ide_startstop_t startstop; struct cdrom_info *info = drive-driver_data; ide_hwif_t *hwif = drive-hwif; + ide_hwgroup_t *hwgroup = hwif-hwgroup; /* Wait for the controller to be idle. */ if (ide_wait_stat(startstop, drive, 0, BUSY_STAT, WAIT_READY)) @@ -866,10 +869,10 @@ unsigned long flags; /* packet command */ - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); hwif-OUTBSYNC(drive, WIN_PACKETCMD, IDE_COMMAND_REG); ndelay(400); - spin_unlock_irqrestore(ide_lock, flags); + spin_unlock_irqrestore(hwgroup-lock, flags); return (*handler) (drive); } @@ -1609,7 +1612,8 @@ static ide_startstop_t cdrom_newpc_intr(ide_drive_t *drive) { struct cdrom_info *info = drive-driver_data; - struct request *rq = HWGROUP(drive)-rq; + ide_hwgroup_t *hwgroup = HWGROUP(drive); + struct request *rq = hwgroup-rq; int dma_error, dma, stat, ireason, len, thislen; u8 lowcyl, highcyl; xfer_func_t *xferfunc; @@ -1737,11 +1741,11 @@ if (!rq-data_len) post_transform_command(rq); - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); blkdev_dequeue_request(rq); end_that_request_last(rq); HWGROUP(drive)-rq = NULL; - spin_unlock_irqrestore(ide_lock, flags); + spin_unlock_irqrestore(hwgroup-lock, flags); return ide_stopped; } Index: linux-2.6.13/drivers/ide/ide-disk.c === --- linux-2.6.13.orig/drivers/ide/ide-disk.c2005-09-06 11:22:29.0 -0700 +++ linux-2.6.13/drivers/ide/ide-disk.c
[patch 4/4] ide: Break ide_lock -- remove ide_lock from piix driver
Patch to convert piix driver to use per-driver/hwgroup lock and kill ide_lock. In the case of piix, hwgroup-lock should be sufficient. Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Index: linux-2.6.13/drivers/ide/pci/piix.c === --- linux-2.6.13.orig/drivers/ide/pci/piix.c2005-09-06 12:00:25.0 -0700 +++ linux-2.6.13/drivers/ide/pci/piix.c 2005-09-06 13:22:49.0 -0700 @@ -231,7 +231,6 @@ pio = ide_get_best_pio_mode(drive, pio, 5, NULL); spin_lock_irqsave(hwgroup-lock, flags); - spin_lock(ide_lock); pci_read_config_word(dev, master_port, master_data); if (is_slave) { master_data = master_data | 0x4000; @@ -251,7 +250,6 @@ pci_write_config_word(dev, master_port, master_data); if (is_slave) pci_write_config_byte(dev, slave_port, slave_data); - spin_unlock(ide_lock); spin_unlock_irqrestore(hwgroup-lock, flags); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/4] ide: Break ide_lock -- change controller drivers
Patch to make ide-host controllers use hwgroup lock where serialization with hwgroup-lock is necessary Signed-off-by: Vaibhav V. Nivargi [EMAIL PROTECTED] Signed-off-by: Alok N. Kataria [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Index: linux-2.6.13/drivers/ide/legacy/ht6560b.c === --- linux-2.6.13.orig/drivers/ide/legacy/ht6560b.c 2005-08-28 19:41:01.0 -0400 +++ linux-2.6.13/drivers/ide/legacy/ht6560b.c 2005-09-06 15:57:46.113541000 -0400 @@ -256,9 +256,11 @@ { unsigned long flags; int t = HT_PREFETCH_MODE 8; - - spin_lock_irqsave(ide_lock, flags); - + ide_hwgroup_t *hwgroup = HWGROUP(drive); + + spin_lock_irqsave(hwgroup-lock, flags); + spin_lock(ide_lock); + /* * Prefetch mode and unmask irq seems to conflict */ @@ -270,9 +272,10 @@ drive-drive_data = ~t; /* disable prefetch mode */ drive-no_unmask = 0; } - - spin_unlock_irqrestore(ide_lock, flags); - + + spin_unlock(ide_lock); + spin_unlock_irqrestore(hwgroup-lock, flags); + #ifdef DEBUG printk(ht6560b: drive %s prefetch mode %sabled\n, drive-name, (state ? en : dis)); #endif @@ -282,6 +285,7 @@ { unsigned long flags; u8 timing; + ide_hwgroup_t *hwgroup = HWGROUP(drive); switch (pio) { case 8: /* set prefetch off */ @@ -291,14 +295,15 @@ } timing = ht_pio2timings(drive, pio); - - spin_lock_irqsave(ide_lock, flags); + + spin_lock_irqsave(hwgroup-lock, flags); + spin_lock(ide_lock); drive-drive_data = 0xff00; drive-drive_data |= timing; - spin_unlock_irqrestore(ide_lock, flags); - + spin_unlock(ide_lock); + spin_unlock_irqrestore(hwgroup-lock, flags); #ifdef DEBUG printk(ht6560b: drive %s tuned to pio mode %#x timing=%#x\n, drive-name, pio, timing); #endif Index: linux-2.6.13/drivers/ide/pci/cmd640.c === --- linux-2.6.13.orig/drivers/ide/pci/cmd640.c 2005-08-28 19:41:01.0 -0400 +++ linux-2.6.13/drivers/ide/pci/cmd640.c 2005-09-06 15:50:35.330618750 -0400 @@ -442,11 +442,14 @@ static void set_prefetch_mode (unsigned int index, int mode) { ide_drive_t *drive = cmd_drives[index]; + ide_hwgroup_t *hwgroup = HWGROUP(drive); int reg = prefetch_regs[index]; u8 b; unsigned long flags; - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); + spin_lock(ide_lock); + b = __get_cmd640_reg(reg); if (mode) { /* want prefetch on? */ #if CMD640_PREFETCH_MASKS @@ -462,7 +465,9 @@ b |= prefetch_masks[index]; /* disable prefetch */ } __put_cmd640_reg(reg, b); - spin_unlock_irqrestore(ide_lock, flags); + + spin_unlock(ide_lock); + spin_unlock_irqrestore(hwgroup-lock, flags); } /* Index: linux-2.6.13/drivers/ide/pci/piix.c === --- linux-2.6.13.orig/drivers/ide/pci/piix.c2005-09-06 15:49:48.271677750 -0400 +++ linux-2.6.13/drivers/ide/pci/piix.c 2005-09-06 15:56:55.982408000 -0400 @@ -215,6 +215,7 @@ { ide_hwif_t *hwif= HWIF(drive); struct pci_dev *dev = hwif-pci_dev; + ide_hwgroup_t *hwgroup = HWGROUP(drive); int is_slave= (hwif-drives[1] == drive); int master_port = hwif-channel ? 0x42 : 0x40; int slave_port = 0x44; @@ -229,7 +230,8 @@ { 2, 3 }, }; pio = ide_get_best_pio_mode(drive, pio, 5, NULL); - spin_lock_irqsave(ide_lock, flags); + spin_lock_irqsave(hwgroup-lock, flags); + spin_lock(ide_lock); pci_read_config_word(dev, master_port, master_data); if (is_slave) { master_data = master_data | 0x4000; @@ -249,7 +251,8 @@ pci_write_config_word(dev, master_port, master_data); if (is_slave) pci_write_config_byte(dev, slave_port, slave_data); - spin_unlock_irqrestore(ide_lock, flags); + spin_unlock(ide_lock); + spin_unlock_irqrestore(hwgroup-lock, flags); } /** - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] Additions to .data.read_mostly section
Following patch moves a few static 'read mostly' variables to the .data.read_mostly section. Typically these are vector - irq tables, boot_cpu_data, node_maps etc., which are initialized once and read from often and rarely written to. Please include. Thanks, Kiran Patch to mark variables which are usually accessed for reads with __readmostly. Signed-off-by: Alok N Kataria <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Index: linux-2.6.13-rc6/arch/i386/kernel/io_apic.c === --- linux-2.6.13-rc6.orig/arch/i386/kernel/io_apic.c2005-08-23 15:05:49.0 -0700 +++ linux-2.6.13-rc6/arch/i386/kernel/io_apic.c 2005-08-23 15:09:22.0 -0700 @@ -77,7 +77,7 @@ int apic, pin, next; } irq_2_pin[PIN_MAP_SIZE]; -int vector_irq[NR_VECTORS] = { [0 ... NR_VECTORS - 1] = -1}; +int vector_irq[NR_VECTORS] __read_mostly = { [0 ... NR_VECTORS - 1] = -1}; #ifdef CONFIG_PCI_MSI #define vector_to_irq(vector) \ (platform_legacy_irq(vector) ? vector : vector_irq[vector]) @@ -1127,7 +1127,7 @@ } /* irq_vectors is indexed by the sum of all RTEs in all I/O APICs. */ -u8 irq_vector[NR_IRQ_VECTORS] = { FIRST_DEVICE_VECTOR , 0 }; +u8 irq_vector[NR_IRQ_VECTORS] __read_mostly = { FIRST_DEVICE_VECTOR , 0 }; int assign_irq_vector(int irq) { @@ -1992,7 +1992,7 @@ * edge-triggered handler, without risking IRQ storms and other ugly * races. */ -static struct hw_interrupt_type ioapic_edge_type = { +static struct hw_interrupt_type ioapic_edge_type __read_mostly = { .typename = "IO-APIC-edge", .startup= startup_edge_ioapic, .shutdown = shutdown_edge_ioapic, @@ -2003,7 +2003,7 @@ .set_affinity = set_ioapic_affinity, }; -static struct hw_interrupt_type ioapic_level_type = { +static struct hw_interrupt_type ioapic_level_type __read_mostly = { .typename = "IO-APIC-level", .startup= startup_level_ioapic, .shutdown = shutdown_level_ioapic, @@ -2074,7 +2074,7 @@ static void end_lapic_irq (unsigned int i) { /* nothing */ } -static struct hw_interrupt_type lapic_irq_type = { +static struct hw_interrupt_type lapic_irq_type __read_mostly = { .typename = "local-APIC-edge", .startup= NULL, /* startup_irq() not used for IRQ0 */ .shutdown = NULL, /* shutdown_irq() not used for IRQ0 */ Index: linux-2.6.13-rc6/arch/i386/kernel/setup.c === --- linux-2.6.13-rc6.orig/arch/i386/kernel/setup.c 2005-08-23 15:05:49.0 -0700 +++ linux-2.6.13-rc6/arch/i386/kernel/setup.c 2005-08-23 15:09:22.0 -0700 @@ -82,7 +82,7 @@ /* cpu data as detected by the assembly code in head.S */ struct cpuinfo_x86 new_cpu_data __initdata = { 0, 0, 0, 0, -1, 1, 0, 0, -1 }; /* common cpu data for all cpus */ -struct cpuinfo_x86 boot_cpu_data = { 0, 0, 0, 0, -1, 1, 0, 0, -1 }; +struct cpuinfo_x86 boot_cpu_data __read_mostly = { 0, 0, 0, 0, -1, 1, 0, 0, -1 }; EXPORT_SYMBOL(boot_cpu_data); unsigned long mmu_cr4_features; Index: linux-2.6.13-rc6/arch/i386/kernel/timers/timer_hpet.c === --- linux-2.6.13-rc6.orig/arch/i386/kernel/timers/timer_hpet.c 2005-08-23 15:05:49.0 -0700 +++ linux-2.6.13-rc6/arch/i386/kernel/timers/timer_hpet.c 2005-08-23 15:09:22.0 -0700 @@ -18,8 +18,8 @@ #include "mach_timer.h" #include -static unsigned long __read_mostly hpet_usec_quotient; /* convert hpet clks to usec */ -static unsigned long tsc_hpet_quotient;/* convert tsc to hpet clks */ +static unsigned long hpet_usec_quotient __read_mostly; /* convert hpet clks to usec */ +static unsigned long tsc_hpet_quotient __read_mostly; /* convert tsc to hpet clks */ static unsigned long hpet_last;/* hpet counter value at last tick*/ static unsigned long last_tsc_low; /* lsb 32 bits of Time Stamp Counter */ static unsigned long last_tsc_high;/* msb 32 bits of Time Stamp Counter */ Index: linux-2.6.13-rc6/arch/i386/mm/discontig.c === --- linux-2.6.13-rc6.orig/arch/i386/mm/discontig.c 2005-08-23 15:05:49.0 -0700 +++ linux-2.6.13-rc6/arch/i386/mm/discontig.c 2005-08-23 15:09:22.0 -0700 @@ -37,7 +37,7 @@ #include #include -struct pglist_data *node_data[MAX_NUMNODES]; +struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); bootmem_data_t node0_bdata; @@ -49,8 +49,8 @@ * 2) node_start_pfn - the starting page frame number for a node * 3) node_end_pfn - the ending page fram number for a node */ -unsigned long node_start_pfn[MAX_NUMNODES]; -unsigned long node_end_pfn[MAX_NUMNODES]; +unsigned long node_start_pfn[MAX_NUMNODES]
[patch] Additions to .data.read_mostly section
Following patch moves a few static 'read mostly' variables to the .data.read_mostly section. Typically these are vector - irq tables, boot_cpu_data, node_maps etc., which are initialized once and read from often and rarely written to. Please include. Thanks, Kiran Patch to mark variables which are usually accessed for reads with __readmostly. Signed-off-by: Alok N Kataria [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Index: linux-2.6.13-rc6/arch/i386/kernel/io_apic.c === --- linux-2.6.13-rc6.orig/arch/i386/kernel/io_apic.c2005-08-23 15:05:49.0 -0700 +++ linux-2.6.13-rc6/arch/i386/kernel/io_apic.c 2005-08-23 15:09:22.0 -0700 @@ -77,7 +77,7 @@ int apic, pin, next; } irq_2_pin[PIN_MAP_SIZE]; -int vector_irq[NR_VECTORS] = { [0 ... NR_VECTORS - 1] = -1}; +int vector_irq[NR_VECTORS] __read_mostly = { [0 ... NR_VECTORS - 1] = -1}; #ifdef CONFIG_PCI_MSI #define vector_to_irq(vector) \ (platform_legacy_irq(vector) ? vector : vector_irq[vector]) @@ -1127,7 +1127,7 @@ } /* irq_vectors is indexed by the sum of all RTEs in all I/O APICs. */ -u8 irq_vector[NR_IRQ_VECTORS] = { FIRST_DEVICE_VECTOR , 0 }; +u8 irq_vector[NR_IRQ_VECTORS] __read_mostly = { FIRST_DEVICE_VECTOR , 0 }; int assign_irq_vector(int irq) { @@ -1992,7 +1992,7 @@ * edge-triggered handler, without risking IRQ storms and other ugly * races. */ -static struct hw_interrupt_type ioapic_edge_type = { +static struct hw_interrupt_type ioapic_edge_type __read_mostly = { .typename = IO-APIC-edge, .startup= startup_edge_ioapic, .shutdown = shutdown_edge_ioapic, @@ -2003,7 +2003,7 @@ .set_affinity = set_ioapic_affinity, }; -static struct hw_interrupt_type ioapic_level_type = { +static struct hw_interrupt_type ioapic_level_type __read_mostly = { .typename = IO-APIC-level, .startup= startup_level_ioapic, .shutdown = shutdown_level_ioapic, @@ -2074,7 +2074,7 @@ static void end_lapic_irq (unsigned int i) { /* nothing */ } -static struct hw_interrupt_type lapic_irq_type = { +static struct hw_interrupt_type lapic_irq_type __read_mostly = { .typename = local-APIC-edge, .startup= NULL, /* startup_irq() not used for IRQ0 */ .shutdown = NULL, /* shutdown_irq() not used for IRQ0 */ Index: linux-2.6.13-rc6/arch/i386/kernel/setup.c === --- linux-2.6.13-rc6.orig/arch/i386/kernel/setup.c 2005-08-23 15:05:49.0 -0700 +++ linux-2.6.13-rc6/arch/i386/kernel/setup.c 2005-08-23 15:09:22.0 -0700 @@ -82,7 +82,7 @@ /* cpu data as detected by the assembly code in head.S */ struct cpuinfo_x86 new_cpu_data __initdata = { 0, 0, 0, 0, -1, 1, 0, 0, -1 }; /* common cpu data for all cpus */ -struct cpuinfo_x86 boot_cpu_data = { 0, 0, 0, 0, -1, 1, 0, 0, -1 }; +struct cpuinfo_x86 boot_cpu_data __read_mostly = { 0, 0, 0, 0, -1, 1, 0, 0, -1 }; EXPORT_SYMBOL(boot_cpu_data); unsigned long mmu_cr4_features; Index: linux-2.6.13-rc6/arch/i386/kernel/timers/timer_hpet.c === --- linux-2.6.13-rc6.orig/arch/i386/kernel/timers/timer_hpet.c 2005-08-23 15:05:49.0 -0700 +++ linux-2.6.13-rc6/arch/i386/kernel/timers/timer_hpet.c 2005-08-23 15:09:22.0 -0700 @@ -18,8 +18,8 @@ #include mach_timer.h #include asm/hpet.h -static unsigned long __read_mostly hpet_usec_quotient; /* convert hpet clks to usec */ -static unsigned long tsc_hpet_quotient;/* convert tsc to hpet clks */ +static unsigned long hpet_usec_quotient __read_mostly; /* convert hpet clks to usec */ +static unsigned long tsc_hpet_quotient __read_mostly; /* convert tsc to hpet clks */ static unsigned long hpet_last;/* hpet counter value at last tick*/ static unsigned long last_tsc_low; /* lsb 32 bits of Time Stamp Counter */ static unsigned long last_tsc_high;/* msb 32 bits of Time Stamp Counter */ Index: linux-2.6.13-rc6/arch/i386/mm/discontig.c === --- linux-2.6.13-rc6.orig/arch/i386/mm/discontig.c 2005-08-23 15:05:49.0 -0700 +++ linux-2.6.13-rc6/arch/i386/mm/discontig.c 2005-08-23 15:09:22.0 -0700 @@ -37,7 +37,7 @@ #include asm/mmzone.h #include bios_ebda.h -struct pglist_data *node_data[MAX_NUMNODES]; +struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); bootmem_data_t node0_bdata; @@ -49,8 +49,8 @@ * 2) node_start_pfn - the starting page frame number for a node * 3) node_end_pfn - the ending page fram number for a node */ -unsigned long node_start_pfn[MAX_NUMNODES]; -unsigned long node_end_pfn[MAX_NUMNODES]; +unsigned long
[patch] ide: fix kmalloc_node breakage in ide driver
Machines with ide-interfaces which do not have pci devices are crashing on boot at pcibus_to_node in the ide drivers. We noticed this on a x445 running 2.6.13-rc4. Similar issue was discussed earlier, but the crash was due to hwif being NULL. http://marc.theaimsgroup.com/?t=11207535203=1=2 Andi and Christoph had patches, but neither went in. Here's one of those patches with an added BUG_ON(hwif == NULL). Please include. Thanks, Kiran Patch fixes oops caused by ide interfaces not on pci. pcibus_to_node causes the kernel to crash otherwise. Patch also adds a BUG_ON to check if hwif is NULL. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Index: linux-2.6.13-rc1/drivers/ide/ide-probe.c === --- linux-2.6.13-rc1.orig/drivers/ide/ide-probe.c 2005-06-29 20:06:53.0 -0400 +++ linux-2.6.13-rc1/drivers/ide/ide-probe.c2005-08-02 10:09:20.930965408 -0400 @@ -960,6 +960,15 @@ } #endif /* MAX_HWIFS > 1 */ +static inline int hwif_to_node(ide_hwif_t *hwif) +{ + if (hwif->pci_dev) + return pcibus_to_node(hwif->pci_dev->bus); + else + /* Add ways to determine the node of other busses here */ + return -1; +} + /* * init request queue */ @@ -978,8 +987,7 @@ * do not. */ - q = blk_init_queue_node(do_ide_request, _lock, - pcibus_to_node(drive->hwif->pci_dev->bus)); + q = blk_init_queue_node(do_ide_request, _lock, hwif_to_node(hwif)); if (!q) return 1; @@ -1048,6 +1056,8 @@ BUG_ON(in_interrupt()); BUG_ON(irqs_disabled()); + BUG_ON(hwif == NULL); + down(_cfg_sem); hwif->hwgroup = NULL; #if MAX_HWIFS > 1 @@ -1097,7 +1107,7 @@ spin_unlock_irq(_lock); } else { hwgroup = kmalloc_node(sizeof(ide_hwgroup_t), GFP_KERNEL, - pcibus_to_node(hwif->drives[0].hwif->pci_dev->bus)); + hwif_to_node(hwif->drives[0].hwif)); if (!hwgroup) goto out_up; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] ide: fix kmalloc_node breakage in ide driver
Machines with ide-interfaces which do not have pci devices are crashing on boot at pcibus_to_node in the ide drivers. We noticed this on a x445 running 2.6.13-rc4. Similar issue was discussed earlier, but the crash was due to hwif being NULL. http://marc.theaimsgroup.com/?t=11207535203r=1w=2 Andi and Christoph had patches, but neither went in. Here's one of those patches with an added BUG_ON(hwif == NULL). Please include. Thanks, Kiran Patch fixes oops caused by ide interfaces not on pci. pcibus_to_node causes the kernel to crash otherwise. Patch also adds a BUG_ON to check if hwif is NULL. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Index: linux-2.6.13-rc1/drivers/ide/ide-probe.c === --- linux-2.6.13-rc1.orig/drivers/ide/ide-probe.c 2005-06-29 20:06:53.0 -0400 +++ linux-2.6.13-rc1/drivers/ide/ide-probe.c2005-08-02 10:09:20.930965408 -0400 @@ -960,6 +960,15 @@ } #endif /* MAX_HWIFS 1 */ +static inline int hwif_to_node(ide_hwif_t *hwif) +{ + if (hwif-pci_dev) + return pcibus_to_node(hwif-pci_dev-bus); + else + /* Add ways to determine the node of other busses here */ + return -1; +} + /* * init request queue */ @@ -978,8 +987,7 @@ * do not. */ - q = blk_init_queue_node(do_ide_request, ide_lock, - pcibus_to_node(drive-hwif-pci_dev-bus)); + q = blk_init_queue_node(do_ide_request, ide_lock, hwif_to_node(hwif)); if (!q) return 1; @@ -1048,6 +1056,8 @@ BUG_ON(in_interrupt()); BUG_ON(irqs_disabled()); + BUG_ON(hwif == NULL); + down(ide_cfg_sem); hwif-hwgroup = NULL; #if MAX_HWIFS 1 @@ -1097,7 +1107,7 @@ spin_unlock_irq(ide_lock); } else { hwgroup = kmalloc_node(sizeof(ide_hwgroup_t), GFP_KERNEL, - pcibus_to_node(hwif-drives[0].hwif-pci_dev-bus)); + hwif_to_node(hwif-drives[0].hwif)); if (!hwgroup) goto out_up; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm: Ensure proper alignment for node_remap_start_pfn
On Thu, Jul 28, 2005 at 10:20:26AM -0700, Dave Hansen wrote: > On Wed, 2005-07-27 at 18:31 -0700, Ravikiran G Thirumalai wrote: > > On Wed, Jul 27, 2005 at 06:17:24PM -0700, Andrew Morton wrote: > > > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > > > > Yes, it does cause a crash. > > I don't know of any NUMA x86 sub-arches that have nodes which are > aligned on any less than 2MB. Is this an architecture that's supported > in the tree, today? SRAT need not guarantee any alignment at all in the memory affinity structure (the address in 64-bit byte address). And yes, there are x86-numa machines that run the latest kernel tree and face this problem. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm: Ensure proper alignment for node_remap_start_pfn
On Thu, Jul 28, 2005 at 10:20:26AM -0700, Dave Hansen wrote: On Wed, 2005-07-27 at 18:31 -0700, Ravikiran G Thirumalai wrote: On Wed, Jul 27, 2005 at 06:17:24PM -0700, Andrew Morton wrote: Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: Yes, it does cause a crash. I don't know of any NUMA x86 sub-arches that have nodes which are aligned on any less than 2MB. Is this an architecture that's supported in the tree, today? SRAT need not guarantee any alignment at all in the memory affinity structure (the address in 64-bit byte address). And yes, there are x86-numa machines that run the latest kernel tree and face this problem. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86_64: fix cpu_to_node setup for sparse apic_ids
On Wed, Jul 27, 2005 at 06:24:45PM -0700, Andrew Morton wrote: > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > > While booting with SMT disabled in bios, when using acpi srat to setup > > cpu_to_node[], sparse apic_ids create problems. Here's a fix for that. > > > > Again, I don't have enough info here to judge the urgency of this patch. > > What are the consequences and risks of not having this patch in 2.6.13, and > to how many machines? > Without this patch, intel x86_64 boxes with hyperthreading disabled in the bios (and which rely on srat for numa setup) endup having incorrect values in cpu_to_node[] arrays, causing sched domains to be built incorrectly etc. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm: Ensure proper alignment for node_remap_start_pfn
On Wed, Jul 27, 2005 at 06:17:24PM -0700, Andrew Morton wrote: > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > > While reserving KVA for lmem_maps of node, we have to make sure that > > node_remap_start_pfn[] is aligned to a proper pmd boundary. > > (node_remap_start_pfn[] gets its value from node_end_pfn[]) > > > > What are the effects of not having this patch applied? Does someone's > computer crash, or what? Yes, it does cause a crash. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] x86_64: fix cpu_to_node setup for sparse apic_ids
While booting with SMT disabled in bios, when using acpi srat to setup cpu_to_node[], sparse apic_ids create problems. Here's a fix for that. Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.13-rc3/arch/x86_64/mm/srat.c === --- linux-2.6.13-rc3.orig/arch/x86_64/mm/srat.c 2005-06-17 12:48:29.0 -0700 +++ linux-2.6.13-rc3/arch/x86_64/mm/srat.c 2005-07-27 15:36:23.0 -0700 @@ -20,6 +20,9 @@ static struct acpi_table_slit *acpi_slit; +/* Internal processor count */ +static unsigned int __initdata num_processors = 0; + static nodemask_t nodes_parsed __initdata; static nodemask_t nodes_found __initdata; static struct node nodes[MAX_NUMNODES] __initdata; @@ -101,16 +104,18 @@ bad_srat(); return; } - if (pa->apic_id >= NR_CPUS) { - printk(KERN_ERR "SRAT: lapic %u too large.\n", - pa->apic_id); + if (num_processors >= NR_CPUS) { + printk(KERN_ERR "SRAT: Processor #%d (lapic %u) INVALID. (Max ID: %d).\n", + num_processors, pa->apic_id, NR_CPUS); bad_srat(); return; } - cpu_to_node[pa->apic_id] = node; + cpu_to_node[num_processors] = node; acpi_numa = 1; - printk(KERN_INFO "SRAT: PXM %u -> APIC %u -> Node %u\n", - pxm, pa->apic_id, node); + printk(KERN_INFO "SRAT: PXM %u -> APIC %u -> CPU %u -> Node %u\n", + pxm, pa->apic_id, num_processors, node); + + num_processors++; } /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/