from:"Ravikiran G Thirumalai"

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai

On Wed, Jul 18, 2007 at 11:09:07PM -0700, Andrew Morton wrote:
>On Wed, 18 Jul 2007 22:41:21 -0700 Ravikiran G Thirumalai <[EMAIL PROTECTED]> 
>wrote:
>
>> On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote:
>> > On Mon, 16 Jul 2007 15:26:50 -0700
>> > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
>> >
>> > > Kernel warns of softlockups if the softlockup thread is not able to run
>> > > on a CPU for 10s.  It is useful to lower the softlockup warning
>> > > threshold in testing environments to catch potential lockups early.
>> > > Following patch adds a kernel parameter 'softlockup_lim' to control
>> > > the softlockup threshold.
>> > >
>> >
>> > Why not make it tunable at runtime?
>> 
>> Sure! Like a sysctl?
>> 

Hi Andrew,
Here's another take, incorporating all of your comments.

Thanks,
Kiran


Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_thresh to identify
possible softlockups earlier.

This patch:
1. Adds a sysctl softlockup_thresh with valid values of 1-60s
   (Higher value to disable false positives)
2. Changes the softlockup printk to print the cpu softlockup time

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

---
Patch applies on top of Ingo's softlockup patches

 Documentation/sysctl/kernel.txt |8 
 include/linux/sched.h   |1 +
 kernel/softlockup.c |8 +---
 kernel/sysctl.c |   25 +++--
 4 files changed, 33 insertions(+), 9 deletions(-)

Index: linux-2.6.22/kernel/softlockup.c
===
--- linux-2.6.22.orig/kernel/softlockup.c   2007-07-18 11:15:18.506614500 
-0700
+++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700
@@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic;
+int softlockup_thresh = 10;
 
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -101,7 +102,7 @@ void softlockup_tick(void)
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (now <= (touch_timestamp + 10))
+   if (now <= (touch_timestamp + softlockup_thresh))
return;
 
regs = get_irq_regs();
@@ -109,8 +110,9 @@ void softlockup_tick(void)
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(_lock);
-   printk(KERN_ERR "BUG: soft lockup detected on CPU#%d! [%s:%d]\n",
-   this_cpu, current->comm, current->pid);
+   printk(KERN_ERR "BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n",
+   this_cpu, now - touch_timestamp,
+   current->comm, current->pid);
if (regs)
show_regs(regs);
else
Index: linux-2.6.22/kernel/sysctl.c
===
--- linux-2.6.22.orig/kernel/sysctl.c   2007-07-08 16:32:17.0 -0700
+++ linux-2.6.22/kernel/sysctl.c2007-07-19 13:42:50.275478000 -0700
@@ -79,6 +79,12 @@ extern int compat_log;
 extern int maps_protect;
 extern int sysctl_stat_interval;
 
+/* Constants used for minimum and  maximum */
+static int one = 1;
+static int zero;
+static int sixty = 60;
+static int one_hundred = 100;
+
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -615,16 +621,23 @@ static ctl_table kern_table[] = {
.proc_handler   = _dointvec,
},
 #endif
+#ifdef CONFIG_DETECT_SOFTLOCKUP
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = "softlockup_thresh",
+   .data   = _thresh,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _dointvec_minmax,
+   .strategy   = _intvec,
+   .extra1 = ,
+   .extra2 = ,
+   },
+#endif
 
{ .ctl_name = 0 }
 };
 
-/* Constants for minimum and maximum testing in vm_table.
-   We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
 static ctl_table vm_table[] = {
{
.ctl_name   = VM_OVERCOMMIT_MEMORY,
Index: linux-2.6.22/Documentation/sysctl/kernel.txt
===
--- linux-2.6.22.orig/Documentation/sysctl/kernel.txt   2007-07-08 
16:32:17.0 -0700
+++ linux-2.6.22/Documentation/sysctl

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai

On Thu, Jul 19, 2007 at 11:51:14AM -0700, Jeremy Fitzhardinge wrote:
>Ingo Molnar wrote:
>> just in case someone sees false positives and wants to turn it off.
>
>Why not make 0=off?

A patch to disable softlockup during boot already went in.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=97842216b8400fe9d1a20468959e2989180f8f79

It uses kernel boot parameter to disable softlockup, not exactly disabling
softlockup at run time though.  Do we still need 0=off?

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai

On Thu, Jul 19, 2007 at 11:11:42AM +0200, Ingo Molnar wrote:
>
>* Andrew Morton <[EMAIL PROTECTED]> wrote:
>
>> > +softlockup_thresh:
>> > +
>> > +This value can be used to lower the softlockup tolerance
>> > +threshold. The default threshold is 10s.  If a cpu is locked up
>> > +for 10s, the kernel complains.  Valid values are 1-10s.
>> > +
>> 
>> neato.
>
>please make sure this is applied after the softlockup watchdog patches i 
>already did. (in particular the cleanup patch, which this could interact 
>with)
>
>also, i think the valid threshold should be between 1 and 60 seconds i 
>think.

60 seconds!  Is that not a pretty high threshold?  The reason for lowering
the tolerance threshold from 10s is to catch bugs early in lab environments,
but why do we need to raise the tolerance thresh beyond 10s?

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai

On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote:
> On Mon, 16 Jul 2007 15:26:50 -0700
> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
>
> > Kernel warns of softlockups if the softlockup thread is not able to run
> > on a CPU for 10s.  It is useful to lower the softlockup warning
> > threshold in testing environments to catch potential lockups early.
> > Following patch adds a kernel parameter 'softlockup_lim' to control
> > the softlockup threshold.
> >
>
> Why not make it tunable at runtime?

Sure! Like a sysctl?

Here's a patch that does that (On top of Ingo's
softlockup-improve-debug-output.patch)

>
> >
> > Control the trigger limit for softlockup warnings.  This is useful for
> > debugging softlockups, by lowering the softlockup_lim to identify
> > possible softlockups earlier.
>
> Please check your patches with scripts/checkpatch.pl.

Yep will-do.
(checkpatch emitted one warning for the patch below, but that was because
of a 'stylo' that already exists in include/linux/sysctl.h -- which probably
needs a style change patch by itself)

---

Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_thresh sysctl,
to identify possible softlockups earlier.

Patch also changes the softlockup printk to print the cpu softlockup time.

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.22/kernel/softlockup.c
===
--- linux-2.6.22.orig/kernel/softlockup.c   2007-07-18 11:15:18.506614500 
-0700
+++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700
@@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic;
+int softlockup_thresh = 10;
 
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -101,7 +102,7 @@ void softlockup_tick(void)
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (now <= (touch_timestamp + 10))
+   if (now <= (touch_timestamp + softlockup_thresh))
return;
 
regs = get_irq_regs();
@@ -109,8 +110,9 @@ void softlockup_tick(void)
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(_lock);
-   printk(KERN_ERR "BUG: soft lockup detected on CPU#%d! [%s:%d]\n",
-   this_cpu, current->comm, current->pid);
+   printk(KERN_ERR "BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n",
+   this_cpu, now - touch_timestamp,
+   current->comm, current->pid);
if (regs)
show_regs(regs);
else
Index: linux-2.6.22/kernel/sysctl.c
===
--- linux-2.6.22.orig/kernel/sysctl.c   2007-07-08 16:32:17.0 -0700
+++ linux-2.6.22/kernel/sysctl.c2007-07-18 21:05:57.877436750 -0700
@@ -78,6 +78,7 @@ extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int maps_protect;
 extern int sysctl_stat_interval;
+extern int softlockup_thresh;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -206,6 +207,10 @@ static ctl_table root_table[] = {
{ .ctl_name = 0 }
 };
 
+/* Constants for kernel table minimum and  maximum */
+static int one = 1;
+static int ten = 10;
+
 static ctl_table kern_table[] = {
{
.ctl_name   = KERN_PANIC,
@@ -615,6 +620,19 @@ static ctl_table kern_table[] = {
.proc_handler   = _dointvec,
},
 #endif
+#ifdef CONFIG_DETECT_SOFTLOCKUP
+   {
+   .ctl_name   = KERN_SOFTLOCKUP_THRESHOLD,
+   .procname   = "softlockup_thresh",
+   .data   = _thresh,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _dointvec_minmax,
+   .strategy   = _intvec,
+   .extra1 = ,
+   .extra2 = ,
+   },
+#endif
 
{ .ctl_name = 0 }
 };
Index: linux-2.6.22/include/linux/sysctl.h
===
--- linux-2.6.22.orig/include/linux/sysctl.h2007-07-08 16:32:17.0 
-0700
+++ linux-2.6.22/include/linux/sysctl.h 2007-07-18 21:41:56.584347500 -0700
@@ -165,6 +165,7 @@ enum
KERN_MAX_LOCK_DEPTH=74,
KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+   KERN_SOFTLOCKUP_THRESHOLD=77, /* int: softlockup tolerance threshold */

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai

On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote:
 On Mon, 16 Jul 2007 15:26:50 -0700
 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:

  Kernel warns of softlockups if the softlockup thread is not able to run
  on a CPU for 10s.  It is useful to lower the softlockup warning
  threshold in testing environments to catch potential lockups early.
  Following patch adds a kernel parameter 'softlockup_lim' to control
  the softlockup threshold.
 

 Why not make it tunable at runtime?

Sure! Like a sysctl?

Here's a patch that does that (On top of Ingo's
softlockup-improve-debug-output.patch)


 
  Control the trigger limit for softlockup warnings.  This is useful for
  debugging softlockups, by lowering the softlockup_lim to identify
  possible softlockups earlier.

 Please check your patches with scripts/checkpatch.pl.

Yep will-do.
(checkpatch emitted one warning for the patch below, but that was because
of a 'stylo' that already exists in include/linux/sysctl.h -- which probably
needs a style change patch by itself)

---

Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_thresh sysctl,
to identify possible softlockups earlier.

Patch also changes the softlockup printk to print the cpu softlockup time.

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.22/kernel/softlockup.c
===
--- linux-2.6.22.orig/kernel/softlockup.c   2007-07-18 11:15:18.506614500 
-0700
+++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700
@@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic;
+int softlockup_thresh = 10;
 
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -101,7 +102,7 @@ void softlockup_tick(void)
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (now = (touch_timestamp + 10))
+   if (now = (touch_timestamp + softlockup_thresh))
return;
 
regs = get_irq_regs();
@@ -109,8 +110,9 @@ void softlockup_tick(void)
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(print_lock);
-   printk(KERN_ERR BUG: soft lockup detected on CPU#%d! [%s:%d]\n,
-   this_cpu, current-comm, current-pid);
+   printk(KERN_ERR BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n,
+   this_cpu, now - touch_timestamp,
+   current-comm, current-pid);
if (regs)
show_regs(regs);
else
Index: linux-2.6.22/kernel/sysctl.c
===
--- linux-2.6.22.orig/kernel/sysctl.c   2007-07-08 16:32:17.0 -0700
+++ linux-2.6.22/kernel/sysctl.c2007-07-18 21:05:57.877436750 -0700
@@ -78,6 +78,7 @@ extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int maps_protect;
 extern int sysctl_stat_interval;
+extern int softlockup_thresh;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -206,6 +207,10 @@ static ctl_table root_table[] = {
{ .ctl_name = 0 }
 };
 
+/* Constants for kernel table minimum and  maximum */
+static int one = 1;
+static int ten = 10;
+
 static ctl_table kern_table[] = {
{
.ctl_name   = KERN_PANIC,
@@ -615,6 +620,19 @@ static ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_DETECT_SOFTLOCKUP
+   {
+   .ctl_name   = KERN_SOFTLOCKUP_THRESHOLD,
+   .procname   = softlockup_thresh,
+   .data   = softlockup_thresh,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .strategy   = sysctl_intvec,
+   .extra1 = one,
+   .extra2 = ten,
+   },
+#endif
 
{ .ctl_name = 0 }
 };
Index: linux-2.6.22/include/linux/sysctl.h
===
--- linux-2.6.22.orig/include/linux/sysctl.h2007-07-08 16:32:17.0 
-0700
+++ linux-2.6.22/include/linux/sysctl.h 2007-07-18 21:41:56.584347500 -0700
@@ -165,6 +165,7 @@ enum
KERN_MAX_LOCK_DEPTH=74,
KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+   KERN_SOFTLOCKUP_THRESHOLD=77, /* int: softlockup tolerance threshold */
 };
 
 
Index: linux-2.6.22/Documentation/sysctl/kernel.txt
===
--- linux-2.6.22.orig/Documentation

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai

On Thu, Jul 19, 2007 at 11:11:42AM +0200, Ingo Molnar wrote:

* Andrew Morton [EMAIL PROTECTED] wrote:

  +softlockup_thresh:
  +
  +This value can be used to lower the softlockup tolerance
  +threshold. The default threshold is 10s.  If a cpu is locked up
  +for 10s, the kernel complains.  Valid values are 1-10s.
  +
 
 neato.

please make sure this is applied after the softlockup watchdog patches i 
already did. (in particular the cleanup patch, which this could interact 
with)

also, i think the valid threshold should be between 1 and 60 seconds i 
think.

60 seconds!  Is that not a pretty high threshold?  The reason for lowering
the tolerance threshold from 10s is to catch bugs early in lab environments,
but why do we need to raise the tolerance thresh beyond 10s?

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai

On Thu, Jul 19, 2007 at 11:51:14AM -0700, Jeremy Fitzhardinge wrote:
Ingo Molnar wrote:
 just in case someone sees false positives and wants to turn it off.

Why not make 0=off?

A patch to disable softlockup during boot already went in.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=97842216b8400fe9d1a20468959e2989180f8f79

It uses kernel boot parameter to disable softlockup, not exactly disabling
softlockup at run time though.  Do we still need 0=off?

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai

On Wed, Jul 18, 2007 at 11:09:07PM -0700, Andrew Morton wrote:
On Wed, 18 Jul 2007 22:41:21 -0700 Ravikiran G Thirumalai [EMAIL PROTECTED] 
wrote:

 On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote:
  On Mon, 16 Jul 2007 15:26:50 -0700
  Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:
 
   Kernel warns of softlockups if the softlockup thread is not able to run
   on a CPU for 10s.  It is useful to lower the softlockup warning
   threshold in testing environments to catch potential lockups early.
   Following patch adds a kernel parameter 'softlockup_lim' to control
   the softlockup threshold.
  
 
  Why not make it tunable at runtime?
 
 Sure! Like a sysctl?
 

Hi Andrew,
Here's another take, incorporating all of your comments.

Thanks,
Kiran


Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_thresh to identify
possible softlockups earlier.

This patch:
1. Adds a sysctl softlockup_thresh with valid values of 1-60s
   (Higher value to disable false positives)
2. Changes the softlockup printk to print the cpu softlockup time

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

---
Patch applies on top of Ingo's softlockup patches

 Documentation/sysctl/kernel.txt |8 
 include/linux/sched.h   |1 +
 kernel/softlockup.c |8 +---
 kernel/sysctl.c |   25 +++--
 4 files changed, 33 insertions(+), 9 deletions(-)

Index: linux-2.6.22/kernel/softlockup.c
===
--- linux-2.6.22.orig/kernel/softlockup.c   2007-07-18 11:15:18.506614500 
-0700
+++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700
@@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic;
+int softlockup_thresh = 10;
 
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -101,7 +102,7 @@ void softlockup_tick(void)
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (now = (touch_timestamp + 10))
+   if (now = (touch_timestamp + softlockup_thresh))
return;
 
regs = get_irq_regs();
@@ -109,8 +110,9 @@ void softlockup_tick(void)
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(print_lock);
-   printk(KERN_ERR BUG: soft lockup detected on CPU#%d! [%s:%d]\n,
-   this_cpu, current-comm, current-pid);
+   printk(KERN_ERR BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n,
+   this_cpu, now - touch_timestamp,
+   current-comm, current-pid);
if (regs)
show_regs(regs);
else
Index: linux-2.6.22/kernel/sysctl.c
===
--- linux-2.6.22.orig/kernel/sysctl.c   2007-07-08 16:32:17.0 -0700
+++ linux-2.6.22/kernel/sysctl.c2007-07-19 13:42:50.275478000 -0700
@@ -79,6 +79,12 @@ extern int compat_log;
 extern int maps_protect;
 extern int sysctl_stat_interval;
 
+/* Constants used for minimum and  maximum */
+static int one = 1;
+static int zero;
+static int sixty = 60;
+static int one_hundred = 100;
+
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -615,16 +621,23 @@ static ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_DETECT_SOFTLOCKUP
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = softlockup_thresh,
+   .data   = softlockup_thresh,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .strategy   = sysctl_intvec,
+   .extra1 = one,
+   .extra2 = sixty,
+   },
+#endif
 
{ .ctl_name = 0 }
 };
 
-/* Constants for minimum and maximum testing in vm_table.
-   We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
 static ctl_table vm_table[] = {
{
.ctl_name   = VM_OVERCOMMIT_MEMORY,
Index: linux-2.6.22/Documentation/sysctl/kernel.txt
===
--- linux-2.6.22.orig/Documentation/sysctl/kernel.txt   2007-07-08 
16:32:17.0 -0700
+++ linux-2.6.22/Documentation/sysctl/kernel.txt2007-07-19 
13:42:09.748945250 -0700
@@ -320,6 +320,14 @@ kernel.  This value defaults to SHMMAX.
 
 ==
 
+softlockup_thresh:
+
+This value can be used to lower the softlockup

[patch] Change softlockup trigger limit using a kernel parameter

2007-07-16 Thread Ravikiran G Thirumalai

Kernel warns of softlockups if the softlockup thread is not able to run
on a CPU for 10s.  It is useful to lower the softlockup warning
threshold in testing environments to catch potential lockups early.
Following patch adds a kernel parameter 'softlockup_lim' to control
the softlockup threshold.

Thanks,
Kiran


Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_lim to identify
possible softlockups earlier.

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.22/Documentation/kernel-parameters.txt
===
--- linux-2.6.22.orig/Documentation/kernel-parameters.txt   2007-07-08 
16:32:17.0 -0700
+++ linux-2.6.22/Documentation/kernel-parameters.txt2007-07-16 
14:07:34.907116001 -0700
@@ -1781,6 +1781,10 @@ and is between 256 and 4096 characters. 
 
snd-ymfpci= [HW,ALSA]
 
+   softlockup_lim=
+   [KNL] Soft lock up tolerance limit in seconds
+   { 1 - 10 }
+
sonycd535=  [HW,CD]
Format: [,]
 
Index: linux-2.6.22/kernel/softlockup.c
===
--- linux-2.6.22.orig/kernel/softlockup.c   2007-07-08 16:32:17.0 
-0700
+++ linux-2.6.22/kernel/softlockup.c2007-07-11 12:50:05.280460591 -0700
@@ -21,6 +21,7 @@ static DEFINE_PER_CPU(unsigned long, pri
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic = 0;
+static int softlockup_lim = 10;
 
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -34,6 +35,20 @@ static struct notifier_block panic_block
.notifier_call = softlock_panic,
 };
 
+static int __init softlockup_lim_setup(char *str)
+{
+   int lim = simple_strtol(str, NULL, 0);
+   if(lim > 0 && lim <= 10) {
+   softlockup_lim = lim;
+   } else {
+   printk("%s: Invalid softlockup_lim parameter.  ", __func__);
+   printk("Using defaults.\n");
+   }
+   return 1;
+}
+
+__setup("softlockup_lim=", softlockup_lim_setup);
+
 /*
  * Returns seconds, approximately.  We don't need nanosecond
  * resolution, and we don't need to waste time with a big divide when
@@ -97,7 +112,7 @@ void softlockup_tick(void)
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (now > (touch_timestamp + 10)) {
+   if (now > (touch_timestamp + softlockup_lim)) {
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(_lock);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] Change softlockup trigger limit using a kernel parameter

2007-07-16 Thread Ravikiran G Thirumalai

Kernel warns of softlockups if the softlockup thread is not able to run
on a CPU for 10s.  It is useful to lower the softlockup warning
threshold in testing environments to catch potential lockups early.
Following patch adds a kernel parameter 'softlockup_lim' to control
the softlockup threshold.

Thanks,
Kiran


Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_lim to identify
possible softlockups earlier.

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.22/Documentation/kernel-parameters.txt
===
--- linux-2.6.22.orig/Documentation/kernel-parameters.txt   2007-07-08 
16:32:17.0 -0700
+++ linux-2.6.22/Documentation/kernel-parameters.txt2007-07-16 
14:07:34.907116001 -0700
@@ -1781,6 +1781,10 @@ and is between 256 and 4096 characters. 
 
snd-ymfpci= [HW,ALSA]
 
+   softlockup_lim=
+   [KNL] Soft lock up tolerance limit in seconds
+   { 1 - 10 }
+
sonycd535=  [HW,CD]
Format: io[,irq]
 
Index: linux-2.6.22/kernel/softlockup.c
===
--- linux-2.6.22.orig/kernel/softlockup.c   2007-07-08 16:32:17.0 
-0700
+++ linux-2.6.22/kernel/softlockup.c2007-07-11 12:50:05.280460591 -0700
@@ -21,6 +21,7 @@ static DEFINE_PER_CPU(unsigned long, pri
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic = 0;
+static int softlockup_lim = 10;
 
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -34,6 +35,20 @@ static struct notifier_block panic_block
.notifier_call = softlock_panic,
 };
 
+static int __init softlockup_lim_setup(char *str)
+{
+   int lim = simple_strtol(str, NULL, 0);
+   if(lim  0  lim = 10) {
+   softlockup_lim = lim;
+   } else {
+   printk(%s: Invalid softlockup_lim parameter.  , __func__);
+   printk(Using defaults.\n);
+   }
+   return 1;
+}
+
+__setup(softlockup_lim=, softlockup_lim_setup);
+
 /*
  * Returns seconds, approximately.  We don't need nanosecond
  * resolution, and we don't need to waste time with a big divide when
@@ -97,7 +112,7 @@ void softlockup_tick(void)
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (now  (touch_timestamp + 10)) {
+   if (now  (touch_timestamp + softlockup_lim)) {
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(print_lock);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86_64: Avoid too many remote cpu references due to /proc/stat

2007-07-13 Thread Ravikiran G Thirumalai

On Thu, Jul 12, 2007 at 07:13:17PM -0700, Andrew Morton wrote:
> On Thu, 12 Jul 2007 17:06:16 -0700 Ravikiran G Thirumalai <[EMAIL PROTECTED]> 
> wrote:
>
> > Too many remote cpu references due to /proc/stat.
> >
> > On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
> > On every call to kstat_irqs, the process brings in per-cpu data from all
> > online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
> > results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
> > /proc/stat is parsed by common commands like top, who etc, causing
> > lots of cacheline transfers
>
> (256+32*63) * 63 = 143136
>
> Do you have any actual numbers for how much this hurts?

Under certain load conditions with 2.6.21/22, this hurts a *lot* infact,
I have noticed delays in seconds when /proc/stat is read (48P box).
However, the reason for this is not the transfer of per-cpu kstat, but
dereferencing of the percpu data offset from a remote cpu pda!  Since the
cpu pda is kmalloced, with 2.6.21, the zone pcp structures happen to
share the same page  (and in some cases the internode cacheline as well)
with the cpu_pda.  Excessive page allocation activities on the remote
node causes reads to the cpu pda of the remote cpus to result in cache misses.

>
> > This statistic seems useless. Other 'big iron' arches disable this.
> > Can we disable computing/reporting this statistic?  This piece of
> > statistic is not human readable on x86_64 anymore,
>
> Did you consider using percpu_counters (or such) at interrupt-time? 
> (warning: percpu_counters aren't presently interrupt safe).

No, but given that alloc_percpu of a counter wastes so much memory, an array
of NR_IRQS percpu_counters will only cause more bloat no?

>
> > If not, can we optimize computing this statistic so as to avoid
> > too many remote references (patch to follow)
>
> You other patch is a straightforward optimisation and should just be merged.
>

Sure!  But I still don't see this statistic _really_ being useful anywhere;
/proc/stat output with NR_CPUS of 64 or 96 is ugly, but I might be missing
something

> But afaict it will only provide a 2x speedup which I doubt is sufficient?

No, it provides much more than 2x speedup with the 'hurting' workload.
This is because the optimized patch goes like this:

for_each_possible_cpu(i) {
...
for (j = 0 ; j < NR_IRQS ; j++) {
unsigned int temp = kstat_cpu(i).irqs[j];
...
}
}

But the existing code with kstat_irqs goes as:

for (j = 0 ; j < NR_IRQS ; j++) {
for_each_possible_cpu(cpu) {
sum += kstat_cpu(cpu).irqs[irq]
...
}

Former case causes less cache misses to the pda per-cpu dataoffset than
the latter by a huge margin.

Cache miss itself is not so bad per miss, but the number of misses
in the latter case adds up!

Yes, the cpu pda needs to be on an internode boundary by itself,  I will be
submitting a patch for the same soon.  I am thinking of page aligning cpu
pda and using the 12 lower bits in %gs to store the cpu processor id maybe :)

>
>
>
> Another thought is: how many of the NR_IRQS counters are actually non-zero?
> Because a pretty obvious optimisation would be to have a global
> bitmap[NR_IRQS] and do
>
>   if (!bitmap[irq])
>   bitmap[irq] = 1;
>
> at interrupt-time, then just print a "0" for the interrupts which have
> never occurred within show_stats().

Yep.  That would work too.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86_64: Avoid too many remote cpu references due to /proc/stat

2007-07-13 Thread Ravikiran G Thirumalai

On Thu, Jul 12, 2007 at 07:13:17PM -0700, Andrew Morton wrote:
 On Thu, 12 Jul 2007 17:06:16 -0700 Ravikiran G Thirumalai [EMAIL PROTECTED] 
 wrote:

  Too many remote cpu references due to /proc/stat.
 
  On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
  On every call to kstat_irqs, the process brings in per-cpu data from all
  online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
  results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
  /proc/stat is parsed by common commands like top, who etc, causing
  lots of cacheline transfers

 (256+32*63) * 63 = 143136

 Do you have any actual numbers for how much this hurts?

Under certain load conditions with 2.6.21/22, this hurts a *lot* infact,
I have noticed delays in seconds when /proc/stat is read (48P box).
However, the reason for this is not the transfer of per-cpu kstat, but
dereferencing of the percpu data offset from a remote cpu pda!  Since the
cpu pda is kmalloced, with 2.6.21, the zone pcp structures happen to
share the same page  (and in some cases the internode cacheline as well)
with the cpu_pda.  Excessive page allocation activities on the remote
node causes reads to the cpu pda of the remote cpus to result in cache misses.


  This statistic seems useless. Other 'big iron' arches disable this.
  Can we disable computing/reporting this statistic?  This piece of
  statistic is not human readable on x86_64 anymore,

 Did you consider using percpu_counters (or such) at interrupt-time? 
 (warning: percpu_counters aren't presently interrupt safe).

No, but given that alloc_percpu of a counter wastes so much memory, an array
of NR_IRQS percpu_counters will only cause more bloat no?



  If not, can we optimize computing this statistic so as to avoid
  too many remote references (patch to follow)

 You other patch is a straightforward optimisation and should just be merged.


Sure!  But I still don't see this statistic _really_ being useful anywhere;
/proc/stat output with NR_CPUS of 64 or 96 is ugly, but I might be missing
something

 But afaict it will only provide a 2x speedup which I doubt is sufficient?

No, it provides much more than 2x speedup with the 'hurting' workload.
This is because the optimized patch goes like this:

for_each_possible_cpu(i) {
...
for (j = 0 ; j  NR_IRQS ; j++) {
unsigned int temp = kstat_cpu(i).irqs[j];
...
}
}

But the existing code with kstat_irqs goes as:

for (j = 0 ; j  NR_IRQS ; j++) {
for_each_possible_cpu(cpu) {
sum += kstat_cpu(cpu).irqs[irq]
...
}

Former case causes less cache misses to the pda per-cpu dataoffset than
the latter by a huge margin.

Cache miss itself is not so bad per miss, but the number of misses
in the latter case adds up!

Yes, the cpu pda needs to be on an internode boundary by itself,  I will be
submitting a patch for the same soon.  I am thinking of page aligning cpu
pda and using the 12 lower bits in %gs to store the cpu processor id maybe :)




 Another thought is: how many of the NR_IRQS counters are actually non-zero?
 Because a pretty obvious optimisation would be to have a global
 bitmap[NR_IRQS] and do

   if (!bitmap[irq])
   bitmap[irq] = 1;

 at interrupt-time, then just print a 0 for the interrupts which have
 never occurred within show_stats().

Yep.  That would work too.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] Avoid too many remote cpu references due to /proc/stat

2007-07-12 Thread Ravikiran G Thirumalai

On Thu, Jul 12, 2007 at 05:06:15PM -0700, Ravikiran G Thirumalai wrote:
> Too many remote cpu references due to /proc/stat.
> 
> On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
> On every call to kstat_irqs, the process brings in per-cpu data from all
> online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
> results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
> /proc/stat is parsed by common commands like top, who etc, causing
> lots of cacheline transfers
> 
> This statistic seems useless. Other 'big iron' arches disable this.
> Can we disable computing/reporting this statistic?  This piece of
> statistic is not human readable on x86_64 anymore,
> 
> If not, can we optimize computing this statistic so as to avoid
> too many remote references (patch to follow)

Optimize show_stat to collect per-irq information just once.

On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
On every call to kstat_irqs, the process brings in per-cpu data from all
online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
Considering the fact that we already compute this value per-cpu, we can
save on the remote references as below.

Signed-off-by: Alok N Kataria <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>

Index: linux-2.6.22/fs/proc/proc_misc.c
===
--- linux-2.6.22.orig/fs/proc/proc_misc.c   2007-07-11 14:32:33.013197741 
-0700
+++ linux-2.6.22/fs/proc/proc_misc.c2007-07-12 16:28:24.871389279 -0700
@@ -443,6 +443,11 @@ static int show_stat(struct seq_file *p,
unsigned long jif;
cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
u64 sum = 0;
+   unsigned int *per_irq_sum;
+
+   per_irq_sum = kzalloc(sizeof(unsigned int)*NR_IRQS, GFP_KERNEL);
+   if (!per_irq_sum)
+   return -ENOMEM;
 
user = nice = system = idle = iowait =
irq = softirq = steal = cputime64_zero;
@@ -461,8 +466,11 @@ static int show_stat(struct seq_file *p,
irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
-   for (j = 0 ; j < NR_IRQS ; j++)
-   sum += kstat_cpu(i).irqs[j];
+   for (j = 0 ; j < NR_IRQS ; j++) {
+   unsigned int temp = kstat_cpu(i).irqs[j];
+   sum += temp;
+   per_irq_sum[j] += temp;
+   }
}
 
seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu\n",
@@ -500,7 +508,7 @@ static int show_stat(struct seq_file *p,
 
 #if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64)
for (i = 0; i < NR_IRQS; i++)
-   seq_printf(p, " %u", kstat_irqs(i));
+   seq_printf(p, " %u", per_irq_sum[i]);
 #endif
 
seq_printf(p,
@@ -515,6 +523,7 @@ static int show_stat(struct seq_file *p,
nr_running(),
nr_iowait());
 
+   kfree(per_irq_sum);
return 0;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86_64: Avoid too many remote cpu references due to /proc/stat

2007-07-12 Thread Ravikiran G Thirumalai

Too many remote cpu references due to /proc/stat.

On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
On every call to kstat_irqs, the process brings in per-cpu data from all
online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
/proc/stat is parsed by common commands like top, who etc, causing
lots of cacheline transfers

This statistic seems useless. Other 'big iron' arches disable this.
Can we disable computing/reporting this statistic?  This piece of
statistic is not human readable on x86_64 anymore,

If not, can we optimize computing this statistic so as to avoid
too many remote references (patch to follow)

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>

Index: linux-2.6.22/fs/proc/proc_misc.c
===
--- linux-2.6.22.orig/fs/proc/proc_misc.c   2007-07-12 16:31:02.0 
-0700
+++ linux-2.6.22/fs/proc/proc_misc.c2007-07-12 16:33:45.226221759 -0700
@@ -498,7 +498,8 @@ static int show_stat(struct seq_file *p,
}
seq_printf(p, "intr %llu", (unsigned long long)sum);
 
-#if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64)
+#if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64) \
+   && !defined(CONFIG_X86_64)
for (i = 0; i < NR_IRQS; i++)
seq_printf(p, " %u", kstat_irqs(i));
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86_64: Avoid too many remote cpu references due to /proc/stat

2007-07-12 Thread Ravikiran G Thirumalai

Too many remote cpu references due to /proc/stat.

On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
On every call to kstat_irqs, the process brings in per-cpu data from all
online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
/proc/stat is parsed by common commands like top, who etc, causing
lots of cacheline transfers

This statistic seems useless. Other 'big iron' arches disable this.
Can we disable computing/reporting this statistic?  This piece of
statistic is not human readable on x86_64 anymore,

If not, can we optimize computing this statistic so as to avoid
too many remote references (patch to follow)

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]

Index: linux-2.6.22/fs/proc/proc_misc.c
===
--- linux-2.6.22.orig/fs/proc/proc_misc.c   2007-07-12 16:31:02.0 
-0700
+++ linux-2.6.22/fs/proc/proc_misc.c2007-07-12 16:33:45.226221759 -0700
@@ -498,7 +498,8 @@ static int show_stat(struct seq_file *p,
}
seq_printf(p, intr %llu, (unsigned long long)sum);
 
-#if !defined(CONFIG_PPC64)  !defined(CONFIG_ALPHA)  !defined(CONFIG_IA64)
+#if !defined(CONFIG_PPC64)  !defined(CONFIG_ALPHA)  !defined(CONFIG_IA64) \
+!defined(CONFIG_X86_64)
for (i = 0; i  NR_IRQS; i++)
seq_printf(p,  %u, kstat_irqs(i));
 #endif
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] Avoid too many remote cpu references due to /proc/stat

2007-07-12 Thread Ravikiran G Thirumalai

On Thu, Jul 12, 2007 at 05:06:15PM -0700, Ravikiran G Thirumalai wrote:
 Too many remote cpu references due to /proc/stat.
 
 On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
 On every call to kstat_irqs, the process brings in per-cpu data from all
 online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
 results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
 /proc/stat is parsed by common commands like top, who etc, causing
 lots of cacheline transfers
 
 This statistic seems useless. Other 'big iron' arches disable this.
 Can we disable computing/reporting this statistic?  This piece of
 statistic is not human readable on x86_64 anymore,
 
 If not, can we optimize computing this statistic so as to avoid
 too many remote references (patch to follow)

Optimize show_stat to collect per-irq information just once.

On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
On every call to kstat_irqs, the process brings in per-cpu data from all
online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
Considering the fact that we already compute this value per-cpu, we can
save on the remote references as below.

Signed-off-by: Alok N Kataria [EMAIL PROTECTED]
Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]

Index: linux-2.6.22/fs/proc/proc_misc.c
===
--- linux-2.6.22.orig/fs/proc/proc_misc.c   2007-07-11 14:32:33.013197741 
-0700
+++ linux-2.6.22/fs/proc/proc_misc.c2007-07-12 16:28:24.871389279 -0700
@@ -443,6 +443,11 @@ static int show_stat(struct seq_file *p,
unsigned long jif;
cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
u64 sum = 0;
+   unsigned int *per_irq_sum;
+
+   per_irq_sum = kzalloc(sizeof(unsigned int)*NR_IRQS, GFP_KERNEL);
+   if (!per_irq_sum)
+   return -ENOMEM;
 
user = nice = system = idle = iowait =
irq = softirq = steal = cputime64_zero;
@@ -461,8 +466,11 @@ static int show_stat(struct seq_file *p,
irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
-   for (j = 0 ; j  NR_IRQS ; j++)
-   sum += kstat_cpu(i).irqs[j];
+   for (j = 0 ; j  NR_IRQS ; j++) {
+   unsigned int temp = kstat_cpu(i).irqs[j];
+   sum += temp;
+   per_irq_sum[j] += temp;
+   }
}
 
seq_printf(p, cpu  %llu %llu %llu %llu %llu %llu %llu %llu\n,
@@ -500,7 +508,7 @@ static int show_stat(struct seq_file *p,
 
 #if !defined(CONFIG_PPC64)  !defined(CONFIG_ALPHA)  !defined(CONFIG_IA64)
for (i = 0; i  NR_IRQS; i++)
-   seq_printf(p,  %u, kstat_irqs(i));
+   seq_printf(p,  %u, per_irq_sum[i]);
 #endif
 
seq_printf(p,
@@ -515,6 +523,7 @@ static int show_stat(struct seq_file *p,
nr_running(),
nr_iowait());
 
+   kfree(per_irq_sum);
return 0;
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/12] acpi: fix another compile warning

2007-06-20 Thread Ravikiran G Thirumalai

On Wed, Jun 20, 2007 at 09:36:30AM -0400, Len Brown wrote:
> On Wednesday 20 June 2007 04:49, Andreas Herrmann wrote:
> > On Tue, Jun 19, 2007 at 11:38:02PM -0400, Len Brown wrote:
> > > On Tuesday 19 June 2007 18:50, Andreas Herrmann wrote:
> 
> I fear, however, that this patch defeats the purpose of 
> b0bd35e622ffbda2c01dc67a0381c6a18817a29a -- which was to make selecting
> NUMA more user-friendly.  So it might make more sense to simply revert that
> patch entirely.
> 
> The underlying problem is that Kconfig doesn't support using select
> on targets which themselves have dependencies.
> 
> Signed-off-by: Len Brown <[EMAIL PROTECTED]>
> 
> 
> diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
> index 5ce9443..e9d7767 100644
> --- a/arch/x86_64/Kconfig
> +++ b/arch/x86_64/Kconfig
> @@ -364,9 +364,9 @@ config NODES_SHIFT
>  config X86_64_ACPI_NUMA
> bool "ACPI NUMA detection"
> depends on NUMA
> -   select ACPI 
> +   depends on ACPI 
>   select PCI
> -   select ACPI_NUMA
> +   depends on ACPI_NUMA
> default y
> help
>Enable ACPI SRAT based node topology detection.
> -

arch/x86_64/Kconfig:706:error: found recursive dependency: PCI -> ACPI ->
X86_64_ACPI_NUMA -> PCI
 -> USB_ARCH_HAS_OHCI -> USB_ARCH_HAS_HCD -> MOUSE_APPLETOUCH -> USB ->
USB_SISUSBVGAmake[1]: *** [menuconfig] Error 1

I guess we have to change PCI to 'depends on' as well.
Else, select PM?

I am OK with either approach.

Thanks,
Kiran

Paper over 'select' inadequacies.  

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc5/arch/x86_64/Kconfig
===
--- linux-2.6.22-rc5.orig/arch/x86_64/Kconfig   2007-06-18 16:02:19.571323415 
-0700
+++ linux-2.6.22-rc5/arch/x86_64/Kconfig2007-06-20 11:34:29.845354250 
-0700
@@ -366,6 +366,7 @@ config X86_64_ACPI_NUMA
depends on NUMA
select ACPI 
select PCI
+   select PM
select ACPI_NUMA
default y
help
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/12] acpi: fix another compile warning

2007-06-20 Thread Ravikiran G Thirumalai

On Wed, Jun 20, 2007 at 09:36:30AM -0400, Len Brown wrote:
 On Wednesday 20 June 2007 04:49, Andreas Herrmann wrote:
  On Tue, Jun 19, 2007 at 11:38:02PM -0400, Len Brown wrote:
   On Tuesday 19 June 2007 18:50, Andreas Herrmann wrote:
 
 I fear, however, that this patch defeats the purpose of 
 b0bd35e622ffbda2c01dc67a0381c6a18817a29a -- which was to make selecting
 NUMA more user-friendly.  So it might make more sense to simply revert that
 patch entirely.
 
 The underlying problem is that Kconfig doesn't support using select
 on targets which themselves have dependencies.
 
 Signed-off-by: Len Brown [EMAIL PROTECTED]
 
 
 diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
 index 5ce9443..e9d7767 100644
 --- a/arch/x86_64/Kconfig
 +++ b/arch/x86_64/Kconfig
 @@ -364,9 +364,9 @@ config NODES_SHIFT
  config X86_64_ACPI_NUMA
 bool ACPI NUMA detection
 depends on NUMA
 -   select ACPI 
 +   depends on ACPI 
   select PCI
 -   select ACPI_NUMA
 +   depends on ACPI_NUMA
 default y
 help
Enable ACPI SRAT based node topology detection.
 -

arch/x86_64/Kconfig:706:error: found recursive dependency: PCI - ACPI -
X86_64_ACPI_NUMA - PCI
 - USB_ARCH_HAS_OHCI - USB_ARCH_HAS_HCD - MOUSE_APPLETOUCH - USB -
USB_SISUSBVGAmake[1]: *** [menuconfig] Error 1

I guess we have to change PCI to 'depends on' as well.
Else, select PM?

I am OK with either approach.

Thanks,
Kiran

Paper over 'select' inadequacies.  

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]

Index: linux-2.6.22-rc5/arch/x86_64/Kconfig
===
--- linux-2.6.22-rc5.orig/arch/x86_64/Kconfig   2007-06-18 16:02:19.571323415 
-0700
+++ linux-2.6.22-rc5/arch/x86_64/Kconfig2007-06-20 11:34:29.845354250 
-0700
@@ -366,6 +366,7 @@ config X86_64_ACPI_NUMA
depends on NUMA
select ACPI 
select PCI
+   select PM
select ACPI_NUMA
default y
help
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] long freezes on thinkpad t60

2007-06-18 Thread Ravikiran G Thirumalai

On Mon, Jun 18, 2007 at 01:20:55AM -0700, Andrew Morton wrote:
> On Mon, 18 Jun 2007 10:12:04 +0200 Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > >
> > Subject: [patch] x86: fix spin-loop starvation bug
> > From: Ingo Molnar <[EMAIL PROTECTED]>
> > 
> > Miklos Szeredi reported very long pauses (several seconds, sometimes 
> > more) on his T60 (with a Core2Duo) which he managed to track down to 
> > wait_task_inactive()'s open-coded busy-loop. He observed that an 
> > interrupt on one core tries to acquire the runqueue-lock but does not 
> > succeed in doing so for a very long time - while wait_task_inactive() on 
> > the other core loops waiting for the first core to deschedule a task 
> > (which it wont do while spinning in an interrupt handler).
> > 
> > The problem is: both the spin_lock() code and the wait_task_inactive() 
> > loop uses cpu_relax()/rep_nop(), so in theory the CPU should have 
> > guaranteed MESI-fairness to the two cores - but that didnt happen: one 
> > of the cores was able to monopolize the cacheline that holds the 
> > runqueue lock, for extended periods of time.
> > 
> > This patch changes the spin-loop to assert an atomic op after every REP 
> > NOP instance - this will cause the CPU to express its "MESI interest" in 
> > that cacheline after every REP NOP.
> 
> Kiran, if you're still able to reproduce that zone->lru_lock starvation 
> problem,
> this would be a good one to try...

We tried this approach a week back (speak of co-incidences), and it did not
help the problem.  I'd changed calls to the zone->lru_lock spin_lock
to do spin_trylock in a while loop with cpu_relax instead.  It did not help,
This was on top of 2.6.17 kernels.  But the good news is 2.6.21, as
is does not have the starvation issue -- that is, zone->lru_lock does not
seem to get contended that much under the same workload.

However, this was not on the same hardware I reported zone->lru_lock
contention on (8 socket dual core opteron).  I don't have access to it 
anymore :(

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] long freezes on thinkpad t60

2007-06-18 Thread Ravikiran G Thirumalai

On Mon, Jun 18, 2007 at 01:20:55AM -0700, Andrew Morton wrote:
 On Mon, 18 Jun 2007 10:12:04 +0200 Ingo Molnar [EMAIL PROTECTED] wrote:

  Subject: [patch] x86: fix spin-loop starvation bug
  From: Ingo Molnar [EMAIL PROTECTED]

  Miklos Szeredi reported very long pauses (several seconds, sometimes 
  more) on his T60 (with a Core2Duo) which he managed to track down to 
  wait_task_inactive()'s open-coded busy-loop. He observed that an 
  interrupt on one core tries to acquire the runqueue-lock but does not 
  succeed in doing so for a very long time - while wait_task_inactive() on 
  the other core loops waiting for the first core to deschedule a task 
  (which it wont do while spinning in an interrupt handler).

  The problem is: both the spin_lock() code and the wait_task_inactive() 
  loop uses cpu_relax()/rep_nop(), so in theory the CPU should have 
  guaranteed MESI-fairness to the two cores - but that didnt happen: one 
  of the cores was able to monopolize the cacheline that holds the 
  runqueue lock, for extended periods of time.

  This patch changes the spin-loop to assert an atomic op after every REP 
  NOP instance - this will cause the CPU to express its MESI interest in 
  that cacheline after every REP NOP.

 Kiran, if you're still able to reproduce that zone-lru_lock starvation 
 problem,
 this would be a good one to try...

We tried this approach a week back (speak of co-incidences), and it did not
help the problem.  I'd changed calls to the zone-lru_lock spin_lock
to do spin_trylock in a while loop with cpu_relax instead.  It did not help,
This was on top of 2.6.17 kernels.  But the good news is 2.6.21, as
is does not have the starvation issue -- that is, zone-lru_lock does not
seem to get contended that much under the same workload.

However, this was not on the same hardware I reported zone-lru_lock
contention on (8 socket dual core opteron).  I don't have access to it 
anymore :(

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.21.3: NFS: Buggy server - nlink == 0!

2007-06-06 Thread Ravikiran G Thirumalai

While running a dbench stress test on  a nfs mounted file system, I notice
the subject error message on the client machine.  The client machine is a 48
core box with NUMA characteristics and 1024 dbench processes running
continuously in a loop, while another memory hog application runs in parallel.
The client is on 2.6.21.3.  The server is booted up with 2.6.21.3 as well.
Attached is the server configuration.  Same test on a 2.6.16 client does not
spew out these messages.  Is this really the server issue, or is the NFS
client to be blamed here?

fstab on the client goes as:
vus2:/mnt/sda5  /nfstestnfs  
udp,wsize=32768,rsize=32768 0 0

/etc/exports on the server looks like:
/mnt/sda5 *(rw,no_root_squash,sync)

I will be happy to post full config/dmesg if required.

The client side NFS config goes as:
#
# Network File Systems
#
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
# CONFIG_NFS_V3_ACL is not set
# CONFIG_NFS_V4 is not set
CONFIG_NFS_DIRECTIO=y
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
# CONFIG_NFSD_V3_ACL is not set
# CONFIG_NFSD_V4 is not set
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
# CONFIG_RPCSEC_GSS_KRB5 is not set
# CONFIG_RPCSEC_GSS_SPKM3 is not set

The server side NFS config goes as:

#
# Network File Systems
#
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
# CONFIG_NFS_V4 is not set
CONFIG_NFS_DIRECTIO=y
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=m
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.21.3: NFS: Buggy server - nlink == 0!

2007-06-06 Thread Ravikiran G Thirumalai

While running a dbench stress test on  a nfs mounted file system, I notice
the subject error message on the client machine.  The client machine is a 48
core box with NUMA characteristics and 1024 dbench processes running
continuously in a loop, while another memory hog application runs in parallel.
The client is on 2.6.21.3.  The server is booted up with 2.6.21.3 as well.
Attached is the server configuration.  Same test on a 2.6.16 client does not
spew out these messages.  Is this really the server issue, or is the NFS
client to be blamed here?

fstab on the client goes as:
vus2:/mnt/sda5  /nfstestnfs  
udp,wsize=32768,rsize=32768 0 0

/etc/exports on the server looks like:
/mnt/sda5 *(rw,no_root_squash,sync)

I will be happy to post full config/dmesg if required.

The client side NFS config goes as:
#
# Network File Systems
#
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
# CONFIG_NFS_V3_ACL is not set
# CONFIG_NFS_V4 is not set
CONFIG_NFS_DIRECTIO=y
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
# CONFIG_NFSD_V3_ACL is not set
# CONFIG_NFSD_V4 is not set
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
# CONFIG_RPCSEC_GSS_KRB5 is not set
# CONFIG_RPCSEC_GSS_SPKM3 is not set

The server side NFS config goes as:

#
# Network File Systems
#
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
# CONFIG_NFS_V4 is not set
CONFIG_NFS_DIRECTIO=y
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=m
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3

2007-05-24 Thread Ravikiran G Thirumalai

On Thu, May 24, 2007 at 11:03:56AM +0200, Martin Schwidefsky wrote:
> On Wed, 2007-05-23 at 11:57 -0700, Ravikiran G Thirumalai wrote:
> 
> Current git with the patches applied and the default configuration for
> s390 decreases the section size fof .data.percpu from 0x3e50 to 0x3e00.
> 0.5% decrease.
> 

Thanks!!
Fenghua, you could use my 'Acked by' if needed
Acked-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3

2007-05-24 Thread Ravikiran G Thirumalai

On Thu, May 24, 2007 at 11:03:56AM +0200, Martin Schwidefsky wrote:
 On Wed, 2007-05-23 at 11:57 -0700, Ravikiran G Thirumalai wrote:
 
 Current git with the patches applied and the default configuration for
 s390 decreases the section size fof .data.percpu from 0x3e50 to 0x3e00.
 0.5% decrease.
 

Thanks!!
Fenghua, you could use my 'Acked by' if needed
Acked-by: Ravikiran Thirumalai [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3

2007-05-23 Thread Ravikiran G Thirumalai

On Wed, May 23, 2007 at 12:09:56PM -0700, Yu, Fenghua wrote:
> 
> >Has there been any measurable benefit yet due to tail padding?
> 
> We don't have data that tail padding actually helps. It all
> depends on what data the linker lays out in the cachelines.
> 
> As of now we just want to create the infrastructure (so that
> more and more people who need it, can use it).

So what we have now is space wastage on some architectures, space savings on
some, but with no measurable performance benefit due to the infrastructure
itself.  Why not push the infrastructure when we really need it, as against
pushing it now when we are not sure if it benefits?

Thanks,
Kiran 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3

2007-05-23 Thread Ravikiran G Thirumalai

On Wed, May 23, 2007 at 11:26:53AM -0700, Yu, Fenghua wrote:
> > elements are cacheline aligned. And as such, this differentiates the
> local
> > only data and remotely accessed data cleanly.
> 
> >OK, but could we please have a concise description of the impact
> >of these changes on kernel memory footprint?  Increase or decrease?
> >And by approximately how much?
> 
> Depending on how linker places percpu data, the patches could
> increase or decrease percpu section size. Data from 2.6.21-rc7-mm2:
> 
> On x86 SMP, the section size is increased from 0x7768 to 0x790c.
> 1.3% increase.
> 
> On X86-64 SMP, the size is decreased from 0x72d0 to 0x6540.
> 11.8% decrease.
> 
> On X86-64 VSMP, the size is increased from 0x72d0 to 0x8340.
> 14.3% increase.
> 
> On IA64 SMP, the size is decreased from 0x8370 to 0x7fc0.
> 2.8% decrease.

Has there been any measurable benefit yet due to tail padding?
It would also be interesting to check the wastage/savings on another large
cache architecture like S390 (which has a 256 byte cache line)

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3

2007-05-23 Thread Ravikiran G Thirumalai

On Wed, May 23, 2007 at 11:26:53AM -0700, Yu, Fenghua wrote:
  elements are cacheline aligned. And as such, this differentiates the
 local
  only data and remotely accessed data cleanly.
 
 OK, but could we please have a concise description of the impact
 of these changes on kernel memory footprint?  Increase or decrease?
 And by approximately how much?
 
 Depending on how linker places percpu data, the patches could
 increase or decrease percpu section size. Data from 2.6.21-rc7-mm2:
 
 On x86 SMP, the section size is increased from 0x7768 to 0x790c.
 1.3% increase.
 
 On X86-64 SMP, the size is decreased from 0x72d0 to 0x6540.
 11.8% decrease.
 
 On X86-64 VSMP, the size is increased from 0x72d0 to 0x8340.
 14.3% increase.
 
 On IA64 SMP, the size is decreased from 0x8370 to 0x7fc0.
 2.8% decrease.

Has there been any measurable benefit yet due to tail padding?
It would also be interesting to check the wastage/savings on another large
cache architecture like S390 (which has a 256 byte cache line)

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] Define new percpu interface for shared data -- version 3

2007-05-23 Thread Ravikiran G Thirumalai

On Wed, May 23, 2007 at 12:09:56PM -0700, Yu, Fenghua wrote:
 
 Has there been any measurable benefit yet due to tail padding?
 
 We don't have data that tail padding actually helps. It all
 depends on what data the linker lays out in the cachelines.
 
 As of now we just want to create the infrastructure (so that
 more and more people who need it, can use it).

So what we have now is space wastage on some architectures, space savings on
some, but with no measurable performance benefit due to the infrastructure
itself.  Why not push the infrastructure when we really need it, as against
pushing it now when we are not sure if it benefits?

Thanks,
Kiran 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ACPI build problem (2.6.21 CONFIG_X86_64_ACPI_NUMA)

2007-04-30 Thread Ravikiran G Thirumalai

On Sat, Apr 28, 2007 at 01:59:46AM -0400, Len Brown wrote:
> On Thursday 26 April 2007 09:26, you wrote:
> ... 
> CONFIG_ACPI depends on CONFIG_PM, yet this build fails because you have
> CONFIG_ACPI=y and CONFIG_PM=n
> 
> Unfortunately kconfig doesn't trace dependencies when "select" is used,
> making select sort of evil.  Ie. select can't target anything which
> itself has dependencies.
> 
> In 2.6.15, b0bd35e622ffbda2c01dc67a0381c6a18817a29a added the select below,
> and subsequently ACPI became dependent on PM, which broke your build.
> Technically, this could have broken your build in other ways too
> since ACPI already had other dependencies.
> 
> +# Dummy CONFIG option to select ACPI_NUMA from drivers/acpi/Kconfig.
> +
> +config X86_64_ACPI_NUMA
> +   bool "ACPI NUMA detection"
> +   depends on NUMA
> +   select ACPI 
> +   select ACPI_NUMA
> +   default y
> +   help
> +  Enable ACPI SRAT based node topology detection.
> 
> If you de-select CONFIG_X86_64_ACPI_NUMA then the select goes away
> and kconfig should work properly.
> 
> It isn't immediately clear to me how the NUMA Kconfig patch is a step
> forward, but perhaps the authors can comment.
>

Before the above change, to select NUMA for em64T systems, you *had* to
select K8_NUMA in the 'Processor type and features'  sub menu to select
CONFIG_NUMA, and then select ACPI NUMA from the generic PM menu, which is
not very intuitive when you have a EM64T NUMA system.

On x86_64 platforms, NUMA topology detection could be K8 style detection,
ACPI based topology  or NUMA emulation.  Most of the real NUMA systems
would need ACPI detection, older AMD platforms would need K8_NUMA, and 
NUMA emulation is a debugging/emulation option to emulate NUMA topology on
a regular SMP box.  Since CONFIG_NUMA + CONFIG_NUMA_EMU is a valid
combination, and since CONFIG_NUMA + CONFIG_K8_NUMA is a valid combination,
it makes sense to have X86_64_ACPI_NUMA on its own, and grouped along with
other NUMA topology styles in the Processor menu.  Hope this is good enough
reason :)

Regarding the subject build breakage, CONFIG_ACPI depends on CONFIG_PM with
an explicit 'depends' keyword. If PM does not get selected and CONFIG_ACPI
gets selected, I would think it is a Kconfig bug which needs fixing no?
Else, we can add a workaround by selecting CONFIG_PM if X86_64_ACPI_NUMA is
chosen.  But AFAICT the right thing is to fix the Kconfig bug.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ACPI build problem (2.6.21 CONFIG_X86_64_ACPI_NUMA)

2007-04-30 Thread Ravikiran G Thirumalai

On Sat, Apr 28, 2007 at 01:59:46AM -0400, Len Brown wrote:
 On Thursday 26 April 2007 09:26, you wrote:
 ... 
 CONFIG_ACPI depends on CONFIG_PM, yet this build fails because you have
 CONFIG_ACPI=y and CONFIG_PM=n
 
 Unfortunately kconfig doesn't trace dependencies when select is used,
 making select sort of evil.  Ie. select can't target anything which
 itself has dependencies.
 
 In 2.6.15, b0bd35e622ffbda2c01dc67a0381c6a18817a29a added the select below,
 and subsequently ACPI became dependent on PM, which broke your build.
 Technically, this could have broken your build in other ways too
 since ACPI already had other dependencies.
 
 +# Dummy CONFIG option to select ACPI_NUMA from drivers/acpi/Kconfig.
 +
 +config X86_64_ACPI_NUMA
 +   bool ACPI NUMA detection
 +   depends on NUMA
 +   select ACPI 
 +   select ACPI_NUMA
 +   default y
 +   help
 +  Enable ACPI SRAT based node topology detection.
 
 If you de-select CONFIG_X86_64_ACPI_NUMA then the select goes away
 and kconfig should work properly.
 
 It isn't immediately clear to me how the NUMA Kconfig patch is a step
 forward, but perhaps the authors can comment.


Before the above change, to select NUMA for em64T systems, you *had* to
select K8_NUMA in the 'Processor type and features'  sub menu to select
CONFIG_NUMA, and then select ACPI NUMA from the generic PM menu, which is
not very intuitive when you have a EM64T NUMA system.

On x86_64 platforms, NUMA topology detection could be K8 style detection,
ACPI based topology  or NUMA emulation.  Most of the real NUMA systems
would need ACPI detection, older AMD platforms would need K8_NUMA, and 
NUMA emulation is a debugging/emulation option to emulate NUMA topology on
a regular SMP box.  Since CONFIG_NUMA + CONFIG_NUMA_EMU is a valid
combination, and since CONFIG_NUMA + CONFIG_K8_NUMA is a valid combination,
it makes sense to have X86_64_ACPI_NUMA on its own, and grouped along with
other NUMA topology styles in the Processor menu.  Hope this is good enough
reason :)

Regarding the subject build breakage, CONFIG_ACPI depends on CONFIG_PM with
an explicit 'depends' keyword. If PM does not get selected and CONFIG_ACPI
gets selected, I would think it is a Kconfig bug which needs fixing no?
Else, we can add a workaround by selecting CONFIG_PM if X86_64_ACPI_NUMA is
chosen.  But AFAICT the right thing is to fix the Kconfig bug.

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2.6.21-rc6] failsafe mechanism to HPET clock calibration

2007-04-13 Thread Ravikiran G Thirumalai

Provide a failsafe mechanism to avoid kernel spinning for ever at read_hpet_tsc
during early kernel bootup.

This failsafe mechanism was introduced in 21-rc,
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=2f7a2a79c3ebb44f8b1b7d9b4fd3a650eb69e544

But looks like the hpet split from time.c lost the commit.

This patch reintroduces the failsafe mechanism

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>
Cc: Jack Steiner <[EMAIL PROTECTED]>
Cc: john stultz <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5/arch/x86_64/kernel/hpet.c
===
--- linux-2.6.21-rc5.orig/arch/x86_64/kernel/hpet.c 2007-04-11 
16:03:05.0 -0700
+++ linux-2.6.21-rc5/arch/x86_64/kernel/hpet.c  2007-04-11 18:49:36.0 
-0700
@@ -191,6 +191,7 @@ int hpet_reenable(void)
 
 #define TICK_COUNT 1
 #define TICK_MIN   5000
+#define MAX_TRIES  5
 
 /*
  * Some platforms take periodic SMI interrupts with 5ms duration. Make sure 
none
@@ -198,13 +199,15 @@ int hpet_reenable(void)
  */
 static void __init read_hpet_tsc(int *hpet, int *tsc)
 {
-   int tsc1, tsc2, hpet1;
+   int tsc1, tsc2, hpet1, i;
 
-   do {
+   for (i = 0; i < MAX_TRIES; i++) {
tsc1 = get_cycles_sync();
hpet1 = hpet_readl(HPET_COUNTER);
tsc2 = get_cycles_sync();
-   } while (tsc2 - tsc1 > TICK_MIN);
+   if (tsc2 - tsc1 > TICK_MIN)
+   break;
+   }
*hpet = hpet1;
*tsc = tsc2;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2.6.21-rc6] failsafe mechanism to HPET clock calibration

2007-04-13 Thread Ravikiran G Thirumalai

Provide a failsafe mechanism to avoid kernel spinning for ever at read_hpet_tsc
during early kernel bootup.

This failsafe mechanism was introduced in 21-rc,
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=2f7a2a79c3ebb44f8b1b7d9b4fd3a650eb69e544

But looks like the hpet split from time.c lost the commit.

This patch reintroduces the failsafe mechanism

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]
Cc: Jack Steiner [EMAIL PROTECTED]
Cc: john stultz [EMAIL PROTECTED]
Cc: Andi Kleen [EMAIL PROTECTED]

Index: linux-2.6.21-rc5/arch/x86_64/kernel/hpet.c
===
--- linux-2.6.21-rc5.orig/arch/x86_64/kernel/hpet.c 2007-04-11 
16:03:05.0 -0700
+++ linux-2.6.21-rc5/arch/x86_64/kernel/hpet.c  2007-04-11 18:49:36.0 
-0700
@@ -191,6 +191,7 @@ int hpet_reenable(void)
 
 #define TICK_COUNT 1
 #define TICK_MIN   5000
+#define MAX_TRIES  5
 
 /*
  * Some platforms take periodic SMI interrupts with 5ms duration. Make sure 
none
@@ -198,13 +199,15 @@ int hpet_reenable(void)
  */
 static void __init read_hpet_tsc(int *hpet, int *tsc)
 {
-   int tsc1, tsc2, hpet1;
+   int tsc1, tsc2, hpet1, i;
 
-   do {
+   for (i = 0; i  MAX_TRIES; i++) {
tsc1 = get_cycles_sync();
hpet1 = hpet_readl(HPET_COUNTER);
tsc2 = get_cycles_sync();
-   } while (tsc2 - tsc1  TICK_MIN);
+   if (tsc2 - tsc1  TICK_MIN)
+   break;
+   }
*hpet = hpet1;
*tsc = tsc2;
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: align rq to cacheline boundary

2007-04-09 Thread Ravikiran G Thirumalai

On Mon, Apr 09, 2007 at 03:17:05PM -0700, Siddha, Suresh B wrote:
> On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote:
> > On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote:
> > > On Mon, 9 Apr 2007 11:08:53 -0700
> > > "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote:
> 
> Kiran, can you educate me when I am supposed to use
> cacheline_aligned_in_smp
> Vs
> __cacheline_aligned_in_smp ?

As far as my understanding goes, the four underscore version is for 
aligning members/elements within a data structure, and the two underscore 
version is for aligning statically defined variables.
The dual underscore version places the variable in a separate section meant
for cacheline aligned variables, so that there is no false sharing on the
cacheline with a consecutive datum.  For regular statically defined data
structures, the latter has to be used, but since your patch uses per-cpu data, 
which is already in a separate section, you had to use the former I guess.

> 
> > As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline 
> > aligned per-cpu data in another section, just like we do with 
> > .data.cacheline_aligned section, but keep this new section between
> > __percpu_start and __percpu_end?
> 
> Yes. But that will still waste some memory in the new section, if the data
> elements are not multiples of 4k.

Yes.  But the wastage depends on the data structure now being aligned rather
than the structure that happened to be there before.  You cannot not lose
memory while padding I guess :).  But padding for per-cpu data seems a bit 
odd and I am not sure if it is worth it for 0.5% gain.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Pad irq_desc to internode cacheline size

2007-04-09 Thread Ravikiran G Thirumalai

On Mon, Apr 09, 2007 at 03:47:52PM -0600, Eric W. Biederman wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> > This will consume nearly 4k per irq won't it?  What is the upper bound
> > here, across all configs and all hardware?
> >
> > Is VSMP the only arch which has cacheline_internodealigned_in_smp
> > larger than cacheline_aligned_in_smp?
> 
> Ugh. We set internode aligned to 4k for all of x86_64.

!!! No. internode aligned is 4k only if CONFIG_X86_VSMP is chosen.  The
internode line size defaults to SMP_CACHE_BYTES for all other machine types.
Please note that an INTERNODE_CACHE_SHIFT of 12 is defined under 
#ifdef CONFIG_X86_VSMP in include/asm-x86_64/cache.h

> 
> I believe this ups our worst case memory consumption for
> the array from 1M to 32M.  Although the low end might be 2M.
> I can't recall if an irq_desc takes one cache line or two
> after we have put the cpu masks in it.
> 
> My gut feel says that what we want to do is delay this until
> we are dynamically allocating the array members.  Then we can at
> least have the chance of allocating the memory on the proper NUMA
> node, and won't need the extra NUMA alignment.

I was thinking on those lines as well.  But, until we get there, can we have
this in as stop gap?  The patch does not increase the memory footprint for
any other architecture other than vSMPowered systems.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: align rq to cacheline boundary

2007-04-09 Thread Ravikiran G Thirumalai

On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote:
> On Mon, 9 Apr 2007 11:08:53 -0700
> "Siddha, Suresh B" <[EMAIL PROTECTED]> wrote:
> 
> > Align the per cpu runqueue to the cacheline boundary. This will minimize the
> > number of cachelines touched during remote wakeup.
> > 
> > Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
> > ---
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index b9a6837..eca33c5 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -278,7 +278,7 @@ struct rq {
> > struct lock_class_key rq_lock_key;
> >  };
> >  
> > -static DEFINE_PER_CPU(struct rq, runqueues);
> > +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
> 
> Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is
> rather a lot.
> 
> Remember also that the linesize on VSMP is 4k.
> 
> And that putting a gap in the per-cpu memory like this will reduce its
> overall cache-friendliness.
> 

The internode line size yes.  But Suresh is using cacheline_aligned_in_smp,
which uses SMP_CACHE_BYTES (L1_CACHE_BYTES).  So this does not align the 
per-cpu variable to 4k.  However, if the motivation for this patch was 
significant performance difference, then, the above padding needs to be on 
the internode cacheline size using cacheline_internodealigned_in_smp.
cacheline_internodealigned_in_smp aligns a data structure to the
internode line size, which is 4k for vSMPowered systems and L1 line size 
for all other architectures.

As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline 
aligned per-cpu data in another section, just like we do with 
.data.cacheline_aligned section, but keep this new section between
__percpu_start and __percpu_end?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] Pad irq_desc to internode cacheline size

2007-04-09 Thread Ravikiran G Thirumalai

We noticed a drop in n/w performance due to the irq_desc being cacheline
aligned rather than internode aligned.  We see 50% of expected performance
when two e1000 nics local to two different nodes have consecutive irq
descriptors allocated, due to false sharing.

Note that this patch does away with cacheline padding for the UP case, as it
does not seem useful for UP configurations.

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5/include/linux/irq.h
===
--- linux-2.6.21-rc5.orig/include/linux/irq.h   2007-04-09 10:16:23.560848473 
-0700
+++ linux-2.6.21-rc5/include/linux/irq.h2007-04-09 10:16:45.401177929 
-0700
@@ -175,7 +175,7 @@ struct irq_desc {
struct proc_dir_entry   *dir;
 #endif
const char  *name;
-} cacheline_aligned;
+} cacheline_internodealigned_in_smp;
 
 extern struct irq_desc irq_desc[NR_IRQS];
 
Index: linux-2.6.21-rc5/kernel/irq/handle.c
===
--- linux-2.6.21-rc5.orig/kernel/irq/handle.c   2007-02-04 10:44:54.0 
-0800
+++ linux-2.6.21-rc5/kernel/irq/handle.c2007-04-09 12:26:40.473326023 
-0700
@@ -48,7 +48,7 @@ handle_bad_irq(unsigned int irq, struct 
  *
  * Controller mappings for all interrupt sources:
  */
-struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned = {
+struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
[0 ... NR_IRQS-1] = {
.status = IRQ_DISABLED,
.chip = _irq_chip,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] Pad irq_desc to internode cacheline size

2007-04-09 Thread Ravikiran G Thirumalai

We noticed a drop in n/w performance due to the irq_desc being cacheline
aligned rather than internode aligned.  We see 50% of expected performance
when two e1000 nics local to two different nodes have consecutive irq
descriptors allocated, due to false sharing.

Note that this patch does away with cacheline padding for the UP case, as it
does not seem useful for UP configurations.

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.21-rc5/include/linux/irq.h
===
--- linux-2.6.21-rc5.orig/include/linux/irq.h   2007-04-09 10:16:23.560848473 
-0700
+++ linux-2.6.21-rc5/include/linux/irq.h2007-04-09 10:16:45.401177929 
-0700
@@ -175,7 +175,7 @@ struct irq_desc {
struct proc_dir_entry   *dir;
 #endif
const char  *name;
-} cacheline_aligned;
+} cacheline_internodealigned_in_smp;
 
 extern struct irq_desc irq_desc[NR_IRQS];
 
Index: linux-2.6.21-rc5/kernel/irq/handle.c
===
--- linux-2.6.21-rc5.orig/kernel/irq/handle.c   2007-02-04 10:44:54.0 
-0800
+++ linux-2.6.21-rc5/kernel/irq/handle.c2007-04-09 12:26:40.473326023 
-0700
@@ -48,7 +48,7 @@ handle_bad_irq(unsigned int irq, struct 
  *
  * Controller mappings for all interrupt sources:
  */
-struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned = {
+struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
[0 ... NR_IRQS-1] = {
.status = IRQ_DISABLED,
.chip = no_irq_chip,
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: align rq to cacheline boundary

2007-04-09 Thread Ravikiran G Thirumalai

On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote:
 On Mon, 9 Apr 2007 11:08:53 -0700
 Siddha, Suresh B [EMAIL PROTECTED] wrote:
 
  Align the per cpu runqueue to the cacheline boundary. This will minimize the
  number of cachelines touched during remote wakeup.
  
  Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
  ---
  
  diff --git a/kernel/sched.c b/kernel/sched.c
  index b9a6837..eca33c5 100644
  --- a/kernel/sched.c
  +++ b/kernel/sched.c
  @@ -278,7 +278,7 @@ struct rq {
  struct lock_class_key rq_lock_key;
   };
   
  -static DEFINE_PER_CPU(struct rq, runqueues);
  +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
 
 Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is
 rather a lot.
 
 Remember also that the linesize on VSMP is 4k.
 
 And that putting a gap in the per-cpu memory like this will reduce its
 overall cache-friendliness.
 

The internode line size yes.  But Suresh is using cacheline_aligned_in_smp,
which uses SMP_CACHE_BYTES (L1_CACHE_BYTES).  So this does not align the 
per-cpu variable to 4k.  However, if the motivation for this patch was 
significant performance difference, then, the above padding needs to be on 
the internode cacheline size using cacheline_internodealigned_in_smp.
cacheline_internodealigned_in_smp aligns a data structure to the
internode line size, which is 4k for vSMPowered systems and L1 line size 
for all other architectures.

As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline 
aligned per-cpu data in another section, just like we do with 
.data.cacheline_aligned section, but keep this new section between
__percpu_start and __percpu_end?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Pad irq_desc to internode cacheline size

2007-04-09 Thread Ravikiran G Thirumalai

On Mon, Apr 09, 2007 at 03:47:52PM -0600, Eric W. Biederman wrote:
 Andrew Morton [EMAIL PROTECTED] writes:
 
  This will consume nearly 4k per irq won't it?  What is the upper bound
  here, across all configs and all hardware?
 
  Is VSMP the only arch which has cacheline_internodealigned_in_smp
  larger than cacheline_aligned_in_smp?
 
 Ugh. We set internode aligned to 4k for all of x86_64.

!!! No. internode aligned is 4k only if CONFIG_X86_VSMP is chosen.  The
internode line size defaults to SMP_CACHE_BYTES for all other machine types.
Please note that an INTERNODE_CACHE_SHIFT of 12 is defined under 
#ifdef CONFIG_X86_VSMP in include/asm-x86_64/cache.h

 
 I believe this ups our worst case memory consumption for
 the array from 1M to 32M.  Although the low end might be 2M.
 I can't recall if an irq_desc takes one cache line or two
 after we have put the cpu masks in it.
 
 My gut feel says that what we want to do is delay this until
 we are dynamically allocating the array members.  Then we can at
 least have the chance of allocating the memory on the proper NUMA
 node, and won't need the extra NUMA alignment.

I was thinking on those lines as well.  But, until we get there, can we have
this in as stop gap?  The patch does not increase the memory footprint for
any other architecture other than vSMPowered systems.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: align rq to cacheline boundary

2007-04-09 Thread Ravikiran G Thirumalai

On Mon, Apr 09, 2007 at 03:17:05PM -0700, Siddha, Suresh B wrote:
 On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote:
  On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote:
   On Mon, 9 Apr 2007 11:08:53 -0700
   Siddha, Suresh B [EMAIL PROTECTED] wrote:
 
 Kiran, can you educate me when I am supposed to use
 cacheline_aligned_in_smp
 Vs
 __cacheline_aligned_in_smp ?

As far as my understanding goes, the four underscore version is for 
aligning members/elements within a data structure, and the two underscore 
version is for aligning statically defined variables.
The dual underscore version places the variable in a separate section meant
for cacheline aligned variables, so that there is no false sharing on the
cacheline with a consecutive datum.  For regular statically defined data
structures, the latter has to be used, but since your patch uses per-cpu data, 
which is already in a separate section, you had to use the former I guess.


 
  As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline 
  aligned per-cpu data in another section, just like we do with 
  .data.cacheline_aligned section, but keep this new section between
  __percpu_start and __percpu_end?
 
 Yes. But that will still waste some memory in the new section, if the data
 elements are not multiples of 4k.

Yes.  But the wastage depends on the data structure now being aligned rather
than the structure that happened to be there before.  You cannot not lose
memory while padding I guess :).  But padding for per-cpu data seems a bit 
odd and I am not sure if it is worth it for 0.5% gain.

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc][patch] queued spinlocks (i386)

2007-03-23 Thread Ravikiran G Thirumalai

On Fri, Mar 23, 2007 at 10:40:17AM +0100, Eric Dumazet wrote:
> On Fri, 23 Mar 2007 09:59:11 +0100
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > 
> > Implement queued spinlocks for i386. This shouldn't increase the size of
> > the spinlock structure, while still able to handle 2^16 CPUs.
> > 
> > Not completely implemented with assembly yet, to make the algorithm a bit
> > clearer.
> > 
> > The queued spinlock has 2 fields, a head and a tail, which are indexes
> > into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an
> > "atomic_inc_return" on the head index, and keeps the returned value as
> > a ticket. The CPU then spins until the tail index is equal to that
> > ticket.
> > 
> > To unlock a spinlock, the tail index is incremented (this can be non
> > atomic, because only the lock owner will modify tail).
> > 
> > Implementation inefficiencies aside, this change should have little
> > effect on performance for uncontended locks, but will have quite a
> > large cost for highly contended locks [O(N) cacheline transfers vs
> > O(1) per lock aquisition, where N is the number of CPUs contending].
> > The benefit is is that contended locks will not cause any starvation.
> > 
> > Just an idea. Big NUMA hardware seems to have fairness logic that
> > prevents starvation for the regular spinlock logic. But it might be
> > interesting for -rt kernel or systems with starvation issues. 
> 
> It's a very nice idea Nick.

Amen to that.

> 
> You also have for free the number or cpus that are before you.
> 
> On big SMP/NUMA, we could use this information to call a special 
> lock_cpu_relax() function to avoid cacheline transferts.
> 
>   asm volatile(LOCK_PREFIX "xaddw %0, %1\n\t"
>: "+r" (pos), "+m" (lock->qhead) : : "memory");
>   for (;;) {
>   unsigned short nwait = pos - lock->qtail;
>   if (likely(nwait == 0))
>   break;
>   lock_cpu_relax(lock, nwait);
>   }
> 
> lock_cpu_relax(raw_spinlock_t *lock, unsigned int nwait)
> {
> unsigned int cycles = nwait * lock->min_cycles_per_round;
> busy_loop(cycles);
> }

Good Idea.  Hopefully, this should reduce the number of cacheline transfers
in the contended case.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc][patch] queued spinlocks (i386)

2007-03-23 Thread Ravikiran G Thirumalai

On Fri, Mar 23, 2007 at 10:40:17AM +0100, Eric Dumazet wrote:
 On Fri, 23 Mar 2007 09:59:11 +0100
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  
  Implement queued spinlocks for i386. This shouldn't increase the size of
  the spinlock structure, while still able to handle 2^16 CPUs.
  
  Not completely implemented with assembly yet, to make the algorithm a bit
  clearer.
  
  The queued spinlock has 2 fields, a head and a tail, which are indexes
  into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an
  atomic_inc_return on the head index, and keeps the returned value as
  a ticket. The CPU then spins until the tail index is equal to that
  ticket.
  
  To unlock a spinlock, the tail index is incremented (this can be non
  atomic, because only the lock owner will modify tail).
  
  Implementation inefficiencies aside, this change should have little
  effect on performance for uncontended locks, but will have quite a
  large cost for highly contended locks [O(N) cacheline transfers vs
  O(1) per lock aquisition, where N is the number of CPUs contending].
  The benefit is is that contended locks will not cause any starvation.
  
  Just an idea. Big NUMA hardware seems to have fairness logic that
  prevents starvation for the regular spinlock logic. But it might be
  interesting for -rt kernel or systems with starvation issues. 
 
 It's a very nice idea Nick.

Amen to that.

 
 You also have for free the number or cpus that are before you.
 
 On big SMP/NUMA, we could use this information to call a special 
 lock_cpu_relax() function to avoid cacheline transferts.
 
   asm volatile(LOCK_PREFIX xaddw %0, %1\n\t
: +r (pos), +m (lock-qhead) : : memory);
   for (;;) {
   unsigned short nwait = pos - lock-qtail;
   if (likely(nwait == 0))
   break;
   lock_cpu_relax(lock, nwait);
   }
 
 lock_cpu_relax(raw_spinlock_t *lock, unsigned int nwait)
 {
 unsigned int cycles = nwait * lock-min_cycles_per_round;
 busy_loop(cycles);
 }

Good Idea.  Hopefully, this should reduce the number of cacheline transfers
in the contended case.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [REVISED] net/ipv4/multipath_wrandom.c: check kmalloc() return value.

2007-03-14 Thread Ravikiran G Thirumalai

On Mon, Mar 12, 2007 at 01:56:13PM -0700, David Miller wrote:
> From: Pekka J Enberg <[EMAIL PROTECTED]>
> Date: Mon, 12 Mar 2007 14:15:16 +0200 (EET)
> 
> > On 3/9/07, David Miller <[EMAIL PROTECTED]> wrote:
> > > The whole cahce-multipath subsystem has to have it's guts revamped for
> > > proper error handling.
> > 
> > (Untested patch follows.)
> 
> I'm not accepting untested patches, if people want to fix this
> code it cannot be done in a shoddy manner any more, it will
> need to be fixed and tested by people who really care about
> this code.

Hi Dave,
We do care for multipath cached option.  We do use it often with node
aware multipath.  We haven't experienced oopsen or crashes for our use cases
since the past year.  We will test Amit and Pekka's patch with fault injection.
We will also go through the core multipath-cached code and give a try at
cleaning it up.

> 
> We need real care for this code or I will remove it in 2.6.23

Can you please point us to any open bug reports on multipath cached?  We did 
a Call for testing on netdev sometime back but did not get any bug reports.  
http://marc.info/?l=linux-netdev=114827370231872=2

The kernel bugzilla shows zaroo boogs for multipath cached as well.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [REVISED] net/ipv4/multipath_wrandom.c: check kmalloc() return value.

2007-03-14 Thread Ravikiran G Thirumalai

On Mon, Mar 12, 2007 at 01:56:13PM -0700, David Miller wrote:
 From: Pekka J Enberg [EMAIL PROTECTED]
 Date: Mon, 12 Mar 2007 14:15:16 +0200 (EET)

  On 3/9/07, David Miller [EMAIL PROTECTED] wrote:
   The whole cahce-multipath subsystem has to have it's guts revamped for
   proper error handling.

  (Untested patch follows.)

 I'm not accepting untested patches, if people want to fix this
 code it cannot be done in a shoddy manner any more, it will
 need to be fixed and tested by people who really care about
 this code.

Hi Dave,
We do care for multipath cached option.  We do use it often with node
aware multipath.  We haven't experienced oopsen or crashes for our use cases
since the past year.  We will test Amit and Pekka's patch with fault injection.
We will also go through the core multipath-cached code and give a try at
cleaning it up.

 We need real care for this code or I will remove it in 2.6.23

Can you please point us to any open bug reports on multipath cached?  We did 
a Call for testing on netdev sometime back but did not get any bug reports.  
http://marc.info/?l=linux-netdevm=114827370231872w=2

The kernel bugzilla shows zaroo boogs for multipath cached as well.

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-15 Thread Ravikiran G Thirumalai

On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote:
> 
> Seeing the code helps.

But there was a subtle problem with hold time instrumentation here.
The code assumed the critical section exiting through 
spin_unlock_irq entered critical section with spin_lock_irq, but that
might not be the case always, and the instrumentation for hold time goes bad
when that happens (as in shrink_inactive_list)

> 
> >  The
> > instrumentation goes like this:
> > 
> > void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > {
> > unsigned long long t1,t2;
> > local_irq_disable();
> > t1 = get_cycles_sync();
> > preempt_disable();
> > spin_acquire(>dep_map, 0, 0, _RET_IP_);
> > _raw_spin_lock(lock);
> > t2 = get_cycles_sync();
> > lock->raw_lock.htsc = t2;
> > if (lock->spin_time < (t2 - t1))
> > lock->spin_time = t2 - t1;
> > }
> > ...
> > 
> > void __lockfunc _spin_unlock_irq(spinlock_t *lock)
> > {
> > unsigned long long t1 ;
> > spin_release(>dep_map, 1, _RET_IP_);
> > t1 = get_cycles_sync();
> > if (lock->cs_time < (t1 -  lock->raw_lock.htsc))
> > lock->cs_time = t1 -  lock->raw_lock.htsc;
> > _raw_spin_unlock(lock);
> > local_irq_enable();
> > preempt_enable();
> > }
> > 
...
> 
> OK, now we need to do a dump_stack() each time we discover a new max hold
> time.  That might a bit tricky: the printk code does spinlocking too so
> things could go recursively deadlocky.  Maybe make spin_unlock_irq() return
> the hold time then do:

What I found now after fixing the above is that hold time is not bad --
249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS.
The spin time is still in orders of seconds.

Hence this looks like a hardware fairness issue.

Attaching the instrumentation patch with this email FR.


Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 
2007-01-14 22:36:46.694248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h  2007-01-15 
15:40:36.554248000 -0800
@@ -6,6 +6,18 @@
 #include 
 #include 
 
+/* Like get_cycles, but make sure the CPU is synchronized. */
+static inline unsigned long long get_cycles_sync2(void)
+{
+   unsigned long long ret;
+   unsigned eax;
+   /* Don't do an additional sync on CPUs where we know
+  RDTSC is already synchronous. */
+   alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
+ "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
+   rdtscll(ret);
+   return ret;
+}
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
  *
@@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s
"jle 3b\n\t"
"jmp 1b\n"
"2:\t" : "=m" (lock->slock) : : "memory");
+   lock->htsc = get_cycles_sync2();
 }
 
 /*
@@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags
"jmp 4b\n"
"5:\n\t"
: "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");
+   lock->htsc = get_cycles_sync2();
 }
 #endif
 
@@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw
:"=q" (oldval), "=m" (lock->slock)
:"0" (0) : "memory");
 
+   if (oldval)
+   lock->htsc = get_cycles_sync2();
return oldval > 0;
 }
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
+   unsigned long long t = get_cycles_sync2();
+   if (lock->hold_time <  t - lock->htsc)
+   lock->hold_time = t - lock->htsc;
asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
 }
 
Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h   
2007-01-14 22:36:46.714248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
2007-01-15 14:23:37.204248000 -0800
@@ -7,9 +7,11 @@
 
 typedef struct {
unsigned int slock;
+   unsigned long long hold_time;
+   unsigned long long htsc;
 } raw_spinlock_t;
 
-#define __RAW_SPIN_LOCK_UNLOCKED   { 1 }
+#define __RAW_SPIN_LOCK_UNLOCKED   { 1,0,0 }
 
 typedef struct {
unsigned int lock;
Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h  2007-01-14 
22:36:48.464248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h   2007-01-14 
22:41:30.964248000 -0800
@@ -231,8 +231,8 @@ do {
\
 # define spin_unlock(lock)

Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-15 Thread Ravikiran G Thirumalai

On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote:
 
 Seeing the code helps.

But there was a subtle problem with hold time instrumentation here.
The code assumed the critical section exiting through 
spin_unlock_irq entered critical section with spin_lock_irq, but that
might not be the case always, and the instrumentation for hold time goes bad
when that happens (as in shrink_inactive_list)

 
   The
  instrumentation goes like this:
  
  void __lockfunc _spin_lock_irq(spinlock_t *lock)
  {
  unsigned long long t1,t2;
  local_irq_disable();
  t1 = get_cycles_sync();
  preempt_disable();
  spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
  _raw_spin_lock(lock);
  t2 = get_cycles_sync();
  lock-raw_lock.htsc = t2;
  if (lock-spin_time  (t2 - t1))
  lock-spin_time = t2 - t1;
  }
  ...
  
  void __lockfunc _spin_unlock_irq(spinlock_t *lock)
  {
  unsigned long long t1 ;
  spin_release(lock-dep_map, 1, _RET_IP_);
  t1 = get_cycles_sync();
  if (lock-cs_time  (t1 -  lock-raw_lock.htsc))
  lock-cs_time = t1 -  lock-raw_lock.htsc;
  _raw_spin_unlock(lock);
  local_irq_enable();
  preempt_enable();
  }
  
...
 
 OK, now we need to do a dump_stack() each time we discover a new max hold
 time.  That might a bit tricky: the printk code does spinlocking too so
 things could go recursively deadlocky.  Maybe make spin_unlock_irq() return
 the hold time then do:

What I found now after fixing the above is that hold time is not bad --
249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS.
The spin time is still in orders of seconds.

Hence this looks like a hardware fairness issue.

Attaching the instrumentation patch with this email FR.


Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 
2007-01-14 22:36:46.694248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h  2007-01-15 
15:40:36.554248000 -0800
@@ -6,6 +6,18 @@
 #include asm/page.h
 #include asm/processor.h
 
+/* Like get_cycles, but make sure the CPU is synchronized. */
+static inline unsigned long long get_cycles_sync2(void)
+{
+   unsigned long long ret;
+   unsigned eax;
+   /* Don't do an additional sync on CPUs where we know
+  RDTSC is already synchronous. */
+   alternative_io(cpuid, ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
+ =a (eax), 0 (1) : ebx,ecx,edx,memory);
+   rdtscll(ret);
+   return ret;
+}
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
  *
@@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s
jle 3b\n\t
jmp 1b\n
2:\t : =m (lock-slock) : : memory);
+   lock-htsc = get_cycles_sync2();
 }
 
 /*
@@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags
jmp 4b\n
5:\n\t
: +m (lock-slock) : r ((unsigned)flags) : memory);
+   lock-htsc = get_cycles_sync2();
 }
 #endif
 
@@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw
:=q (oldval), =m (lock-slock)
:0 (0) : memory);
 
+   if (oldval)
+   lock-htsc = get_cycles_sync2();
return oldval  0;
 }
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
+   unsigned long long t = get_cycles_sync2();
+   if (lock-hold_time   t - lock-htsc)
+   lock-hold_time = t - lock-htsc;
asm volatile(movl $1,%0 :=m (lock-slock) :: memory);
 }
 
Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h   
2007-01-14 22:36:46.714248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
2007-01-15 14:23:37.204248000 -0800
@@ -7,9 +7,11 @@
 
 typedef struct {
unsigned int slock;
+   unsigned long long hold_time;
+   unsigned long long htsc;
 } raw_spinlock_t;
 
-#define __RAW_SPIN_LOCK_UNLOCKED   { 1 }
+#define __RAW_SPIN_LOCK_UNLOCKED   { 1,0,0 }
 
 typedef struct {
unsigned int lock;
Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h  2007-01-14 
22:36:48.464248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h   2007-01-14 
22:41:30.964248000 -0800
@@ -231,8 +231,8 @@ do {
\
 # define spin_unlock(lock) __raw_spin_unlock((lock)-raw_lock)
 # define read_unlock(lock) __raw_read_unlock((lock)-raw_lock)
 # define

Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-13 Thread Ravikiran G Thirumalai

On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote:
> > On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[EMAIL 
> > PROTECTED]> wrote:
> > > >void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > > >{
> > > >local_irq_disable();
> > > >> rdtsc(t1);
> > > >preempt_disable();
> > > >spin_acquire(>dep_map, 0, 0, _RET_IP_);
> > > >_raw_spin_lock(lock);
> > > >> rdtsc(t2);
> > > >if (lock->spin_time < (t2 - t1))
> > > >lock->spin_time = t2 - t1;
> > > >}
> > > >
> > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or 
> > > >more
> > > >while the maximal CS time was 3 seconds or so.
> > > 
> > > What is the "CS time"?
> > 
> > Critical Section :).  This is the maximal time interval I measured  from 
> > t2 above to the time point we release the spin lock.  This is the hold 
> > time I guess.
> 
> By no means.  The theory here is that CPUA is taking and releasing the
> lock at high frequency, but CPUB never manages to get in and take it.  In
> which case the maximum-acquisition-time is much larger than the
> maximum-hold-time.
> 
> I'd suggest that you use a similar trick to measure the maximum hold time:
> start the timer after we got the lock, stop it just before we release the
> lock (assuming that the additional rdtsc delay doesn't "fix" things, of
> course...)

Well, that is exactly what I described above  as CS time.  The
instrumentation goes like this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
unsigned long long t1,t2;
local_irq_disable();
t1 = get_cycles_sync();
preempt_disable();
spin_acquire(>dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
t2 = get_cycles_sync();
lock->raw_lock.htsc = t2;
if (lock->spin_time < (t2 - t1))
lock->spin_time = t2 - t1;
}
...

void __lockfunc _spin_unlock_irq(spinlock_t *lock)
{
unsigned long long t1 ;
spin_release(>dep_map, 1, _RET_IP_);
t1 = get_cycles_sync();
if (lock->cs_time < (t1 -  lock->raw_lock.htsc))
lock->cs_time = t1 -  lock->raw_lock.htsc;
_raw_spin_unlock(lock);
local_irq_enable();
preempt_enable();
}

Am I missing something?  Is this not what you just described? (The
synchronizing rdtsc might not be really required at all locations, but I 
doubt if it would contribute a significant fraction to 33s  or even 
the 3s hold time on a 2.6 GHZ opteron).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-13 Thread Ravikiran G Thirumalai

On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote:
  On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai [EMAIL 
  PROTECTED] wrote:
   void __lockfunc _spin_lock_irq(spinlock_t *lock)
   {
   local_irq_disable();
    rdtsc(t1);
   preempt_disable();
   spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
   _raw_spin_lock(lock);
    rdtsc(t2);
   if (lock-spin_time  (t2 - t1))
   lock-spin_time = t2 - t1;
   }
   
   On some runs, we found that the zone-lru_lock spun for 33 seconds or 
   more
   while the maximal CS time was 3 seconds or so.
   
   What is the CS time?
  
  Critical Section :).  This is the maximal time interval I measured  from 
  t2 above to the time point we release the spin lock.  This is the hold 
  time I guess.
 
 By no means.  The theory here is that CPUA is taking and releasing the
 lock at high frequency, but CPUB never manages to get in and take it.  In
 which case the maximum-acquisition-time is much larger than the
 maximum-hold-time.
 
 I'd suggest that you use a similar trick to measure the maximum hold time:
 start the timer after we got the lock, stop it just before we release the
 lock (assuming that the additional rdtsc delay doesn't fix things, of
 course...)

Well, that is exactly what I described above  as CS time.  The
instrumentation goes like this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
unsigned long long t1,t2;
local_irq_disable();
t1 = get_cycles_sync();
preempt_disable();
spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
t2 = get_cycles_sync();
lock-raw_lock.htsc = t2;
if (lock-spin_time  (t2 - t1))
lock-spin_time = t2 - t1;
}
...

void __lockfunc _spin_unlock_irq(spinlock_t *lock)
{
unsigned long long t1 ;
spin_release(lock-dep_map, 1, _RET_IP_);
t1 = get_cycles_sync();
if (lock-cs_time  (t1 -  lock-raw_lock.htsc))
lock-cs_time = t1 -  lock-raw_lock.htsc;
_raw_spin_unlock(lock);
local_irq_enable();
preempt_enable();
}

Am I missing something?  Is this not what you just described? (The
synchronizing rdtsc might not be really required at all locations, but I 
doubt if it would contribute a significant fraction to 33s  or even 
the 3s hold time on a 2.6 GHZ opteron).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai

On Fri, Jan 12, 2007 at 05:11:16PM -0800, Andrew Morton wrote:
> On Fri, 12 Jan 2007 17:00:39 -0800
> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> 
> > But is
> > lru_lock an issue is another question.
> 
> I doubt it, although there might be changes we can make in there to
> work around it.
> 
> 

I tested with PAGEVEC_SIZE define to 62 and 126 -- no difference.  I still
notice the atrociously high spin times.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai

On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:
> Ravikiran G Thirumalai wrote:
> >Hi,
> >We noticed high interrupt hold off times while running some memory 
> >intensive
> >tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
> 
> [...]
> 
> >We did not use any lock debugging options and used plain old rdtsc to
> >measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
> >this:
> >
> >void __lockfunc _spin_lock_irq(spinlock_t *lock)
> >{
> >local_irq_disable();
> >> rdtsc(t1);
> >preempt_disable();
> >spin_acquire(>dep_map, 0, 0, _RET_IP_);
> >_raw_spin_lock(lock);
> >> rdtsc(t2);
> >if (lock->spin_time < (t2 - t1))
> >lock->spin_time = t2 - t1;
> >}
> >
> >On some runs, we found that the zone->lru_lock spun for 33 seconds or more
> >while the maximal CS time was 3 seconds or so.
> 
> What is the "CS time"?

Critical Section :).  This is the maximal time interval I measured  from 
t2 above to the time point we release the spin lock.  This is the hold 
time I guess.

> 
> It would be interesting to know how long the maximal lru_lock *hold* time 
> is,
> which could give us a better indication of whether it is a hardware problem.
> 
> For example, if the maximum hold time is 10ms, that it might indicate a
> hardware fairness problem.

The maximal hold time was about 3s.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai

On Fri, Jan 12, 2007 at 01:45:43PM -0800, Christoph Lameter wrote:
> On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:
> 
> Moreover mostatomic operations are to remote memory which is also 
> increasing the problem by making the atomic ops take longer. Typically 
> mature NUMA system have implemented hardware provisions that can deal with 
> such high degrees of contention. If this is simply a SMP system that was
> turned into a NUMA box then this is a new hardware scenario for the 
> engineers.

This is using HT as all AMD systems do, but this is one of the 8
socket systems.  

I ran the same test on a 2 node Tyan AMD box, and did not notice the
atrocious spin times. It would be interesting to see how a 4 socket HT box
would fare. Unfortunately, I do not have access to one. If someone has access
to such a box, I can provide the test case and instrumentation patches.

It could very well be the hardware limitation in this case, which means, all
the more reason to enable interrupts with spin locks while spinning. But is
lru_lock an issue is another question.

Thanks,
Kiran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai

On Fri, Jan 12, 2007 at 11:46:22AM -0800, Christoph Lameter wrote:
> On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:
> 
> > The test was simple, we have 16 processes, each allocating 3.5G of memory
> > and and touching each and every page and returning.  Each of the process is
> > bound to a node (socket), with the local node being the preferred node for
> > allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
> > socket has 4G of physical memory and there are two cores on each socket. On
> > start of the test, the machine becomes unresponsive after sometime and
> > prints out softlockup and OOM messages.  We then found out the cause
> > for softlockups being the excessive spin times on zone_lru lock.  The fact
> > that spin_lock_irq disables interrupts while spinning made matters very bad.
> > We instrumented the spin_lock_irq code and found that the spin time on the
> > lru locks was in the order of a few seconds (tens of seconds at times) and
> > the hold time was comparatively lesser.
> 
> So the issue is two processes contenting on the zone lock for one node? 
> You are overallocating the 4G node with two processes attempting to 
> allocate 7.5GB? So we go off node for 3.5G of the allocation?

Yes.

> 
> Does the system scale the right way if you stay within the bounds of node 
> memory? I.e. allocate 1.5GB from each process?

Yes. We see problems only when we oversubscribe memory.

> 
> Have you tried increasing the size of the per cpu caches in 
> /proc/sys/vm/percpu_pagelist_fraction?

No not yet. I can give it a try.

> 
> > While the softlockups and the like went away by enabling interrupts during
> > spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
> > Andi thought maybe this is exposing a problem with zone->lru_locks and 
> > hence warrants a discussion on lkml, hence this post.  Are there any 
> > plans/patches/ideas to address the spin time under such extreme conditions?
> 
> Could this be a hardware problem? Some issue with atomic ops in the 
> Sun hardware?

I think that is unlikely -- because when we donot oversubscribe
memory, the tests complete quickly without softlockups ane the like.  Peter 
has also noticed this (presumeably on different hardware).  I would think
this could also be locking unfairness (cpus of the same node getting the 
lock and starving out other nodes) case under extreme contention.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai

Hi,
We noticed high interrupt hold off times while running some memory intensive
tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
lost ticks and even wall time drifting (which is probably a bug in the
x86_64 timer subsystem). 

The test was simple, we have 16 processes, each allocating 3.5G of memory
and and touching each and every page and returning.  Each of the process is
bound to a node (socket), with the local node being the preferred node for
allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
socket has 4G of physical memory and there are two cores on each socket. On
start of the test, the machine becomes unresponsive after sometime and
prints out softlockup and OOM messages.  We then found out the cause
for softlockups being the excessive spin times on zone_lru lock.  The fact
that spin_lock_irq disables interrupts while spinning made matters very bad.
We instrumented the spin_lock_irq code and found that the spin time on the
lru locks was in the order of a few seconds (tens of seconds at times) and
the hold time was comparatively lesser.

We did not use any lock debugging options and used plain old rdtsc to
measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
local_irq_disable();
> rdtsc(t1);
preempt_disable();
spin_acquire(>dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
> rdtsc(t2);
if (lock->spin_time < (t2 - t1))
lock->spin_time = t2 - t1;
}

On some runs, we found that the zone->lru_lock spun for 33 seconds or more
while the maximal CS time was 3 seconds or so.

While the softlockups and the like went away by enabling interrupts during
spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
Andi thought maybe this is exposing a problem with zone->lru_locks and 
hence warrants a discussion on lkml, hence this post.  Are there any 
plans/patches/ideas to address the spin time under such extreme conditions?

I will be happy to provide any additional information (config/dmesg/test
case if needed.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai

Hi,
We noticed high interrupt hold off times while running some memory intensive
tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
lost ticks and even wall time drifting (which is probably a bug in the
x86_64 timer subsystem). 

The test was simple, we have 16 processes, each allocating 3.5G of memory
and and touching each and every page and returning.  Each of the process is
bound to a node (socket), with the local node being the preferred node for
allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
socket has 4G of physical memory and there are two cores on each socket. On
start of the test, the machine becomes unresponsive after sometime and
prints out softlockup and OOM messages.  We then found out the cause
for softlockups being the excessive spin times on zone_lru lock.  The fact
that spin_lock_irq disables interrupts while spinning made matters very bad.
We instrumented the spin_lock_irq code and found that the spin time on the
lru locks was in the order of a few seconds (tens of seconds at times) and
the hold time was comparatively lesser.

We did not use any lock debugging options and used plain old rdtsc to
measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
local_irq_disable();
 rdtsc(t1);
preempt_disable();
spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
 rdtsc(t2);
if (lock-spin_time  (t2 - t1))
lock-spin_time = t2 - t1;
}

On some runs, we found that the zone-lru_lock spun for 33 seconds or more
while the maximal CS time was 3 seconds or so.

While the softlockups and the like went away by enabling interrupts during
spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
Andi thought maybe this is exposing a problem with zone-lru_locks and 
hence warrants a discussion on lkml, hence this post.  Are there any 
plans/patches/ideas to address the spin time under such extreme conditions?

I will be happy to provide any additional information (config/dmesg/test
case if needed.

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai

On Fri, Jan 12, 2007 at 11:46:22AM -0800, Christoph Lameter wrote:
 On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:
 
  The test was simple, we have 16 processes, each allocating 3.5G of memory
  and and touching each and every page and returning.  Each of the process is
  bound to a node (socket), with the local node being the preferred node for
  allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
  socket has 4G of physical memory and there are two cores on each socket. On
  start of the test, the machine becomes unresponsive after sometime and
  prints out softlockup and OOM messages.  We then found out the cause
  for softlockups being the excessive spin times on zone_lru lock.  The fact
  that spin_lock_irq disables interrupts while spinning made matters very bad.
  We instrumented the spin_lock_irq code and found that the spin time on the
  lru locks was in the order of a few seconds (tens of seconds at times) and
  the hold time was comparatively lesser.
 
 So the issue is two processes contenting on the zone lock for one node? 
 You are overallocating the 4G node with two processes attempting to 
 allocate 7.5GB? So we go off node for 3.5G of the allocation?

Yes.

 
 Does the system scale the right way if you stay within the bounds of node 
 memory? I.e. allocate 1.5GB from each process?

Yes. We see problems only when we oversubscribe memory.

 
 Have you tried increasing the size of the per cpu caches in 
 /proc/sys/vm/percpu_pagelist_fraction?

No not yet. I can give it a try.

 
  While the softlockups and the like went away by enabling interrupts during
  spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
  Andi thought maybe this is exposing a problem with zone-lru_locks and 
  hence warrants a discussion on lkml, hence this post.  Are there any 
  plans/patches/ideas to address the spin time under such extreme conditions?
 
 Could this be a hardware problem? Some issue with atomic ops in the 
 Sun hardware?

I think that is unlikely -- because when we donot oversubscribe
memory, the tests complete quickly without softlockups ane the like.  Peter 
has also noticed this (presumeably on different hardware).  I would think
this could also be locking unfairness (cpus of the same node getting the 
lock and starving out other nodes) case under extreme contention.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai

On Fri, Jan 12, 2007 at 01:45:43PM -0800, Christoph Lameter wrote:
 On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:
 
 Moreover mostatomic operations are to remote memory which is also 
 increasing the problem by making the atomic ops take longer. Typically 
 mature NUMA system have implemented hardware provisions that can deal with 
 such high degrees of contention. If this is simply a SMP system that was
 turned into a NUMA box then this is a new hardware scenario for the 
 engineers.

This is using HT as all AMD systems do, but this is one of the 8
socket systems.  

I ran the same test on a 2 node Tyan AMD box, and did not notice the
atrocious spin times. It would be interesting to see how a 4 socket HT box
would fare. Unfortunately, I do not have access to one. If someone has access
to such a box, I can provide the test case and instrumentation patches.

It could very well be the hardware limitation in this case, which means, all
the more reason to enable interrupts with spin locks while spinning. But is
lru_lock an issue is another question.

Thanks,
Kiran
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai

On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:
 Ravikiran G Thirumalai wrote:
 Hi,
 We noticed high interrupt hold off times while running some memory 
 intensive
 tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
 
 [...]
 
 We did not use any lock debugging options and used plain old rdtsc to
 measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
 this:
 
 void __lockfunc _spin_lock_irq(spinlock_t *lock)
 {
 local_irq_disable();
  rdtsc(t1);
 preempt_disable();
 spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
 _raw_spin_lock(lock);
  rdtsc(t2);
 if (lock-spin_time  (t2 - t1))
 lock-spin_time = t2 - t1;
 }
 
 On some runs, we found that the zone-lru_lock spun for 33 seconds or more
 while the maximal CS time was 3 seconds or so.
 
 What is the CS time?

Critical Section :).  This is the maximal time interval I measured  from 
t2 above to the time point we release the spin lock.  This is the hold 
time I guess.

 
 It would be interesting to know how long the maximal lru_lock *hold* time 
 is,
 which could give us a better indication of whether it is a hardware problem.
 
 For example, if the maximum hold time is 10ms, that it might indicate a
 hardware fairness problem.

The maximal hold time was about 3s.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai

On Fri, Jan 12, 2007 at 05:11:16PM -0800, Andrew Morton wrote:
 On Fri, 12 Jan 2007 17:00:39 -0800
 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:
 
  But is
  lru_lock an issue is another question.
 
 I doubt it, although there might be changes we can make in there to
 work around it.
 
 mentions PAGEVEC_SIZE again

I tested with PAGEVEC_SIZE define to 62 and 126 -- no difference.  I still
notice the atrociously high spin times.

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + spin_lock_irq-enable-interrupts-while-spinning-i386-implementation.patch added to -mm tree

2007-01-07 Thread Ravikiran G Thirumalai

On Sun, Jan 07, 2007 at 01:06:58PM -0800, Ravikiran G Thirumalai wrote:
> 
> 
> Question is, now we have 2 versions of spin_locks_irq implementation
> with CONFIG_PARAVIRT -- one with regular cli sti and other with virtualized 
> CLI/STI -- sounds odd!

Sunday morning hangovers !! spin_lock_irq is not inlined so there is just one
version even with CONFIG_PARAVIRT.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + spin_lock_irq-enable-interrupts-while-spinning-i386-implementation.patch added to -mm tree

2007-01-07 Thread Ravikiran G Thirumalai

On Sun, Jan 07, 2007 at 12:05:03PM -0800, Andrew Morton wrote:
> On Sun, 07 Jan 2007 05:24:45 -0800
> Daniel Walker <[EMAIL PROTECTED]> wrote:
> 
> > Now it fails with CONFIG_PARAVIRT off .
> > 
> > scripts/kconfig/conf -s arch/i386/Kconfig
> >   CHK include/linux/version.h
> >   CHK include/linux/compile.h
> >   CHK include/linux/utsrelease.h
> >   UPD include/linux/compile.h
> >   CC  arch/i386/kernel/asm-offsets.s
> > In file included from include/linux/spinlock.h:88,
> >  from include/linux/module.h:10,
> >  from include/linux/crypto.h:22,
> >  from arch/i386/kernel/asm-offsets.c:8:
> > include/asm/spinlock.h: In function '__raw_spin_lock_irq':
> > include/asm/spinlock.h:100: error: expected string literal before 
> > '__CLI_STI_INPUT_ARGS'
> 
> bah.
> 
> --- 
> a/include/asm-i386/spinlock.h~spin_lock_irq-enable-interrupts-while-spinning-i386-implementation-fix-fix
> +++ a/include/asm-i386/spinlock.h
> @@ -14,6 +14,7 @@
>  #define STI_STRING   "sti"
>  #define CLI_STI_CLOBBERS
>  #define CLI_STI_INPUT_ARGS
> +#define __CLI_STI_INPUT_ARGS
>  #endif /* CONFIG_PARAVIRT */

Apologies for the broken patch and thanks for the fix,
But, the above is needed to fix the build even with CONFIG_PARAVIRT!!!
Apparently because arch/i386/mm/boot_ioremap.c undefs CONFIG_PARAVIRT.

Question is, now we have 2 versions of spin_locks_irq implementation
with CONFIG_PARAVIRT -- one with regular cli sti and other with virtualized 
CLI/STI -- sounds odd!

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + spin_lock_irq-enable-interrupts-while-spinning-i386-implementation.patch added to -mm tree

2007-01-07 Thread Ravikiran G Thirumalai

On Sun, Jan 07, 2007 at 12:05:03PM -0800, Andrew Morton wrote:
 On Sun, 07 Jan 2007 05:24:45 -0800
 Daniel Walker [EMAIL PROTECTED] wrote:
 
  Now it fails with CONFIG_PARAVIRT off .
  
  scripts/kconfig/conf -s arch/i386/Kconfig
CHK include/linux/version.h
CHK include/linux/compile.h
CHK include/linux/utsrelease.h
UPD include/linux/compile.h
CC  arch/i386/kernel/asm-offsets.s
  In file included from include/linux/spinlock.h:88,
   from include/linux/module.h:10,
   from include/linux/crypto.h:22,
   from arch/i386/kernel/asm-offsets.c:8:
  include/asm/spinlock.h: In function '__raw_spin_lock_irq':
  include/asm/spinlock.h:100: error: expected string literal before 
  '__CLI_STI_INPUT_ARGS'
 
 bah.
 
 --- 
 a/include/asm-i386/spinlock.h~spin_lock_irq-enable-interrupts-while-spinning-i386-implementation-fix-fix
 +++ a/include/asm-i386/spinlock.h
 @@ -14,6 +14,7 @@
  #define STI_STRING   sti
  #define CLI_STI_CLOBBERS
  #define CLI_STI_INPUT_ARGS
 +#define __CLI_STI_INPUT_ARGS
  #endif /* CONFIG_PARAVIRT */

Apologies for the broken patch and thanks for the fix,
But, the above is needed to fix the build even with CONFIG_PARAVIRT!!!
Apparently because arch/i386/mm/boot_ioremap.c undefs CONFIG_PARAVIRT.

Question is, now we have 2 versions of spin_locks_irq implementation
with CONFIG_PARAVIRT -- one with regular cli sti and other with virtualized 
CLI/STI -- sounds odd!

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + spin_lock_irq-enable-interrupts-while-spinning-i386-implementation.patch added to -mm tree

2007-01-07 Thread Ravikiran G Thirumalai

On Sun, Jan 07, 2007 at 01:06:58PM -0800, Ravikiran G Thirumalai wrote:
 
 
 Question is, now we have 2 versions of spin_locks_irq implementation
 with CONFIG_PARAVIRT -- one with regular cli sti and other with virtualized 
 CLI/STI -- sounds odd!

Sunday morning hangovers !! spin_lock_irq is not inlined so there is just one
version even with CONFIG_PARAVIRT.

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] [patch 1/2] spin_lock_irq: Enable interrupts while spinning -- preperatory patch

2007-01-03 Thread Ravikiran G Thirumalai

On Wed, Jan 03, 2007 at 12:16:35AM -0800, Andrew Morton wrote:
> On Tue, 2 Jan 2007 23:59:23 -0800
> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> 
> > The following patches do just that. The first patch is preparatory in nature
> > and the second one changes the  x86_64 implementation of spin_lock_irq.
> > Patch passed overnight runs of kernbench and dbench on 4 way x86_64 smp.
> 
> The end result of this is, I think, that i386 enables irqs while spinning
> in spin_lock_irqsave() but not while spinning in spin_lock_irq().  And
> x86_64 does the opposite.

No, right now we have on mainline (non PREEMPT case);

i386x86_64
-
spin_lock_irq   cli when spin   cli when spin
spin_lock_irqsave   spin with intr enabled  spin with intr enabled

The posted patchset changed this to:

i386x86_64
-
spin_lock_irq   cli when spin   spin with intr enabled
spin_lock_irqsave   spin with intr enabled  spin with intr enabled

> 
> Odd, isn't it?

Well we just implemented the x86_64 part.  Here goes the i386 part as well for 
spin_lock_irq.

i386: Enable interrupts while spinning for a lock with spin_lock_irq

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc1/include/asm-i386/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-i386/spinlock.h   2006-12-28 
17:18:32.142775000 -0800
+++ linux-2.6.20-rc1/include/asm-i386/spinlock.h2007-01-03 
10:18:32.243662000 -0800
@@ -82,7 +82,22 @@ static inline void __raw_spin_lock_flags
  CLI_STI_INPUT_ARGS
: "memory" CLI_STI_CLOBBERS);
 }
-# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
+
+static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
+{
+   asm volatile("\n1:\t"
+LOCK_PREFIX " ; decb %0\n\t"
+"jns 3f\n"
+STI_STRING "\n"
+"2:\t"
+"rep;nop\n\t"
+"cmpb $0,%0\n\t"
+"jle 2b\n\t"
+CLI_STRING "\n"
+"jmp 1b\n"
+"3:\n\t"
+: "+m" (lock->slock) : : "memory");
+}
 #endif
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[rfc] [patch 2/2] spin_lock_irq: Enable interrupts while spinning -- x86_64 implementation

2007-01-03 Thread Ravikiran G Thirumalai

Implement interrupt enabling while spinning for lock for spin_lock_irq

Signed-off by: Pravin B. Shelar <[EMAIL PROTECTED]>
Signed-off by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc1/include/asm-x86_64/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-x86_64/spinlock.h 2006-12-28 
17:18:32.142775000 -0800
+++ linux-2.6.20-rc1/include/asm-x86_64/spinlock.h  2006-12-29 
14:05:04.012954000 -0800
@@ -63,7 +63,21 @@ static inline void __raw_spin_lock_flags
"5:\n\t"
: "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");
 }
-#define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
+static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
+{
+   asm volatile(
+   "\n1:\t"
+   LOCK_PREFIX " ; decl %0\n\t"
+   "jns 2f\n"
+   "sti\n" /* Enable interrupts during spin */
+   "3:\n"
+   "rep;nop\n\t"
+   "cmpl $0,%0\n\t"
+   "jle 3b\n\t"
+   "cli\n"
+   "jmp 1b\n"
+   "2:\t" : "+m" (lock->slock) : : "memory");
+}
 #endif
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[rfc] [patch 1/2] spin_lock_irq: Enable interrupts while spinning -- preperatory patch

2007-01-03 Thread Ravikiran G Thirumalai

There seems to be  no good reason for spin_lock_irq to disable interrupts 
while spinning.  Zwane Mwaikambo had an implementation couple of years  ago, 
and the only objection seemed to be concerns about buggy code using
spin_lock_irq whilst interrupts disabled
http://lkml.org/lkml/2004/5/26/87
That shouldn't be a concern anymore. Besides, spin_lock_irqsave now enables
interrupts while spinning.

As to the motivation, on a Sun x4600 8 socket 16 core x86_64 NUMA box, 
we notice softlockups and quite a few lost timer ticks under extreme 
memory pressure.  The reason turned out to be that zone->lru_lock tends 
to get heavily contended, and NUMA nodes try to grab the locks using 
spin_lock_irq.  Instrumentation showed us that interrupt hold offs can
last for a few seconds, and even the main timer interrupts get held off for
long -- which is not good. Enabling interrupts for spinlocks while spinning
made the machine responsive and the softlockups/lost ticks went away.

Although the scenario above was an extreme condition (very high memory
pressure), I guess it still makes sense to enable interrupts while spinning
for a lock.

The following patches do just that. The first patch is preparatory in nature
and the second one changes the  x86_64 implementation of spin_lock_irq.
Patch passed overnight runs of kernbench and dbench on 4 way x86_64 smp.

Comments?

Thanks,
Kiran


Preparatory patch to enable interrupts while spinning with spinlock irqs.
Any arch which needs this feature just has to implement __raw_spin_lock_irq

Signed-off by: Pravin B. Shelar <[EMAIL PROTECTED]>
Signed-off by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc1/include/asm-alpha/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-alpha/spinlock.h  2006-11-29 
13:57:37.0 -0800
+++ linux-2.6.20-rc1/include/asm-alpha/spinlock.h   2006-12-28 
17:18:32.132775000 -0800
@@ -13,6 +13,7 @@
  */
 
 #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+#define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
 #define __raw_spin_is_locked(x)((x)->lock != 0)
 #define __raw_spin_unlock_wait(x) \
do { cpu_relax(); } while ((x)->lock)
Index: linux-2.6.20-rc1/include/asm-arm/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-arm/spinlock.h2006-11-29 
13:57:37.0 -0800
+++ linux-2.6.20-rc1/include/asm-arm/spinlock.h 2006-12-28 17:18:32.132775000 
-0800
@@ -22,6 +22,7 @@
do { while (__raw_spin_is_locked(lock)) cpu_relax(); } while (0)
 
 #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+#define __raw_spin_lock_irq(lock, flags) __raw_spin_lock(lock)
 
 static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
Index: linux-2.6.20-rc1/include/asm-cris/arch-v32/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-cris/arch-v32/spinlock.h  2006-11-29 
13:57:37.0 -0800
+++ linux-2.6.20-rc1/include/asm-cris/arch-v32/spinlock.h   2006-12-29 
16:36:27.182954000 -0800
@@ -36,7 +36,7 @@ static inline void _raw_spin_lock_flags 
 {
   _raw_spin_lock(lock);
 }
-
+#define __raw_spin_lock_irq(lock) _raw_spin_lock(lock)
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
Index: linux-2.6.20-rc1/include/asm-i386/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-i386/spinlock.h   2006-12-21 
14:34:33.871573917 -0800
+++ linux-2.6.20-rc1/include/asm-i386/spinlock.h2006-12-28 
17:18:32.142775000 -0800
@@ -82,6 +82,7 @@ static inline void __raw_spin_lock_flags
  CLI_STI_INPUT_ARGS
: "memory" CLI_STI_CLOBBERS);
 }
+# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
 #endif
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
Index: linux-2.6.20-rc1/include/asm-ia64/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-ia64/spinlock.h   2006-11-29 
13:57:37.0 -0800
+++ linux-2.6.20-rc1/include/asm-ia64/spinlock.h2006-12-28 
17:18:32.142775000 -0800
@@ -87,7 +87,7 @@ __raw_spin_lock_flags (raw_spinlock_t *l
 }
 
 #define __raw_spin_lock(lock) __raw_spin_lock_flags(lock, 0)
-
+# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
 /* Unlock by doing an ordered store and releasing the cacheline with nta */
 static inline void __raw_spin_unlock(raw_spinlock_t *x) {
barrier();
@@ -96,6 +96,7 @@ static inline void __raw_spin_unlock(raw
 
 #else /* !ASM_SUPPORTED */
 #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
 # define __raw_spin_lock(x)
\
 do {

[rfc] [patch 1/2] spin_lock_irq: Enable interrupts while spinning -- preperatory patch

2007-01-03 Thread Ravikiran G Thirumalai

There seems to be  no good reason for spin_lock_irq to disable interrupts 
while spinning.  Zwane Mwaikambo had an implementation couple of years  ago, 
and the only objection seemed to be concerns about buggy code using
spin_lock_irq whilst interrupts disabled
http://lkml.org/lkml/2004/5/26/87
That shouldn't be a concern anymore. Besides, spin_lock_irqsave now enables
interrupts while spinning.

As to the motivation, on a Sun x4600 8 socket 16 core x86_64 NUMA box, 
we notice softlockups and quite a few lost timer ticks under extreme 
memory pressure.  The reason turned out to be that zone-lru_lock tends 
to get heavily contended, and NUMA nodes try to grab the locks using 
spin_lock_irq.  Instrumentation showed us that interrupt hold offs can
last for a few seconds, and even the main timer interrupts get held off for
long -- which is not good. Enabling interrupts for spinlocks while spinning
made the machine responsive and the softlockups/lost ticks went away.

Although the scenario above was an extreme condition (very high memory
pressure), I guess it still makes sense to enable interrupts while spinning
for a lock.

The following patches do just that. The first patch is preparatory in nature
and the second one changes the  x86_64 implementation of spin_lock_irq.
Patch passed overnight runs of kernbench and dbench on 4 way x86_64 smp.

Comments?

Thanks,
Kiran


Preparatory patch to enable interrupts while spinning with spinlock irqs.
Any arch which needs this feature just has to implement __raw_spin_lock_irq

Signed-off by: Pravin B. Shelar [EMAIL PROTECTED]
Signed-off by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.20-rc1/include/asm-alpha/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-alpha/spinlock.h  2006-11-29 
13:57:37.0 -0800
+++ linux-2.6.20-rc1/include/asm-alpha/spinlock.h   2006-12-28 
17:18:32.132775000 -0800
@@ -13,6 +13,7 @@
  */
 
 #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+#define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
 #define __raw_spin_is_locked(x)((x)-lock != 0)
 #define __raw_spin_unlock_wait(x) \
do { cpu_relax(); } while ((x)-lock)
Index: linux-2.6.20-rc1/include/asm-arm/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-arm/spinlock.h2006-11-29 
13:57:37.0 -0800
+++ linux-2.6.20-rc1/include/asm-arm/spinlock.h 2006-12-28 17:18:32.132775000 
-0800
@@ -22,6 +22,7 @@
do { while (__raw_spin_is_locked(lock)) cpu_relax(); } while (0)
 
 #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+#define __raw_spin_lock_irq(lock, flags) __raw_spin_lock(lock)
 
 static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
Index: linux-2.6.20-rc1/include/asm-cris/arch-v32/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-cris/arch-v32/spinlock.h  2006-11-29 
13:57:37.0 -0800
+++ linux-2.6.20-rc1/include/asm-cris/arch-v32/spinlock.h   2006-12-29 
16:36:27.182954000 -0800
@@ -36,7 +36,7 @@ static inline void _raw_spin_lock_flags 
 {
   _raw_spin_lock(lock);
 }
-
+#define __raw_spin_lock_irq(lock) _raw_spin_lock(lock)
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
Index: linux-2.6.20-rc1/include/asm-i386/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-i386/spinlock.h   2006-12-21 
14:34:33.871573917 -0800
+++ linux-2.6.20-rc1/include/asm-i386/spinlock.h2006-12-28 
17:18:32.142775000 -0800
@@ -82,6 +82,7 @@ static inline void __raw_spin_lock_flags
  CLI_STI_INPUT_ARGS
: memory CLI_STI_CLOBBERS);
 }
+# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
 #endif
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
Index: linux-2.6.20-rc1/include/asm-ia64/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-ia64/spinlock.h   2006-11-29 
13:57:37.0 -0800
+++ linux-2.6.20-rc1/include/asm-ia64/spinlock.h2006-12-28 
17:18:32.142775000 -0800
@@ -87,7 +87,7 @@ __raw_spin_lock_flags (raw_spinlock_t *l
 }
 
 #define __raw_spin_lock(lock) __raw_spin_lock_flags(lock, 0)
-
+# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
 /* Unlock by doing an ordered store and releasing the cacheline with nta */
 static inline void __raw_spin_unlock(raw_spinlock_t *x) {
barrier();
@@ -96,6 +96,7 @@ static inline void __raw_spin_unlock(raw
 
 #else /* !ASM_SUPPORTED */
 #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
 # define __raw_spin_lock(x)
\
 do {

[rfc] [patch 2/2] spin_lock_irq: Enable interrupts while spinning -- x86_64 implementation

2007-01-03 Thread Ravikiran G Thirumalai

Implement interrupt enabling while spinning for lock for spin_lock_irq

Signed-off by: Pravin B. Shelar [EMAIL PROTECTED]
Signed-off by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.20-rc1/include/asm-x86_64/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-x86_64/spinlock.h 2006-12-28 
17:18:32.142775000 -0800
+++ linux-2.6.20-rc1/include/asm-x86_64/spinlock.h  2006-12-29 
14:05:04.012954000 -0800
@@ -63,7 +63,21 @@ static inline void __raw_spin_lock_flags
5:\n\t
: +m (lock-slock) : r ((unsigned)flags) : memory);
 }
-#define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
+static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
+{
+   asm volatile(
+   \n1:\t
+   LOCK_PREFIX  ; decl %0\n\t
+   jns 2f\n
+   sti\n /* Enable interrupts during spin */
+   3:\n
+   rep;nop\n\t
+   cmpl $0,%0\n\t
+   jle 3b\n\t
+   cli\n
+   jmp 1b\n
+   2:\t : +m (lock-slock) : : memory);
+}
 #endif
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] [patch 1/2] spin_lock_irq: Enable interrupts while spinning -- preperatory patch

2007-01-03 Thread Ravikiran G Thirumalai

On Wed, Jan 03, 2007 at 12:16:35AM -0800, Andrew Morton wrote:
 On Tue, 2 Jan 2007 23:59:23 -0800
 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:
 
  The following patches do just that. The first patch is preparatory in nature
  and the second one changes the  x86_64 implementation of spin_lock_irq.
  Patch passed overnight runs of kernbench and dbench on 4 way x86_64 smp.
 
 The end result of this is, I think, that i386 enables irqs while spinning
 in spin_lock_irqsave() but not while spinning in spin_lock_irq().  And
 x86_64 does the opposite.

No, right now we have on mainline (non PREEMPT case);

i386x86_64
-
spin_lock_irq   cli when spin   cli when spin
spin_lock_irqsave   spin with intr enabled  spin with intr enabled

The posted patchset changed this to:

i386x86_64
-
spin_lock_irq   cli when spin   spin with intr enabled
spin_lock_irqsave   spin with intr enabled  spin with intr enabled

 
 Odd, isn't it?

Well we just implemented the x86_64 part.  Here goes the i386 part as well for 
spin_lock_irq.

i386: Enable interrupts while spinning for a lock with spin_lock_irq

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]

Index: linux-2.6.20-rc1/include/asm-i386/spinlock.h
===
--- linux-2.6.20-rc1.orig/include/asm-i386/spinlock.h   2006-12-28 
17:18:32.142775000 -0800
+++ linux-2.6.20-rc1/include/asm-i386/spinlock.h2007-01-03 
10:18:32.243662000 -0800
@@ -82,7 +82,22 @@ static inline void __raw_spin_lock_flags
  CLI_STI_INPUT_ARGS
: memory CLI_STI_CLOBBERS);
 }
-# define __raw_spin_lock_irq(lock) __raw_spin_lock(lock)
+
+static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
+{
+   asm volatile(\n1:\t
+LOCK_PREFIX  ; decb %0\n\t
+jns 3f\n
+STI_STRING \n
+2:\t
+rep;nop\n\t
+cmpb $0,%0\n\t
+jle 2b\n\t
+CLI_STRING \n
+jmp 1b\n
+3:\n\t
+: +m (lock-slock) : : memory);
+}
 #endif
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86: Fix dev_to_node for x86 and x86_64

2006-12-28 Thread Ravikiran G Thirumalai

Hi Andrew,
dev_to_node() does not work as expected on x86 and x86_64 as pointed out
earlier here:
http://lkml.org/lkml/2006/11/7/10

Following patch fixes it, please apply.  (Note: The fix depends on support
for PCI domains for x86/x86_64)

Thanks,
Kiran


dev_to_node does not work as expected on x86_64 (and i386).  This is because
node value returned by pcibus_to_node is initialized after a struct device
is created with current x86_64 code.

We need the node value initialized before the call to pci_scan_bus_parented,
as the generic devices are allocated and initialized
off pci_scan_child_bus, which gets called from pci_scan_bus_parented
The following patch does that using "pci_sysdata" introduced by the PCI
domain patches in -mm.

Signed-off-by: Alok N Kataria <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc1/arch/i386/pci/acpi.c
===
--- linux-2.6.20-rc1.orig/arch/i386/pci/acpi.c  2006-12-28 11:51:52.542775000 
-0800
+++ linux-2.6.20-rc1/arch/i386/pci/acpi.c   2006-12-28 12:01:19.242775000 
-0800
@@ -9,6 +9,7 @@ struct pci_bus * __devinit pci_acpi_scan
 {
struct pci_bus *bus;
struct pci_sysdata *sd;
+   int pxm;
 
/* Allocate per-root-bus (not per bus) arch-specific data.
 * TODO: leak; this memory is never freed.
@@ -30,15 +31,21 @@ struct pci_bus * __devinit pci_acpi_scan
}
 #endif /* CONFIG_PCI_DOMAINS */
 
+   sd->node = -1;
+
+   pxm = acpi_get_pxm(device->handle);
+#ifdef CONFIG_ACPI_NUMA
+   if (pxm >= 0)
+   sd->node = pxm_to_node(pxm);
+#endif
+
bus = pci_scan_bus_parented(NULL, busnum, _root_ops, sd);
if (!bus)
kfree(sd);
 
 #ifdef CONFIG_ACPI_NUMA
if (bus != NULL) {
-   int pxm = acpi_get_pxm(device->handle);
if (pxm >= 0) {
-   sd->node = pxm_to_node(pxm);
printk("bus %d -> pxm %d -> node %d\n",
busnum, pxm, sd->node);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86: Fix dev_to_node for x86 and x86_64

2006-12-28 Thread Ravikiran G Thirumalai

Hi Andrew,
dev_to_node() does not work as expected on x86 and x86_64 as pointed out
earlier here:
http://lkml.org/lkml/2006/11/7/10

Following patch fixes it, please apply.  (Note: The fix depends on support
for PCI domains for x86/x86_64)

Thanks,
Kiran


dev_to_node does not work as expected on x86_64 (and i386).  This is because
node value returned by pcibus_to_node is initialized after a struct device
is created with current x86_64 code.

We need the node value initialized before the call to pci_scan_bus_parented,
as the generic devices are allocated and initialized
off pci_scan_child_bus, which gets called from pci_scan_bus_parented
The following patch does that using pci_sysdata introduced by the PCI
domain patches in -mm.

Signed-off-by: Alok N Kataria [EMAIL PROTECTED]
Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.20-rc1/arch/i386/pci/acpi.c
===
--- linux-2.6.20-rc1.orig/arch/i386/pci/acpi.c  2006-12-28 11:51:52.542775000 
-0800
+++ linux-2.6.20-rc1/arch/i386/pci/acpi.c   2006-12-28 12:01:19.242775000 
-0800
@@ -9,6 +9,7 @@ struct pci_bus * __devinit pci_acpi_scan
 {
struct pci_bus *bus;
struct pci_sysdata *sd;
+   int pxm;
 
/* Allocate per-root-bus (not per bus) arch-specific data.
 * TODO: leak; this memory is never freed.
@@ -30,15 +31,21 @@ struct pci_bus * __devinit pci_acpi_scan
}
 #endif /* CONFIG_PCI_DOMAINS */
 
+   sd-node = -1;
+
+   pxm = acpi_get_pxm(device-handle);
+#ifdef CONFIG_ACPI_NUMA
+   if (pxm = 0)
+   sd-node = pxm_to_node(pxm);
+#endif
+
bus = pci_scan_bus_parented(NULL, busnum, pci_root_ops, sd);
if (!bus)
kfree(sd);
 
 #ifdef CONFIG_ACPI_NUMA
if (bus != NULL) {
-   int pxm = acpi_get_pxm(device-handle);
if (pxm = 0) {
-   sd-node = pxm_to_node(pxm);
printk(bus %d - pxm %d - node %d\n,
busnum, pxm, sd-node);
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] mm: Set HASHDIST_DEFAULT to 1 for x86_64 NUMA

2006-12-26 Thread Ravikiran G Thirumalai

Enable system hashtable memory to be distributed among nodes on x86_64 NUMA

Forcing the kernel to use node interleaved vmalloc instead of bootmem
for the system hashtable memory (alloc_large_system_hash) reduces the
memory imbalance on node 0 by around 40MB on a 8 node x86_64 NUMA box:

Before the following patch, on bootup of a 8 node box:

Node 0 MemTotal:  3407488 kB
Node 0 MemFree:   3206296 kB
Node 0 MemUsed:201192 kB
Node 0 Active:   7012 kB
Node 0 Inactive:  512 kB
Node 0 Dirty:   0 kB
Node 0 Writeback:   0 kB
Node 0 FilePages:1912 kB
Node 0 Mapped:420 kB
Node 0 AnonPages:5612 kB
Node 0 PageTables:468 kB
Node 0 NFS_Unstable:0 kB
Node 0 Bounce:  0 kB
Node 0 Slab: 5408 kB
Node 0 SReclaimable:  644 kB
Node 0 SUnreclaim:   4764 kB

After the patch (or using hashdist=1 on the kernel command line):

Node 0 MemTotal:  3407488 kB
Node 0 MemFree:   3247608 kB
Node 0 MemUsed:159880 kB
Node 0 Active:   3012 kB
Node 0 Inactive:  616 kB
Node 0 Dirty:   0 kB
Node 0 Writeback:   0 kB
Node 0 FilePages:2424 kB
Node 0 Mapped:380 kB
Node 0 AnonPages:1200 kB
Node 0 PageTables:396 kB
Node 0 NFS_Unstable:0 kB
Node 0 Bounce:  0 kB
Node 0 Slab: 6304 kB
Node 0 SReclaimable: 1596 kB
Node 0 SUnreclaim:   4708 kB

I guess it is a good idea to keep HASHDIST_DEFAULT "on" for x86_64 NUMA since
x86_64 has no dearth of vmalloc space?  Or maybe enable hash distribution for
all 64bit NUMA arches?  The following patch does it only for x86_64.

Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc1/include/linux/bootmem.h
===
--- linux-2.6.20-rc1.orig/include/linux/bootmem.h   2006-12-21 
14:34:36.321610875 -0800
+++ linux-2.6.20-rc1/include/linux/bootmem.h2006-12-26 15:55:04.501064560 
-0800
@@ -122,9 +122,9 @@ extern void *alloc_large_system_hash(con
 #define HASH_EARLY 0x0001  /* Allocating during early boot? */
 
 /* Only NUMA needs hash distribution.
- * IA64 is known to have sufficient vmalloc space.
+ * IA64 and x86_64 have sufficient vmalloc space.
  */
-#if defined(CONFIG_NUMA) && defined(CONFIG_IA64)
+#if defined(CONFIG_NUMA) && (defined(CONFIG_IA64) || defined(CONFIG_X86_64))
 #define HASHDIST_DEFAULT 1
 #else
 #define HASHDIST_DEFAULT 0
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] mm: Set HASHDIST_DEFAULT to 1 for x86_64 NUMA

2006-12-26 Thread Ravikiran G Thirumalai

Enable system hashtable memory to be distributed among nodes on x86_64 NUMA

Forcing the kernel to use node interleaved vmalloc instead of bootmem
for the system hashtable memory (alloc_large_system_hash) reduces the
memory imbalance on node 0 by around 40MB on a 8 node x86_64 NUMA box:

Before the following patch, on bootup of a 8 node box:

Node 0 MemTotal:  3407488 kB
Node 0 MemFree:   3206296 kB
Node 0 MemUsed:201192 kB
Node 0 Active:   7012 kB
Node 0 Inactive:  512 kB
Node 0 Dirty:   0 kB
Node 0 Writeback:   0 kB
Node 0 FilePages:1912 kB
Node 0 Mapped:420 kB
Node 0 AnonPages:5612 kB
Node 0 PageTables:468 kB
Node 0 NFS_Unstable:0 kB
Node 0 Bounce:  0 kB
Node 0 Slab: 5408 kB
Node 0 SReclaimable:  644 kB
Node 0 SUnreclaim:   4764 kB

After the patch (or using hashdist=1 on the kernel command line):

Node 0 MemTotal:  3407488 kB
Node 0 MemFree:   3247608 kB
Node 0 MemUsed:159880 kB
Node 0 Active:   3012 kB
Node 0 Inactive:  616 kB
Node 0 Dirty:   0 kB
Node 0 Writeback:   0 kB
Node 0 FilePages:2424 kB
Node 0 Mapped:380 kB
Node 0 AnonPages:1200 kB
Node 0 PageTables:396 kB
Node 0 NFS_Unstable:0 kB
Node 0 Bounce:  0 kB
Node 0 Slab: 6304 kB
Node 0 SReclaimable: 1596 kB
Node 0 SUnreclaim:   4708 kB

I guess it is a good idea to keep HASHDIST_DEFAULT on for x86_64 NUMA since
x86_64 has no dearth of vmalloc space?  Or maybe enable hash distribution for
all 64bit NUMA arches?  The following patch does it only for x86_64.

Signed-off-by: Pravin B. Shelar [EMAIL PROTECTED]
Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.20-rc1/include/linux/bootmem.h
===
--- linux-2.6.20-rc1.orig/include/linux/bootmem.h   2006-12-21 
14:34:36.321610875 -0800
+++ linux-2.6.20-rc1/include/linux/bootmem.h2006-12-26 15:55:04.501064560 
-0800
@@ -122,9 +122,9 @@ extern void *alloc_large_system_hash(con
 #define HASH_EARLY 0x0001  /* Allocating during early boot? */
 
 /* Only NUMA needs hash distribution.
- * IA64 is known to have sufficient vmalloc space.
+ * IA64 and x86_64 have sufficient vmalloc space.
  */
-#if defined(CONFIG_NUMA)  defined(CONFIG_IA64)
+#if defined(CONFIG_NUMA)  (defined(CONFIG_IA64) || defined(CONFIG_X86_64))
 #define HASHDIST_DEFAULT 1
 #else
 #define HASHDIST_DEFAULT 0
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86_64: Fix boot hang due to nmi watchdog init code

2006-12-05 Thread Ravikiran G Thirumalai

2.6.19  stopped booting (or booted based on build/config) on our x86_64
systems due to a bug introduced in 2.6.19.  check_nmi_watchdog schedules an
IPI on all cpus to  busy wait on a flag, but fails to set the busywait
flag if NMI functionality is disabled.  This causes the secondary cpus
to spin in an endless loop, causing the kernel bootup to hang.
Depending upon the build, the  busywait flag got overwritten (stack variable)
and caused  the kernel to bootup on certain builds.  Following patch fixes
the bug by setting the busywait flag before returning from check_nmi_watchdog.
I guess using a stack variable is not good here as the calling function could
potentially return while the busy wait loop is still spinning on the flag.
I would think this is a good candidate for 2.6.19 stable as well.

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.19/arch/x86_64/kernel/nmi.c
===
--- linux-2.6.19.orig/arch/x86_64/kernel/nmi.c  2006-11-29 13:57:37.0 
-0800
+++ linux-2.6.19/arch/x86_64/kernel/nmi.c   2006-12-04 18:02:42.462737000 
-0800
@@ -210,9 +210,10 @@ static __init void nmi_cpu_busy(void *da
 }
 #endif
 
+static int endflag = 0;
+
 int __init check_nmi_watchdog (void)
 {
-   volatile int endflag = 0;
int *counts;
int cpu;
 
@@ -253,6 +254,7 @@ int __init check_nmi_watchdog (void)
if (!atomic_read(_active)) {
kfree(counts);
atomic_set(_active, -1);
+   endflag = 1;
return -1;
}
endflag = 1;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86_64: Fix boot hang due to nmi watchdog init code

2006-12-05 Thread Ravikiran G Thirumalai

2.6.19  stopped booting (or booted based on build/config) on our x86_64
systems due to a bug introduced in 2.6.19.  check_nmi_watchdog schedules an
IPI on all cpus to  busy wait on a flag, but fails to set the busywait
flag if NMI functionality is disabled.  This causes the secondary cpus
to spin in an endless loop, causing the kernel bootup to hang.
Depending upon the build, the  busywait flag got overwritten (stack variable)
and caused  the kernel to bootup on certain builds.  Following patch fixes
the bug by setting the busywait flag before returning from check_nmi_watchdog.
I guess using a stack variable is not good here as the calling function could
potentially return while the busy wait loop is still spinning on the flag.
I would think this is a good candidate for 2.6.19 stable as well.

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.19/arch/x86_64/kernel/nmi.c
===
--- linux-2.6.19.orig/arch/x86_64/kernel/nmi.c  2006-11-29 13:57:37.0 
-0800
+++ linux-2.6.19/arch/x86_64/kernel/nmi.c   2006-12-04 18:02:42.462737000 
-0800
@@ -210,9 +210,10 @@ static __init void nmi_cpu_busy(void *da
 }
 #endif
 
+static int endflag = 0;
+
 int __init check_nmi_watchdog (void)
 {
-   volatile int endflag = 0;
int *counts;
int cpu;
 
@@ -253,6 +254,7 @@ int __init check_nmi_watchdog (void)
if (!atomic_read(nmi_active)) {
kfree(counts);
atomic_set(nmi_active, -1);
+   endflag = 1;
return -1;
}
endflag = 1;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock

2005-09-07 Thread Ravikiran G Thirumalai

On Wed, Sep 07, 2005 at 11:19:24AM +0200, Jens Axboe wrote:
> On Tue, Sep 06 2005, Ravikiran G Thirumalai wrote:
> > The following patchset breaks down the global ide_lock to per-hwgroup lock.
> > We have taken the following approach.
> 
> Curious, what is the point of this?
> 

On smp machines with multiple ide interfaces, we take per-group lock instead
of a global lock, there by breaking the lock to per-irq hwgroups.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock

2005-09-07 Thread Ravikiran G Thirumalai

On Wed, Sep 07, 2005 at 06:09:10PM +0100, Alan Cox wrote:
> On Maw, 2005-09-06 at 16:33 -0700, Ravikiran G Thirumalai wrote:
> > 2. Change the core ide code to use hwgroup->lock instead of ide_lock.
> > Deprecate ide_lock (patch 2)
> 
> hwgroups and IDE locking requirements are frequently completely
> unrelated. Its clear from the changes proposed you've not tested on real
> hardware for each case and you have not studied the documented errata

I tested it on a 2way box with a piix controller. It got through Bonnie++.  
I have access to piix controllers only, so that was the only controller 
I changed.  I did not read the errata though... :(

Do you think the approach is unsafe, even if the piix tune routine is
serialized with a per-driver lock?  Bartlomiej?

Thanks,
Kiran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 4/4] ide: Break ide_lock -- remove ide_lock from piix driver

2005-09-07 Thread Ravikiran G Thirumalai

On Wed, Sep 07, 2005 at 06:06:23PM +0100, Alan Cox wrote:
> On Maw, 2005-09-06 at 16:44 -0700, Ravikiran G Thirumalai wrote:
> > Patch to convert piix driver to use per-driver/hwgroup lock and kill
> > ide_lock.  In the case of piix, hwgroup->lock should be sufficient.
> 
> PIIX requires that both channels are quiescent when retuning in some
> cases. It wasn't totally safe before, its now totally broken. Start by

Then the change to piix controller in my patchset is bad, How about changing
the ide_lock to per-driver lock in this case?  Locking for rest of the
controllers in the system is left equivalent to what ide_lock did earlier..


> fixing the IDE layer locking properly (or forward porting my patches and
> then fixing them for all the refcounting changes and other stuff done
> since).

Can you please point me to the patchset...

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/4] ide: Break ide_lock -- replace ide_lock with hwgroup->lock in core ide

2005-09-07 Thread Ravikiran G Thirumalai

On Tue, Sep 06, 2005 at 04:40:28PM -0700, Ravikiran G Thirumalai wrote:
> Patch to convert ide core code to use hwgroup lock instead of a global
> ide_lock.
>  
> Index: linux-2.6.13/drivers/ide/ide-io.c
> ===
> --- linux-2.6.13.orig/drivers/ide/ide-io.c2005-09-06 11:22:29.0 
> -0700
> @@ -1211,11 +1214,11 @@
>*/
>   if (masked_irq != IDE_NO_IRQ && hwif->irq != masked_irq)
>   disable_irq_nosync(hwif->irq);
> - spin_unlock(_lock);
> + spin_unlock(>lock);
>   local_irq_enable();
>   /* allow other IRQs while we start this request */
>   startstop = start_request(drive, rq);
> - spin_lock_irq(_lock);
> + spin_unlock_irq(>lock);
^^^

My bad,
Fixed patch attached.

Thanks,
Kiran


Patch to convert ide core code to use hwgroup lock instead of a global
ide_lock.

Signed-off-by: Vaibhav V. Nivargi <[EMAIL PROTECTED]>
Signed-off-by: Alok N. Kataria <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.13/drivers/ide/ide-cd.c
===
--- linux-2.6.13.orig/drivers/ide/ide-cd.c  2005-09-06 15:08:07.0 
-0700
+++ linux-2.6.13/drivers/ide/ide-cd.c   2005-09-06 15:16:15.0 -0700
@@ -590,7 +590,8 @@
 
 static void cdrom_end_request (ide_drive_t *drive, int uptodate)
 {
-   struct request *rq = HWGROUP(drive)->rq;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+   struct request *rq = hwgroup->rq;
int nsectors = rq->hard_cur_sectors;
 
if ((rq->flags & REQ_SENSE) && uptodate) {
@@ -612,10 +613,10 @@
/*
 * now end failed request
 */
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, flags);
end_that_request_chunk(failed, 0, failed->data_len);
end_that_request_last(failed);
-   spin_unlock_irqrestore(_lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
}
 
cdrom_analyze_sense_data(drive, failed, sense);
@@ -636,7 +637,8 @@
Returns 1 if the request was ended. */
 static int cdrom_decode_status(ide_drive_t *drive, int good_stat, int 
*stat_ret)
 {
-   struct request *rq = HWGROUP(drive)->rq;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+   struct request *rq = hwgroup->rq;
int stat, err, sense_key;

/* Check for errors. */
@@ -698,10 +700,10 @@
 * request sense has completed
 */
if (stat & ERR_STAT) {
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, flags);
blkdev_dequeue_request(rq);
HWGROUP(drive)->rq = NULL;
-   spin_unlock_irqrestore(_lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 
cdrom_queue_request_sense(drive, rq->sense, rq);
} else
@@ -741,9 +743,9 @@
 * take a breather relying on the
 * unplug timer to kick us again
 */
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, 
flags);
blk_plug_device(drive->queue);
-   spin_unlock_irqrestore(_lock,flags);
+   spin_unlock_irqrestore(>lock, 
flags);
return 1;
}
}
@@ -839,6 +841,7 @@
ide_startstop_t startstop;
struct cdrom_info *info = drive->driver_data;
ide_hwif_t *hwif = drive->hwif;
+   ide_hwgroup_t *hwgroup = hwif->hwgroup;
 
/* Wait for the controller to be idle. */
if (ide_wait_stat(, drive, 0, BUSY_STAT, WAIT_READY))
@@ -866,10 +869,10 @@
unsigned long flags;
 
/* packet command */
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, flags);
hwif->OUTBSYNC(drive, WIN_PACKETCMD, IDE_COMMAND_REG);
ndelay(400);
-   spin_unlock_irqrestore(_lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 
return (*handler) (drive);
}
@@ -160

Re: [patch 2/4] ide: Break ide_lock -- replace ide_lock with hwgroup-lock in core ide

2005-09-07 Thread Ravikiran G Thirumalai

On Tue, Sep 06, 2005 at 04:40:28PM -0700, Ravikiran G Thirumalai wrote:
 Patch to convert ide core code to use hwgroup lock instead of a global
 ide_lock.
  
 Index: linux-2.6.13/drivers/ide/ide-io.c
 ===
 --- linux-2.6.13.orig/drivers/ide/ide-io.c2005-09-06 11:22:29.0 
 -0700
 @@ -1211,11 +1214,11 @@
*/
   if (masked_irq != IDE_NO_IRQ  hwif-irq != masked_irq)
   disable_irq_nosync(hwif-irq);
 - spin_unlock(ide_lock);
 + spin_unlock(hwgroup-lock);
   local_irq_enable();
   /* allow other IRQs while we start this request */
   startstop = start_request(drive, rq);
 - spin_lock_irq(ide_lock);
 + spin_unlock_irq(hwgroup-lock);
^^^

My bad,
Fixed patch attached.

Thanks,
Kiran


Patch to convert ide core code to use hwgroup lock instead of a global
ide_lock.

Signed-off-by: Vaibhav V. Nivargi [EMAIL PROTECTED]
Signed-off-by: Alok N. Kataria [EMAIL PROTECTED]
Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.13/drivers/ide/ide-cd.c
===
--- linux-2.6.13.orig/drivers/ide/ide-cd.c  2005-09-06 15:08:07.0 
-0700
+++ linux-2.6.13/drivers/ide/ide-cd.c   2005-09-06 15:16:15.0 -0700
@@ -590,7 +590,8 @@
 
 static void cdrom_end_request (ide_drive_t *drive, int uptodate)
 {
-   struct request *rq = HWGROUP(drive)-rq;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+   struct request *rq = hwgroup-rq;
int nsectors = rq-hard_cur_sectors;
 
if ((rq-flags  REQ_SENSE)  uptodate) {
@@ -612,10 +613,10 @@
/*
 * now end failed request
 */
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, flags);
end_that_request_chunk(failed, 0, failed-data_len);
end_that_request_last(failed);
-   spin_unlock_irqrestore(ide_lock, flags);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
}
 
cdrom_analyze_sense_data(drive, failed, sense);
@@ -636,7 +637,8 @@
Returns 1 if the request was ended. */
 static int cdrom_decode_status(ide_drive_t *drive, int good_stat, int 
*stat_ret)
 {
-   struct request *rq = HWGROUP(drive)-rq;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+   struct request *rq = hwgroup-rq;
int stat, err, sense_key;

/* Check for errors. */
@@ -698,10 +700,10 @@
 * request sense has completed
 */
if (stat  ERR_STAT) {
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, flags);
blkdev_dequeue_request(rq);
HWGROUP(drive)-rq = NULL;
-   spin_unlock_irqrestore(ide_lock, flags);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
 
cdrom_queue_request_sense(drive, rq-sense, rq);
} else
@@ -741,9 +743,9 @@
 * take a breather relying on the
 * unplug timer to kick us again
 */
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, 
flags);
blk_plug_device(drive-queue);
-   spin_unlock_irqrestore(ide_lock,flags);
+   spin_unlock_irqrestore(hwgroup-lock, 
flags);
return 1;
}
}
@@ -839,6 +841,7 @@
ide_startstop_t startstop;
struct cdrom_info *info = drive-driver_data;
ide_hwif_t *hwif = drive-hwif;
+   ide_hwgroup_t *hwgroup = hwif-hwgroup;
 
/* Wait for the controller to be idle. */
if (ide_wait_stat(startstop, drive, 0, BUSY_STAT, WAIT_READY))
@@ -866,10 +869,10 @@
unsigned long flags;
 
/* packet command */
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, flags);
hwif-OUTBSYNC(drive, WIN_PACKETCMD, IDE_COMMAND_REG);
ndelay(400);
-   spin_unlock_irqrestore(ide_lock, flags);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
 
return (*handler) (drive);
}
@@ -1609,7 +1612,8 @@
 static ide_startstop_t cdrom_newpc_intr(ide_drive_t *drive)
 {
struct cdrom_info *info = drive

Re: [patch 4/4] ide: Break ide_lock -- remove ide_lock from piix driver

2005-09-07 Thread Ravikiran G Thirumalai

On Wed, Sep 07, 2005 at 06:06:23PM +0100, Alan Cox wrote:
 On Maw, 2005-09-06 at 16:44 -0700, Ravikiran G Thirumalai wrote:
  Patch to convert piix driver to use per-driver/hwgroup lock and kill
  ide_lock.  In the case of piix, hwgroup-lock should be sufficient.
 
 PIIX requires that both channels are quiescent when retuning in some
 cases. It wasn't totally safe before, its now totally broken. Start by

Then the change to piix controller in my patchset is bad, How about changing
the ide_lock to per-driver lock in this case?  Locking for rest of the
controllers in the system is left equivalent to what ide_lock did earlier..


 fixing the IDE layer locking properly (or forward porting my patches and
 then fixing them for all the refcounting changes and other stuff done
 since).

Can you please point me to the patchset...

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock

2005-09-07 Thread Ravikiran G Thirumalai

On Wed, Sep 07, 2005 at 06:09:10PM +0100, Alan Cox wrote:
 On Maw, 2005-09-06 at 16:33 -0700, Ravikiran G Thirumalai wrote:
  2. Change the core ide code to use hwgroup-lock instead of ide_lock.
  Deprecate ide_lock (patch 2)
 
 hwgroups and IDE locking requirements are frequently completely
 unrelated. Its clear from the changes proposed you've not tested on real
 hardware for each case and you have not studied the documented errata

I tested it on a 2way box with a piix controller. It got through Bonnie++.  
I have access to piix controllers only, so that was the only controller 
I changed.  I did not read the errata though... :(

Do you think the approach is unsafe, even if the piix tune routine is
serialized with a per-driver lock?  Bartlomiej?

Thanks,
Kiran

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock

2005-09-07 Thread Ravikiran G Thirumalai

On Wed, Sep 07, 2005 at 11:19:24AM +0200, Jens Axboe wrote:
 On Tue, Sep 06 2005, Ravikiran G Thirumalai wrote:
  The following patchset breaks down the global ide_lock to per-hwgroup lock.
  We have taken the following approach.
 
 Curious, what is the point of this?
 

On smp machines with multiple ide interfaces, we take per-group lock instead
of a global lock, there by breaking the lock to per-irq hwgroups.

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/4] ide: Break ide_lock -- change controller drivers

2005-09-06 Thread Ravikiran G Thirumalai

Patch to make ide-host controllers use hwgroup lock where serialization with
hwgroup->lock is necessary


Signed-off-by: Vaibhav V. Nivargi <[EMAIL PROTECTED]>
Signed-off-by: Alok N. Kataria <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>

Index: linux-2.6.13/drivers/ide/legacy/ht6560b.c
===
--- linux-2.6.13.orig/drivers/ide/legacy/ht6560b.c  2005-08-28 
19:41:01.0 -0400
+++ linux-2.6.13/drivers/ide/legacy/ht6560b.c   2005-09-06 15:57:46.113541000 
-0400
@@ -256,9 +256,11 @@
 {
unsigned long flags;
int t = HT_PREFETCH_MODE << 8;
-   
-   spin_lock_irqsave(_lock, flags);
-   
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+
+   spin_lock_irqsave(>lock, flags);
+   spin_lock(_lock);
+
/*
 *  Prefetch mode and unmask irq seems to conflict
 */
@@ -270,9 +272,10 @@
drive->drive_data &= ~t;  /* disable prefetch mode */
drive->no_unmask = 0;
}
-   
-   spin_unlock_irqrestore(_lock, flags);
-   
+
+   spin_unlock(_lock);
+   spin_unlock_irqrestore(>lock, flags);
+
 #ifdef DEBUG
printk("ht6560b: drive %s prefetch mode %sabled\n", drive->name, (state 
? "en" : "dis"));
 #endif
@@ -282,6 +285,7 @@
 {
unsigned long flags;
u8 timing;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);

switch (pio) {
case 8: /* set prefetch off */
@@ -291,14 +295,15 @@
}

timing = ht_pio2timings(drive, pio);
-   
-   spin_lock_irqsave(_lock, flags);
+
+   spin_lock_irqsave(>lock, flags);
+   spin_lock(_lock);

drive->drive_data &= 0xff00;
drive->drive_data |= timing;

-   spin_unlock_irqrestore(_lock, flags);
-   
+   spin_unlock(_lock);
+   spin_unlock_irqrestore(>lock, flags);
 #ifdef DEBUG
printk("ht6560b: drive %s tuned to pio mode %#x timing=%#x\n", 
drive->name, pio, timing);
 #endif
Index: linux-2.6.13/drivers/ide/pci/cmd640.c
===
--- linux-2.6.13.orig/drivers/ide/pci/cmd640.c  2005-08-28 19:41:01.0 
-0400
+++ linux-2.6.13/drivers/ide/pci/cmd640.c   2005-09-06 15:50:35.330618750 
-0400
@@ -442,11 +442,14 @@
 static void set_prefetch_mode (unsigned int index, int mode)
 {
ide_drive_t *drive = cmd_drives[index];
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
int reg = prefetch_regs[index];
u8 b;
unsigned long flags;
 
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, flags);
+   spin_lock(_lock);
+
b = __get_cmd640_reg(reg);
if (mode) { /* want prefetch on? */
 #if CMD640_PREFETCH_MASKS
@@ -462,7 +465,9 @@
b |= prefetch_masks[index]; /* disable prefetch */
}
__put_cmd640_reg(reg, b);
-   spin_unlock_irqrestore(_lock, flags);
+
+   spin_unlock(_lock);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 /*
Index: linux-2.6.13/drivers/ide/pci/piix.c
===
--- linux-2.6.13.orig/drivers/ide/pci/piix.c2005-09-06 15:49:48.271677750 
-0400
+++ linux-2.6.13/drivers/ide/pci/piix.c 2005-09-06 15:56:55.982408000 -0400
@@ -215,6 +215,7 @@
 {
ide_hwif_t *hwif= HWIF(drive);
struct pci_dev *dev = hwif->pci_dev;
+   ide_hwgroup_t *hwgroup  = HWGROUP(drive);
int is_slave= (>drives[1] == drive);
int master_port = hwif->channel ? 0x42 : 0x40;
int slave_port  = 0x44;
@@ -229,7 +230,8 @@
{ 2, 3 }, };
 
pio = ide_get_best_pio_mode(drive, pio, 5, NULL);
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, flags);
+   spin_lock(_lock);
pci_read_config_word(dev, master_port, _data);
if (is_slave) {
master_data = master_data | 0x4000;
@@ -249,7 +251,8 @@
pci_write_config_word(dev, master_port, master_data);
if (is_slave)
pci_write_config_byte(dev, slave_port, slave_data);
-   spin_unlock_irqrestore(_lock, flags);
+   spin_unlock(_lock);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 /**
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 4/4] ide: Break ide_lock -- remove ide_lock from piix driver

2005-09-06 Thread Ravikiran G Thirumalai

Patch to convert piix driver to use per-driver/hwgroup lock and kill
ide_lock.  In the case of piix, hwgroup->lock should be sufficient.

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>

Index: linux-2.6.13/drivers/ide/pci/piix.c
===
--- linux-2.6.13.orig/drivers/ide/pci/piix.c2005-09-06 12:00:25.0 
-0700
+++ linux-2.6.13/drivers/ide/pci/piix.c 2005-09-06 13:22:49.0 -0700
@@ -231,7 +231,6 @@
 
pio = ide_get_best_pio_mode(drive, pio, 5, NULL);
spin_lock_irqsave(>lock, flags);
-   spin_lock(_lock);
pci_read_config_word(dev, master_port, _data);
if (is_slave) {
master_data = master_data | 0x4000;
@@ -251,7 +250,6 @@
pci_write_config_word(dev, master_port, master_data);
if (is_slave)
pci_write_config_byte(dev, slave_port, slave_data);
-   spin_unlock(_lock);
spin_unlock_irqrestore(>lock, flags);
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/4] ide: Break ide_lock -- replace ide_lock with hwgroup->lock in core ide

2005-09-06 Thread Ravikiran G Thirumalai

Patch to convert ide core code to use hwgroup lock instead of a global
ide_lock.

Signed-off-by: Vaibhav V. Nivargi <[EMAIL PROTECTED]>
Signed-off-by: Alok N. Kataria <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>


Index: linux-2.6.13/drivers/ide/ide-cd.c
===
--- linux-2.6.13.orig/drivers/ide/ide-cd.c  2005-09-06 11:22:29.0 
-0700
+++ linux-2.6.13/drivers/ide/ide-cd.c   2005-09-06 11:34:43.0 -0700
@@ -590,7 +590,8 @@
 
 static void cdrom_end_request (ide_drive_t *drive, int uptodate)
 {
-   struct request *rq = HWGROUP(drive)->rq;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+   struct request *rq = hwgroup->rq;
int nsectors = rq->hard_cur_sectors;
 
if ((rq->flags & REQ_SENSE) && uptodate) {
@@ -612,10 +613,10 @@
/*
 * now end failed request
 */
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, flags);
end_that_request_chunk(failed, 0, failed->data_len);
end_that_request_last(failed);
-   spin_unlock_irqrestore(_lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
}
 
cdrom_analyze_sense_data(drive, failed, sense);
@@ -636,7 +637,8 @@
Returns 1 if the request was ended. */
 static int cdrom_decode_status(ide_drive_t *drive, int good_stat, int 
*stat_ret)
 {
-   struct request *rq = HWGROUP(drive)->rq;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+   struct request *rq = hwgroup->rq;
int stat, err, sense_key;

/* Check for errors. */
@@ -698,10 +700,10 @@
 * request sense has completed
 */
if (stat & ERR_STAT) {
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, flags);
blkdev_dequeue_request(rq);
HWGROUP(drive)->rq = NULL;
-   spin_unlock_irqrestore(_lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 
cdrom_queue_request_sense(drive, rq->sense, rq);
} else
@@ -741,9 +743,9 @@
 * take a breather relying on the
 * unplug timer to kick us again
 */
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, 
flags);
blk_plug_device(drive->queue);
-   spin_unlock_irqrestore(_lock,flags);
+   spin_unlock_irqrestore(>lock, 
flags);
return 1;
}
}
@@ -839,6 +841,7 @@
ide_startstop_t startstop;
struct cdrom_info *info = drive->driver_data;
ide_hwif_t *hwif = drive->hwif;
+   ide_hwgroup_t *hwgroup = hwif->hwgroup;
 
/* Wait for the controller to be idle. */
if (ide_wait_stat(, drive, 0, BUSY_STAT, WAIT_READY))
@@ -866,10 +869,10 @@
unsigned long flags;
 
/* packet command */
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, flags);
hwif->OUTBSYNC(drive, WIN_PACKETCMD, IDE_COMMAND_REG);
ndelay(400);
-   spin_unlock_irqrestore(_lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 
return (*handler) (drive);
}
@@ -1609,7 +1612,8 @@
 static ide_startstop_t cdrom_newpc_intr(ide_drive_t *drive)
 {
struct cdrom_info *info = drive->driver_data;
-   struct request *rq = HWGROUP(drive)->rq;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+   struct request *rq = hwgroup->rq;
int dma_error, dma, stat, ireason, len, thislen;
u8 lowcyl, highcyl;
xfer_func_t *xferfunc;
@@ -1737,11 +1741,11 @@
if (!rq->data_len)
post_transform_command(rq);
 
-   spin_lock_irqsave(_lock, flags);
+   spin_lock_irqsave(>lock, flags);
blkdev_dequeue_request(rq);
end_that_request_last(rq);
HWGROUP(drive)->rq = NULL;
-   spin_unlock_irqrestore(_lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
return ide_stopped;
 }
 
Index: linux-2.6.13/drivers/ide/ide-disk.c
===
--- linux-2.6.13.orig/drivers/ide/ide-disk.c2005-09-06 11:22:29.0 
-0700
+++ linux-2.6.13/drivers/ide/ide-disk.c 2005-09-06 11:34:43.0 -0700
@@ -786,11 +786,12 @@
 
 static int

[patch 1/4] ide: Break ide_lock -- Move hwif tuning code after hwif_init

2005-09-06 Thread Ravikiran G Thirumalai

Following patch moves the hwif tuning code from probe_hwif to ideprobe_init
after ideprobe_init calls hwif_init so that all hwif's have associated
hwgroups.  With this patch, we should always have hwgroups for hwifs during
calls the drive tune routines.


Signed-off-by: Alok N Kataria <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>

Index: linux-2.6.13/drivers/ide/ide-probe.c
===
--- linux-2.6.13.orig/drivers/ide/ide-probe.c   2005-09-06 11:22:29.0 
-0700
+++ linux-2.6.13/drivers/ide/ide-probe.c2005-09-06 15:12:58.0 
-0700
@@ -852,8 +852,15 @@
 
if (!hwif->present) {
ide_hwif_release_regions(hwif);
-   return;
}
+}
+
+static void ide_tune_drives(ide_hwif_t * hwif)
+{
+   unsigned int unit;
+
+   if (!hwif->present)
+   return;
 
for (unit = 0; unit < MAX_DRIVES; ++unit) {
ide_drive_t *drive = >drives[unit];
@@ -1443,9 +1450,12 @@
for (index = 0; index < MAX_HWIFS; ++index)
if (probe[index])
probe_hwif(_hwifs[index]);
-   for (index = 0; index < MAX_HWIFS; ++index)
-   if (probe[index])
+   for (index = 0; index < MAX_HWIFS; ++index) {
+   if (probe[index]) {
hwif_init(_hwifs[index]);
+   ide_tune_drives(_hwifs[index]);
+   }
+   }
for (index = 0; index < MAX_HWIFS; ++index) {
if (probe[index]) {
ide_hwif_t *hwif = _hwifs[index];
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 0/4] ide: Break ide_lock to per-hwgroup lock

2005-09-06 Thread Ravikiran G Thirumalai

The following patchset breaks down the global ide_lock to per-hwgroup lock.
We have taken the following approach.

1. Move the hwif tuning code from probe_hwif to ideprobe_init, after
hwif_init so that hwgroups are present for all the hwifs when the tune
routines for the hwifs are invoked (patch 1)

2. Change the core ide code to use hwgroup->lock instead of ide_lock.
Deprecate ide_lock (patch 2)

3. Change the host controllers to use hwgroup->lock (patch 3)

4. Change host controller drivers to use per driver lock instead of ide_lock
where needed or hwgroup->lock on case by case basis. This can be done 
incrementally for various controllers and we will have working code between 
patches -- this is done now for piix controller only.  Eventually, 
we can change all controllers to remove ide_lock

Thanks to Bartlomiej for comments and suggestions.

Patchset follows.  Patchset tested on a smp box with a piix controller.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/4] ide: Break ide_lock -- Move hwif tuning code after hwif_init

2005-09-06 Thread Ravikiran G Thirumalai

Following patch moves the hwif tuning code from probe_hwif to ideprobe_init
after ideprobe_init calls hwif_init so that all hwif's have associated
hwgroups.  With this patch, we should always have hwgroups for hwifs during
calls the drive tune routines.


Signed-off-by: Alok N Kataria [EMAIL PROTECTED]
Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]

Index: linux-2.6.13/drivers/ide/ide-probe.c
===
--- linux-2.6.13.orig/drivers/ide/ide-probe.c   2005-09-06 11:22:29.0 
-0700
+++ linux-2.6.13/drivers/ide/ide-probe.c2005-09-06 15:12:58.0 
-0700
@@ -852,8 +852,15 @@
 
if (!hwif-present) {
ide_hwif_release_regions(hwif);
-   return;
}
+}
+
+static void ide_tune_drives(ide_hwif_t * hwif)
+{
+   unsigned int unit;
+
+   if (!hwif-present)
+   return;
 
for (unit = 0; unit  MAX_DRIVES; ++unit) {
ide_drive_t *drive = hwif-drives[unit];
@@ -1443,9 +1450,12 @@
for (index = 0; index  MAX_HWIFS; ++index)
if (probe[index])
probe_hwif(ide_hwifs[index]);
-   for (index = 0; index  MAX_HWIFS; ++index)
-   if (probe[index])
+   for (index = 0; index  MAX_HWIFS; ++index) {
+   if (probe[index]) {
hwif_init(ide_hwifs[index]);
+   ide_tune_drives(ide_hwifs[index]);
+   }
+   }
for (index = 0; index  MAX_HWIFS; ++index) {
if (probe[index]) {
ide_hwif_t *hwif = ide_hwifs[index];
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/4] ide: Break ide_lock -- replace ide_lock with hwgroup-lock in core ide

2005-09-06 Thread Ravikiran G Thirumalai

Patch to convert ide core code to use hwgroup lock instead of a global
ide_lock.

Signed-off-by: Vaibhav V. Nivargi [EMAIL PROTECTED]
Signed-off-by: Alok N. Kataria [EMAIL PROTECTED]
Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]


Index: linux-2.6.13/drivers/ide/ide-cd.c
===
--- linux-2.6.13.orig/drivers/ide/ide-cd.c  2005-09-06 11:22:29.0 
-0700
+++ linux-2.6.13/drivers/ide/ide-cd.c   2005-09-06 11:34:43.0 -0700
@@ -590,7 +590,8 @@
 
 static void cdrom_end_request (ide_drive_t *drive, int uptodate)
 {
-   struct request *rq = HWGROUP(drive)-rq;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+   struct request *rq = hwgroup-rq;
int nsectors = rq-hard_cur_sectors;
 
if ((rq-flags  REQ_SENSE)  uptodate) {
@@ -612,10 +613,10 @@
/*
 * now end failed request
 */
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, flags);
end_that_request_chunk(failed, 0, failed-data_len);
end_that_request_last(failed);
-   spin_unlock_irqrestore(ide_lock, flags);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
}
 
cdrom_analyze_sense_data(drive, failed, sense);
@@ -636,7 +637,8 @@
Returns 1 if the request was ended. */
 static int cdrom_decode_status(ide_drive_t *drive, int good_stat, int 
*stat_ret)
 {
-   struct request *rq = HWGROUP(drive)-rq;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+   struct request *rq = hwgroup-rq;
int stat, err, sense_key;

/* Check for errors. */
@@ -698,10 +700,10 @@
 * request sense has completed
 */
if (stat  ERR_STAT) {
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, flags);
blkdev_dequeue_request(rq);
HWGROUP(drive)-rq = NULL;
-   spin_unlock_irqrestore(ide_lock, flags);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
 
cdrom_queue_request_sense(drive, rq-sense, rq);
} else
@@ -741,9 +743,9 @@
 * take a breather relying on the
 * unplug timer to kick us again
 */
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, 
flags);
blk_plug_device(drive-queue);
-   spin_unlock_irqrestore(ide_lock,flags);
+   spin_unlock_irqrestore(hwgroup-lock, 
flags);
return 1;
}
}
@@ -839,6 +841,7 @@
ide_startstop_t startstop;
struct cdrom_info *info = drive-driver_data;
ide_hwif_t *hwif = drive-hwif;
+   ide_hwgroup_t *hwgroup = hwif-hwgroup;
 
/* Wait for the controller to be idle. */
if (ide_wait_stat(startstop, drive, 0, BUSY_STAT, WAIT_READY))
@@ -866,10 +869,10 @@
unsigned long flags;
 
/* packet command */
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, flags);
hwif-OUTBSYNC(drive, WIN_PACKETCMD, IDE_COMMAND_REG);
ndelay(400);
-   spin_unlock_irqrestore(ide_lock, flags);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
 
return (*handler) (drive);
}
@@ -1609,7 +1612,8 @@
 static ide_startstop_t cdrom_newpc_intr(ide_drive_t *drive)
 {
struct cdrom_info *info = drive-driver_data;
-   struct request *rq = HWGROUP(drive)-rq;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+   struct request *rq = hwgroup-rq;
int dma_error, dma, stat, ireason, len, thislen;
u8 lowcyl, highcyl;
xfer_func_t *xferfunc;
@@ -1737,11 +1741,11 @@
if (!rq-data_len)
post_transform_command(rq);
 
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, flags);
blkdev_dequeue_request(rq);
end_that_request_last(rq);
HWGROUP(drive)-rq = NULL;
-   spin_unlock_irqrestore(ide_lock, flags);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
return ide_stopped;
 }
 
Index: linux-2.6.13/drivers/ide/ide-disk.c
===
--- linux-2.6.13.orig/drivers/ide/ide-disk.c2005-09-06 11:22:29.0 
-0700
+++ linux-2.6.13/drivers/ide/ide-disk.c

[patch 4/4] ide: Break ide_lock -- remove ide_lock from piix driver

2005-09-06 Thread Ravikiran G Thirumalai

Patch to convert piix driver to use per-driver/hwgroup lock and kill
ide_lock.  In the case of piix, hwgroup-lock should be sufficient.

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]

Index: linux-2.6.13/drivers/ide/pci/piix.c
===
--- linux-2.6.13.orig/drivers/ide/pci/piix.c2005-09-06 12:00:25.0 
-0700
+++ linux-2.6.13/drivers/ide/pci/piix.c 2005-09-06 13:22:49.0 -0700
@@ -231,7 +231,6 @@
 
pio = ide_get_best_pio_mode(drive, pio, 5, NULL);
spin_lock_irqsave(hwgroup-lock, flags);
-   spin_lock(ide_lock);
pci_read_config_word(dev, master_port, master_data);
if (is_slave) {
master_data = master_data | 0x4000;
@@ -251,7 +250,6 @@
pci_write_config_word(dev, master_port, master_data);
if (is_slave)
pci_write_config_byte(dev, slave_port, slave_data);
-   spin_unlock(ide_lock);
spin_unlock_irqrestore(hwgroup-lock, flags);
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/4] ide: Break ide_lock -- change controller drivers

2005-09-06 Thread Ravikiran G Thirumalai

Patch to make ide-host controllers use hwgroup lock where serialization with
hwgroup-lock is necessary


Signed-off-by: Vaibhav V. Nivargi [EMAIL PROTECTED]
Signed-off-by: Alok N. Kataria [EMAIL PROTECTED]
Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]

Index: linux-2.6.13/drivers/ide/legacy/ht6560b.c
===
--- linux-2.6.13.orig/drivers/ide/legacy/ht6560b.c  2005-08-28 
19:41:01.0 -0400
+++ linux-2.6.13/drivers/ide/legacy/ht6560b.c   2005-09-06 15:57:46.113541000 
-0400
@@ -256,9 +256,11 @@
 {
unsigned long flags;
int t = HT_PREFETCH_MODE  8;
-   
-   spin_lock_irqsave(ide_lock, flags);
-   
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
+
+   spin_lock_irqsave(hwgroup-lock, flags);
+   spin_lock(ide_lock);
+
/*
 *  Prefetch mode and unmask irq seems to conflict
 */
@@ -270,9 +272,10 @@
drive-drive_data = ~t;  /* disable prefetch mode */
drive-no_unmask = 0;
}
-   
-   spin_unlock_irqrestore(ide_lock, flags);
-   
+
+   spin_unlock(ide_lock);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
+
 #ifdef DEBUG
printk(ht6560b: drive %s prefetch mode %sabled\n, drive-name, (state 
? en : dis));
 #endif
@@ -282,6 +285,7 @@
 {
unsigned long flags;
u8 timing;
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);

switch (pio) {
case 8: /* set prefetch off */
@@ -291,14 +295,15 @@
}

timing = ht_pio2timings(drive, pio);
-   
-   spin_lock_irqsave(ide_lock, flags);
+
+   spin_lock_irqsave(hwgroup-lock, flags);
+   spin_lock(ide_lock);

drive-drive_data = 0xff00;
drive-drive_data |= timing;

-   spin_unlock_irqrestore(ide_lock, flags);
-   
+   spin_unlock(ide_lock);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
 #ifdef DEBUG
printk(ht6560b: drive %s tuned to pio mode %#x timing=%#x\n, 
drive-name, pio, timing);
 #endif
Index: linux-2.6.13/drivers/ide/pci/cmd640.c
===
--- linux-2.6.13.orig/drivers/ide/pci/cmd640.c  2005-08-28 19:41:01.0 
-0400
+++ linux-2.6.13/drivers/ide/pci/cmd640.c   2005-09-06 15:50:35.330618750 
-0400
@@ -442,11 +442,14 @@
 static void set_prefetch_mode (unsigned int index, int mode)
 {
ide_drive_t *drive = cmd_drives[index];
+   ide_hwgroup_t *hwgroup = HWGROUP(drive);
int reg = prefetch_regs[index];
u8 b;
unsigned long flags;
 
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, flags);
+   spin_lock(ide_lock);
+
b = __get_cmd640_reg(reg);
if (mode) { /* want prefetch on? */
 #if CMD640_PREFETCH_MASKS
@@ -462,7 +465,9 @@
b |= prefetch_masks[index]; /* disable prefetch */
}
__put_cmd640_reg(reg, b);
-   spin_unlock_irqrestore(ide_lock, flags);
+
+   spin_unlock(ide_lock);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
 }
 
 /*
Index: linux-2.6.13/drivers/ide/pci/piix.c
===
--- linux-2.6.13.orig/drivers/ide/pci/piix.c2005-09-06 15:49:48.271677750 
-0400
+++ linux-2.6.13/drivers/ide/pci/piix.c 2005-09-06 15:56:55.982408000 -0400
@@ -215,6 +215,7 @@
 {
ide_hwif_t *hwif= HWIF(drive);
struct pci_dev *dev = hwif-pci_dev;
+   ide_hwgroup_t *hwgroup  = HWGROUP(drive);
int is_slave= (hwif-drives[1] == drive);
int master_port = hwif-channel ? 0x42 : 0x40;
int slave_port  = 0x44;
@@ -229,7 +230,8 @@
{ 2, 3 }, };
 
pio = ide_get_best_pio_mode(drive, pio, 5, NULL);
-   spin_lock_irqsave(ide_lock, flags);
+   spin_lock_irqsave(hwgroup-lock, flags);
+   spin_lock(ide_lock);
pci_read_config_word(dev, master_port, master_data);
if (is_slave) {
master_data = master_data | 0x4000;
@@ -249,7 +251,8 @@
pci_write_config_word(dev, master_port, master_data);
if (is_slave)
pci_write_config_byte(dev, slave_port, slave_data);
-   spin_unlock_irqrestore(ide_lock, flags);
+   spin_unlock(ide_lock);
+   spin_unlock_irqrestore(hwgroup-lock, flags);
 }
 
 /**
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] Additions to .data.read_mostly section

2005-08-24 Thread Ravikiran G Thirumalai

Following patch moves a few static 'read mostly' variables to the 
.data.read_mostly section.  Typically these are vector - irq tables,
boot_cpu_data, node_maps etc., which are initialized once and read from 
often and rarely written to.  Please include.

Thanks,
Kiran


Patch to mark variables which are usually accessed for reads with __readmostly.

Signed-off-by: Alok N Kataria <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>

Index: linux-2.6.13-rc6/arch/i386/kernel/io_apic.c
===
--- linux-2.6.13-rc6.orig/arch/i386/kernel/io_apic.c2005-08-23 
15:05:49.0 -0700
+++ linux-2.6.13-rc6/arch/i386/kernel/io_apic.c 2005-08-23 15:09:22.0 
-0700
@@ -77,7 +77,7 @@
int apic, pin, next;
 } irq_2_pin[PIN_MAP_SIZE];
 
-int vector_irq[NR_VECTORS] = { [0 ... NR_VECTORS - 1] = -1};
+int vector_irq[NR_VECTORS] __read_mostly = { [0 ... NR_VECTORS - 1] = -1};
 #ifdef CONFIG_PCI_MSI
 #define vector_to_irq(vector)  \
(platform_legacy_irq(vector) ? vector : vector_irq[vector])
@@ -1127,7 +1127,7 @@
 }
 
 /* irq_vectors is indexed by the sum of all RTEs in all I/O APICs. */
-u8 irq_vector[NR_IRQ_VECTORS] = { FIRST_DEVICE_VECTOR , 0 };
+u8 irq_vector[NR_IRQ_VECTORS] __read_mostly = { FIRST_DEVICE_VECTOR , 0 };
 
 int assign_irq_vector(int irq)
 {
@@ -1992,7 +1992,7 @@
  * edge-triggered handler, without risking IRQ storms and other ugly
  * races.
  */
-static struct hw_interrupt_type ioapic_edge_type = {
+static struct hw_interrupt_type ioapic_edge_type __read_mostly = {
.typename   = "IO-APIC-edge",
.startup= startup_edge_ioapic,
.shutdown   = shutdown_edge_ioapic,
@@ -2003,7 +2003,7 @@
.set_affinity   = set_ioapic_affinity,
 };
 
-static struct hw_interrupt_type ioapic_level_type = {
+static struct hw_interrupt_type ioapic_level_type __read_mostly = {
.typename   = "IO-APIC-level",
.startup= startup_level_ioapic,
.shutdown   = shutdown_level_ioapic,
@@ -2074,7 +2074,7 @@
 
 static void end_lapic_irq (unsigned int i) { /* nothing */ }
 
-static struct hw_interrupt_type lapic_irq_type = {
+static struct hw_interrupt_type lapic_irq_type __read_mostly = {
.typename   = "local-APIC-edge",
.startup= NULL, /* startup_irq() not used for IRQ0 */
.shutdown   = NULL, /* shutdown_irq() not used for IRQ0 */
Index: linux-2.6.13-rc6/arch/i386/kernel/setup.c
===
--- linux-2.6.13-rc6.orig/arch/i386/kernel/setup.c  2005-08-23 
15:05:49.0 -0700
+++ linux-2.6.13-rc6/arch/i386/kernel/setup.c   2005-08-23 15:09:22.0 
-0700
@@ -82,7 +82,7 @@
 /* cpu data as detected by the assembly code in head.S */
 struct cpuinfo_x86 new_cpu_data __initdata = { 0, 0, 0, 0, -1, 1, 0, 0, -1 };
 /* common cpu data for all cpus */
-struct cpuinfo_x86 boot_cpu_data = { 0, 0, 0, 0, -1, 1, 0, 0, -1 };
+struct cpuinfo_x86 boot_cpu_data __read_mostly = { 0, 0, 0, 0, -1, 1, 0, 0, -1 
};
 EXPORT_SYMBOL(boot_cpu_data);
 
 unsigned long mmu_cr4_features;
Index: linux-2.6.13-rc6/arch/i386/kernel/timers/timer_hpet.c
===
--- linux-2.6.13-rc6.orig/arch/i386/kernel/timers/timer_hpet.c  2005-08-23 
15:05:49.0 -0700
+++ linux-2.6.13-rc6/arch/i386/kernel/timers/timer_hpet.c   2005-08-23 
15:09:22.0 -0700
@@ -18,8 +18,8 @@
 #include "mach_timer.h"
 #include 
 
-static unsigned long __read_mostly hpet_usec_quotient; /* convert hpet clks to 
usec */
-static unsigned long tsc_hpet_quotient;/* convert tsc to hpet 
clks */
+static unsigned long hpet_usec_quotient __read_mostly; /* convert hpet clks to 
usec */
+static unsigned long tsc_hpet_quotient __read_mostly;  /* convert tsc 
to hpet clks */
 static unsigned long hpet_last;/* hpet counter value at last tick*/
 static unsigned long last_tsc_low; /* lsb 32 bits of Time Stamp Counter */
 static unsigned long last_tsc_high;/* msb 32 bits of Time Stamp Counter */
Index: linux-2.6.13-rc6/arch/i386/mm/discontig.c
===
--- linux-2.6.13-rc6.orig/arch/i386/mm/discontig.c  2005-08-23 
15:05:49.0 -0700
+++ linux-2.6.13-rc6/arch/i386/mm/discontig.c   2005-08-23 15:09:22.0 
-0700
@@ -37,7 +37,7 @@
 #include 
 #include 
 
-struct pglist_data *node_data[MAX_NUMNODES];
+struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 bootmem_data_t node0_bdata;
 
@@ -49,8 +49,8 @@
  * 2) node_start_pfn   - the starting page frame number for a node
  * 3) node_end_pfn - the ending page fram number for a node
  */
-unsigned long node_start_pfn[MAX_NUMNODES];
-unsigned long node_end_pfn[MAX_NUMNODES];
+unsigned long node_start_pfn[MAX_NUMNODES]

[patch] Additions to .data.read_mostly section

2005-08-24 Thread Ravikiran G Thirumalai

Following patch moves a few static 'read mostly' variables to the 
.data.read_mostly section.  Typically these are vector - irq tables,
boot_cpu_data, node_maps etc., which are initialized once and read from 
often and rarely written to.  Please include.

Thanks,
Kiran


Patch to mark variables which are usually accessed for reads with __readmostly.

Signed-off-by: Alok N Kataria [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]
Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]

Index: linux-2.6.13-rc6/arch/i386/kernel/io_apic.c
===
--- linux-2.6.13-rc6.orig/arch/i386/kernel/io_apic.c2005-08-23 
15:05:49.0 -0700
+++ linux-2.6.13-rc6/arch/i386/kernel/io_apic.c 2005-08-23 15:09:22.0 
-0700
@@ -77,7 +77,7 @@
int apic, pin, next;
 } irq_2_pin[PIN_MAP_SIZE];
 
-int vector_irq[NR_VECTORS] = { [0 ... NR_VECTORS - 1] = -1};
+int vector_irq[NR_VECTORS] __read_mostly = { [0 ... NR_VECTORS - 1] = -1};
 #ifdef CONFIG_PCI_MSI
 #define vector_to_irq(vector)  \
(platform_legacy_irq(vector) ? vector : vector_irq[vector])
@@ -1127,7 +1127,7 @@
 }
 
 /* irq_vectors is indexed by the sum of all RTEs in all I/O APICs. */
-u8 irq_vector[NR_IRQ_VECTORS] = { FIRST_DEVICE_VECTOR , 0 };
+u8 irq_vector[NR_IRQ_VECTORS] __read_mostly = { FIRST_DEVICE_VECTOR , 0 };
 
 int assign_irq_vector(int irq)
 {
@@ -1992,7 +1992,7 @@
  * edge-triggered handler, without risking IRQ storms and other ugly
  * races.
  */
-static struct hw_interrupt_type ioapic_edge_type = {
+static struct hw_interrupt_type ioapic_edge_type __read_mostly = {
.typename   = IO-APIC-edge,
.startup= startup_edge_ioapic,
.shutdown   = shutdown_edge_ioapic,
@@ -2003,7 +2003,7 @@
.set_affinity   = set_ioapic_affinity,
 };
 
-static struct hw_interrupt_type ioapic_level_type = {
+static struct hw_interrupt_type ioapic_level_type __read_mostly = {
.typename   = IO-APIC-level,
.startup= startup_level_ioapic,
.shutdown   = shutdown_level_ioapic,
@@ -2074,7 +2074,7 @@
 
 static void end_lapic_irq (unsigned int i) { /* nothing */ }
 
-static struct hw_interrupt_type lapic_irq_type = {
+static struct hw_interrupt_type lapic_irq_type __read_mostly = {
.typename   = local-APIC-edge,
.startup= NULL, /* startup_irq() not used for IRQ0 */
.shutdown   = NULL, /* shutdown_irq() not used for IRQ0 */
Index: linux-2.6.13-rc6/arch/i386/kernel/setup.c
===
--- linux-2.6.13-rc6.orig/arch/i386/kernel/setup.c  2005-08-23 
15:05:49.0 -0700
+++ linux-2.6.13-rc6/arch/i386/kernel/setup.c   2005-08-23 15:09:22.0 
-0700
@@ -82,7 +82,7 @@
 /* cpu data as detected by the assembly code in head.S */
 struct cpuinfo_x86 new_cpu_data __initdata = { 0, 0, 0, 0, -1, 1, 0, 0, -1 };
 /* common cpu data for all cpus */
-struct cpuinfo_x86 boot_cpu_data = { 0, 0, 0, 0, -1, 1, 0, 0, -1 };
+struct cpuinfo_x86 boot_cpu_data __read_mostly = { 0, 0, 0, 0, -1, 1, 0, 0, -1 
};
 EXPORT_SYMBOL(boot_cpu_data);
 
 unsigned long mmu_cr4_features;
Index: linux-2.6.13-rc6/arch/i386/kernel/timers/timer_hpet.c
===
--- linux-2.6.13-rc6.orig/arch/i386/kernel/timers/timer_hpet.c  2005-08-23 
15:05:49.0 -0700
+++ linux-2.6.13-rc6/arch/i386/kernel/timers/timer_hpet.c   2005-08-23 
15:09:22.0 -0700
@@ -18,8 +18,8 @@
 #include mach_timer.h
 #include asm/hpet.h
 
-static unsigned long __read_mostly hpet_usec_quotient; /* convert hpet clks to 
usec */
-static unsigned long tsc_hpet_quotient;/* convert tsc to hpet 
clks */
+static unsigned long hpet_usec_quotient __read_mostly; /* convert hpet clks to 
usec */
+static unsigned long tsc_hpet_quotient __read_mostly;  /* convert tsc 
to hpet clks */
 static unsigned long hpet_last;/* hpet counter value at last tick*/
 static unsigned long last_tsc_low; /* lsb 32 bits of Time Stamp Counter */
 static unsigned long last_tsc_high;/* msb 32 bits of Time Stamp Counter */
Index: linux-2.6.13-rc6/arch/i386/mm/discontig.c
===
--- linux-2.6.13-rc6.orig/arch/i386/mm/discontig.c  2005-08-23 
15:05:49.0 -0700
+++ linux-2.6.13-rc6/arch/i386/mm/discontig.c   2005-08-23 15:09:22.0 
-0700
@@ -37,7 +37,7 @@
 #include asm/mmzone.h
 #include bios_ebda.h
 
-struct pglist_data *node_data[MAX_NUMNODES];
+struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 bootmem_data_t node0_bdata;
 
@@ -49,8 +49,8 @@
  * 2) node_start_pfn   - the starting page frame number for a node
  * 3) node_end_pfn - the ending page fram number for a node
  */
-unsigned long node_start_pfn[MAX_NUMNODES];
-unsigned long node_end_pfn[MAX_NUMNODES];
+unsigned long

[patch] ide: fix kmalloc_node breakage in ide driver

2005-08-03 Thread Ravikiran G Thirumalai

Machines with ide-interfaces which do not have pci devices are crashing on boot
at pcibus_to_node in the ide drivers.  We noticed this on a x445 running
2.6.13-rc4.  Similar issue was discussed earlier, but the crash was due 
to hwif being NULL.
http://marc.theaimsgroup.com/?t=11207535203=1=2
Andi and Christoph had patches, but neither went in.  Here's one of those
patches with an added BUG_ON(hwif == NULL).  Please include.

Thanks,
Kiran


Patch fixes oops caused by ide interfaces not on pci. pcibus_to_node causes
the kernel to crash otherwise.  Patch also adds a BUG_ON to check if hwif is 
NULL. 

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>


Index: linux-2.6.13-rc1/drivers/ide/ide-probe.c
===
--- linux-2.6.13-rc1.orig/drivers/ide/ide-probe.c   2005-06-29 
20:06:53.0 -0400
+++ linux-2.6.13-rc1/drivers/ide/ide-probe.c2005-08-02 10:09:20.930965408 
-0400
@@ -960,6 +960,15 @@
 }
 #endif /* MAX_HWIFS > 1 */
 
+static inline int hwif_to_node(ide_hwif_t *hwif)
+{
+   if (hwif->pci_dev)
+   return pcibus_to_node(hwif->pci_dev->bus);
+   else
+   /* Add ways to determine the node of other busses here */
+   return -1;
+}
+
 /*
  * init request queue
  */
@@ -978,8 +987,7 @@
 *  do not.
 */
 
-   q = blk_init_queue_node(do_ide_request, _lock,
-   pcibus_to_node(drive->hwif->pci_dev->bus));
+   q = blk_init_queue_node(do_ide_request, _lock, hwif_to_node(hwif));
if (!q)
return 1;
 
@@ -1048,6 +1056,8 @@
 
BUG_ON(in_interrupt());
BUG_ON(irqs_disabled());
+   BUG_ON(hwif == NULL);
+   
down(_cfg_sem);
hwif->hwgroup = NULL;
 #if MAX_HWIFS > 1
@@ -1097,7 +1107,7 @@
spin_unlock_irq(_lock);
} else {
hwgroup = kmalloc_node(sizeof(ide_hwgroup_t), GFP_KERNEL,
-   pcibus_to_node(hwif->drives[0].hwif->pci_dev->bus));
+   hwif_to_node(hwif->drives[0].hwif));
if (!hwgroup)
goto out_up;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] ide: fix kmalloc_node breakage in ide driver

2005-08-03 Thread Ravikiran G Thirumalai

Machines with ide-interfaces which do not have pci devices are crashing on boot
at pcibus_to_node in the ide drivers.  We noticed this on a x445 running
2.6.13-rc4.  Similar issue was discussed earlier, but the crash was due 
to hwif being NULL.
http://marc.theaimsgroup.com/?t=11207535203r=1w=2
Andi and Christoph had patches, but neither went in.  Here's one of those
patches with an added BUG_ON(hwif == NULL).  Please include.

Thanks,
Kiran


Patch fixes oops caused by ide interfaces not on pci. pcibus_to_node causes
the kernel to crash otherwise.  Patch also adds a BUG_ON to check if hwif is 
NULL. 

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]
Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]


Index: linux-2.6.13-rc1/drivers/ide/ide-probe.c
===
--- linux-2.6.13-rc1.orig/drivers/ide/ide-probe.c   2005-06-29 
20:06:53.0 -0400
+++ linux-2.6.13-rc1/drivers/ide/ide-probe.c2005-08-02 10:09:20.930965408 
-0400
@@ -960,6 +960,15 @@
 }
 #endif /* MAX_HWIFS  1 */
 
+static inline int hwif_to_node(ide_hwif_t *hwif)
+{
+   if (hwif-pci_dev)
+   return pcibus_to_node(hwif-pci_dev-bus);
+   else
+   /* Add ways to determine the node of other busses here */
+   return -1;
+}
+
 /*
  * init request queue
  */
@@ -978,8 +987,7 @@
 *  do not.
 */
 
-   q = blk_init_queue_node(do_ide_request, ide_lock,
-   pcibus_to_node(drive-hwif-pci_dev-bus));
+   q = blk_init_queue_node(do_ide_request, ide_lock, hwif_to_node(hwif));
if (!q)
return 1;
 
@@ -1048,6 +1056,8 @@
 
BUG_ON(in_interrupt());
BUG_ON(irqs_disabled());
+   BUG_ON(hwif == NULL);
+   
down(ide_cfg_sem);
hwif-hwgroup = NULL;
 #if MAX_HWIFS  1
@@ -1097,7 +1107,7 @@
spin_unlock_irq(ide_lock);
} else {
hwgroup = kmalloc_node(sizeof(ide_hwgroup_t), GFP_KERNEL,
-   pcibus_to_node(hwif-drives[0].hwif-pci_dev-bus));
+   hwif_to_node(hwif-drives[0].hwif));
if (!hwgroup)
goto out_up;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm: Ensure proper alignment for node_remap_start_pfn

2005-07-28 Thread Ravikiran G Thirumalai

On Thu, Jul 28, 2005 at 10:20:26AM -0700, Dave Hansen wrote:
> On Wed, 2005-07-27 at 18:31 -0700, Ravikiran G Thirumalai wrote:
> > On Wed, Jul 27, 2005 at 06:17:24PM -0700, Andrew Morton wrote:
> > > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> > > >
> > Yes, it does cause a crash.
> 
> I don't know of any NUMA x86 sub-arches that have nodes which are
> aligned on any less than 2MB.  Is this an architecture that's supported
> in the tree, today?

SRAT need not guarantee any alignment at all in the memory affinity 
structure (the address in 64-bit byte address).   And yes, there are x86-numa
machines that run the latest kernel tree and face this problem.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm: Ensure proper alignment for node_remap_start_pfn

2005-07-28 Thread Ravikiran G Thirumalai

On Thu, Jul 28, 2005 at 10:20:26AM -0700, Dave Hansen wrote:
 On Wed, 2005-07-27 at 18:31 -0700, Ravikiran G Thirumalai wrote:
  On Wed, Jul 27, 2005 at 06:17:24PM -0700, Andrew Morton wrote:
   Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:
   
  Yes, it does cause a crash.
 
 I don't know of any NUMA x86 sub-arches that have nodes which are
 aligned on any less than 2MB.  Is this an architecture that's supported
 in the tree, today?

SRAT need not guarantee any alignment at all in the memory affinity 
structure (the address in 64-bit byte address).   And yes, there are x86-numa
machines that run the latest kernel tree and face this problem.

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86_64: fix cpu_to_node setup for sparse apic_ids

2005-07-27 Thread Ravikiran G Thirumalai

On Wed, Jul 27, 2005 at 06:24:45PM -0700, Andrew Morton wrote:
> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> >
> > While booting with SMT disabled in bios, when using acpi srat to setup
> > cpu_to_node[],  sparse apic_ids create problems.  Here's a fix for that.
> > 
> 
> Again, I don't have enough info here to judge the urgency of this patch.
> 
> What are the consequences and risks of not having this patch in 2.6.13, and
> to how many machines?
> 

Without this patch, intel x86_64 boxes with hyperthreading disabled in the
bios (and which rely on srat for numa setup) endup having incorrect 
values in cpu_to_node[] arrays, causing sched domains to be built 
incorrectly etc.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm: Ensure proper alignment for node_remap_start_pfn

2005-07-27 Thread Ravikiran G Thirumalai

On Wed, Jul 27, 2005 at 06:17:24PM -0700, Andrew Morton wrote:
> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> >
> > While reserving KVA for lmem_maps of node, we have to make sure that
> > node_remap_start_pfn[] is aligned to a proper pmd boundary.
> > (node_remap_start_pfn[] gets its value from node_end_pfn[])
> > 
> 
> What are the effects of not having this patch applied?  Does someone's
> computer crash, or what?

Yes, it does cause a crash.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86_64: fix cpu_to_node setup for sparse apic_ids

2005-07-27 Thread Ravikiran G Thirumalai

While booting with SMT disabled in bios, when using acpi srat to setup
cpu_to_node[],  sparse apic_ids create problems.  Here's a fix for that.

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.13-rc3/arch/x86_64/mm/srat.c
===
--- linux-2.6.13-rc3.orig/arch/x86_64/mm/srat.c 2005-06-17 12:48:29.0 
-0700
+++ linux-2.6.13-rc3/arch/x86_64/mm/srat.c  2005-07-27 15:36:23.0 
-0700
@@ -20,6 +20,9 @@
 
 static struct acpi_table_slit *acpi_slit;
 
+/* Internal processor count */
+static unsigned int __initdata num_processors = 0;
+
 static nodemask_t nodes_parsed __initdata;
 static nodemask_t nodes_found __initdata;
 static struct node nodes[MAX_NUMNODES] __initdata;
@@ -101,16 +104,18 @@
bad_srat();
return;
}
-   if (pa->apic_id >= NR_CPUS) {
-   printk(KERN_ERR "SRAT: lapic %u too large.\n",
-  pa->apic_id);
+   if (num_processors >= NR_CPUS) {
+   printk(KERN_ERR "SRAT: Processor #%d (lapic %u) INVALID. (Max 
ID: %d).\n",
+   num_processors, pa->apic_id, NR_CPUS);
bad_srat();
return;
}
-   cpu_to_node[pa->apic_id] = node;
+   cpu_to_node[num_processors] = node;
acpi_numa = 1;
-   printk(KERN_INFO "SRAT: PXM %u -> APIC %u -> Node %u\n",
-  pxm, pa->apic_id, node);
+   printk(KERN_INFO "SRAT: PXM %u -> APIC %u -> CPU %u -> Node %u\n",
+  pxm, pa->apic_id, num_processors, node);
+
+   num_processors++;
 }
 
 /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 117 matches

Mail list logo