Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-24 Thread Rafael J. Wysocki
On Sunday, 24 June 2007 02:45, Eric W. Biederman wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> > On Sun, 24 Jun 2007 01:54:52 +0200 "Rafael J. Wysocki" <[EMAIL PROTECTED]> 
> > wrote:
> >
> >> On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote:
> >> > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:
> >> > > 
> >> > > This fixes the problem!  Hurrah!
> >> > 
> >> > Great!  Andrew, please include the appended patch in -mm.
> >> > 
> >> > 
> >> > Subject: [patch] x86_64, irq: use mask/unmask and proper locking in
> > fixup_irqs
> >> > From: Suresh Siddha <[EMAIL PROTECTED]>
> >> > 
> >> > Force irq migration path during cpu offline, is not using proper
> >> > locks and irq_chip mask/unmask routines. This will result in
> >> > some races(especially the device generating the interrupt can see
> >> > some inconsistent state, resulting in issues like stuck irq,..).
> >> > 
> >> > Appended patch fixes the issue by taking proper lock and
> >> > encapsulating irq_chip set_affinity() with a mask() before and an
> >> > unmask() after.
> >> > 
> >> > This fixes a MSI irq stuck issue reported by Darrick Wong.
> >> > 
> >> > There are several more general bugs in this area(irq migration in the
> >> > process context). For example,
> >> > 
> >> > 1. Possibility of missing edge triggered irq.
> >> > 2. Reliable method of migrating level triggered irq in the process 
> >> > context.
> >> > 
> >> > We plan to look and close these in the near future.
> >> 
> >> This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC 
> >> nx6325).
> >> 
> >> _cpu_down() just hangs as though there were a deadlock in there, 100% of 
> >> the
> >> time.
> >> 
> >
> > Thanks, I dropped it.
> 
> Hmm.  It looks like Siddha sent the wrong version of the patch.
> The working tested version had an additional test to ensure
> the mask and unmask methods were implemented.
> 
> i.e.
> + if (irq_desc[irq].chip->mask)
> + irq_desc[irq].chip->mask(irq);
> and
> 
> + if (irq_desc[irq].chip->unmask)
> + irq_desc[irq].chip->unmask(irq);
> +
> 
> Siddha think you can resend the correct version.
> 
> Rafael.  Think you can add those two ifs and see if you test bed box
> works?

Yes, that helps.

For reference I'm appending the complete patch that I have tested.

Greetings,
Rafael


---
 arch/x86_64/kernel/irq.c |   32 +---
 1 file changed, 29 insertions(+), 3 deletions(-)

Index: linux-2.6.22-rc5/arch/x86_64/kernel/irq.c
===
--- linux-2.6.22-rc5.orig/arch/x86_64/kernel/irq.c  2007-06-24 
14:28:33.0 +0200
+++ linux-2.6.22-rc5/arch/x86_64/kernel/irq.c   2007-06-24 14:31:11.0 
+0200
@@ -144,17 +144,43 @@ void fixup_irqs(cpumask_t map)
 
for (irq = 0; irq < NR_IRQS; irq++) {
cpumask_t mask;
+   int break_affinity = 0;
+   int set_affinity = 1;
+
if (irq == 2)
continue;
 
+   /* interrupt's are disabled at this point */
+   spin_lock(_desc[irq].lock);
+
+   if (!irq_has_action(irq) ||
+   cpus_equal(irq_desc[irq].affinity, map)) {
+   spin_unlock(_desc[irq].lock);
+   continue;
+   }
+
cpus_and(mask, irq_desc[irq].affinity, map);
-   if (any_online_cpu(mask) == NR_CPUS) {
-   printk("Breaking affinity for irq %i\n", irq);
+   if (cpus_empty(mask)) {
+   break_affinity = 1;
mask = map;
}
+
+   if (irq_desc[irq].chip->mask)
+   irq_desc[irq].chip->mask(irq);
+
if (irq_desc[irq].chip->set_affinity)
irq_desc[irq].chip->set_affinity(irq, mask);
-   else if (irq_desc[irq].action && !(warned++))
+   else if (!(warned++))
+   set_affinity = 0;
+
+   if (irq_desc[irq].chip->unmask)
+   irq_desc[irq].chip->unmask(irq);
+
+   spin_unlock(_desc[irq].lock);
+
+   if (break_affinity && set_affinity)
+   printk("Broke affinity for irq %i\n", irq);
+   else if (!set_affinity)
printk("Cannot set affinity for irq %i\n", irq);
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-24 Thread Rafael J. Wysocki
On Sunday, 24 June 2007 02:28, Siddha, Suresh B wrote:
> On Sun, Jun 24, 2007 at 01:54:52AM +0200, Rafael J. Wysocki wrote:
> > This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC 
> > nx6325).
> > 
> > _cpu_down() just hangs as though there were a deadlock in there, 100% of the
> > time.
> 
> Does the patch at this URL work for you?
> 
> http://marc.info/?l=linux-kernel=118228358826737=2

Yes, it does.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-24 Thread Rafael J. Wysocki
On Sunday, 24 June 2007 02:45, Eric W. Biederman wrote:
 Andrew Morton [EMAIL PROTECTED] writes:
 
  On Sun, 24 Jun 2007 01:54:52 +0200 Rafael J. Wysocki [EMAIL PROTECTED] 
  wrote:
 
  On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote:
   On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:

This fixes the problem!  Hurrah!
   
   Great!  Andrew, please include the appended patch in -mm.
   
   
   Subject: [patch] x86_64, irq: use mask/unmask and proper locking in
  fixup_irqs
   From: Suresh Siddha [EMAIL PROTECTED]
   
   Force irq migration path during cpu offline, is not using proper
   locks and irq_chip mask/unmask routines. This will result in
   some races(especially the device generating the interrupt can see
   some inconsistent state, resulting in issues like stuck irq,..).
   
   Appended patch fixes the issue by taking proper lock and
   encapsulating irq_chip set_affinity() with a mask() before and an
   unmask() after.
   
   This fixes a MSI irq stuck issue reported by Darrick Wong.
   
   There are several more general bugs in this area(irq migration in the
   process context). For example,
   
   1. Possibility of missing edge triggered irq.
   2. Reliable method of migrating level triggered irq in the process 
   context.
   
   We plan to look and close these in the near future.
  
  This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC 
  nx6325).
  
  _cpu_down() just hangs as though there were a deadlock in there, 100% of 
  the
  time.
  
 
  Thanks, I dropped it.
 
 Hmm.  It looks like Siddha sent the wrong version of the patch.
 The working tested version had an additional test to ensure
 the mask and unmask methods were implemented.
 
 i.e.
 + if (irq_desc[irq].chip-mask)
 + irq_desc[irq].chip-mask(irq);
 and
 
 + if (irq_desc[irq].chip-unmask)
 + irq_desc[irq].chip-unmask(irq);
 +
 
 Siddha think you can resend the correct version.
 
 Rafael.  Think you can add those two ifs and see if you test bed box
 works?

Yes, that helps.

For reference I'm appending the complete patch that I have tested.

Greetings,
Rafael


---
 arch/x86_64/kernel/irq.c |   32 +---
 1 file changed, 29 insertions(+), 3 deletions(-)

Index: linux-2.6.22-rc5/arch/x86_64/kernel/irq.c
===
--- linux-2.6.22-rc5.orig/arch/x86_64/kernel/irq.c  2007-06-24 
14:28:33.0 +0200
+++ linux-2.6.22-rc5/arch/x86_64/kernel/irq.c   2007-06-24 14:31:11.0 
+0200
@@ -144,17 +144,43 @@ void fixup_irqs(cpumask_t map)
 
for (irq = 0; irq  NR_IRQS; irq++) {
cpumask_t mask;
+   int break_affinity = 0;
+   int set_affinity = 1;
+
if (irq == 2)
continue;
 
+   /* interrupt's are disabled at this point */
+   spin_lock(irq_desc[irq].lock);
+
+   if (!irq_has_action(irq) ||
+   cpus_equal(irq_desc[irq].affinity, map)) {
+   spin_unlock(irq_desc[irq].lock);
+   continue;
+   }
+
cpus_and(mask, irq_desc[irq].affinity, map);
-   if (any_online_cpu(mask) == NR_CPUS) {
-   printk(Breaking affinity for irq %i\n, irq);
+   if (cpus_empty(mask)) {
+   break_affinity = 1;
mask = map;
}
+
+   if (irq_desc[irq].chip-mask)
+   irq_desc[irq].chip-mask(irq);
+
if (irq_desc[irq].chip-set_affinity)
irq_desc[irq].chip-set_affinity(irq, mask);
-   else if (irq_desc[irq].action  !(warned++))
+   else if (!(warned++))
+   set_affinity = 0;
+
+   if (irq_desc[irq].chip-unmask)
+   irq_desc[irq].chip-unmask(irq);
+
+   spin_unlock(irq_desc[irq].lock);
+
+   if (break_affinity  set_affinity)
+   printk(Broke affinity for irq %i\n, irq);
+   else if (!set_affinity)
printk(Cannot set affinity for irq %i\n, irq);
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-24 Thread Rafael J. Wysocki
On Sunday, 24 June 2007 02:28, Siddha, Suresh B wrote:
 On Sun, Jun 24, 2007 at 01:54:52AM +0200, Rafael J. Wysocki wrote:
  This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC 
  nx6325).
  
  _cpu_down() just hangs as though there were a deadlock in there, 100% of the
  time.
 
 Does the patch at this URL work for you?
 
 http://marc.info/?l=linux-kernelm=118228358826737w=2

Yes, it does.

Greetings,
Rafael


-- 
Premature optimization is the root of all evil. - Donald Knuth
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Siddha, Suresh B
On Sat, Jun 23, 2007 at 06:45:05PM -0600, Eric W. Biederman wrote:
> 
> Hmm.  It looks like Siddha sent the wrong version of the patch.
> The working tested version had an additional test to ensure
> the mask and unmask methods were implemented.
> 
> i.e.
> + if (irq_desc[irq].chip->mask)
> + irq_desc[irq].chip->mask(irq);
> and
> 
> + if (irq_desc[irq].chip->unmask)
> + irq_desc[irq].chip->unmask(irq);
> +
> 
> Siddha think you can resend the correct version.

Eric, In this version, I added the irq_has_action() check and hence
removed the check which ensures the presence for mask/unmask. My tests
showed that it was working fine. May be I am missing something.

> 
> Rafael.  Think you can add those two ifs and see if you test bed box
> works?
> 
> I'm still not convinced that we can make fixup_irqs work in general
> but if we aren't going to yank it we should at least make it
> consistent with the rest of the code.

I agree.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Eric W. Biederman
Andrew Morton <[EMAIL PROTECTED]> writes:

> On Sun, 24 Jun 2007 01:54:52 +0200 "Rafael J. Wysocki" <[EMAIL PROTECTED]> 
> wrote:
>
>> On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote:
>> > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:
>> > > 
>> > > This fixes the problem!  Hurrah!
>> > 
>> > Great!  Andrew, please include the appended patch in -mm.
>> > 
>> > 
>> > Subject: [patch] x86_64, irq: use mask/unmask and proper locking in
> fixup_irqs
>> > From: Suresh Siddha <[EMAIL PROTECTED]>
>> > 
>> > Force irq migration path during cpu offline, is not using proper
>> > locks and irq_chip mask/unmask routines. This will result in
>> > some races(especially the device generating the interrupt can see
>> > some inconsistent state, resulting in issues like stuck irq,..).
>> > 
>> > Appended patch fixes the issue by taking proper lock and
>> > encapsulating irq_chip set_affinity() with a mask() before and an
>> > unmask() after.
>> > 
>> > This fixes a MSI irq stuck issue reported by Darrick Wong.
>> > 
>> > There are several more general bugs in this area(irq migration in the
>> > process context). For example,
>> > 
>> > 1. Possibility of missing edge triggered irq.
>> > 2. Reliable method of migrating level triggered irq in the process context.
>> > 
>> > We plan to look and close these in the near future.
>> 
>> This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC 
>> nx6325).
>> 
>> _cpu_down() just hangs as though there were a deadlock in there, 100% of the
>> time.
>> 
>
> Thanks, I dropped it.

Hmm.  It looks like Siddha sent the wrong version of the patch.
The working tested version had an additional test to ensure
the mask and unmask methods were implemented.

i.e.
+   if (irq_desc[irq].chip->mask)
+   irq_desc[irq].chip->mask(irq);
and

+   if (irq_desc[irq].chip->unmask)
+   irq_desc[irq].chip->unmask(irq);
+

Siddha think you can resend the correct version.

Rafael.  Think you can add those two ifs and see if you test bed box
works?

I'm still not convinced that we can make fixup_irqs work in general
but if we aren't going to yank it we should at least make it
consistent with the rest of the code.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Siddha, Suresh B
On Sun, Jun 24, 2007 at 01:54:52AM +0200, Rafael J. Wysocki wrote:
> This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325).
> 
> _cpu_down() just hangs as though there were a deadlock in there, 100% of the
> time.

Does the patch at this URL work for you?

http://marc.info/?l=linux-kernel=118228358826737=2

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Andrew Morton
On Sun, 24 Jun 2007 01:54:52 +0200 "Rafael J. Wysocki" <[EMAIL PROTECTED]> 
wrote:

> On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote:
> > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:
> > > 
> > > This fixes the problem!  Hurrah!
> > 
> > Great!  Andrew, please include the appended patch in -mm.
> > 
> > 
> > Subject: [patch] x86_64, irq: use mask/unmask and proper locking in 
> > fixup_irqs
> > From: Suresh Siddha <[EMAIL PROTECTED]>
> > 
> > Force irq migration path during cpu offline, is not using proper
> > locks and irq_chip mask/unmask routines. This will result in
> > some races(especially the device generating the interrupt can see
> > some inconsistent state, resulting in issues like stuck irq,..).
> > 
> > Appended patch fixes the issue by taking proper lock and
> > encapsulating irq_chip set_affinity() with a mask() before and an
> > unmask() after.
> > 
> > This fixes a MSI irq stuck issue reported by Darrick Wong.
> > 
> > There are several more general bugs in this area(irq migration in the
> > process context). For example,
> > 
> > 1. Possibility of missing edge triggered irq.
> > 2. Reliable method of migrating level triggered irq in the process context.
> > 
> > We plan to look and close these in the near future.
> 
> This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325).
> 
> _cpu_down() just hangs as though there were a deadlock in there, 100% of the
> time.
> 

Thanks, I dropped it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Rafael J. Wysocki
On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote:
> On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:
> > 
> > This fixes the problem!  Hurrah!
> 
> Great!  Andrew, please include the appended patch in -mm.
> 
> 
> Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs
> From: Suresh Siddha <[EMAIL PROTECTED]>
> 
> Force irq migration path during cpu offline, is not using proper
> locks and irq_chip mask/unmask routines. This will result in
> some races(especially the device generating the interrupt can see
> some inconsistent state, resulting in issues like stuck irq,..).
> 
> Appended patch fixes the issue by taking proper lock and
> encapsulating irq_chip set_affinity() with a mask() before and an
> unmask() after.
> 
> This fixes a MSI irq stuck issue reported by Darrick Wong.
> 
> There are several more general bugs in this area(irq migration in the
> process context). For example,
> 
> 1. Possibility of missing edge triggered irq.
> 2. Reliable method of migrating level triggered irq in the process context.
> 
> We plan to look and close these in the near future.

This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325).

_cpu_down() just hangs as though there were a deadlock in there, 100% of the
time.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Rafael J. Wysocki
On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote:
 On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:
  
  This fixes the problem!  Hurrah!
 
 Great!  Andrew, please include the appended patch in -mm.
 
 
 Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs
 From: Suresh Siddha [EMAIL PROTECTED]
 
 Force irq migration path during cpu offline, is not using proper
 locks and irq_chip mask/unmask routines. This will result in
 some races(especially the device generating the interrupt can see
 some inconsistent state, resulting in issues like stuck irq,..).
 
 Appended patch fixes the issue by taking proper lock and
 encapsulating irq_chip set_affinity() with a mask() before and an
 unmask() after.
 
 This fixes a MSI irq stuck issue reported by Darrick Wong.
 
 There are several more general bugs in this area(irq migration in the
 process context). For example,
 
 1. Possibility of missing edge triggered irq.
 2. Reliable method of migrating level triggered irq in the process context.
 
 We plan to look and close these in the near future.

This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325).

_cpu_down() just hangs as though there were a deadlock in there, 100% of the
time.

Greetings,
Rafael


-- 
Premature optimization is the root of all evil. - Donald Knuth
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Andrew Morton
On Sun, 24 Jun 2007 01:54:52 +0200 Rafael J. Wysocki [EMAIL PROTECTED] 
wrote:

 On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote:
  On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:
   
   This fixes the problem!  Hurrah!
  
  Great!  Andrew, please include the appended patch in -mm.
  
  
  Subject: [patch] x86_64, irq: use mask/unmask and proper locking in 
  fixup_irqs
  From: Suresh Siddha [EMAIL PROTECTED]
  
  Force irq migration path during cpu offline, is not using proper
  locks and irq_chip mask/unmask routines. This will result in
  some races(especially the device generating the interrupt can see
  some inconsistent state, resulting in issues like stuck irq,..).
  
  Appended patch fixes the issue by taking proper lock and
  encapsulating irq_chip set_affinity() with a mask() before and an
  unmask() after.
  
  This fixes a MSI irq stuck issue reported by Darrick Wong.
  
  There are several more general bugs in this area(irq migration in the
  process context). For example,
  
  1. Possibility of missing edge triggered irq.
  2. Reliable method of migrating level triggered irq in the process context.
  
  We plan to look and close these in the near future.
 
 This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325).
 
 _cpu_down() just hangs as though there were a deadlock in there, 100% of the
 time.
 

Thanks, I dropped it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Siddha, Suresh B
On Sun, Jun 24, 2007 at 01:54:52AM +0200, Rafael J. Wysocki wrote:
 This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325).
 
 _cpu_down() just hangs as though there were a deadlock in there, 100% of the
 time.

Does the patch at this URL work for you?

http://marc.info/?l=linux-kernelm=118228358826737w=2

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Eric W. Biederman
Andrew Morton [EMAIL PROTECTED] writes:

 On Sun, 24 Jun 2007 01:54:52 +0200 Rafael J. Wysocki [EMAIL PROTECTED] 
 wrote:

 On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote:
  On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:
   
   This fixes the problem!  Hurrah!
  
  Great!  Andrew, please include the appended patch in -mm.
  
  
  Subject: [patch] x86_64, irq: use mask/unmask and proper locking in
 fixup_irqs
  From: Suresh Siddha [EMAIL PROTECTED]
  
  Force irq migration path during cpu offline, is not using proper
  locks and irq_chip mask/unmask routines. This will result in
  some races(especially the device generating the interrupt can see
  some inconsistent state, resulting in issues like stuck irq,..).
  
  Appended patch fixes the issue by taking proper lock and
  encapsulating irq_chip set_affinity() with a mask() before and an
  unmask() after.
  
  This fixes a MSI irq stuck issue reported by Darrick Wong.
  
  There are several more general bugs in this area(irq migration in the
  process context). For example,
  
  1. Possibility of missing edge triggered irq.
  2. Reliable method of migrating level triggered irq in the process context.
  
  We plan to look and close these in the near future.
 
 This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC 
 nx6325).
 
 _cpu_down() just hangs as though there were a deadlock in there, 100% of the
 time.
 

 Thanks, I dropped it.

Hmm.  It looks like Siddha sent the wrong version of the patch.
The working tested version had an additional test to ensure
the mask and unmask methods were implemented.

i.e.
+   if (irq_desc[irq].chip-mask)
+   irq_desc[irq].chip-mask(irq);
and

+   if (irq_desc[irq].chip-unmask)
+   irq_desc[irq].chip-unmask(irq);
+

Siddha think you can resend the correct version.

Rafael.  Think you can add those two ifs and see if you test bed box
works?

I'm still not convinced that we can make fixup_irqs work in general
but if we aren't going to yank it we should at least make it
consistent with the rest of the code.

Eric

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-23 Thread Siddha, Suresh B
On Sat, Jun 23, 2007 at 06:45:05PM -0600, Eric W. Biederman wrote:
 
 Hmm.  It looks like Siddha sent the wrong version of the patch.
 The working tested version had an additional test to ensure
 the mask and unmask methods were implemented.
 
 i.e.
 + if (irq_desc[irq].chip-mask)
 + irq_desc[irq].chip-mask(irq);
 and
 
 + if (irq_desc[irq].chip-unmask)
 + irq_desc[irq].chip-unmask(irq);
 +
 
 Siddha think you can resend the correct version.

Eric, In this version, I added the irq_has_action() check and hence
removed the check which ensures the presence for mask/unmask. My tests
showed that it was working fine. May be I am missing something.

 
 Rafael.  Think you can add those two ifs and see if you test bed box
 works?
 
 I'm still not convinced that we can make fixup_irqs work in general
 but if we aren't going to yank it we should at least make it
 consistent with the rest of the code.

I agree.

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Siddha, Suresh B
On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:
> 
> This fixes the problem!  Hurrah!

Great!  Andrew, please include the appended patch in -mm.


Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs
From: Suresh Siddha <[EMAIL PROTECTED]>

Force irq migration path during cpu offline, is not using proper
locks and irq_chip mask/unmask routines. This will result in
some races(especially the device generating the interrupt can see
some inconsistent state, resulting in issues like stuck irq,..).

Appended patch fixes the issue by taking proper lock and
encapsulating irq_chip set_affinity() with a mask() before and an
unmask() after.

This fixes a MSI irq stuck issue reported by Darrick Wong.

There are several more general bugs in this area(irq migration in the
process context). For example,

1. Possibility of missing edge triggered irq.
2. Reliable method of migrating level triggered irq in the process context.

We plan to look and close these in the near future.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
Cc: Eric W. Biederman <[EMAIL PROTECTED]>
Reported-by: Darrick Wong <[EMAIL PROTECTED]>
---

diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 3eaceac..55b2733 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -144,17 +144,41 @@ void fixup_irqs(cpumask_t map)
 
for (irq = 0; irq < NR_IRQS; irq++) {
cpumask_t mask;
+   int break_affinity = 0;
+   int set_affinity = 1;
+
if (irq == 2)
continue;
 
+   /* interrupt's are disabled at this point */
+   spin_lock(_desc[irq].lock);
+
+   if (!irq_has_action(irq) ||
+   cpus_equal(irq_desc[irq].affinity, map)) {
+   spin_unlock(_desc[irq].lock);
+   continue;
+   }
+
cpus_and(mask, irq_desc[irq].affinity, map);
-   if (any_online_cpu(mask) == NR_CPUS) {
-   printk("Breaking affinity for irq %i\n", irq);
+   if (cpus_empty(mask)) {
+   break_affinity = 1;
mask = map;
}
+
+   irq_desc[irq].chip->mask(irq);
+
if (irq_desc[irq].chip->set_affinity)
irq_desc[irq].chip->set_affinity(irq, mask);
-   else if (irq_desc[irq].action && !(warned++))
+   else if (!(warned++))
+   set_affinity = 0;
+
+   irq_desc[irq].chip->unmask(irq);
+
+   spin_unlock(_desc[irq].lock);
+
+   if (break_affinity && set_affinity)
+   printk("Broke affinity for irq %i\n", irq);
+   else if (!set_affinity)
printk("Cannot set affinity for irq %i\n", irq);
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Darrick J. Wong
On Tue, Jun 19, 2007 at 12:59:27PM -0700, Siddha, Suresh B wrote:

> hmm.. Please try this instead. This is intended only for debug. Based on your
> test results, we can comeup with a more decent fix.

This fixes the problem!  Hurrah!

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Siddha, Suresh B
On Tue, Jun 19, 2007 at 12:06:37PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 19, 2007 at 11:00:03AM -0700, Siddha, Suresh B wrote:
> > Anyhow, Darrick there is a general bug in this area, can you try this and
> > see if it helps?
> 
> Er... that instantly locked up the system.

hmm.. Please try this instead. This is intended only for debug. Based on your
test results, we can comeup with a more decent fix.

diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 3eaceac..3997679 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -144,17 +144,37 @@ void fixup_irqs(cpumask_t map)
 
for (irq = 0; irq < NR_IRQS; irq++) {
cpumask_t mask;
+   int break_affinity = 0;
+   int set_affinity = 1;
+
if (irq == 2)
continue;
 
+   /* irq's are disabled at this point */
+   spin_lock(_desc[irq].lock);
+
cpus_and(mask, irq_desc[irq].affinity, map);
if (any_online_cpu(mask) == NR_CPUS) {
-   printk("Breaking affinity for irq %i\n", irq);
+   break_affinity = 1;
mask = map;
}
+
+   if (irq_desc[irq].chip->mask)
+   irq_desc[irq].chip->mask(irq);
+
if (irq_desc[irq].chip->set_affinity)
irq_desc[irq].chip->set_affinity(irq, mask);
else if (irq_desc[irq].action && !(warned++))
+   set_affinity = 0;
+
+   if (irq_desc[irq].chip->unmask)
+   irq_desc[irq].chip->unmask(irq);
+
+   spin_unlock(_desc[irq].lock);
+
+   if (break_affinity && set_affinity)
+   printk("Broke affinity for irq %i\n", irq);
+   else if (!set_affinity)
printk("Cannot set affinity for irq %i\n", irq);
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Darrick J. Wong
On Tue, Jun 19, 2007 at 11:00:03AM -0700, Siddha, Suresh B wrote:
> Anyhow, Darrick there is a general bug in this area, can you try this and
> see if it helps?

Er... that instantly locked up the system.

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Eric W. Biederman
"Siddha, Suresh B" <[EMAIL PROTECTED]> writes:

> On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote:
>> "Darrick J. Wong" <[EMAIL PROTECTED]> writes:
>> 
>> > On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote:
>> >
>> >> > 
>> >> > [  256.298787] irq=4341 affinity=d
>> >> > 
>> >> 
>> >> And just to make sure, at this point, your MSI irq 4341 affinity
>> >> (/proc/irq/4341/smp_affinity) still points to '2'?
>> >
>> > Actually, it's 0xD.  From the kernel's perspective the mask has been
>> > updated (and I even stuck a printk into set_msi_irq_affinity to verify
>> > that the writes are happening) but ... the hardware doesn't seem to
>> > reflect this.  I also tried putting read_msi_msg right afterwards to
>> > compare contents, though it complained about all the MSIs _except_ for
>> > 4341.  (Of course, I could just be way off on the effectiveness of
>> > that.)
>> 
>> The fact that MSI interrupts are having problems is odd.  It is possible
>> that we still have a bug in there somewhere but msi interrupts should
>> be safe to migrate outside of irq context (no known hardware bugs).
>> As we can actually synchronize with the irq source and eliminate all
>> of the migration races.
>> 
>> The non-msi case requires hitting a hardware race that is rare enough
>> you should not normally have problems.
>
> Yep. But Darrick's seems to say, problem happens consistently.
>
> Anyhow, Darrick there is a general bug in this area, can you try this and
> see if it helps?

There are several general bugs in this area.  But yes your patch
should help things, especially for MSI where masking the irq before
migration is required.  Adding locking the proper locking and masking
should make things quite a bit more how set_affinity is expected to be
called.

I just gave up on fixing these things because we can't eliminate
the races, so the real problem is the existence of this code path
with it's unsupportable semantics in the first place.

> diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
> index 3eaceac..a0e11c9 100644
> --- a/arch/x86_64/kernel/irq.c
> +++ b/arch/x86_64/kernel/irq.c
> @@ -144,17 +144,35 @@ void fixup_irqs(cpumask_t map)
>  
>   for (irq = 0; irq < NR_IRQS; irq++) {
>   cpumask_t mask;
> + int break_affinity = 0;
> + int set_affinity = 1;
> +
>   if (irq == 2)
>   continue;
>  
> + /* irq's are disabled at this point */
> + spin_lock(_desc[irq].lock);
> +
>   cpus_and(mask, irq_desc[irq].affinity, map);
>   if (any_online_cpu(mask) == NR_CPUS) {
> - printk("Breaking affinity for irq %i\n", irq);
> + break_affinity = 1;
>   mask = map;
>   }

We should really express the "any_online_cpu(mask) == NR_CPUS" test as:
"cpus_empty(mask)" it would be much clearer.

Further we should skip the migration if "cpus_equal(mask,
irq_desc[irq].affinity)" or "!irq_has_action(irq)" because no one has
called request_irq.


> + irq_desc[irq].chip->mask(irq);
> +
>   if (irq_desc[irq].chip->set_affinity)
>   irq_desc[irq].chip->set_affinity(irq, mask);
>   else if (irq_desc[irq].action && !(warned++))
> + set_affinity = 0;
> +
> + irq_desc[irq].chip->unmask(irq);
> +
> + spin_unlock(_desc[irq].lock);
> +
> + if (break_affinity && set_affinity)
> + printk("Broke affinity for irq %i\n", irq);
> + else if (!set_affinity)
>   printk("Cannot set affinity for irq %i\n", irq);
>   }
>  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Siddha, Suresh B
On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote:
> "Darrick J. Wong" <[EMAIL PROTECTED]> writes:
> 
> > On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote:
> >
> >> > 
> >> > [  256.298787] irq=4341 affinity=d
> >> > 
> >> 
> >> And just to make sure, at this point, your MSI irq 4341 affinity
> >> (/proc/irq/4341/smp_affinity) still points to '2'?
> >
> > Actually, it's 0xD.  From the kernel's perspective the mask has been
> > updated (and I even stuck a printk into set_msi_irq_affinity to verify
> > that the writes are happening) but ... the hardware doesn't seem to
> > reflect this.  I also tried putting read_msi_msg right afterwards to
> > compare contents, though it complained about all the MSIs _except_ for
> > 4341.  (Of course, I could just be way off on the effectiveness of
> > that.)
> 
> The fact that MSI interrupts are having problems is odd.  It is possible
> that we still have a bug in there somewhere but msi interrupts should
> be safe to migrate outside of irq context (no known hardware bugs).
> As we can actually synchronize with the irq source and eliminate all
> of the migration races.
> 
> The non-msi case requires hitting a hardware race that is rare enough
> you should not normally have problems.

Yep. But Darrick's seems to say, problem happens consistently.

Anyhow, Darrick there is a general bug in this area, can you try this and
see if it helps?

diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 3eaceac..a0e11c9 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -144,17 +144,35 @@ void fixup_irqs(cpumask_t map)
 
for (irq = 0; irq < NR_IRQS; irq++) {
cpumask_t mask;
+   int break_affinity = 0;
+   int set_affinity = 1;
+
if (irq == 2)
continue;
 
+   /* irq's are disabled at this point */
+   spin_lock(_desc[irq].lock);
+
cpus_and(mask, irq_desc[irq].affinity, map);
if (any_online_cpu(mask) == NR_CPUS) {
-   printk("Breaking affinity for irq %i\n", irq);
+   break_affinity = 1;
mask = map;
}
+
+   irq_desc[irq].chip->mask(irq);
+
if (irq_desc[irq].chip->set_affinity)
irq_desc[irq].chip->set_affinity(irq, mask);
else if (irq_desc[irq].action && !(warned++))
+   set_affinity = 0;
+
+   irq_desc[irq].chip->unmask(irq);
+
+   spin_unlock(_desc[irq].lock);
+
+   if (break_affinity && set_affinity)
+   printk("Broke affinity for irq %i\n", irq);
+   else if (!set_affinity)
printk("Cannot set affinity for irq %i\n", irq);
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Eric W. Biederman
"Darrick J. Wong" <[EMAIL PROTECTED]> writes:

> On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote:
>
>> > 
>> > [  256.298787] irq=4341 affinity=d
>> > 
>> 
>> And just to make sure, at this point, your MSI irq 4341 affinity
>> (/proc/irq/4341/smp_affinity) still points to '2'?
>
> Actually, it's 0xD.  From the kernel's perspective the mask has been
> updated (and I even stuck a printk into set_msi_irq_affinity to verify
> that the writes are happening) but ... the hardware doesn't seem to
> reflect this.  I also tried putting read_msi_msg right afterwards to
> compare contents, though it complained about all the MSIs _except_ for
> 4341.  (Of course, I could just be way off on the effectiveness of
> that.)

The fact that MSI interrupts are having problems is odd.  It is possible
that we still have a bug in there somewhere but msi interrupts should
be safe to migrate outside of irq context (no known hardware bugs).
As we can actually synchronize with the irq source and eliminate all
of the migration races.

The non-msi case requires hitting a hardware race that is rare enough
you should not normally have problems.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Eric W. Biederman
Darrick J. Wong [EMAIL PROTECTED] writes:

 On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote:

  call to set_affinity
  [  256.298787] irq=4341 affinity=d
  ethernet on irq 4341 stops working
 
 And just to make sure, at this point, your MSI irq 4341 affinity
 (/proc/irq/4341/smp_affinity) still points to '2'?

 Actually, it's 0xD.  From the kernel's perspective the mask has been
 updated (and I even stuck a printk into set_msi_irq_affinity to verify
 that the writes are happening) but ... the hardware doesn't seem to
 reflect this.  I also tried putting read_msi_msg right afterwards to
 compare contents, though it complained about all the MSIs _except_ for
 4341.  (Of course, I could just be way off on the effectiveness of
 that.)

The fact that MSI interrupts are having problems is odd.  It is possible
that we still have a bug in there somewhere but msi interrupts should
be safe to migrate outside of irq context (no known hardware bugs).
As we can actually synchronize with the irq source and eliminate all
of the migration races.

The non-msi case requires hitting a hardware race that is rare enough
you should not normally have problems.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Siddha, Suresh B
On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote:
 Darrick J. Wong [EMAIL PROTECTED] writes:
 
  On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote:
 
   call to set_affinity
   [  256.298787] irq=4341 affinity=d
   ethernet on irq 4341 stops working
  
  And just to make sure, at this point, your MSI irq 4341 affinity
  (/proc/irq/4341/smp_affinity) still points to '2'?
 
  Actually, it's 0xD.  From the kernel's perspective the mask has been
  updated (and I even stuck a printk into set_msi_irq_affinity to verify
  that the writes are happening) but ... the hardware doesn't seem to
  reflect this.  I also tried putting read_msi_msg right afterwards to
  compare contents, though it complained about all the MSIs _except_ for
  4341.  (Of course, I could just be way off on the effectiveness of
  that.)
 
 The fact that MSI interrupts are having problems is odd.  It is possible
 that we still have a bug in there somewhere but msi interrupts should
 be safe to migrate outside of irq context (no known hardware bugs).
 As we can actually synchronize with the irq source and eliminate all
 of the migration races.
 
 The non-msi case requires hitting a hardware race that is rare enough
 you should not normally have problems.

Yep. But Darrick's seems to say, problem happens consistently.

Anyhow, Darrick there is a general bug in this area, can you try this and
see if it helps?

diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 3eaceac..a0e11c9 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -144,17 +144,35 @@ void fixup_irqs(cpumask_t map)
 
for (irq = 0; irq  NR_IRQS; irq++) {
cpumask_t mask;
+   int break_affinity = 0;
+   int set_affinity = 1;
+
if (irq == 2)
continue;
 
+   /* irq's are disabled at this point */
+   spin_lock(irq_desc[irq].lock);
+
cpus_and(mask, irq_desc[irq].affinity, map);
if (any_online_cpu(mask) == NR_CPUS) {
-   printk(Breaking affinity for irq %i\n, irq);
+   break_affinity = 1;
mask = map;
}
+
+   irq_desc[irq].chip-mask(irq);
+
if (irq_desc[irq].chip-set_affinity)
irq_desc[irq].chip-set_affinity(irq, mask);
else if (irq_desc[irq].action  !(warned++))
+   set_affinity = 0;
+
+   irq_desc[irq].chip-unmask(irq);
+
+   spin_unlock(irq_desc[irq].lock);
+
+   if (break_affinity  set_affinity)
+   printk(Broke affinity for irq %i\n, irq);
+   else if (!set_affinity)
printk(Cannot set affinity for irq %i\n, irq);
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Eric W. Biederman
Siddha, Suresh B [EMAIL PROTECTED] writes:

 On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote:
 Darrick J. Wong [EMAIL PROTECTED] writes:
 
  On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote:
 
   call to set_affinity
   [  256.298787] irq=4341 affinity=d
   ethernet on irq 4341 stops working
  
  And just to make sure, at this point, your MSI irq 4341 affinity
  (/proc/irq/4341/smp_affinity) still points to '2'?
 
  Actually, it's 0xD.  From the kernel's perspective the mask has been
  updated (and I even stuck a printk into set_msi_irq_affinity to verify
  that the writes are happening) but ... the hardware doesn't seem to
  reflect this.  I also tried putting read_msi_msg right afterwards to
  compare contents, though it complained about all the MSIs _except_ for
  4341.  (Of course, I could just be way off on the effectiveness of
  that.)
 
 The fact that MSI interrupts are having problems is odd.  It is possible
 that we still have a bug in there somewhere but msi interrupts should
 be safe to migrate outside of irq context (no known hardware bugs).
 As we can actually synchronize with the irq source and eliminate all
 of the migration races.
 
 The non-msi case requires hitting a hardware race that is rare enough
 you should not normally have problems.

 Yep. But Darrick's seems to say, problem happens consistently.

 Anyhow, Darrick there is a general bug in this area, can you try this and
 see if it helps?

There are several general bugs in this area.  But yes your patch
should help things, especially for MSI where masking the irq before
migration is required.  Adding locking the proper locking and masking
should make things quite a bit more how set_affinity is expected to be
called.

I just gave up on fixing these things because we can't eliminate
the races, so the real problem is the existence of this code path
with it's unsupportable semantics in the first place.

 diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
 index 3eaceac..a0e11c9 100644
 --- a/arch/x86_64/kernel/irq.c
 +++ b/arch/x86_64/kernel/irq.c
 @@ -144,17 +144,35 @@ void fixup_irqs(cpumask_t map)
  
   for (irq = 0; irq  NR_IRQS; irq++) {
   cpumask_t mask;
 + int break_affinity = 0;
 + int set_affinity = 1;
 +
   if (irq == 2)
   continue;
  
 + /* irq's are disabled at this point */
 + spin_lock(irq_desc[irq].lock);
 +
   cpus_and(mask, irq_desc[irq].affinity, map);
   if (any_online_cpu(mask) == NR_CPUS) {
 - printk(Breaking affinity for irq %i\n, irq);
 + break_affinity = 1;
   mask = map;
   }

We should really express the any_online_cpu(mask) == NR_CPUS test as:
cpus_empty(mask) it would be much clearer.

Further we should skip the migration if cpus_equal(mask,
irq_desc[irq].affinity) or !irq_has_action(irq) because no one has
called request_irq.


 + irq_desc[irq].chip-mask(irq);
 +
   if (irq_desc[irq].chip-set_affinity)
   irq_desc[irq].chip-set_affinity(irq, mask);
   else if (irq_desc[irq].action  !(warned++))
 + set_affinity = 0;
 +
 + irq_desc[irq].chip-unmask(irq);
 +
 + spin_unlock(irq_desc[irq].lock);
 +
 + if (break_affinity  set_affinity)
 + printk(Broke affinity for irq %i\n, irq);
 + else if (!set_affinity)
   printk(Cannot set affinity for irq %i\n, irq);
   }
  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Darrick J. Wong
On Tue, Jun 19, 2007 at 11:00:03AM -0700, Siddha, Suresh B wrote:
 Anyhow, Darrick there is a general bug in this area, can you try this and
 see if it helps?

Er... that instantly locked up the system.

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Siddha, Suresh B
On Tue, Jun 19, 2007 at 12:06:37PM -0700, Darrick J. Wong wrote:
 On Tue, Jun 19, 2007 at 11:00:03AM -0700, Siddha, Suresh B wrote:
  Anyhow, Darrick there is a general bug in this area, can you try this and
  see if it helps?
 
 Er... that instantly locked up the system.

hmm.. Please try this instead. This is intended only for debug. Based on your
test results, we can comeup with a more decent fix.

diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 3eaceac..3997679 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -144,17 +144,37 @@ void fixup_irqs(cpumask_t map)
 
for (irq = 0; irq  NR_IRQS; irq++) {
cpumask_t mask;
+   int break_affinity = 0;
+   int set_affinity = 1;
+
if (irq == 2)
continue;
 
+   /* irq's are disabled at this point */
+   spin_lock(irq_desc[irq].lock);
+
cpus_and(mask, irq_desc[irq].affinity, map);
if (any_online_cpu(mask) == NR_CPUS) {
-   printk(Breaking affinity for irq %i\n, irq);
+   break_affinity = 1;
mask = map;
}
+
+   if (irq_desc[irq].chip-mask)
+   irq_desc[irq].chip-mask(irq);
+
if (irq_desc[irq].chip-set_affinity)
irq_desc[irq].chip-set_affinity(irq, mask);
else if (irq_desc[irq].action  !(warned++))
+   set_affinity = 0;
+
+   if (irq_desc[irq].chip-unmask)
+   irq_desc[irq].chip-unmask(irq);
+
+   spin_unlock(irq_desc[irq].lock);
+
+   if (break_affinity  set_affinity)
+   printk(Broke affinity for irq %i\n, irq);
+   else if (!set_affinity)
printk(Cannot set affinity for irq %i\n, irq);
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Darrick J. Wong
On Tue, Jun 19, 2007 at 12:59:27PM -0700, Siddha, Suresh B wrote:

 hmm.. Please try this instead. This is intended only for debug. Based on your
 test results, we can comeup with a more decent fix.

This fixes the problem!  Hurrah!

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-19 Thread Siddha, Suresh B
On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:
 
 This fixes the problem!  Hurrah!

Great!  Andrew, please include the appended patch in -mm.


Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs
From: Suresh Siddha [EMAIL PROTECTED]

Force irq migration path during cpu offline, is not using proper
locks and irq_chip mask/unmask routines. This will result in
some races(especially the device generating the interrupt can see
some inconsistent state, resulting in issues like stuck irq,..).

Appended patch fixes the issue by taking proper lock and
encapsulating irq_chip set_affinity() with a mask() before and an
unmask() after.

This fixes a MSI irq stuck issue reported by Darrick Wong.

There are several more general bugs in this area(irq migration in the
process context). For example,

1. Possibility of missing edge triggered irq.
2. Reliable method of migrating level triggered irq in the process context.

We plan to look and close these in the near future.

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
Cc: Eric W. Biederman [EMAIL PROTECTED]
Reported-by: Darrick Wong [EMAIL PROTECTED]
---

diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 3eaceac..55b2733 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -144,17 +144,41 @@ void fixup_irqs(cpumask_t map)
 
for (irq = 0; irq  NR_IRQS; irq++) {
cpumask_t mask;
+   int break_affinity = 0;
+   int set_affinity = 1;
+
if (irq == 2)
continue;
 
+   /* interrupt's are disabled at this point */
+   spin_lock(irq_desc[irq].lock);
+
+   if (!irq_has_action(irq) ||
+   cpus_equal(irq_desc[irq].affinity, map)) {
+   spin_unlock(irq_desc[irq].lock);
+   continue;
+   }
+
cpus_and(mask, irq_desc[irq].affinity, map);
-   if (any_online_cpu(mask) == NR_CPUS) {
-   printk(Breaking affinity for irq %i\n, irq);
+   if (cpus_empty(mask)) {
+   break_affinity = 1;
mask = map;
}
+
+   irq_desc[irq].chip-mask(irq);
+
if (irq_desc[irq].chip-set_affinity)
irq_desc[irq].chip-set_affinity(irq, mask);
-   else if (irq_desc[irq].action  !(warned++))
+   else if (!(warned++))
+   set_affinity = 0;
+
+   irq_desc[irq].chip-unmask(irq);
+
+   spin_unlock(irq_desc[irq].lock);
+
+   if (break_affinity  set_affinity)
+   printk(Broke affinity for irq %i\n, irq);
+   else if (!set_affinity)
printk(Cannot set affinity for irq %i\n, irq);
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-18 Thread Darrick J. Wong
On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote:

> > 
> > [  256.298787] irq=4341 affinity=d
> > 
> 
> And just to make sure, at this point, your MSI irq 4341 affinity
> (/proc/irq/4341/smp_affinity) still points to '2'?

Actually, it's 0xD.  From the kernel's perspective the mask has been
updated (and I even stuck a printk into set_msi_irq_affinity to verify
that the writes are happening) but ... the hardware doesn't seem to
reflect this.  I also tried putting read_msi_msg right afterwards to
compare contents, though it complained about all the MSIs _except_ for
4341.  (Of course, I could just be way off on the effectiveness of
that.)

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-18 Thread Siddha, Suresh B
On Mon, Jun 18, 2007 at 03:38:20PM -0700, Darrick J. Wong wrote:
> On Thu, Jun 07, 2007 at 05:57:26PM -0700, Siddha, Suresh B wrote:
> 
> > As you have the failing system, you need to do more detective work and
> > help me out. Can you try this debug patch and send across the dmesg after 
> > the
> > bug happens and also can you try different compiler to see if something
> > changes..
> 
> Hrm, I just updated to -rc5.  Interrupts being handled by the IOAPIC
> don't suffer from this problem, but MSI interrupts are still affected.
> I added a few printks to the kernel to figure out what IRQ affinity
> masks were being passed around and saw this:
> 
> [  256.298773] Breaking affinity for irq 4341
> [  256.298774] irq=4341 affinity=2 mask=d
> 
> [  256.298787] irq=4341 affinity=d
> 

And just to make sure, at this point, your MSI irq 4341 affinity
(/proc/irq/4341/smp_affinity) still points to '2'?

> I'll keep digging, but at least it appears that the problem has been
> shrunk down to something the MSI code.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-18 Thread Darrick J. Wong
On Thu, Jun 07, 2007 at 05:57:26PM -0700, Siddha, Suresh B wrote:

> As you have the failing system, you need to do more detective work and
> help me out. Can you try this debug patch and send across the dmesg after the
> bug happens and also can you try different compiler to see if something
> changes..

Hrm, I just updated to -rc5.  Interrupts being handled by the IOAPIC
don't suffer from this problem, but MSI interrupts are still affected.
I added a few printks to the kernel to figure out what IRQ affinity
masks were being passed around and saw this:

[  256.298773] Breaking affinity for irq 4341
[  256.298774] irq=4341 affinity=2 mask=d

[  256.298787] irq=4341 affinity=d


I'll keep digging, but at least it appears that the problem has been
shrunk down to something the MSI code.

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-18 Thread Darrick J. Wong
On Thu, Jun 07, 2007 at 05:57:26PM -0700, Siddha, Suresh B wrote:

 As you have the failing system, you need to do more detective work and
 help me out. Can you try this debug patch and send across the dmesg after the
 bug happens and also can you try different compiler to see if something
 changes..

Hrm, I just updated to -rc5.  Interrupts being handled by the IOAPIC
don't suffer from this problem, but MSI interrupts are still affected.
I added a few printks to the kernel to figure out what IRQ affinity
masks were being passed around and saw this:

[  256.298773] Breaking affinity for irq 4341
[  256.298774] irq=4341 affinity=2 mask=d
call to set_affinity
[  256.298787] irq=4341 affinity=d
ethernet on irq 4341 stops working

I'll keep digging, but at least it appears that the problem has been
shrunk down to something the MSI code.

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-18 Thread Siddha, Suresh B
On Mon, Jun 18, 2007 at 03:38:20PM -0700, Darrick J. Wong wrote:
 On Thu, Jun 07, 2007 at 05:57:26PM -0700, Siddha, Suresh B wrote:
 
  As you have the failing system, you need to do more detective work and
  help me out. Can you try this debug patch and send across the dmesg after 
  the
  bug happens and also can you try different compiler to see if something
  changes..
 
 Hrm, I just updated to -rc5.  Interrupts being handled by the IOAPIC
 don't suffer from this problem, but MSI interrupts are still affected.
 I added a few printks to the kernel to figure out what IRQ affinity
 masks were being passed around and saw this:
 
 [  256.298773] Breaking affinity for irq 4341
 [  256.298774] irq=4341 affinity=2 mask=d
 call to set_affinity
 [  256.298787] irq=4341 affinity=d
 ethernet on irq 4341 stops working

And just to make sure, at this point, your MSI irq 4341 affinity
(/proc/irq/4341/smp_affinity) still points to '2'?

 I'll keep digging, but at least it appears that the problem has been
 shrunk down to something the MSI code.

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-18 Thread Darrick J. Wong
On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote:

  call to set_affinity
  [  256.298787] irq=4341 affinity=d
  ethernet on irq 4341 stops working
 
 And just to make sure, at this point, your MSI irq 4341 affinity
 (/proc/irq/4341/smp_affinity) still points to '2'?

Actually, it's 0xD.  From the kernel's perspective the mask has been
updated (and I even stuck a printk into set_msi_irq_affinity to verify
that the writes are happening) but ... the hardware doesn't seem to
reflect this.  I also tried putting read_msi_msg right afterwards to
compare contents, though it complained about all the MSIs _except_ for
4341.  (Of course, I could just be way off on the effectiveness of
that.)

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-07 Thread Siddha, Suresh B
On Wed, Jun 06, 2007 at 04:16:42PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote:
> 
> > Weird. Then the bug can only happen if for some reason, "mask = map"
> > didn't happen in fixup_irqs(). Can you send us the disassembly of the
> > fixup_irqs()?
> 
> Attached.

hmm.. Darrick, can't find anything wrong in there.

I am very much puzzled and the main thing I am confused about is, that
how come "/proc/irq//smp_affinity" is still pointing at the old
offlined cpu, while calls to set_affinity() with cpu_online_map mask
in fixup_irqs() don't show any failure..

As you have the failing system, you need to do more detective work and
help me out. Can you try this debug patch and send across the dmesg after the
bug happens and also can you try different compiler to see if something
changes..

diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 3eaceac..fc2a576 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -152,9 +152,11 @@ void fixup_irqs(cpumask_t map)
printk("Breaking affinity for irq %i\n", irq);
mask = map;
}
-   if (irq_desc[irq].chip->set_affinity)
+   if (irq_desc[irq].chip->set_affinity) {
+   printk("calling set affinity for %i, with mask %lx\n",
+   irq, cpus_addr(mask)[0]);
irq_desc[irq].chip->set_affinity(irq, mask);
-   else if (irq_desc[irq].action && !(warned++))
+   } else if (irq_desc[irq].action && !(warned++))
printk("Cannot set affinity for irq %i\n", irq);
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-07 Thread Siddha, Suresh B
On Wed, Jun 06, 2007 at 04:16:42PM -0700, Darrick J. Wong wrote:
 On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote:
 
  Weird. Then the bug can only happen if for some reason, mask = map
  didn't happen in fixup_irqs(). Can you send us the disassembly of the
  fixup_irqs()?
 
 Attached.

hmm.. Darrick, can't find anything wrong in there.

I am very much puzzled and the main thing I am confused about is, that
how come /proc/irq/irq#-hung/smp_affinity is still pointing at the old
offlined cpu, while calls to set_affinity() with cpu_online_map mask
in fixup_irqs() don't show any failure..

As you have the failing system, you need to do more detective work and
help me out. Can you try this debug patch and send across the dmesg after the
bug happens and also can you try different compiler to see if something
changes..

diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 3eaceac..fc2a576 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -152,9 +152,11 @@ void fixup_irqs(cpumask_t map)
printk(Breaking affinity for irq %i\n, irq);
mask = map;
}
-   if (irq_desc[irq].chip-set_affinity)
+   if (irq_desc[irq].chip-set_affinity) {
+   printk(calling set affinity for %i, with mask %lx\n,
+   irq, cpus_addr(mask)[0]);
irq_desc[irq].chip-set_affinity(irq, mask);
-   else if (irq_desc[irq].action  !(warned++))
+   } else if (irq_desc[irq].action  !(warned++))
printk(Cannot set affinity for irq %i\n, irq);
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-06 Thread Darrick J. Wong
On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote:

> Weird. Then the bug can only happen if for some reason, "mask = map"
> didn't happen in fixup_irqs(). Can you send us the disassembly of the
> fixup_irqs()?

Attached.

--D
(gdb) disassemble fixup_irqs
Dump of assembler code for function fixup_irqs:
0x8020bf50 :  push   %rbp
0x8020bf51 :  mov%rsp,%rbp
0x8020bf54 :  push   %r13
0x8020bf56 :  xor%r13d,%r13d
0x8020bf59 :  push   %r12
0x8020bf5b : push   %rbx
0x8020bf5c : sub$0x28,%rsp
0x8020bf60 : mov%rdi,0xffc0(%rbp)
0x8020bf64 : mov%rsi,0xffc8(%rbp)
0x8020bf68 : jmp0x8020bf73 

0x8020bf6a : inc%r13d
0x8020bf6d : cmp$0x2,%r13d
0x8020bf71 : je 0x8020bf6a 

0x8020bf73 : mov%r13d,%r12d
0x8020bf76 : lea0xffd0(%rbp),%rbx
0x8020bf7a : lea0xffc0(%rbp),%rdx
0x8020bf7e : shl$0x8,%r12
0x8020bf82 : mov$0x80,%ecx
0x8020bf87 : lea0x805505f8(%r12),%rsi
0x8020bf8f : mov%rbx,%rdi
0x8020bf92 : callq  0x802fb606 <__bitmap_and>
0x8020bf97 : mov%rbx,%rdi
0x8020bf9a : callq  0x802fc6ad 
<__any_online_cpu>
0x8020bf9f : add$0xff80,%eax
0x8020bfa2 : jne0x8020bfc5 

0x8020bfa4 : mov%r13d,%esi
0x8020bfa7 : mov$0x804a52b0,%rdi
0x8020bfae : xor%eax,%eax
0x8020bfb0 : callq  0x80233d28 
0x8020bfb5 :mov0xffc0(%rbp),%rax
0x8020bfb9 :mov%rax,0xffd0(%rbp)
0x8020bfbd :mov0xffc8(%rbp),%rax
0x8020bfc1 :mov%rax,0xffd8(%rbp)
0x8020bfc5 :mov0x80550588(%r12),%rax
0x8020bfcd :mov0x58(%rax),%rax
0x8020bfd1 :test   %rax,%rax
0x8020bfd4 :je 0x8020bfe5 

0x8020bfd6 :mov0xffd0(%rbp),%rsi
0x8020bfda :mov0xffd8(%rbp),%rdx
0x8020bfde :mov%r13d,%edi
0x8020bfe1 :callq  *%rax
0x8020bfe3 :jmp0x8020c013 

0x8020bfe5 :cmpq   $0x0,0x805505a8(%r12)
0x8020bfee :je 0x8020c013 

0x8020bff0 :mov5181486(%rip),%eax# 
0x806fd024 
0x8020bff6 :inc%eax
0x8020bff8 :mov%eax,5181478(%rip)# 
0x806fd024 
0x8020bffe :dec%eax
0x8020c000 :jne0x8020c013 

0x8020c002 :mov%r13d,%esi
0x8020c005 :mov$0x804a52ce,%rdi
0x8020c00c :xor%eax,%eax
0x8020c00e :callq  0x80233d28 
0x8020c013 :lea0x1(%r13),%eax
0x8020c017 :cmp$0x10ff,%eax
0x8020c01c :jbe0x8020bf6a 

0x8020c022 :callq  0x8024e46e 

0x8020c027 :sti
0x8020c028 :mov$0x418958,%edi
0x8020c02d :callq  0x803018cf 
<__const_udelay>
0x8020c032 :cli
0x8020c033 :callq  0x8024cf31 

0x8020c038 :add$0x28,%rsp
0x8020c03c :pop%rbx
0x8020c03d :pop%r12
0x8020c03f :pop%r13
0x8020c041 :leaveq 
0x8020c042 :retq   


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-06 Thread Siddha, Suresh B
On Wed, Jun 06, 2007 at 11:58:29AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 05, 2007 at 06:37:59PM -0700, Siddha, Suresh B wrote:
> > On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote:
> > > On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote:
> > >  
> > > > Can you send us your system's dmesg aswell as output of 
> > > > /proc/interrupts?
> > > 
> > > http://sweaglesw.net/~djwong/docs/dmesg
> > > http://sweaglesw.net/~djwong/docs/interrupts
> > 
> > Didn't find anything wrong in that information. Can you try this
> > appended debug patch and see if you see this error msg in dmesg, when you
> > hit the bug? Thanks.
> 
> I don't see that message.

Weird. Then the bug can only happen if for some reason, "mask = map"
didn't happen in fixup_irqs(). Can you send us the disassembly of the
fixup_irqs()?

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-06 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 06:37:59PM -0700, Siddha, Suresh B wrote:
> On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote:
> >  
> > > Can you send us your system's dmesg aswell as output of /proc/interrupts?
> > 
> > http://sweaglesw.net/~djwong/docs/dmesg
> > http://sweaglesw.net/~djwong/docs/interrupts
> 
> Didn't find anything wrong in that information. Can you try this
> appended debug patch and see if you see this error msg in dmesg, when you
> hit the bug? Thanks.

I don't see that message.

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-06 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 06:37:59PM -0700, Siddha, Suresh B wrote:
 On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote:
  On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote:
   
   Can you send us your system's dmesg aswell as output of /proc/interrupts?
  
  http://sweaglesw.net/~djwong/docs/dmesg
  http://sweaglesw.net/~djwong/docs/interrupts
 
 Didn't find anything wrong in that information. Can you try this
 appended debug patch and see if you see this error msg in dmesg, when you
 hit the bug? Thanks.

I don't see that message.

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-06 Thread Siddha, Suresh B
On Wed, Jun 06, 2007 at 11:58:29AM -0700, Darrick J. Wong wrote:
 On Tue, Jun 05, 2007 at 06:37:59PM -0700, Siddha, Suresh B wrote:
  On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote:
   On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote:

Can you send us your system's dmesg aswell as output of 
/proc/interrupts?
   
   http://sweaglesw.net/~djwong/docs/dmesg
   http://sweaglesw.net/~djwong/docs/interrupts
  
  Didn't find anything wrong in that information. Can you try this
  appended debug patch and see if you see this error msg in dmesg, when you
  hit the bug? Thanks.
 
 I don't see that message.

Weird. Then the bug can only happen if for some reason, mask = map
didn't happen in fixup_irqs(). Can you send us the disassembly of the
fixup_irqs()?

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-06 Thread Darrick J. Wong
On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote:

 Weird. Then the bug can only happen if for some reason, mask = map
 didn't happen in fixup_irqs(). Can you send us the disassembly of the
 fixup_irqs()?

Attached.

--D
(gdb) disassemble fixup_irqs
Dump of assembler code for function fixup_irqs:
0x8020bf50 fixup_irqs+0:  push   %rbp
0x8020bf51 fixup_irqs+1:  mov%rsp,%rbp
0x8020bf54 fixup_irqs+4:  push   %r13
0x8020bf56 fixup_irqs+6:  xor%r13d,%r13d
0x8020bf59 fixup_irqs+9:  push   %r12
0x8020bf5b fixup_irqs+11: push   %rbx
0x8020bf5c fixup_irqs+12: sub$0x28,%rsp
0x8020bf60 fixup_irqs+16: mov%rdi,0xffc0(%rbp)
0x8020bf64 fixup_irqs+20: mov%rsi,0xffc8(%rbp)
0x8020bf68 fixup_irqs+24: jmp0x8020bf73 
fixup_irqs+35
0x8020bf6a fixup_irqs+26: inc%r13d
0x8020bf6d fixup_irqs+29: cmp$0x2,%r13d
0x8020bf71 fixup_irqs+33: je 0x8020bf6a 
fixup_irqs+26
0x8020bf73 fixup_irqs+35: mov%r13d,%r12d
0x8020bf76 fixup_irqs+38: lea0xffd0(%rbp),%rbx
0x8020bf7a fixup_irqs+42: lea0xffc0(%rbp),%rdx
0x8020bf7e fixup_irqs+46: shl$0x8,%r12
0x8020bf82 fixup_irqs+50: mov$0x80,%ecx
0x8020bf87 fixup_irqs+55: lea0x805505f8(%r12),%rsi
0x8020bf8f fixup_irqs+63: mov%rbx,%rdi
0x8020bf92 fixup_irqs+66: callq  0x802fb606 __bitmap_and
0x8020bf97 fixup_irqs+71: mov%rbx,%rdi
0x8020bf9a fixup_irqs+74: callq  0x802fc6ad 
__any_online_cpu
0x8020bf9f fixup_irqs+79: add$0xff80,%eax
0x8020bfa2 fixup_irqs+82: jne0x8020bfc5 
fixup_irqs+117
0x8020bfa4 fixup_irqs+84: mov%r13d,%esi
0x8020bfa7 fixup_irqs+87: mov$0x804a52b0,%rdi
0x8020bfae fixup_irqs+94: xor%eax,%eax
0x8020bfb0 fixup_irqs+96: callq  0x80233d28 printk
0x8020bfb5 fixup_irqs+101:mov0xffc0(%rbp),%rax
0x8020bfb9 fixup_irqs+105:mov%rax,0xffd0(%rbp)
0x8020bfbd fixup_irqs+109:mov0xffc8(%rbp),%rax
0x8020bfc1 fixup_irqs+113:mov%rax,0xffd8(%rbp)
0x8020bfc5 fixup_irqs+117:mov0x80550588(%r12),%rax
0x8020bfcd fixup_irqs+125:mov0x58(%rax),%rax
0x8020bfd1 fixup_irqs+129:test   %rax,%rax
0x8020bfd4 fixup_irqs+132:je 0x8020bfe5 
fixup_irqs+149
0x8020bfd6 fixup_irqs+134:mov0xffd0(%rbp),%rsi
0x8020bfda fixup_irqs+138:mov0xffd8(%rbp),%rdx
0x8020bfde fixup_irqs+142:mov%r13d,%edi
0x8020bfe1 fixup_irqs+145:callq  *%rax
0x8020bfe3 fixup_irqs+147:jmp0x8020c013 
fixup_irqs+195
0x8020bfe5 fixup_irqs+149:cmpq   $0x0,0x805505a8(%r12)
0x8020bfee fixup_irqs+158:je 0x8020c013 
fixup_irqs+195
0x8020bff0 fixup_irqs+160:mov5181486(%rip),%eax# 
0x806fd024 warned.11720
0x8020bff6 fixup_irqs+166:inc%eax
0x8020bff8 fixup_irqs+168:mov%eax,5181478(%rip)# 
0x806fd024 warned.11720
0x8020bffe fixup_irqs+174:dec%eax
0x8020c000 fixup_irqs+176:jne0x8020c013 
fixup_irqs+195
0x8020c002 fixup_irqs+178:mov%r13d,%esi
0x8020c005 fixup_irqs+181:mov$0x804a52ce,%rdi
0x8020c00c fixup_irqs+188:xor%eax,%eax
0x8020c00e fixup_irqs+190:callq  0x80233d28 printk
0x8020c013 fixup_irqs+195:lea0x1(%r13),%eax
0x8020c017 fixup_irqs+199:cmp$0x10ff,%eax
0x8020c01c fixup_irqs+204:jbe0x8020bf6a 
fixup_irqs+26
0x8020c022 fixup_irqs+210:callq  0x8024e46e 
trace_hardirqs_on
0x8020c027 fixup_irqs+215:sti
0x8020c028 fixup_irqs+216:mov$0x418958,%edi
0x8020c02d fixup_irqs+221:callq  0x803018cf 
__const_udelay
0x8020c032 fixup_irqs+226:cli
0x8020c033 fixup_irqs+227:callq  0x8024cf31 
trace_hardirqs_off
0x8020c038 fixup_irqs+232:add$0x28,%rsp
0x8020c03c fixup_irqs+236:pop%rbx
0x8020c03d fixup_irqs+237:pop%r12
0x8020c03f fixup_irqs+239:pop%r13
0x8020c041 fixup_irqs+241:leaveq 
0x8020c042 fixup_irqs+242:retq   


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote:
>  
> > Can you send us your system's dmesg aswell as output of /proc/interrupts?
> 
> http://sweaglesw.net/~djwong/docs/dmesg
> http://sweaglesw.net/~djwong/docs/interrupts

Didn't find anything wrong in that information. Can you try this
appended debug patch and see if you see this error msg in dmesg, when you
hit the bug? Thanks.

diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c
index d8bfe31..3409c1f 100644
--- a/arch/x86_64/kernel/io_apic.c
+++ b/arch/x86_64/kernel/io_apic.c
@@ -720,10 +720,13 @@ static int assign_irq_vector(int irq, cpumask_t mask)
 {
int err;
unsigned long flags;
+   int cpu = smp_processor_id();
 
spin_lock_irqsave(_lock, flags);
err = __assign_irq_vector(irq, mask);
spin_unlock_irqrestore(_lock, flags);
+   if (err && !cpu_isset(cpu, cpu_online_map))
+   printk("assigning irq to a vector failed : %d\n", err);
return err;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote:
 
> Can you send us your system's dmesg aswell as output of /proc/interrupts?

http://sweaglesw.net/~djwong/docs/dmesg
http://sweaglesw.net/~djwong/docs/interrupts

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 01:09:54PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 05, 2007 at 11:40:15AM -0700, Siddha, Suresh B wrote:
> 
> > Does this problem happen only under certain stress or something simple, like
> > 
> > boot the kernel
> > echo 2 > /proc/irq/114/smp_affinity
> > wait for irq to hit the cpu1.
> > echo 0 > /sys/devices/system/cpu/cpu1/online
> > 
> > will immmd trigger this?
> 
> The system is not under any stress at all.

Can you send us your system's dmesg aswell as output of /proc/interrupts?

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 11:40:15AM -0700, Siddha, Suresh B wrote:

> Does this problem happen only under certain stress or something simple, like
> 
> boot the kernel
> echo 2 > /proc/irq/114/smp_affinity
> wait for irq to hit the cpu1.
> echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> will immmd trigger this?

The system is not under any stress at all.

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 11:33:01AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 05, 2007 at 11:13:42AM -0700, Siddha, Suresh B wrote:
> > I see. Your system should have 4 or 8 logical cpu's right. So you must be
> > using logical flat mode, right?
> 
> I believe so.  The system has two Xeon 5150s with an Intel 5000 chipset
> of some sort.
> 
> > When this bug happens, what does /proc/irq//smp_affinity show?
> 
> [EMAIL PROTECTED]:~# cat /proc/irq/114/smp_affinity 
> 02

Ok. What this shows is that fixup_irqs() failed to move the irq properly.
Ideally we should see cpu_online_map here (i.e., 0xfd).

So most likely __assign_irq_vector() failed for some reason and I am
puzzled for the reason...

Does this problem happen only under certain stress or something simple, like

boot the kernel
echo 2 > /proc/irq/114/smp_affinity
wait for irq to hit the cpu1.
echo 0 > /sys/devices/system/cpu/cpu1/online

will immmd trigger this?

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 11:13:42AM -0700, Siddha, Suresh B wrote:
> I see. Your system should have 4 or 8 logical cpu's right. So you must be
> using logical flat mode, right?

I believe so.  The system has two Xeon 5150s with an Intel 5000 chipset
of some sort.

> When this bug happens, what does /proc/irq//smp_affinity show?

[EMAIL PROTECTED]:~# cat /proc/irq/114/smp_affinity 
02

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 10:36:47AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 05, 2007 at 10:23:10AM -0700, Siddha, Suresh B wrote:
> 
> > Darrick, I see a kernel bug in this area(which is already filled with bugs,
> > and I am looking into ways to fix them). Are you making sure that
> > between step-1 and step-2, that interrupts actually started arriving at 
> > cpu1?
> > 
> > i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point
> > do step-2 and let us know if you still hit this bug?
> 
> Yes, the bug only happens after CPU1 begins to receive interrupts.

I see. Your system should have 4 or 8 logical cpu's right. So you must be
using logical flat mode, right?

When this bug happens, what does /proc/irq//smp_affinity show?

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 10:23:10AM -0700, Siddha, Suresh B wrote:

> Darrick, I see a kernel bug in this area(which is already filled with bugs,
> and I am looking into ways to fix them). Are you making sure that
> between step-1 and step-2, that interrupts actually started arriving at cpu1?
> 
> i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point
> do step-2 and let us know if you still hit this bug?

Yes, the bug only happens after CPU1 begins to receive interrupts.

> > There exists a similar scenario.  Set the IRQ affinity to a bunch of
> > CPUs, watch /proc/interrupts to see which CPU is actually servicing the
> > interrupts, then offline that CPU.  The kernel does not reroute the IRQ
> > to any of the other CPUs and the device also hangs.
> 
> Is this a theory or did you observe this problem happening?

Nope, I've observed this situation too.

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Thu, May 31, 2007 at 05:44:27PM -0700, Darrick J. Wong wrote:
> Hi there,
> 
> I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid
> about offlining CPUs.  I suspect that this problem extends beyond a
> particular machine, as I've been able to replicate it with an IBM x3650
> and an IBM x3755.  This is what I'm doing:
> 
> 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ
> 4341 is the network card and we're picking on CPU1 in this example):
> echo 2 > /proc/irq/4341/smp_affinity

Darrick, I see a kernel bug in this area(which is already filled with bugs,
and I am looking into ways to fix them). Are you making sure that
between step-1 and step-2, that interrupts actually started arriving at cpu1?

i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point
do step-2 and let us know if you still hit this bug?

> 
> 2) I then take CPU1 offline:
> echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> 3) The kernel prints this:
> [ 1101.968040] Breaking affinity for irq 4341
> [ 1102.074019] CPU 1 is now offline
> [ 1102.081593] lockdep: not fixing up alternatives.
> [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying
> 
> After step 2 the system never sees interrupts from the network card and
> remains hung like that until CPU1 is brought back up.  It looks as
> though the kernel is trying to reroute the IRQ (or so I'm assuming from
> the "Breaking affinity" message), but this doesn't ever happen, so the
> the kernel stops seeing interrupts from the device.
> 
> Granted, one should not be offlining the CPU that is currently
> designated to handle an IRQ, but I suspect that the kernel ought at a
> minimum to reject the offlining or route the IRQ to any online CPU
> instead of screwing things up.
> 
> There exists a similar scenario.  Set the IRQ affinity to a bunch of
> CPUs, watch /proc/interrupts to see which CPU is actually servicing the
> interrupts, then offline that CPU.  The kernel does not reroute the IRQ
> to any of the other CPUs and the device also hangs.

Is this a theory or did you observe this problem happening?

thanks,
suresh

> 
> The furthest that I've dug is that it works on 2.6.17 and is broken in
> 2.6.22-rc3 and 2.6.21.  Will git-bisect further, but I wanted to know if
> anyone else has seen this sort of problem.  afaik, this seems to happen
> with both IOAPIC and MSI interrupts, possibly more.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Thu, May 31, 2007 at 05:44:27PM -0700, Darrick J. Wong wrote:
 Hi there,
 
 I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid
 about offlining CPUs.  I suspect that this problem extends beyond a
 particular machine, as I've been able to replicate it with an IBM x3650
 and an IBM x3755.  This is what I'm doing:
 
 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ
 4341 is the network card and we're picking on CPU1 in this example):
 echo 2  /proc/irq/4341/smp_affinity

Darrick, I see a kernel bug in this area(which is already filled with bugs,
and I am looking into ways to fix them). Are you making sure that
between step-1 and step-2, that interrupts actually started arriving at cpu1?

i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point
do step-2 and let us know if you still hit this bug?

 
 2) I then take CPU1 offline:
 echo 0  /sys/devices/system/cpu/cpu1/online
 
 3) The kernel prints this:
 [ 1101.968040] Breaking affinity for irq 4341
 [ 1102.074019] CPU 1 is now offline
 [ 1102.081593] lockdep: not fixing up alternatives.
 [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying
 
 After step 2 the system never sees interrupts from the network card and
 remains hung like that until CPU1 is brought back up.  It looks as
 though the kernel is trying to reroute the IRQ (or so I'm assuming from
 the Breaking affinity message), but this doesn't ever happen, so the
 the kernel stops seeing interrupts from the device.
 
 Granted, one should not be offlining the CPU that is currently
 designated to handle an IRQ, but I suspect that the kernel ought at a
 minimum to reject the offlining or route the IRQ to any online CPU
 instead of screwing things up.
 
 There exists a similar scenario.  Set the IRQ affinity to a bunch of
 CPUs, watch /proc/interrupts to see which CPU is actually servicing the
 interrupts, then offline that CPU.  The kernel does not reroute the IRQ
 to any of the other CPUs and the device also hangs.

Is this a theory or did you observe this problem happening?

thanks,
suresh

 
 The furthest that I've dug is that it works on 2.6.17 and is broken in
 2.6.22-rc3 and 2.6.21.  Will git-bisect further, but I wanted to know if
 anyone else has seen this sort of problem.  afaik, this seems to happen
 with both IOAPIC and MSI interrupts, possibly more.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 10:23:10AM -0700, Siddha, Suresh B wrote:

 Darrick, I see a kernel bug in this area(which is already filled with bugs,
 and I am looking into ways to fix them). Are you making sure that
 between step-1 and step-2, that interrupts actually started arriving at cpu1?
 
 i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point
 do step-2 and let us know if you still hit this bug?

Yes, the bug only happens after CPU1 begins to receive interrupts.

  There exists a similar scenario.  Set the IRQ affinity to a bunch of
  CPUs, watch /proc/interrupts to see which CPU is actually servicing the
  interrupts, then offline that CPU.  The kernel does not reroute the IRQ
  to any of the other CPUs and the device also hangs.
 
 Is this a theory or did you observe this problem happening?

Nope, I've observed this situation too.

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 10:36:47AM -0700, Darrick J. Wong wrote:
 On Tue, Jun 05, 2007 at 10:23:10AM -0700, Siddha, Suresh B wrote:
 
  Darrick, I see a kernel bug in this area(which is already filled with bugs,
  and I am looking into ways to fix them). Are you making sure that
  between step-1 and step-2, that interrupts actually started arriving at 
  cpu1?
  
  i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point
  do step-2 and let us know if you still hit this bug?
 
 Yes, the bug only happens after CPU1 begins to receive interrupts.

I see. Your system should have 4 or 8 logical cpu's right. So you must be
using logical flat mode, right?

When this bug happens, what does /proc/irq/irq-no/smp_affinity show?

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 11:13:42AM -0700, Siddha, Suresh B wrote:
 I see. Your system should have 4 or 8 logical cpu's right. So you must be
 using logical flat mode, right?

I believe so.  The system has two Xeon 5150s with an Intel 5000 chipset
of some sort.

 When this bug happens, what does /proc/irq/irq-no/smp_affinity show?

[EMAIL PROTECTED]:~# cat /proc/irq/114/smp_affinity 
02

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 11:33:01AM -0700, Darrick J. Wong wrote:
 On Tue, Jun 05, 2007 at 11:13:42AM -0700, Siddha, Suresh B wrote:
  I see. Your system should have 4 or 8 logical cpu's right. So you must be
  using logical flat mode, right?
 
 I believe so.  The system has two Xeon 5150s with an Intel 5000 chipset
 of some sort.
 
  When this bug happens, what does /proc/irq/irq-no/smp_affinity show?
 
 [EMAIL PROTECTED]:~# cat /proc/irq/114/smp_affinity 
 02

Ok. What this shows is that fixup_irqs() failed to move the irq properly.
Ideally we should see cpu_online_map here (i.e., 0xfd).

So most likely __assign_irq_vector() failed for some reason and I am
puzzled for the reason...

Does this problem happen only under certain stress or something simple, like

boot the kernel
echo 2  /proc/irq/114/smp_affinity
wait for irq to hit the cpu1.
echo 0  /sys/devices/system/cpu/cpu1/online

will immmd trigger this?

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 11:40:15AM -0700, Siddha, Suresh B wrote:

 Does this problem happen only under certain stress or something simple, like
 
 boot the kernel
 echo 2  /proc/irq/114/smp_affinity
 wait for irq to hit the cpu1.
 echo 0  /sys/devices/system/cpu/cpu1/online
 
 will immmd trigger this?

The system is not under any stress at all.

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 01:09:54PM -0700, Darrick J. Wong wrote:
 On Tue, Jun 05, 2007 at 11:40:15AM -0700, Siddha, Suresh B wrote:
 
  Does this problem happen only under certain stress or something simple, like
  
  boot the kernel
  echo 2  /proc/irq/114/smp_affinity
  wait for irq to hit the cpu1.
  echo 0  /sys/devices/system/cpu/cpu1/online
  
  will immmd trigger this?
 
 The system is not under any stress at all.

Can you send us your system's dmesg aswell as output of /proc/interrupts?

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Darrick J. Wong
On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote:
 
 Can you send us your system's dmesg aswell as output of /proc/interrupts?

http://sweaglesw.net/~djwong/docs/dmesg
http://sweaglesw.net/~djwong/docs/interrupts

--D


signature.asc
Description: Digital signature


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-05 Thread Siddha, Suresh B
On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote:
 On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote:
  
  Can you send us your system's dmesg aswell as output of /proc/interrupts?
 
 http://sweaglesw.net/~djwong/docs/dmesg
 http://sweaglesw.net/~djwong/docs/interrupts

Didn't find anything wrong in that information. Can you try this
appended debug patch and see if you see this error msg in dmesg, when you
hit the bug? Thanks.

diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c
index d8bfe31..3409c1f 100644
--- a/arch/x86_64/kernel/io_apic.c
+++ b/arch/x86_64/kernel/io_apic.c
@@ -720,10 +720,13 @@ static int assign_irq_vector(int irq, cpumask_t mask)
 {
int err;
unsigned long flags;
+   int cpu = smp_processor_id();
 
spin_lock_irqsave(vector_lock, flags);
err = __assign_irq_vector(irq, mask);
spin_unlock_irqrestore(vector_lock, flags);
+   if (err  !cpu_isset(cpu, cpu_online_map))
+   printk(assigning irq to a vector failed : %d\n, err);
return err;
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-03 Thread Emmanuel Fusté
> 
> This is just getting confusing.
> 
> Emmanuel Fust.  Please play with /proc/irq/*/smp_affinity by
and and
> confirm that you can move your irqs.  This will confirm it
is the decision
> part.
> 
Ok, as planned, you're right ;-) , playing with
/proc/irq/*/smp_affinity let me move irqs.

Emmanuel.
---

Créez votre adresse électronique [EMAIL PROTECTED] 
1 Go d'espace de stockage, anti-spam et anti-virus intégrés.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-03 Thread Emmanuel Fusté
 
 This is just getting confusing.
 
 Emmanuel Fust.  Please play with /proc/irq/*/smp_affinity by
and and
 confirm that you can move your irqs.  This will confirm it
is the decision
 part.
 
Ok, as planned, you're right ;-) , playing with
/proc/irq/*/smp_affinity let me move irqs.

Emmanuel.
---

Créez votre adresse électronique [EMAIL PROTECTED] 
1 Go d'espace de stockage, anti-spam et anti-virus intégrés.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Eric W. Biederman
"Darrick J. Wong" <[EMAIL PROTECTED]> writes:

> On Fri, Jun 01, 2007 at 06:18:32PM -0600, Eric W. Biederman wrote:
>
>> I doubt it.  The practical problem is that cpu_down does not
>> and by design can not call the irq balancing part properly
>> and I haven't yet seen anything to suggest that we don't migrate
>> irq properly.
>> 
>> So I'm guessing it was the decision part.
>
> I'm not using any IRQ balancer, afaik.  As I recall, CONFIG_IRQBALANCE
> is i386-only, and I'm not running the userland irqbalance program
> either.  Just messing around with /proc/irq/*/smp_affinity by hand. :)

This is just getting confusing.

Emmanuel Fust.  Please play with /proc/irq/*/smp_affinity by hand and
confirm that you can move your irqs.  This will confirm it is the decision
part.

Darrick.  The cpu hotplug architecture makes it impossible to properly
call irq migration code that backs /proc/irq/*/smp_affinity.  Therefore
the cpu hotplug interface to irq migration is broken by design.  There
are some other bugs in the implementation of migrating irqs off of cpus
as well.  I'm pretty certain that some combination of those problems is
biting you.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Darrick J. Wong
On Fri, Jun 01, 2007 at 06:18:32PM -0600, Eric W. Biederman wrote:

> I doubt it.  The practical problem is that cpu_down does not
> and by design can not call the irq balancing part properly
> and I haven't yet seen anything to suggest that we don't migrate
> irq properly.
> 
> So I'm guessing it was the decision part.

I'm not using any IRQ balancer, afaik.  As I recall, CONFIG_IRQBALANCE
is i386-only, and I'm not running the userland irqbalance program
either.  Just messing around with /proc/irq/*/smp_affinity by hand. :)

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Eric W. Biederman

> As a side note, on my very old SMP machine, 2.6.20 correctly
> load-balance IRQs across CPU but 2.6.21 not. I know that
> in-kernel IRQ load balancer is marked as deprecated and
> somewhat broken, but with your report it make me think it
> could be a bug in the IRQ rerouting part in my case too and
> not necessary in the load-balancer (decision) part.

I doubt it.  The practical problem is that cpu_down does not
and by design can not call the irq balancing part properly
and I haven't yet seen anything to suggest that we don't migrate
irq properly.

So I'm guessing it was the decision part.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Emmanuel Fusté
> There exists a similar scenario.  Set the IRQ affinity to a
bunch of
> CPUs, watch /proc/interrupts to see which CPU is actually
servicing the
> interrupts, then offline that CPU.  The kernel does not
reroute the IRQ
> to any of the other CPUs and the device also hangs.
>
> The furthest that I've dug is that it works on 2.6.17 and is
broken in
> 2.6.22-rc3 and 2.6.21.  Will git-bisect further, but I
wanted to know if
> anyone else has seen this sort of problem.  afaik, this
seems to happen
> with both IOAPIC and MSI interrupts, possibly more.
Hi,
As a side note, on my very old SMP machine, 2.6.20 correctly
load-balance IRQs across CPU but 2.6.21 not. I know that
in-kernel IRQ load balancer is marked as deprecated and
somewhat broken, but with your report it make me think it
could be a bug in the IRQ rerouting part in my case too and
not necessary in the load-balancer (decision) part.

Emmanuel.

---

Créez votre adresse électronique [EMAIL PROTECTED] 
1 Go d'espace de stockage, anti-spam et anti-virus intégrés.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Eric W. Biederman
"Darrick J. Wong" <[EMAIL PROTECTED]> writes:

> Hi there,
>
> I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid
> about offlining CPUs.  I suspect that this problem extends beyond a
> particular machine, as I've been able to replicate it with an IBM x3650
> and an IBM x3755.  This is what I'm doing:
>
> 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ
> 4341 is the network card and we're picking on CPU1 in this example):
> echo 2 > /proc/irq/4341/smp_affinity
>
> 2) I then take CPU1 offline:
> echo 0 > /sys/devices/system/cpu/cpu1/online
>
> 3) The kernel prints this:
> [ 1101.968040] Breaking affinity for irq 4341
> [ 1102.074019] CPU 1 is now offline
> [ 1102.081593] lockdep: not fixing up alternatives.
> [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying
>
> After step 2 the system never sees interrupts from the network card and
> remains hung like that until CPU1 is brought back up.  It looks as
> though the kernel is trying to reroute the IRQ (or so I'm assuming from
> the "Breaking affinity" message), but this doesn't ever happen, so the
> the kernel stops seeing interrupts from the device.
>
> Granted, one should not be offlining the CPU that is currently
> designated to handle an IRQ, but I suspect that the kernel ought at a
> minimum to reject the offlining or route the IRQ to any online CPU
> instead of screwing things up.

I agree.

> There exists a similar scenario.  Set the IRQ affinity to a bunch of
> CPUs, watch /proc/interrupts to see which CPU is actually servicing the
> interrupts, then offline that CPU.  The kernel does not reroute the IRQ
> to any of the other CPUs and the device also hangs.
>
> The furthest that I've dug is that it works on 2.6.17 and is broken in
> 2.6.22-rc3 and 2.6.21.  Will git-bisect further, but I wanted to know if
> anyone else has seen this sort of problem.  afaik, this seems to happen
> with both IOAPIC and MSI interrupts, possibly more.

Thanks for the bug report.  I'm chuckling because I just submitted a
patch to count that whole code path as broken, based on code review.
It is trying to do something that the hardware can not reliably
accomplish.

Now I am surprised you were seeing this with MSI as well because
the hardware should theoretically work in that case.  However the
irq_fixup code has enough issues that I wouldn't be surprised if
it was just doing something stupid and wrong.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Eric W. Biederman
Darrick J. Wong [EMAIL PROTECTED] writes:

 Hi there,

 I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid
 about offlining CPUs.  I suspect that this problem extends beyond a
 particular machine, as I've been able to replicate it with an IBM x3650
 and an IBM x3755.  This is what I'm doing:

 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ
 4341 is the network card and we're picking on CPU1 in this example):
 echo 2  /proc/irq/4341/smp_affinity

 2) I then take CPU1 offline:
 echo 0  /sys/devices/system/cpu/cpu1/online

 3) The kernel prints this:
 [ 1101.968040] Breaking affinity for irq 4341
 [ 1102.074019] CPU 1 is now offline
 [ 1102.081593] lockdep: not fixing up alternatives.
 [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying

 After step 2 the system never sees interrupts from the network card and
 remains hung like that until CPU1 is brought back up.  It looks as
 though the kernel is trying to reroute the IRQ (or so I'm assuming from
 the Breaking affinity message), but this doesn't ever happen, so the
 the kernel stops seeing interrupts from the device.

 Granted, one should not be offlining the CPU that is currently
 designated to handle an IRQ, but I suspect that the kernel ought at a
 minimum to reject the offlining or route the IRQ to any online CPU
 instead of screwing things up.

I agree.

 There exists a similar scenario.  Set the IRQ affinity to a bunch of
 CPUs, watch /proc/interrupts to see which CPU is actually servicing the
 interrupts, then offline that CPU.  The kernel does not reroute the IRQ
 to any of the other CPUs and the device also hangs.

 The furthest that I've dug is that it works on 2.6.17 and is broken in
 2.6.22-rc3 and 2.6.21.  Will git-bisect further, but I wanted to know if
 anyone else has seen this sort of problem.  afaik, this seems to happen
 with both IOAPIC and MSI interrupts, possibly more.

Thanks for the bug report.  I'm chuckling because I just submitted a
patch to count that whole code path as broken, based on code review.
It is trying to do something that the hardware can not reliably
accomplish.

Now I am surprised you were seeing this with MSI as well because
the hardware should theoretically work in that case.  However the
irq_fixup code has enough issues that I wouldn't be surprised if
it was just doing something stupid and wrong.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Emmanuel Fusté
 There exists a similar scenario.  Set the IRQ affinity to a
bunch of
 CPUs, watch /proc/interrupts to see which CPU is actually
servicing the
 interrupts, then offline that CPU.  The kernel does not
reroute the IRQ
 to any of the other CPUs and the device also hangs.

 The furthest that I've dug is that it works on 2.6.17 and is
broken in
 2.6.22-rc3 and 2.6.21.  Will git-bisect further, but I
wanted to know if
 anyone else has seen this sort of problem.  afaik, this
seems to happen
 with both IOAPIC and MSI interrupts, possibly more.
Hi,
As a side note, on my very old SMP machine, 2.6.20 correctly
load-balance IRQs across CPU but 2.6.21 not. I know that
in-kernel IRQ load balancer is marked as deprecated and
somewhat broken, but with your report it make me think it
could be a bug in the IRQ rerouting part in my case too and
not necessary in the load-balancer (decision) part.

Emmanuel.

---

Créez votre adresse électronique [EMAIL PROTECTED] 
1 Go d'espace de stockage, anti-spam et anti-virus intégrés.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Eric W. Biederman

 As a side note, on my very old SMP machine, 2.6.20 correctly
 load-balance IRQs across CPU but 2.6.21 not. I know that
 in-kernel IRQ load balancer is marked as deprecated and
 somewhat broken, but with your report it make me think it
 could be a bug in the IRQ rerouting part in my case too and
 not necessary in the load-balancer (decision) part.

I doubt it.  The practical problem is that cpu_down does not
and by design can not call the irq balancing part properly
and I haven't yet seen anything to suggest that we don't migrate
irq properly.

So I'm guessing it was the decision part.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Darrick J. Wong
On Fri, Jun 01, 2007 at 06:18:32PM -0600, Eric W. Biederman wrote:

 I doubt it.  The practical problem is that cpu_down does not
 and by design can not call the irq balancing part properly
 and I haven't yet seen anything to suggest that we don't migrate
 irq properly.
 
 So I'm guessing it was the decision part.

I'm not using any IRQ balancer, afaik.  As I recall, CONFIG_IRQBALANCE
is i386-only, and I'm not running the userland irqbalance program
either.  Just messing around with /proc/irq/*/smp_affinity by hand. :)

--D
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-01 Thread Eric W. Biederman
Darrick J. Wong [EMAIL PROTECTED] writes:

 On Fri, Jun 01, 2007 at 06:18:32PM -0600, Eric W. Biederman wrote:

 I doubt it.  The practical problem is that cpu_down does not
 and by design can not call the irq balancing part properly
 and I haven't yet seen anything to suggest that we don't migrate
 irq properly.
 
 So I'm guessing it was the decision part.

 I'm not using any IRQ balancer, afaik.  As I recall, CONFIG_IRQBALANCE
 is i386-only, and I'm not running the userland irqbalance program
 either.  Just messing around with /proc/irq/*/smp_affinity by hand. :)

This is just getting confusing.

Emmanuel Fust.  Please play with /proc/irq/*/smp_affinity by hand and
confirm that you can move your irqs.  This will confirm it is the decision
part.

Darrick.  The cpu hotplug architecture makes it impossible to properly
call irq migration code that backs /proc/irq/*/smp_affinity.  Therefore
the cpu hotplug interface to irq migration is broken by design.  There
are some other bugs in the implementation of migrating irqs off of cpus
as well.  I'm pretty certain that some combination of those problems is
biting you.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Device hang when offlining a CPU due to IRQ misrouting

2007-05-31 Thread Darrick J. Wong
Hi there,

I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid
about offlining CPUs.  I suspect that this problem extends beyond a
particular machine, as I've been able to replicate it with an IBM x3650
and an IBM x3755.  This is what I'm doing:

1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ
4341 is the network card and we're picking on CPU1 in this example):
echo 2 > /proc/irq/4341/smp_affinity

2) I then take CPU1 offline:
echo 0 > /sys/devices/system/cpu/cpu1/online

3) The kernel prints this:
[ 1101.968040] Breaking affinity for irq 4341
[ 1102.074019] CPU 1 is now offline
[ 1102.081593] lockdep: not fixing up alternatives.
[ 1112.886919] nfs: server 9.47.66.169 not responding, still trying

After step 2 the system never sees interrupts from the network card and
remains hung like that until CPU1 is brought back up.  It looks as
though the kernel is trying to reroute the IRQ (or so I'm assuming from
the "Breaking affinity" message), but this doesn't ever happen, so the
the kernel stops seeing interrupts from the device.

Granted, one should not be offlining the CPU that is currently
designated to handle an IRQ, but I suspect that the kernel ought at a
minimum to reject the offlining or route the IRQ to any online CPU
instead of screwing things up.

There exists a similar scenario.  Set the IRQ affinity to a bunch of
CPUs, watch /proc/interrupts to see which CPU is actually servicing the
interrupts, then offline that CPU.  The kernel does not reroute the IRQ
to any of the other CPUs and the device also hangs.

The furthest that I've dug is that it works on 2.6.17 and is broken in
2.6.22-rc3 and 2.6.21.  Will git-bisect further, but I wanted to know if
anyone else has seen this sort of problem.  afaik, this seems to happen
with both IOAPIC and MSI interrupts, possibly more.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Device hang when offlining a CPU due to IRQ misrouting

2007-05-31 Thread Darrick J. Wong
Hi there,

I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid
about offlining CPUs.  I suspect that this problem extends beyond a
particular machine, as I've been able to replicate it with an IBM x3650
and an IBM x3755.  This is what I'm doing:

1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ
4341 is the network card and we're picking on CPU1 in this example):
echo 2  /proc/irq/4341/smp_affinity

2) I then take CPU1 offline:
echo 0  /sys/devices/system/cpu/cpu1/online

3) The kernel prints this:
[ 1101.968040] Breaking affinity for irq 4341
[ 1102.074019] CPU 1 is now offline
[ 1102.081593] lockdep: not fixing up alternatives.
[ 1112.886919] nfs: server 9.47.66.169 not responding, still trying

After step 2 the system never sees interrupts from the network card and
remains hung like that until CPU1 is brought back up.  It looks as
though the kernel is trying to reroute the IRQ (or so I'm assuming from
the Breaking affinity message), but this doesn't ever happen, so the
the kernel stops seeing interrupts from the device.

Granted, one should not be offlining the CPU that is currently
designated to handle an IRQ, but I suspect that the kernel ought at a
minimum to reject the offlining or route the IRQ to any online CPU
instead of screwing things up.

There exists a similar scenario.  Set the IRQ affinity to a bunch of
CPUs, watch /proc/interrupts to see which CPU is actually servicing the
interrupts, then offline that CPU.  The kernel does not reroute the IRQ
to any of the other CPUs and the device also hangs.

The furthest that I've dug is that it works on 2.6.17 and is broken in
2.6.22-rc3 and 2.6.21.  Will git-bisect further, but I wanted to know if
anyone else has seen this sort of problem.  afaik, this seems to happen
with both IOAPIC and MSI interrupts, possibly more.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/