Re: Device hang when offlining a CPU due to IRQ misrouting
On Sunday, 24 June 2007 02:45, Eric W. Biederman wrote: > Andrew Morton <[EMAIL PROTECTED]> writes: > > > On Sun, 24 Jun 2007 01:54:52 +0200 "Rafael J. Wysocki" <[EMAIL PROTECTED]> > > wrote: > > > >> On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: > >> > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: > >> > > > >> > > This fixes the problem! Hurrah! > >> > > >> > Great! Andrew, please include the appended patch in -mm. > >> > > >> > > >> > Subject: [patch] x86_64, irq: use mask/unmask and proper locking in > > fixup_irqs > >> > From: Suresh Siddha <[EMAIL PROTECTED]> > >> > > >> > Force irq migration path during cpu offline, is not using proper > >> > locks and irq_chip mask/unmask routines. This will result in > >> > some races(especially the device generating the interrupt can see > >> > some inconsistent state, resulting in issues like stuck irq,..). > >> > > >> > Appended patch fixes the issue by taking proper lock and > >> > encapsulating irq_chip set_affinity() with a mask() before and an > >> > unmask() after. > >> > > >> > This fixes a MSI irq stuck issue reported by Darrick Wong. > >> > > >> > There are several more general bugs in this area(irq migration in the > >> > process context). For example, > >> > > >> > 1. Possibility of missing edge triggered irq. > >> > 2. Reliable method of migrating level triggered irq in the process > >> > context. > >> > > >> > We plan to look and close these in the near future. > >> > >> This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC > >> nx6325). > >> > >> _cpu_down() just hangs as though there were a deadlock in there, 100% of > >> the > >> time. > >> > > > > Thanks, I dropped it. > > Hmm. It looks like Siddha sent the wrong version of the patch. > The working tested version had an additional test to ensure > the mask and unmask methods were implemented. > > i.e. > + if (irq_desc[irq].chip->mask) > + irq_desc[irq].chip->mask(irq); > and > > + if (irq_desc[irq].chip->unmask) > + irq_desc[irq].chip->unmask(irq); > + > > Siddha think you can resend the correct version. > > Rafael. Think you can add those two ifs and see if you test bed box > works? Yes, that helps. For reference I'm appending the complete patch that I have tested. Greetings, Rafael --- arch/x86_64/kernel/irq.c | 32 +--- 1 file changed, 29 insertions(+), 3 deletions(-) Index: linux-2.6.22-rc5/arch/x86_64/kernel/irq.c === --- linux-2.6.22-rc5.orig/arch/x86_64/kernel/irq.c 2007-06-24 14:28:33.0 +0200 +++ linux-2.6.22-rc5/arch/x86_64/kernel/irq.c 2007-06-24 14:31:11.0 +0200 @@ -144,17 +144,43 @@ void fixup_irqs(cpumask_t map) for (irq = 0; irq < NR_IRQS; irq++) { cpumask_t mask; + int break_affinity = 0; + int set_affinity = 1; + if (irq == 2) continue; + /* interrupt's are disabled at this point */ + spin_lock(_desc[irq].lock); + + if (!irq_has_action(irq) || + cpus_equal(irq_desc[irq].affinity, map)) { + spin_unlock(_desc[irq].lock); + continue; + } + cpus_and(mask, irq_desc[irq].affinity, map); - if (any_online_cpu(mask) == NR_CPUS) { - printk("Breaking affinity for irq %i\n", irq); + if (cpus_empty(mask)) { + break_affinity = 1; mask = map; } + + if (irq_desc[irq].chip->mask) + irq_desc[irq].chip->mask(irq); + if (irq_desc[irq].chip->set_affinity) irq_desc[irq].chip->set_affinity(irq, mask); - else if (irq_desc[irq].action && !(warned++)) + else if (!(warned++)) + set_affinity = 0; + + if (irq_desc[irq].chip->unmask) + irq_desc[irq].chip->unmask(irq); + + spin_unlock(_desc[irq].lock); + + if (break_affinity && set_affinity) + printk("Broke affinity for irq %i\n", irq); + else if (!set_affinity) printk("Cannot set affinity for irq %i\n", irq); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Sunday, 24 June 2007 02:28, Siddha, Suresh B wrote: > On Sun, Jun 24, 2007 at 01:54:52AM +0200, Rafael J. Wysocki wrote: > > This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC > > nx6325). > > > > _cpu_down() just hangs as though there were a deadlock in there, 100% of the > > time. > > Does the patch at this URL work for you? > > http://marc.info/?l=linux-kernel=118228358826737=2 Yes, it does. Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Sunday, 24 June 2007 02:45, Eric W. Biederman wrote: Andrew Morton [EMAIL PROTECTED] writes: On Sun, 24 Jun 2007 01:54:52 +0200 Rafael J. Wysocki [EMAIL PROTECTED] wrote: On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: This fixes the problem! Hurrah! Great! Andrew, please include the appended patch in -mm. Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs From: Suresh Siddha [EMAIL PROTECTED] Force irq migration path during cpu offline, is not using proper locks and irq_chip mask/unmask routines. This will result in some races(especially the device generating the interrupt can see some inconsistent state, resulting in issues like stuck irq,..). Appended patch fixes the issue by taking proper lock and encapsulating irq_chip set_affinity() with a mask() before and an unmask() after. This fixes a MSI irq stuck issue reported by Darrick Wong. There are several more general bugs in this area(irq migration in the process context). For example, 1. Possibility of missing edge triggered irq. 2. Reliable method of migrating level triggered irq in the process context. We plan to look and close these in the near future. This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325). _cpu_down() just hangs as though there were a deadlock in there, 100% of the time. Thanks, I dropped it. Hmm. It looks like Siddha sent the wrong version of the patch. The working tested version had an additional test to ensure the mask and unmask methods were implemented. i.e. + if (irq_desc[irq].chip-mask) + irq_desc[irq].chip-mask(irq); and + if (irq_desc[irq].chip-unmask) + irq_desc[irq].chip-unmask(irq); + Siddha think you can resend the correct version. Rafael. Think you can add those two ifs and see if you test bed box works? Yes, that helps. For reference I'm appending the complete patch that I have tested. Greetings, Rafael --- arch/x86_64/kernel/irq.c | 32 +--- 1 file changed, 29 insertions(+), 3 deletions(-) Index: linux-2.6.22-rc5/arch/x86_64/kernel/irq.c === --- linux-2.6.22-rc5.orig/arch/x86_64/kernel/irq.c 2007-06-24 14:28:33.0 +0200 +++ linux-2.6.22-rc5/arch/x86_64/kernel/irq.c 2007-06-24 14:31:11.0 +0200 @@ -144,17 +144,43 @@ void fixup_irqs(cpumask_t map) for (irq = 0; irq NR_IRQS; irq++) { cpumask_t mask; + int break_affinity = 0; + int set_affinity = 1; + if (irq == 2) continue; + /* interrupt's are disabled at this point */ + spin_lock(irq_desc[irq].lock); + + if (!irq_has_action(irq) || + cpus_equal(irq_desc[irq].affinity, map)) { + spin_unlock(irq_desc[irq].lock); + continue; + } + cpus_and(mask, irq_desc[irq].affinity, map); - if (any_online_cpu(mask) == NR_CPUS) { - printk(Breaking affinity for irq %i\n, irq); + if (cpus_empty(mask)) { + break_affinity = 1; mask = map; } + + if (irq_desc[irq].chip-mask) + irq_desc[irq].chip-mask(irq); + if (irq_desc[irq].chip-set_affinity) irq_desc[irq].chip-set_affinity(irq, mask); - else if (irq_desc[irq].action !(warned++)) + else if (!(warned++)) + set_affinity = 0; + + if (irq_desc[irq].chip-unmask) + irq_desc[irq].chip-unmask(irq); + + spin_unlock(irq_desc[irq].lock); + + if (break_affinity set_affinity) + printk(Broke affinity for irq %i\n, irq); + else if (!set_affinity) printk(Cannot set affinity for irq %i\n, irq); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Sunday, 24 June 2007 02:28, Siddha, Suresh B wrote: On Sun, Jun 24, 2007 at 01:54:52AM +0200, Rafael J. Wysocki wrote: This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325). _cpu_down() just hangs as though there were a deadlock in there, 100% of the time. Does the patch at this URL work for you? http://marc.info/?l=linux-kernelm=118228358826737w=2 Yes, it does. Greetings, Rafael -- Premature optimization is the root of all evil. - Donald Knuth - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Sat, Jun 23, 2007 at 06:45:05PM -0600, Eric W. Biederman wrote: > > Hmm. It looks like Siddha sent the wrong version of the patch. > The working tested version had an additional test to ensure > the mask and unmask methods were implemented. > > i.e. > + if (irq_desc[irq].chip->mask) > + irq_desc[irq].chip->mask(irq); > and > > + if (irq_desc[irq].chip->unmask) > + irq_desc[irq].chip->unmask(irq); > + > > Siddha think you can resend the correct version. Eric, In this version, I added the irq_has_action() check and hence removed the check which ensures the presence for mask/unmask. My tests showed that it was working fine. May be I am missing something. > > Rafael. Think you can add those two ifs and see if you test bed box > works? > > I'm still not convinced that we can make fixup_irqs work in general > but if we aren't going to yank it we should at least make it > consistent with the rest of the code. I agree. thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
Andrew Morton <[EMAIL PROTECTED]> writes: > On Sun, 24 Jun 2007 01:54:52 +0200 "Rafael J. Wysocki" <[EMAIL PROTECTED]> > wrote: > >> On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: >> > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: >> > > >> > > This fixes the problem! Hurrah! >> > >> > Great! Andrew, please include the appended patch in -mm. >> > >> > >> > Subject: [patch] x86_64, irq: use mask/unmask and proper locking in > fixup_irqs >> > From: Suresh Siddha <[EMAIL PROTECTED]> >> > >> > Force irq migration path during cpu offline, is not using proper >> > locks and irq_chip mask/unmask routines. This will result in >> > some races(especially the device generating the interrupt can see >> > some inconsistent state, resulting in issues like stuck irq,..). >> > >> > Appended patch fixes the issue by taking proper lock and >> > encapsulating irq_chip set_affinity() with a mask() before and an >> > unmask() after. >> > >> > This fixes a MSI irq stuck issue reported by Darrick Wong. >> > >> > There are several more general bugs in this area(irq migration in the >> > process context). For example, >> > >> > 1. Possibility of missing edge triggered irq. >> > 2. Reliable method of migrating level triggered irq in the process context. >> > >> > We plan to look and close these in the near future. >> >> This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC >> nx6325). >> >> _cpu_down() just hangs as though there were a deadlock in there, 100% of the >> time. >> > > Thanks, I dropped it. Hmm. It looks like Siddha sent the wrong version of the patch. The working tested version had an additional test to ensure the mask and unmask methods were implemented. i.e. + if (irq_desc[irq].chip->mask) + irq_desc[irq].chip->mask(irq); and + if (irq_desc[irq].chip->unmask) + irq_desc[irq].chip->unmask(irq); + Siddha think you can resend the correct version. Rafael. Think you can add those two ifs and see if you test bed box works? I'm still not convinced that we can make fixup_irqs work in general but if we aren't going to yank it we should at least make it consistent with the rest of the code. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Sun, Jun 24, 2007 at 01:54:52AM +0200, Rafael J. Wysocki wrote: > This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325). > > _cpu_down() just hangs as though there were a deadlock in there, 100% of the > time. Does the patch at this URL work for you? http://marc.info/?l=linux-kernel=118228358826737=2 thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Sun, 24 Jun 2007 01:54:52 +0200 "Rafael J. Wysocki" <[EMAIL PROTECTED]> wrote: > On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: > > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: > > > > > > This fixes the problem! Hurrah! > > > > Great! Andrew, please include the appended patch in -mm. > > > > > > Subject: [patch] x86_64, irq: use mask/unmask and proper locking in > > fixup_irqs > > From: Suresh Siddha <[EMAIL PROTECTED]> > > > > Force irq migration path during cpu offline, is not using proper > > locks and irq_chip mask/unmask routines. This will result in > > some races(especially the device generating the interrupt can see > > some inconsistent state, resulting in issues like stuck irq,..). > > > > Appended patch fixes the issue by taking proper lock and > > encapsulating irq_chip set_affinity() with a mask() before and an > > unmask() after. > > > > This fixes a MSI irq stuck issue reported by Darrick Wong. > > > > There are several more general bugs in this area(irq migration in the > > process context). For example, > > > > 1. Possibility of missing edge triggered irq. > > 2. Reliable method of migrating level triggered irq in the process context. > > > > We plan to look and close these in the near future. > > This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325). > > _cpu_down() just hangs as though there were a deadlock in there, 100% of the > time. > Thanks, I dropped it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: > > > > This fixes the problem! Hurrah! > > Great! Andrew, please include the appended patch in -mm. > > > Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs > From: Suresh Siddha <[EMAIL PROTECTED]> > > Force irq migration path during cpu offline, is not using proper > locks and irq_chip mask/unmask routines. This will result in > some races(especially the device generating the interrupt can see > some inconsistent state, resulting in issues like stuck irq,..). > > Appended patch fixes the issue by taking proper lock and > encapsulating irq_chip set_affinity() with a mask() before and an > unmask() after. > > This fixes a MSI irq stuck issue reported by Darrick Wong. > > There are several more general bugs in this area(irq migration in the > process context). For example, > > 1. Possibility of missing edge triggered irq. > 2. Reliable method of migrating level triggered irq in the process context. > > We plan to look and close these in the near future. This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325). _cpu_down() just hangs as though there were a deadlock in there, 100% of the time. Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: This fixes the problem! Hurrah! Great! Andrew, please include the appended patch in -mm. Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs From: Suresh Siddha [EMAIL PROTECTED] Force irq migration path during cpu offline, is not using proper locks and irq_chip mask/unmask routines. This will result in some races(especially the device generating the interrupt can see some inconsistent state, resulting in issues like stuck irq,..). Appended patch fixes the issue by taking proper lock and encapsulating irq_chip set_affinity() with a mask() before and an unmask() after. This fixes a MSI irq stuck issue reported by Darrick Wong. There are several more general bugs in this area(irq migration in the process context). For example, 1. Possibility of missing edge triggered irq. 2. Reliable method of migrating level triggered irq in the process context. We plan to look and close these in the near future. This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325). _cpu_down() just hangs as though there were a deadlock in there, 100% of the time. Greetings, Rafael -- Premature optimization is the root of all evil. - Donald Knuth - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Sun, 24 Jun 2007 01:54:52 +0200 Rafael J. Wysocki [EMAIL PROTECTED] wrote: On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: This fixes the problem! Hurrah! Great! Andrew, please include the appended patch in -mm. Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs From: Suresh Siddha [EMAIL PROTECTED] Force irq migration path during cpu offline, is not using proper locks and irq_chip mask/unmask routines. This will result in some races(especially the device generating the interrupt can see some inconsistent state, resulting in issues like stuck irq,..). Appended patch fixes the issue by taking proper lock and encapsulating irq_chip set_affinity() with a mask() before and an unmask() after. This fixes a MSI irq stuck issue reported by Darrick Wong. There are several more general bugs in this area(irq migration in the process context). For example, 1. Possibility of missing edge triggered irq. 2. Reliable method of migrating level triggered irq in the process context. We plan to look and close these in the near future. This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325). _cpu_down() just hangs as though there were a deadlock in there, 100% of the time. Thanks, I dropped it. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Sun, Jun 24, 2007 at 01:54:52AM +0200, Rafael J. Wysocki wrote: This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325). _cpu_down() just hangs as though there were a deadlock in there, 100% of the time. Does the patch at this URL work for you? http://marc.info/?l=linux-kernelm=118228358826737w=2 thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
Andrew Morton [EMAIL PROTECTED] writes: On Sun, 24 Jun 2007 01:54:52 +0200 Rafael J. Wysocki [EMAIL PROTECTED] wrote: On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote: On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: This fixes the problem! Hurrah! Great! Andrew, please include the appended patch in -mm. Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs From: Suresh Siddha [EMAIL PROTECTED] Force irq migration path during cpu offline, is not using proper locks and irq_chip mask/unmask routines. This will result in some races(especially the device generating the interrupt can see some inconsistent state, resulting in issues like stuck irq,..). Appended patch fixes the issue by taking proper lock and encapsulating irq_chip set_affinity() with a mask() before and an unmask() after. This fixes a MSI irq stuck issue reported by Darrick Wong. There are several more general bugs in this area(irq migration in the process context). For example, 1. Possibility of missing edge triggered irq. 2. Reliable method of migrating level triggered irq in the process context. We plan to look and close these in the near future. This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325). _cpu_down() just hangs as though there were a deadlock in there, 100% of the time. Thanks, I dropped it. Hmm. It looks like Siddha sent the wrong version of the patch. The working tested version had an additional test to ensure the mask and unmask methods were implemented. i.e. + if (irq_desc[irq].chip-mask) + irq_desc[irq].chip-mask(irq); and + if (irq_desc[irq].chip-unmask) + irq_desc[irq].chip-unmask(irq); + Siddha think you can resend the correct version. Rafael. Think you can add those two ifs and see if you test bed box works? I'm still not convinced that we can make fixup_irqs work in general but if we aren't going to yank it we should at least make it consistent with the rest of the code. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Sat, Jun 23, 2007 at 06:45:05PM -0600, Eric W. Biederman wrote: Hmm. It looks like Siddha sent the wrong version of the patch. The working tested version had an additional test to ensure the mask and unmask methods were implemented. i.e. + if (irq_desc[irq].chip-mask) + irq_desc[irq].chip-mask(irq); and + if (irq_desc[irq].chip-unmask) + irq_desc[irq].chip-unmask(irq); + Siddha think you can resend the correct version. Eric, In this version, I added the irq_has_action() check and hence removed the check which ensures the presence for mask/unmask. My tests showed that it was working fine. May be I am missing something. Rafael. Think you can add those two ifs and see if you test bed box works? I'm still not convinced that we can make fixup_irqs work in general but if we aren't going to yank it we should at least make it consistent with the rest of the code. I agree. thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: > > This fixes the problem! Hurrah! Great! Andrew, please include the appended patch in -mm. Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs From: Suresh Siddha <[EMAIL PROTECTED]> Force irq migration path during cpu offline, is not using proper locks and irq_chip mask/unmask routines. This will result in some races(especially the device generating the interrupt can see some inconsistent state, resulting in issues like stuck irq,..). Appended patch fixes the issue by taking proper lock and encapsulating irq_chip set_affinity() with a mask() before and an unmask() after. This fixes a MSI irq stuck issue reported by Darrick Wong. There are several more general bugs in this area(irq migration in the process context). For example, 1. Possibility of missing edge triggered irq. 2. Reliable method of migrating level triggered irq in the process context. We plan to look and close these in the near future. Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]> Cc: Eric W. Biederman <[EMAIL PROTECTED]> Reported-by: Darrick Wong <[EMAIL PROTECTED]> --- diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 3eaceac..55b2733 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -144,17 +144,41 @@ void fixup_irqs(cpumask_t map) for (irq = 0; irq < NR_IRQS; irq++) { cpumask_t mask; + int break_affinity = 0; + int set_affinity = 1; + if (irq == 2) continue; + /* interrupt's are disabled at this point */ + spin_lock(_desc[irq].lock); + + if (!irq_has_action(irq) || + cpus_equal(irq_desc[irq].affinity, map)) { + spin_unlock(_desc[irq].lock); + continue; + } + cpus_and(mask, irq_desc[irq].affinity, map); - if (any_online_cpu(mask) == NR_CPUS) { - printk("Breaking affinity for irq %i\n", irq); + if (cpus_empty(mask)) { + break_affinity = 1; mask = map; } + + irq_desc[irq].chip->mask(irq); + if (irq_desc[irq].chip->set_affinity) irq_desc[irq].chip->set_affinity(irq, mask); - else if (irq_desc[irq].action && !(warned++)) + else if (!(warned++)) + set_affinity = 0; + + irq_desc[irq].chip->unmask(irq); + + spin_unlock(_desc[irq].lock); + + if (break_affinity && set_affinity) + printk("Broke affinity for irq %i\n", irq); + else if (!set_affinity) printk("Cannot set affinity for irq %i\n", irq); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 19, 2007 at 12:59:27PM -0700, Siddha, Suresh B wrote: > hmm.. Please try this instead. This is intended only for debug. Based on your > test results, we can comeup with a more decent fix. This fixes the problem! Hurrah! --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 19, 2007 at 12:06:37PM -0700, Darrick J. Wong wrote: > On Tue, Jun 19, 2007 at 11:00:03AM -0700, Siddha, Suresh B wrote: > > Anyhow, Darrick there is a general bug in this area, can you try this and > > see if it helps? > > Er... that instantly locked up the system. hmm.. Please try this instead. This is intended only for debug. Based on your test results, we can comeup with a more decent fix. diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 3eaceac..3997679 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -144,17 +144,37 @@ void fixup_irqs(cpumask_t map) for (irq = 0; irq < NR_IRQS; irq++) { cpumask_t mask; + int break_affinity = 0; + int set_affinity = 1; + if (irq == 2) continue; + /* irq's are disabled at this point */ + spin_lock(_desc[irq].lock); + cpus_and(mask, irq_desc[irq].affinity, map); if (any_online_cpu(mask) == NR_CPUS) { - printk("Breaking affinity for irq %i\n", irq); + break_affinity = 1; mask = map; } + + if (irq_desc[irq].chip->mask) + irq_desc[irq].chip->mask(irq); + if (irq_desc[irq].chip->set_affinity) irq_desc[irq].chip->set_affinity(irq, mask); else if (irq_desc[irq].action && !(warned++)) + set_affinity = 0; + + if (irq_desc[irq].chip->unmask) + irq_desc[irq].chip->unmask(irq); + + spin_unlock(_desc[irq].lock); + + if (break_affinity && set_affinity) + printk("Broke affinity for irq %i\n", irq); + else if (!set_affinity) printk("Cannot set affinity for irq %i\n", irq); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 19, 2007 at 11:00:03AM -0700, Siddha, Suresh B wrote: > Anyhow, Darrick there is a general bug in this area, can you try this and > see if it helps? Er... that instantly locked up the system. --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
"Siddha, Suresh B" <[EMAIL PROTECTED]> writes: > On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote: >> "Darrick J. Wong" <[EMAIL PROTECTED]> writes: >> >> > On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: >> > >> >> > >> >> > [ 256.298787] irq=4341 affinity=d >> >> > >> >> >> >> And just to make sure, at this point, your MSI irq 4341 affinity >> >> (/proc/irq/4341/smp_affinity) still points to '2'? >> > >> > Actually, it's 0xD. From the kernel's perspective the mask has been >> > updated (and I even stuck a printk into set_msi_irq_affinity to verify >> > that the writes are happening) but ... the hardware doesn't seem to >> > reflect this. I also tried putting read_msi_msg right afterwards to >> > compare contents, though it complained about all the MSIs _except_ for >> > 4341. (Of course, I could just be way off on the effectiveness of >> > that.) >> >> The fact that MSI interrupts are having problems is odd. It is possible >> that we still have a bug in there somewhere but msi interrupts should >> be safe to migrate outside of irq context (no known hardware bugs). >> As we can actually synchronize with the irq source and eliminate all >> of the migration races. >> >> The non-msi case requires hitting a hardware race that is rare enough >> you should not normally have problems. > > Yep. But Darrick's seems to say, problem happens consistently. > > Anyhow, Darrick there is a general bug in this area, can you try this and > see if it helps? There are several general bugs in this area. But yes your patch should help things, especially for MSI where masking the irq before migration is required. Adding locking the proper locking and masking should make things quite a bit more how set_affinity is expected to be called. I just gave up on fixing these things because we can't eliminate the races, so the real problem is the existence of this code path with it's unsupportable semantics in the first place. > diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c > index 3eaceac..a0e11c9 100644 > --- a/arch/x86_64/kernel/irq.c > +++ b/arch/x86_64/kernel/irq.c > @@ -144,17 +144,35 @@ void fixup_irqs(cpumask_t map) > > for (irq = 0; irq < NR_IRQS; irq++) { > cpumask_t mask; > + int break_affinity = 0; > + int set_affinity = 1; > + > if (irq == 2) > continue; > > + /* irq's are disabled at this point */ > + spin_lock(_desc[irq].lock); > + > cpus_and(mask, irq_desc[irq].affinity, map); > if (any_online_cpu(mask) == NR_CPUS) { > - printk("Breaking affinity for irq %i\n", irq); > + break_affinity = 1; > mask = map; > } We should really express the "any_online_cpu(mask) == NR_CPUS" test as: "cpus_empty(mask)" it would be much clearer. Further we should skip the migration if "cpus_equal(mask, irq_desc[irq].affinity)" or "!irq_has_action(irq)" because no one has called request_irq. > + irq_desc[irq].chip->mask(irq); > + > if (irq_desc[irq].chip->set_affinity) > irq_desc[irq].chip->set_affinity(irq, mask); > else if (irq_desc[irq].action && !(warned++)) > + set_affinity = 0; > + > + irq_desc[irq].chip->unmask(irq); > + > + spin_unlock(_desc[irq].lock); > + > + if (break_affinity && set_affinity) > + printk("Broke affinity for irq %i\n", irq); > + else if (!set_affinity) > printk("Cannot set affinity for irq %i\n", irq); > } > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote: > "Darrick J. Wong" <[EMAIL PROTECTED]> writes: > > > On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: > > > >> > > >> > [ 256.298787] irq=4341 affinity=d > >> > > >> > >> And just to make sure, at this point, your MSI irq 4341 affinity > >> (/proc/irq/4341/smp_affinity) still points to '2'? > > > > Actually, it's 0xD. From the kernel's perspective the mask has been > > updated (and I even stuck a printk into set_msi_irq_affinity to verify > > that the writes are happening) but ... the hardware doesn't seem to > > reflect this. I also tried putting read_msi_msg right afterwards to > > compare contents, though it complained about all the MSIs _except_ for > > 4341. (Of course, I could just be way off on the effectiveness of > > that.) > > The fact that MSI interrupts are having problems is odd. It is possible > that we still have a bug in there somewhere but msi interrupts should > be safe to migrate outside of irq context (no known hardware bugs). > As we can actually synchronize with the irq source and eliminate all > of the migration races. > > The non-msi case requires hitting a hardware race that is rare enough > you should not normally have problems. Yep. But Darrick's seems to say, problem happens consistently. Anyhow, Darrick there is a general bug in this area, can you try this and see if it helps? diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 3eaceac..a0e11c9 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -144,17 +144,35 @@ void fixup_irqs(cpumask_t map) for (irq = 0; irq < NR_IRQS; irq++) { cpumask_t mask; + int break_affinity = 0; + int set_affinity = 1; + if (irq == 2) continue; + /* irq's are disabled at this point */ + spin_lock(_desc[irq].lock); + cpus_and(mask, irq_desc[irq].affinity, map); if (any_online_cpu(mask) == NR_CPUS) { - printk("Breaking affinity for irq %i\n", irq); + break_affinity = 1; mask = map; } + + irq_desc[irq].chip->mask(irq); + if (irq_desc[irq].chip->set_affinity) irq_desc[irq].chip->set_affinity(irq, mask); else if (irq_desc[irq].action && !(warned++)) + set_affinity = 0; + + irq_desc[irq].chip->unmask(irq); + + spin_unlock(_desc[irq].lock); + + if (break_affinity && set_affinity) + printk("Broke affinity for irq %i\n", irq); + else if (!set_affinity) printk("Cannot set affinity for irq %i\n", irq); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
"Darrick J. Wong" <[EMAIL PROTECTED]> writes: > On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: > >> > >> > [ 256.298787] irq=4341 affinity=d >> > >> >> And just to make sure, at this point, your MSI irq 4341 affinity >> (/proc/irq/4341/smp_affinity) still points to '2'? > > Actually, it's 0xD. From the kernel's perspective the mask has been > updated (and I even stuck a printk into set_msi_irq_affinity to verify > that the writes are happening) but ... the hardware doesn't seem to > reflect this. I also tried putting read_msi_msg right afterwards to > compare contents, though it complained about all the MSIs _except_ for > 4341. (Of course, I could just be way off on the effectiveness of > that.) The fact that MSI interrupts are having problems is odd. It is possible that we still have a bug in there somewhere but msi interrupts should be safe to migrate outside of irq context (no known hardware bugs). As we can actually synchronize with the irq source and eliminate all of the migration races. The non-msi case requires hitting a hardware race that is rare enough you should not normally have problems. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
Darrick J. Wong [EMAIL PROTECTED] writes: On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: call to set_affinity [ 256.298787] irq=4341 affinity=d ethernet on irq 4341 stops working And just to make sure, at this point, your MSI irq 4341 affinity (/proc/irq/4341/smp_affinity) still points to '2'? Actually, it's 0xD. From the kernel's perspective the mask has been updated (and I even stuck a printk into set_msi_irq_affinity to verify that the writes are happening) but ... the hardware doesn't seem to reflect this. I also tried putting read_msi_msg right afterwards to compare contents, though it complained about all the MSIs _except_ for 4341. (Of course, I could just be way off on the effectiveness of that.) The fact that MSI interrupts are having problems is odd. It is possible that we still have a bug in there somewhere but msi interrupts should be safe to migrate outside of irq context (no known hardware bugs). As we can actually synchronize with the irq source and eliminate all of the migration races. The non-msi case requires hitting a hardware race that is rare enough you should not normally have problems. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote: Darrick J. Wong [EMAIL PROTECTED] writes: On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: call to set_affinity [ 256.298787] irq=4341 affinity=d ethernet on irq 4341 stops working And just to make sure, at this point, your MSI irq 4341 affinity (/proc/irq/4341/smp_affinity) still points to '2'? Actually, it's 0xD. From the kernel's perspective the mask has been updated (and I even stuck a printk into set_msi_irq_affinity to verify that the writes are happening) but ... the hardware doesn't seem to reflect this. I also tried putting read_msi_msg right afterwards to compare contents, though it complained about all the MSIs _except_ for 4341. (Of course, I could just be way off on the effectiveness of that.) The fact that MSI interrupts are having problems is odd. It is possible that we still have a bug in there somewhere but msi interrupts should be safe to migrate outside of irq context (no known hardware bugs). As we can actually synchronize with the irq source and eliminate all of the migration races. The non-msi case requires hitting a hardware race that is rare enough you should not normally have problems. Yep. But Darrick's seems to say, problem happens consistently. Anyhow, Darrick there is a general bug in this area, can you try this and see if it helps? diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 3eaceac..a0e11c9 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -144,17 +144,35 @@ void fixup_irqs(cpumask_t map) for (irq = 0; irq NR_IRQS; irq++) { cpumask_t mask; + int break_affinity = 0; + int set_affinity = 1; + if (irq == 2) continue; + /* irq's are disabled at this point */ + spin_lock(irq_desc[irq].lock); + cpus_and(mask, irq_desc[irq].affinity, map); if (any_online_cpu(mask) == NR_CPUS) { - printk(Breaking affinity for irq %i\n, irq); + break_affinity = 1; mask = map; } + + irq_desc[irq].chip-mask(irq); + if (irq_desc[irq].chip-set_affinity) irq_desc[irq].chip-set_affinity(irq, mask); else if (irq_desc[irq].action !(warned++)) + set_affinity = 0; + + irq_desc[irq].chip-unmask(irq); + + spin_unlock(irq_desc[irq].lock); + + if (break_affinity set_affinity) + printk(Broke affinity for irq %i\n, irq); + else if (!set_affinity) printk(Cannot set affinity for irq %i\n, irq); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
Siddha, Suresh B [EMAIL PROTECTED] writes: On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote: Darrick J. Wong [EMAIL PROTECTED] writes: On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: call to set_affinity [ 256.298787] irq=4341 affinity=d ethernet on irq 4341 stops working And just to make sure, at this point, your MSI irq 4341 affinity (/proc/irq/4341/smp_affinity) still points to '2'? Actually, it's 0xD. From the kernel's perspective the mask has been updated (and I even stuck a printk into set_msi_irq_affinity to verify that the writes are happening) but ... the hardware doesn't seem to reflect this. I also tried putting read_msi_msg right afterwards to compare contents, though it complained about all the MSIs _except_ for 4341. (Of course, I could just be way off on the effectiveness of that.) The fact that MSI interrupts are having problems is odd. It is possible that we still have a bug in there somewhere but msi interrupts should be safe to migrate outside of irq context (no known hardware bugs). As we can actually synchronize with the irq source and eliminate all of the migration races. The non-msi case requires hitting a hardware race that is rare enough you should not normally have problems. Yep. But Darrick's seems to say, problem happens consistently. Anyhow, Darrick there is a general bug in this area, can you try this and see if it helps? There are several general bugs in this area. But yes your patch should help things, especially for MSI where masking the irq before migration is required. Adding locking the proper locking and masking should make things quite a bit more how set_affinity is expected to be called. I just gave up on fixing these things because we can't eliminate the races, so the real problem is the existence of this code path with it's unsupportable semantics in the first place. diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 3eaceac..a0e11c9 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -144,17 +144,35 @@ void fixup_irqs(cpumask_t map) for (irq = 0; irq NR_IRQS; irq++) { cpumask_t mask; + int break_affinity = 0; + int set_affinity = 1; + if (irq == 2) continue; + /* irq's are disabled at this point */ + spin_lock(irq_desc[irq].lock); + cpus_and(mask, irq_desc[irq].affinity, map); if (any_online_cpu(mask) == NR_CPUS) { - printk(Breaking affinity for irq %i\n, irq); + break_affinity = 1; mask = map; } We should really express the any_online_cpu(mask) == NR_CPUS test as: cpus_empty(mask) it would be much clearer. Further we should skip the migration if cpus_equal(mask, irq_desc[irq].affinity) or !irq_has_action(irq) because no one has called request_irq. + irq_desc[irq].chip-mask(irq); + if (irq_desc[irq].chip-set_affinity) irq_desc[irq].chip-set_affinity(irq, mask); else if (irq_desc[irq].action !(warned++)) + set_affinity = 0; + + irq_desc[irq].chip-unmask(irq); + + spin_unlock(irq_desc[irq].lock); + + if (break_affinity set_affinity) + printk(Broke affinity for irq %i\n, irq); + else if (!set_affinity) printk(Cannot set affinity for irq %i\n, irq); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 19, 2007 at 11:00:03AM -0700, Siddha, Suresh B wrote: Anyhow, Darrick there is a general bug in this area, can you try this and see if it helps? Er... that instantly locked up the system. --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 19, 2007 at 12:06:37PM -0700, Darrick J. Wong wrote: On Tue, Jun 19, 2007 at 11:00:03AM -0700, Siddha, Suresh B wrote: Anyhow, Darrick there is a general bug in this area, can you try this and see if it helps? Er... that instantly locked up the system. hmm.. Please try this instead. This is intended only for debug. Based on your test results, we can comeup with a more decent fix. diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 3eaceac..3997679 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -144,17 +144,37 @@ void fixup_irqs(cpumask_t map) for (irq = 0; irq NR_IRQS; irq++) { cpumask_t mask; + int break_affinity = 0; + int set_affinity = 1; + if (irq == 2) continue; + /* irq's are disabled at this point */ + spin_lock(irq_desc[irq].lock); + cpus_and(mask, irq_desc[irq].affinity, map); if (any_online_cpu(mask) == NR_CPUS) { - printk(Breaking affinity for irq %i\n, irq); + break_affinity = 1; mask = map; } + + if (irq_desc[irq].chip-mask) + irq_desc[irq].chip-mask(irq); + if (irq_desc[irq].chip-set_affinity) irq_desc[irq].chip-set_affinity(irq, mask); else if (irq_desc[irq].action !(warned++)) + set_affinity = 0; + + if (irq_desc[irq].chip-unmask) + irq_desc[irq].chip-unmask(irq); + + spin_unlock(irq_desc[irq].lock); + + if (break_affinity set_affinity) + printk(Broke affinity for irq %i\n, irq); + else if (!set_affinity) printk(Cannot set affinity for irq %i\n, irq); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 19, 2007 at 12:59:27PM -0700, Siddha, Suresh B wrote: hmm.. Please try this instead. This is intended only for debug. Based on your test results, we can comeup with a more decent fix. This fixes the problem! Hurrah! --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote: This fixes the problem! Hurrah! Great! Andrew, please include the appended patch in -mm. Subject: [patch] x86_64, irq: use mask/unmask and proper locking in fixup_irqs From: Suresh Siddha [EMAIL PROTECTED] Force irq migration path during cpu offline, is not using proper locks and irq_chip mask/unmask routines. This will result in some races(especially the device generating the interrupt can see some inconsistent state, resulting in issues like stuck irq,..). Appended patch fixes the issue by taking proper lock and encapsulating irq_chip set_affinity() with a mask() before and an unmask() after. This fixes a MSI irq stuck issue reported by Darrick Wong. There are several more general bugs in this area(irq migration in the process context). For example, 1. Possibility of missing edge triggered irq. 2. Reliable method of migrating level triggered irq in the process context. We plan to look and close these in the near future. Signed-off-by: Suresh Siddha [EMAIL PROTECTED] Cc: Eric W. Biederman [EMAIL PROTECTED] Reported-by: Darrick Wong [EMAIL PROTECTED] --- diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 3eaceac..55b2733 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -144,17 +144,41 @@ void fixup_irqs(cpumask_t map) for (irq = 0; irq NR_IRQS; irq++) { cpumask_t mask; + int break_affinity = 0; + int set_affinity = 1; + if (irq == 2) continue; + /* interrupt's are disabled at this point */ + spin_lock(irq_desc[irq].lock); + + if (!irq_has_action(irq) || + cpus_equal(irq_desc[irq].affinity, map)) { + spin_unlock(irq_desc[irq].lock); + continue; + } + cpus_and(mask, irq_desc[irq].affinity, map); - if (any_online_cpu(mask) == NR_CPUS) { - printk(Breaking affinity for irq %i\n, irq); + if (cpus_empty(mask)) { + break_affinity = 1; mask = map; } + + irq_desc[irq].chip-mask(irq); + if (irq_desc[irq].chip-set_affinity) irq_desc[irq].chip-set_affinity(irq, mask); - else if (irq_desc[irq].action !(warned++)) + else if (!(warned++)) + set_affinity = 0; + + irq_desc[irq].chip-unmask(irq); + + spin_unlock(irq_desc[irq].lock); + + if (break_affinity set_affinity) + printk(Broke affinity for irq %i\n, irq); + else if (!set_affinity) printk(Cannot set affinity for irq %i\n, irq); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: > > > > [ 256.298787] irq=4341 affinity=d > > > > And just to make sure, at this point, your MSI irq 4341 affinity > (/proc/irq/4341/smp_affinity) still points to '2'? Actually, it's 0xD. From the kernel's perspective the mask has been updated (and I even stuck a printk into set_msi_irq_affinity to verify that the writes are happening) but ... the hardware doesn't seem to reflect this. I also tried putting read_msi_msg right afterwards to compare contents, though it complained about all the MSIs _except_ for 4341. (Of course, I could just be way off on the effectiveness of that.) --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Mon, Jun 18, 2007 at 03:38:20PM -0700, Darrick J. Wong wrote: > On Thu, Jun 07, 2007 at 05:57:26PM -0700, Siddha, Suresh B wrote: > > > As you have the failing system, you need to do more detective work and > > help me out. Can you try this debug patch and send across the dmesg after > > the > > bug happens and also can you try different compiler to see if something > > changes.. > > Hrm, I just updated to -rc5. Interrupts being handled by the IOAPIC > don't suffer from this problem, but MSI interrupts are still affected. > I added a few printks to the kernel to figure out what IRQ affinity > masks were being passed around and saw this: > > [ 256.298773] Breaking affinity for irq 4341 > [ 256.298774] irq=4341 affinity=2 mask=d > > [ 256.298787] irq=4341 affinity=d > And just to make sure, at this point, your MSI irq 4341 affinity (/proc/irq/4341/smp_affinity) still points to '2'? > I'll keep digging, but at least it appears that the problem has been > shrunk down to something the MSI code. thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Thu, Jun 07, 2007 at 05:57:26PM -0700, Siddha, Suresh B wrote: > As you have the failing system, you need to do more detective work and > help me out. Can you try this debug patch and send across the dmesg after the > bug happens and also can you try different compiler to see if something > changes.. Hrm, I just updated to -rc5. Interrupts being handled by the IOAPIC don't suffer from this problem, but MSI interrupts are still affected. I added a few printks to the kernel to figure out what IRQ affinity masks were being passed around and saw this: [ 256.298773] Breaking affinity for irq 4341 [ 256.298774] irq=4341 affinity=2 mask=d [ 256.298787] irq=4341 affinity=d I'll keep digging, but at least it appears that the problem has been shrunk down to something the MSI code. --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Thu, Jun 07, 2007 at 05:57:26PM -0700, Siddha, Suresh B wrote: As you have the failing system, you need to do more detective work and help me out. Can you try this debug patch and send across the dmesg after the bug happens and also can you try different compiler to see if something changes.. Hrm, I just updated to -rc5. Interrupts being handled by the IOAPIC don't suffer from this problem, but MSI interrupts are still affected. I added a few printks to the kernel to figure out what IRQ affinity masks were being passed around and saw this: [ 256.298773] Breaking affinity for irq 4341 [ 256.298774] irq=4341 affinity=2 mask=d call to set_affinity [ 256.298787] irq=4341 affinity=d ethernet on irq 4341 stops working I'll keep digging, but at least it appears that the problem has been shrunk down to something the MSI code. --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Mon, Jun 18, 2007 at 03:38:20PM -0700, Darrick J. Wong wrote: On Thu, Jun 07, 2007 at 05:57:26PM -0700, Siddha, Suresh B wrote: As you have the failing system, you need to do more detective work and help me out. Can you try this debug patch and send across the dmesg after the bug happens and also can you try different compiler to see if something changes.. Hrm, I just updated to -rc5. Interrupts being handled by the IOAPIC don't suffer from this problem, but MSI interrupts are still affected. I added a few printks to the kernel to figure out what IRQ affinity masks were being passed around and saw this: [ 256.298773] Breaking affinity for irq 4341 [ 256.298774] irq=4341 affinity=2 mask=d call to set_affinity [ 256.298787] irq=4341 affinity=d ethernet on irq 4341 stops working And just to make sure, at this point, your MSI irq 4341 affinity (/proc/irq/4341/smp_affinity) still points to '2'? I'll keep digging, but at least it appears that the problem has been shrunk down to something the MSI code. thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote: call to set_affinity [ 256.298787] irq=4341 affinity=d ethernet on irq 4341 stops working And just to make sure, at this point, your MSI irq 4341 affinity (/proc/irq/4341/smp_affinity) still points to '2'? Actually, it's 0xD. From the kernel's perspective the mask has been updated (and I even stuck a printk into set_msi_irq_affinity to verify that the writes are happening) but ... the hardware doesn't seem to reflect this. I also tried putting read_msi_msg right afterwards to compare contents, though it complained about all the MSIs _except_ for 4341. (Of course, I could just be way off on the effectiveness of that.) --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Wed, Jun 06, 2007 at 04:16:42PM -0700, Darrick J. Wong wrote: > On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote: > > > Weird. Then the bug can only happen if for some reason, "mask = map" > > didn't happen in fixup_irqs(). Can you send us the disassembly of the > > fixup_irqs()? > > Attached. hmm.. Darrick, can't find anything wrong in there. I am very much puzzled and the main thing I am confused about is, that how come "/proc/irq//smp_affinity" is still pointing at the old offlined cpu, while calls to set_affinity() with cpu_online_map mask in fixup_irqs() don't show any failure.. As you have the failing system, you need to do more detective work and help me out. Can you try this debug patch and send across the dmesg after the bug happens and also can you try different compiler to see if something changes.. diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 3eaceac..fc2a576 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -152,9 +152,11 @@ void fixup_irqs(cpumask_t map) printk("Breaking affinity for irq %i\n", irq); mask = map; } - if (irq_desc[irq].chip->set_affinity) + if (irq_desc[irq].chip->set_affinity) { + printk("calling set affinity for %i, with mask %lx\n", + irq, cpus_addr(mask)[0]); irq_desc[irq].chip->set_affinity(irq, mask); - else if (irq_desc[irq].action && !(warned++)) + } else if (irq_desc[irq].action && !(warned++)) printk("Cannot set affinity for irq %i\n", irq); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Wed, Jun 06, 2007 at 04:16:42PM -0700, Darrick J. Wong wrote: On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote: Weird. Then the bug can only happen if for some reason, mask = map didn't happen in fixup_irqs(). Can you send us the disassembly of the fixup_irqs()? Attached. hmm.. Darrick, can't find anything wrong in there. I am very much puzzled and the main thing I am confused about is, that how come /proc/irq/irq#-hung/smp_affinity is still pointing at the old offlined cpu, while calls to set_affinity() with cpu_online_map mask in fixup_irqs() don't show any failure.. As you have the failing system, you need to do more detective work and help me out. Can you try this debug patch and send across the dmesg after the bug happens and also can you try different compiler to see if something changes.. diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 3eaceac..fc2a576 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -152,9 +152,11 @@ void fixup_irqs(cpumask_t map) printk(Breaking affinity for irq %i\n, irq); mask = map; } - if (irq_desc[irq].chip-set_affinity) + if (irq_desc[irq].chip-set_affinity) { + printk(calling set affinity for %i, with mask %lx\n, + irq, cpus_addr(mask)[0]); irq_desc[irq].chip-set_affinity(irq, mask); - else if (irq_desc[irq].action !(warned++)) + } else if (irq_desc[irq].action !(warned++)) printk(Cannot set affinity for irq %i\n, irq); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote: > Weird. Then the bug can only happen if for some reason, "mask = map" > didn't happen in fixup_irqs(). Can you send us the disassembly of the > fixup_irqs()? Attached. --D (gdb) disassemble fixup_irqs Dump of assembler code for function fixup_irqs: 0x8020bf50 : push %rbp 0x8020bf51 : mov%rsp,%rbp 0x8020bf54 : push %r13 0x8020bf56 : xor%r13d,%r13d 0x8020bf59 : push %r12 0x8020bf5b : push %rbx 0x8020bf5c : sub$0x28,%rsp 0x8020bf60 : mov%rdi,0xffc0(%rbp) 0x8020bf64 : mov%rsi,0xffc8(%rbp) 0x8020bf68 : jmp0x8020bf73 0x8020bf6a : inc%r13d 0x8020bf6d : cmp$0x2,%r13d 0x8020bf71 : je 0x8020bf6a 0x8020bf73 : mov%r13d,%r12d 0x8020bf76 : lea0xffd0(%rbp),%rbx 0x8020bf7a : lea0xffc0(%rbp),%rdx 0x8020bf7e : shl$0x8,%r12 0x8020bf82 : mov$0x80,%ecx 0x8020bf87 : lea0x805505f8(%r12),%rsi 0x8020bf8f : mov%rbx,%rdi 0x8020bf92 : callq 0x802fb606 <__bitmap_and> 0x8020bf97 : mov%rbx,%rdi 0x8020bf9a : callq 0x802fc6ad <__any_online_cpu> 0x8020bf9f : add$0xff80,%eax 0x8020bfa2 : jne0x8020bfc5 0x8020bfa4 : mov%r13d,%esi 0x8020bfa7 : mov$0x804a52b0,%rdi 0x8020bfae : xor%eax,%eax 0x8020bfb0 : callq 0x80233d28 0x8020bfb5 :mov0xffc0(%rbp),%rax 0x8020bfb9 :mov%rax,0xffd0(%rbp) 0x8020bfbd :mov0xffc8(%rbp),%rax 0x8020bfc1 :mov%rax,0xffd8(%rbp) 0x8020bfc5 :mov0x80550588(%r12),%rax 0x8020bfcd :mov0x58(%rax),%rax 0x8020bfd1 :test %rax,%rax 0x8020bfd4 :je 0x8020bfe5 0x8020bfd6 :mov0xffd0(%rbp),%rsi 0x8020bfda :mov0xffd8(%rbp),%rdx 0x8020bfde :mov%r13d,%edi 0x8020bfe1 :callq *%rax 0x8020bfe3 :jmp0x8020c013 0x8020bfe5 :cmpq $0x0,0x805505a8(%r12) 0x8020bfee :je 0x8020c013 0x8020bff0 :mov5181486(%rip),%eax# 0x806fd024 0x8020bff6 :inc%eax 0x8020bff8 :mov%eax,5181478(%rip)# 0x806fd024 0x8020bffe :dec%eax 0x8020c000 :jne0x8020c013 0x8020c002 :mov%r13d,%esi 0x8020c005 :mov$0x804a52ce,%rdi 0x8020c00c :xor%eax,%eax 0x8020c00e :callq 0x80233d28 0x8020c013 :lea0x1(%r13),%eax 0x8020c017 :cmp$0x10ff,%eax 0x8020c01c :jbe0x8020bf6a 0x8020c022 :callq 0x8024e46e 0x8020c027 :sti 0x8020c028 :mov$0x418958,%edi 0x8020c02d :callq 0x803018cf <__const_udelay> 0x8020c032 :cli 0x8020c033 :callq 0x8024cf31 0x8020c038 :add$0x28,%rsp 0x8020c03c :pop%rbx 0x8020c03d :pop%r12 0x8020c03f :pop%r13 0x8020c041 :leaveq 0x8020c042 :retq signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Wed, Jun 06, 2007 at 11:58:29AM -0700, Darrick J. Wong wrote: > On Tue, Jun 05, 2007 at 06:37:59PM -0700, Siddha, Suresh B wrote: > > On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote: > > > On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: > > > > > > > Can you send us your system's dmesg aswell as output of > > > > /proc/interrupts? > > > > > > http://sweaglesw.net/~djwong/docs/dmesg > > > http://sweaglesw.net/~djwong/docs/interrupts > > > > Didn't find anything wrong in that information. Can you try this > > appended debug patch and see if you see this error msg in dmesg, when you > > hit the bug? Thanks. > > I don't see that message. Weird. Then the bug can only happen if for some reason, "mask = map" didn't happen in fixup_irqs(). Can you send us the disassembly of the fixup_irqs()? thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 06:37:59PM -0700, Siddha, Suresh B wrote: > On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote: > > On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: > > > > > Can you send us your system's dmesg aswell as output of /proc/interrupts? > > > > http://sweaglesw.net/~djwong/docs/dmesg > > http://sweaglesw.net/~djwong/docs/interrupts > > Didn't find anything wrong in that information. Can you try this > appended debug patch and see if you see this error msg in dmesg, when you > hit the bug? Thanks. I don't see that message. --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 06:37:59PM -0700, Siddha, Suresh B wrote: On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote: On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: Can you send us your system's dmesg aswell as output of /proc/interrupts? http://sweaglesw.net/~djwong/docs/dmesg http://sweaglesw.net/~djwong/docs/interrupts Didn't find anything wrong in that information. Can you try this appended debug patch and see if you see this error msg in dmesg, when you hit the bug? Thanks. I don't see that message. --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Wed, Jun 06, 2007 at 11:58:29AM -0700, Darrick J. Wong wrote: On Tue, Jun 05, 2007 at 06:37:59PM -0700, Siddha, Suresh B wrote: On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote: On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: Can you send us your system's dmesg aswell as output of /proc/interrupts? http://sweaglesw.net/~djwong/docs/dmesg http://sweaglesw.net/~djwong/docs/interrupts Didn't find anything wrong in that information. Can you try this appended debug patch and see if you see this error msg in dmesg, when you hit the bug? Thanks. I don't see that message. Weird. Then the bug can only happen if for some reason, mask = map didn't happen in fixup_irqs(). Can you send us the disassembly of the fixup_irqs()? thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote: Weird. Then the bug can only happen if for some reason, mask = map didn't happen in fixup_irqs(). Can you send us the disassembly of the fixup_irqs()? Attached. --D (gdb) disassemble fixup_irqs Dump of assembler code for function fixup_irqs: 0x8020bf50 fixup_irqs+0: push %rbp 0x8020bf51 fixup_irqs+1: mov%rsp,%rbp 0x8020bf54 fixup_irqs+4: push %r13 0x8020bf56 fixup_irqs+6: xor%r13d,%r13d 0x8020bf59 fixup_irqs+9: push %r12 0x8020bf5b fixup_irqs+11: push %rbx 0x8020bf5c fixup_irqs+12: sub$0x28,%rsp 0x8020bf60 fixup_irqs+16: mov%rdi,0xffc0(%rbp) 0x8020bf64 fixup_irqs+20: mov%rsi,0xffc8(%rbp) 0x8020bf68 fixup_irqs+24: jmp0x8020bf73 fixup_irqs+35 0x8020bf6a fixup_irqs+26: inc%r13d 0x8020bf6d fixup_irqs+29: cmp$0x2,%r13d 0x8020bf71 fixup_irqs+33: je 0x8020bf6a fixup_irqs+26 0x8020bf73 fixup_irqs+35: mov%r13d,%r12d 0x8020bf76 fixup_irqs+38: lea0xffd0(%rbp),%rbx 0x8020bf7a fixup_irqs+42: lea0xffc0(%rbp),%rdx 0x8020bf7e fixup_irqs+46: shl$0x8,%r12 0x8020bf82 fixup_irqs+50: mov$0x80,%ecx 0x8020bf87 fixup_irqs+55: lea0x805505f8(%r12),%rsi 0x8020bf8f fixup_irqs+63: mov%rbx,%rdi 0x8020bf92 fixup_irqs+66: callq 0x802fb606 __bitmap_and 0x8020bf97 fixup_irqs+71: mov%rbx,%rdi 0x8020bf9a fixup_irqs+74: callq 0x802fc6ad __any_online_cpu 0x8020bf9f fixup_irqs+79: add$0xff80,%eax 0x8020bfa2 fixup_irqs+82: jne0x8020bfc5 fixup_irqs+117 0x8020bfa4 fixup_irqs+84: mov%r13d,%esi 0x8020bfa7 fixup_irqs+87: mov$0x804a52b0,%rdi 0x8020bfae fixup_irqs+94: xor%eax,%eax 0x8020bfb0 fixup_irqs+96: callq 0x80233d28 printk 0x8020bfb5 fixup_irqs+101:mov0xffc0(%rbp),%rax 0x8020bfb9 fixup_irqs+105:mov%rax,0xffd0(%rbp) 0x8020bfbd fixup_irqs+109:mov0xffc8(%rbp),%rax 0x8020bfc1 fixup_irqs+113:mov%rax,0xffd8(%rbp) 0x8020bfc5 fixup_irqs+117:mov0x80550588(%r12),%rax 0x8020bfcd fixup_irqs+125:mov0x58(%rax),%rax 0x8020bfd1 fixup_irqs+129:test %rax,%rax 0x8020bfd4 fixup_irqs+132:je 0x8020bfe5 fixup_irqs+149 0x8020bfd6 fixup_irqs+134:mov0xffd0(%rbp),%rsi 0x8020bfda fixup_irqs+138:mov0xffd8(%rbp),%rdx 0x8020bfde fixup_irqs+142:mov%r13d,%edi 0x8020bfe1 fixup_irqs+145:callq *%rax 0x8020bfe3 fixup_irqs+147:jmp0x8020c013 fixup_irqs+195 0x8020bfe5 fixup_irqs+149:cmpq $0x0,0x805505a8(%r12) 0x8020bfee fixup_irqs+158:je 0x8020c013 fixup_irqs+195 0x8020bff0 fixup_irqs+160:mov5181486(%rip),%eax# 0x806fd024 warned.11720 0x8020bff6 fixup_irqs+166:inc%eax 0x8020bff8 fixup_irqs+168:mov%eax,5181478(%rip)# 0x806fd024 warned.11720 0x8020bffe fixup_irqs+174:dec%eax 0x8020c000 fixup_irqs+176:jne0x8020c013 fixup_irqs+195 0x8020c002 fixup_irqs+178:mov%r13d,%esi 0x8020c005 fixup_irqs+181:mov$0x804a52ce,%rdi 0x8020c00c fixup_irqs+188:xor%eax,%eax 0x8020c00e fixup_irqs+190:callq 0x80233d28 printk 0x8020c013 fixup_irqs+195:lea0x1(%r13),%eax 0x8020c017 fixup_irqs+199:cmp$0x10ff,%eax 0x8020c01c fixup_irqs+204:jbe0x8020bf6a fixup_irqs+26 0x8020c022 fixup_irqs+210:callq 0x8024e46e trace_hardirqs_on 0x8020c027 fixup_irqs+215:sti 0x8020c028 fixup_irqs+216:mov$0x418958,%edi 0x8020c02d fixup_irqs+221:callq 0x803018cf __const_udelay 0x8020c032 fixup_irqs+226:cli 0x8020c033 fixup_irqs+227:callq 0x8024cf31 trace_hardirqs_off 0x8020c038 fixup_irqs+232:add$0x28,%rsp 0x8020c03c fixup_irqs+236:pop%rbx 0x8020c03d fixup_irqs+237:pop%r12 0x8020c03f fixup_irqs+239:pop%r13 0x8020c041 fixup_irqs+241:leaveq 0x8020c042 fixup_irqs+242:retq signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote: > On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: > > > Can you send us your system's dmesg aswell as output of /proc/interrupts? > > http://sweaglesw.net/~djwong/docs/dmesg > http://sweaglesw.net/~djwong/docs/interrupts Didn't find anything wrong in that information. Can you try this appended debug patch and see if you see this error msg in dmesg, when you hit the bug? Thanks. diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c index d8bfe31..3409c1f 100644 --- a/arch/x86_64/kernel/io_apic.c +++ b/arch/x86_64/kernel/io_apic.c @@ -720,10 +720,13 @@ static int assign_irq_vector(int irq, cpumask_t mask) { int err; unsigned long flags; + int cpu = smp_processor_id(); spin_lock_irqsave(_lock, flags); err = __assign_irq_vector(irq, mask); spin_unlock_irqrestore(_lock, flags); + if (err && !cpu_isset(cpu, cpu_online_map)) + printk("assigning irq to a vector failed : %d\n", err); return err; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: > Can you send us your system's dmesg aswell as output of /proc/interrupts? http://sweaglesw.net/~djwong/docs/dmesg http://sweaglesw.net/~djwong/docs/interrupts --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 01:09:54PM -0700, Darrick J. Wong wrote: > On Tue, Jun 05, 2007 at 11:40:15AM -0700, Siddha, Suresh B wrote: > > > Does this problem happen only under certain stress or something simple, like > > > > boot the kernel > > echo 2 > /proc/irq/114/smp_affinity > > wait for irq to hit the cpu1. > > echo 0 > /sys/devices/system/cpu/cpu1/online > > > > will immmd trigger this? > > The system is not under any stress at all. Can you send us your system's dmesg aswell as output of /proc/interrupts? thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 11:40:15AM -0700, Siddha, Suresh B wrote: > Does this problem happen only under certain stress or something simple, like > > boot the kernel > echo 2 > /proc/irq/114/smp_affinity > wait for irq to hit the cpu1. > echo 0 > /sys/devices/system/cpu/cpu1/online > > will immmd trigger this? The system is not under any stress at all. --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 11:33:01AM -0700, Darrick J. Wong wrote: > On Tue, Jun 05, 2007 at 11:13:42AM -0700, Siddha, Suresh B wrote: > > I see. Your system should have 4 or 8 logical cpu's right. So you must be > > using logical flat mode, right? > > I believe so. The system has two Xeon 5150s with an Intel 5000 chipset > of some sort. > > > When this bug happens, what does /proc/irq//smp_affinity show? > > [EMAIL PROTECTED]:~# cat /proc/irq/114/smp_affinity > 02 Ok. What this shows is that fixup_irqs() failed to move the irq properly. Ideally we should see cpu_online_map here (i.e., 0xfd). So most likely __assign_irq_vector() failed for some reason and I am puzzled for the reason... Does this problem happen only under certain stress or something simple, like boot the kernel echo 2 > /proc/irq/114/smp_affinity wait for irq to hit the cpu1. echo 0 > /sys/devices/system/cpu/cpu1/online will immmd trigger this? thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 11:13:42AM -0700, Siddha, Suresh B wrote: > I see. Your system should have 4 or 8 logical cpu's right. So you must be > using logical flat mode, right? I believe so. The system has two Xeon 5150s with an Intel 5000 chipset of some sort. > When this bug happens, what does /proc/irq//smp_affinity show? [EMAIL PROTECTED]:~# cat /proc/irq/114/smp_affinity 02 --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 10:36:47AM -0700, Darrick J. Wong wrote: > On Tue, Jun 05, 2007 at 10:23:10AM -0700, Siddha, Suresh B wrote: > > > Darrick, I see a kernel bug in this area(which is already filled with bugs, > > and I am looking into ways to fix them). Are you making sure that > > between step-1 and step-2, that interrupts actually started arriving at > > cpu1? > > > > i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point > > do step-2 and let us know if you still hit this bug? > > Yes, the bug only happens after CPU1 begins to receive interrupts. I see. Your system should have 4 or 8 logical cpu's right. So you must be using logical flat mode, right? When this bug happens, what does /proc/irq//smp_affinity show? thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 10:23:10AM -0700, Siddha, Suresh B wrote: > Darrick, I see a kernel bug in this area(which is already filled with bugs, > and I am looking into ways to fix them). Are you making sure that > between step-1 and step-2, that interrupts actually started arriving at cpu1? > > i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point > do step-2 and let us know if you still hit this bug? Yes, the bug only happens after CPU1 begins to receive interrupts. > > There exists a similar scenario. Set the IRQ affinity to a bunch of > > CPUs, watch /proc/interrupts to see which CPU is actually servicing the > > interrupts, then offline that CPU. The kernel does not reroute the IRQ > > to any of the other CPUs and the device also hangs. > > Is this a theory or did you observe this problem happening? Nope, I've observed this situation too. --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Thu, May 31, 2007 at 05:44:27PM -0700, Darrick J. Wong wrote: > Hi there, > > I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid > about offlining CPUs. I suspect that this problem extends beyond a > particular machine, as I've been able to replicate it with an IBM x3650 > and an IBM x3755. This is what I'm doing: > > 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ > 4341 is the network card and we're picking on CPU1 in this example): > echo 2 > /proc/irq/4341/smp_affinity Darrick, I see a kernel bug in this area(which is already filled with bugs, and I am looking into ways to fix them). Are you making sure that between step-1 and step-2, that interrupts actually started arriving at cpu1? i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point do step-2 and let us know if you still hit this bug? > > 2) I then take CPU1 offline: > echo 0 > /sys/devices/system/cpu/cpu1/online > > 3) The kernel prints this: > [ 1101.968040] Breaking affinity for irq 4341 > [ 1102.074019] CPU 1 is now offline > [ 1102.081593] lockdep: not fixing up alternatives. > [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying > > After step 2 the system never sees interrupts from the network card and > remains hung like that until CPU1 is brought back up. It looks as > though the kernel is trying to reroute the IRQ (or so I'm assuming from > the "Breaking affinity" message), but this doesn't ever happen, so the > the kernel stops seeing interrupts from the device. > > Granted, one should not be offlining the CPU that is currently > designated to handle an IRQ, but I suspect that the kernel ought at a > minimum to reject the offlining or route the IRQ to any online CPU > instead of screwing things up. > > There exists a similar scenario. Set the IRQ affinity to a bunch of > CPUs, watch /proc/interrupts to see which CPU is actually servicing the > interrupts, then offline that CPU. The kernel does not reroute the IRQ > to any of the other CPUs and the device also hangs. Is this a theory or did you observe this problem happening? thanks, suresh > > The furthest that I've dug is that it works on 2.6.17 and is broken in > 2.6.22-rc3 and 2.6.21. Will git-bisect further, but I wanted to know if > anyone else has seen this sort of problem. afaik, this seems to happen > with both IOAPIC and MSI interrupts, possibly more. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Thu, May 31, 2007 at 05:44:27PM -0700, Darrick J. Wong wrote: Hi there, I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid about offlining CPUs. I suspect that this problem extends beyond a particular machine, as I've been able to replicate it with an IBM x3650 and an IBM x3755. This is what I'm doing: 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ 4341 is the network card and we're picking on CPU1 in this example): echo 2 /proc/irq/4341/smp_affinity Darrick, I see a kernel bug in this area(which is already filled with bugs, and I am looking into ways to fix them). Are you making sure that between step-1 and step-2, that interrupts actually started arriving at cpu1? i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point do step-2 and let us know if you still hit this bug? 2) I then take CPU1 offline: echo 0 /sys/devices/system/cpu/cpu1/online 3) The kernel prints this: [ 1101.968040] Breaking affinity for irq 4341 [ 1102.074019] CPU 1 is now offline [ 1102.081593] lockdep: not fixing up alternatives. [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying After step 2 the system never sees interrupts from the network card and remains hung like that until CPU1 is brought back up. It looks as though the kernel is trying to reroute the IRQ (or so I'm assuming from the Breaking affinity message), but this doesn't ever happen, so the the kernel stops seeing interrupts from the device. Granted, one should not be offlining the CPU that is currently designated to handle an IRQ, but I suspect that the kernel ought at a minimum to reject the offlining or route the IRQ to any online CPU instead of screwing things up. There exists a similar scenario. Set the IRQ affinity to a bunch of CPUs, watch /proc/interrupts to see which CPU is actually servicing the interrupts, then offline that CPU. The kernel does not reroute the IRQ to any of the other CPUs and the device also hangs. Is this a theory or did you observe this problem happening? thanks, suresh The furthest that I've dug is that it works on 2.6.17 and is broken in 2.6.22-rc3 and 2.6.21. Will git-bisect further, but I wanted to know if anyone else has seen this sort of problem. afaik, this seems to happen with both IOAPIC and MSI interrupts, possibly more. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 10:23:10AM -0700, Siddha, Suresh B wrote: Darrick, I see a kernel bug in this area(which is already filled with bugs, and I am looking into ways to fix them). Are you making sure that between step-1 and step-2, that interrupts actually started arriving at cpu1? i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point do step-2 and let us know if you still hit this bug? Yes, the bug only happens after CPU1 begins to receive interrupts. There exists a similar scenario. Set the IRQ affinity to a bunch of CPUs, watch /proc/interrupts to see which CPU is actually servicing the interrupts, then offline that CPU. The kernel does not reroute the IRQ to any of the other CPUs and the device also hangs. Is this a theory or did you observe this problem happening? Nope, I've observed this situation too. --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 10:36:47AM -0700, Darrick J. Wong wrote: On Tue, Jun 05, 2007 at 10:23:10AM -0700, Siddha, Suresh B wrote: Darrick, I see a kernel bug in this area(which is already filled with bugs, and I am looking into ways to fix them). Are you making sure that between step-1 and step-2, that interrupts actually started arriving at cpu1? i.e., do step-1 and wait till the irq's start hitting at cpu1. At this point do step-2 and let us know if you still hit this bug? Yes, the bug only happens after CPU1 begins to receive interrupts. I see. Your system should have 4 or 8 logical cpu's right. So you must be using logical flat mode, right? When this bug happens, what does /proc/irq/irq-no/smp_affinity show? thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 11:13:42AM -0700, Siddha, Suresh B wrote: I see. Your system should have 4 or 8 logical cpu's right. So you must be using logical flat mode, right? I believe so. The system has two Xeon 5150s with an Intel 5000 chipset of some sort. When this bug happens, what does /proc/irq/irq-no/smp_affinity show? [EMAIL PROTECTED]:~# cat /proc/irq/114/smp_affinity 02 --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 11:33:01AM -0700, Darrick J. Wong wrote: On Tue, Jun 05, 2007 at 11:13:42AM -0700, Siddha, Suresh B wrote: I see. Your system should have 4 or 8 logical cpu's right. So you must be using logical flat mode, right? I believe so. The system has two Xeon 5150s with an Intel 5000 chipset of some sort. When this bug happens, what does /proc/irq/irq-no/smp_affinity show? [EMAIL PROTECTED]:~# cat /proc/irq/114/smp_affinity 02 Ok. What this shows is that fixup_irqs() failed to move the irq properly. Ideally we should see cpu_online_map here (i.e., 0xfd). So most likely __assign_irq_vector() failed for some reason and I am puzzled for the reason... Does this problem happen only under certain stress or something simple, like boot the kernel echo 2 /proc/irq/114/smp_affinity wait for irq to hit the cpu1. echo 0 /sys/devices/system/cpu/cpu1/online will immmd trigger this? thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 11:40:15AM -0700, Siddha, Suresh B wrote: Does this problem happen only under certain stress or something simple, like boot the kernel echo 2 /proc/irq/114/smp_affinity wait for irq to hit the cpu1. echo 0 /sys/devices/system/cpu/cpu1/online will immmd trigger this? The system is not under any stress at all. --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 01:09:54PM -0700, Darrick J. Wong wrote: On Tue, Jun 05, 2007 at 11:40:15AM -0700, Siddha, Suresh B wrote: Does this problem happen only under certain stress or something simple, like boot the kernel echo 2 /proc/irq/114/smp_affinity wait for irq to hit the cpu1. echo 0 /sys/devices/system/cpu/cpu1/online will immmd trigger this? The system is not under any stress at all. Can you send us your system's dmesg aswell as output of /proc/interrupts? thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: Can you send us your system's dmesg aswell as output of /proc/interrupts? http://sweaglesw.net/~djwong/docs/dmesg http://sweaglesw.net/~djwong/docs/interrupts --D signature.asc Description: Digital signature
Re: Device hang when offlining a CPU due to IRQ misrouting
On Tue, Jun 05, 2007 at 04:57:07PM -0700, Darrick J. Wong wrote: On Tue, Jun 05, 2007 at 02:14:51PM -0700, Siddha, Suresh B wrote: Can you send us your system's dmesg aswell as output of /proc/interrupts? http://sweaglesw.net/~djwong/docs/dmesg http://sweaglesw.net/~djwong/docs/interrupts Didn't find anything wrong in that information. Can you try this appended debug patch and see if you see this error msg in dmesg, when you hit the bug? Thanks. diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c index d8bfe31..3409c1f 100644 --- a/arch/x86_64/kernel/io_apic.c +++ b/arch/x86_64/kernel/io_apic.c @@ -720,10 +720,13 @@ static int assign_irq_vector(int irq, cpumask_t mask) { int err; unsigned long flags; + int cpu = smp_processor_id(); spin_lock_irqsave(vector_lock, flags); err = __assign_irq_vector(irq, mask); spin_unlock_irqrestore(vector_lock, flags); + if (err !cpu_isset(cpu, cpu_online_map)) + printk(assigning irq to a vector failed : %d\n, err); return err; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
> > This is just getting confusing. > > Emmanuel Fust. Please play with /proc/irq/*/smp_affinity by and and > confirm that you can move your irqs. This will confirm it is the decision > part. > Ok, as planned, you're right ;-) , playing with /proc/irq/*/smp_affinity let me move irqs. Emmanuel. --- Créez votre adresse électronique [EMAIL PROTECTED] 1 Go d'espace de stockage, anti-spam et anti-virus intégrés. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
This is just getting confusing. Emmanuel Fust. Please play with /proc/irq/*/smp_affinity by and and confirm that you can move your irqs. This will confirm it is the decision part. Ok, as planned, you're right ;-) , playing with /proc/irq/*/smp_affinity let me move irqs. Emmanuel. --- Créez votre adresse électronique [EMAIL PROTECTED] 1 Go d'espace de stockage, anti-spam et anti-virus intégrés. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
"Darrick J. Wong" <[EMAIL PROTECTED]> writes: > On Fri, Jun 01, 2007 at 06:18:32PM -0600, Eric W. Biederman wrote: > >> I doubt it. The practical problem is that cpu_down does not >> and by design can not call the irq balancing part properly >> and I haven't yet seen anything to suggest that we don't migrate >> irq properly. >> >> So I'm guessing it was the decision part. > > I'm not using any IRQ balancer, afaik. As I recall, CONFIG_IRQBALANCE > is i386-only, and I'm not running the userland irqbalance program > either. Just messing around with /proc/irq/*/smp_affinity by hand. :) This is just getting confusing. Emmanuel Fust. Please play with /proc/irq/*/smp_affinity by hand and confirm that you can move your irqs. This will confirm it is the decision part. Darrick. The cpu hotplug architecture makes it impossible to properly call irq migration code that backs /proc/irq/*/smp_affinity. Therefore the cpu hotplug interface to irq migration is broken by design. There are some other bugs in the implementation of migrating irqs off of cpus as well. I'm pretty certain that some combination of those problems is biting you. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Fri, Jun 01, 2007 at 06:18:32PM -0600, Eric W. Biederman wrote: > I doubt it. The practical problem is that cpu_down does not > and by design can not call the irq balancing part properly > and I haven't yet seen anything to suggest that we don't migrate > irq properly. > > So I'm guessing it was the decision part. I'm not using any IRQ balancer, afaik. As I recall, CONFIG_IRQBALANCE is i386-only, and I'm not running the userland irqbalance program either. Just messing around with /proc/irq/*/smp_affinity by hand. :) --D - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
> As a side note, on my very old SMP machine, 2.6.20 correctly > load-balance IRQs across CPU but 2.6.21 not. I know that > in-kernel IRQ load balancer is marked as deprecated and > somewhat broken, but with your report it make me think it > could be a bug in the IRQ rerouting part in my case too and > not necessary in the load-balancer (decision) part. I doubt it. The practical problem is that cpu_down does not and by design can not call the irq balancing part properly and I haven't yet seen anything to suggest that we don't migrate irq properly. So I'm guessing it was the decision part. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
> There exists a similar scenario. Set the IRQ affinity to a bunch of > CPUs, watch /proc/interrupts to see which CPU is actually servicing the > interrupts, then offline that CPU. The kernel does not reroute the IRQ > to any of the other CPUs and the device also hangs. > > The furthest that I've dug is that it works on 2.6.17 and is broken in > 2.6.22-rc3 and 2.6.21. Will git-bisect further, but I wanted to know if > anyone else has seen this sort of problem. afaik, this seems to happen > with both IOAPIC and MSI interrupts, possibly more. Hi, As a side note, on my very old SMP machine, 2.6.20 correctly load-balance IRQs across CPU but 2.6.21 not. I know that in-kernel IRQ load balancer is marked as deprecated and somewhat broken, but with your report it make me think it could be a bug in the IRQ rerouting part in my case too and not necessary in the load-balancer (decision) part. Emmanuel. --- Créez votre adresse électronique [EMAIL PROTECTED] 1 Go d'espace de stockage, anti-spam et anti-virus intégrés. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
"Darrick J. Wong" <[EMAIL PROTECTED]> writes: > Hi there, > > I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid > about offlining CPUs. I suspect that this problem extends beyond a > particular machine, as I've been able to replicate it with an IBM x3650 > and an IBM x3755. This is what I'm doing: > > 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ > 4341 is the network card and we're picking on CPU1 in this example): > echo 2 > /proc/irq/4341/smp_affinity > > 2) I then take CPU1 offline: > echo 0 > /sys/devices/system/cpu/cpu1/online > > 3) The kernel prints this: > [ 1101.968040] Breaking affinity for irq 4341 > [ 1102.074019] CPU 1 is now offline > [ 1102.081593] lockdep: not fixing up alternatives. > [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying > > After step 2 the system never sees interrupts from the network card and > remains hung like that until CPU1 is brought back up. It looks as > though the kernel is trying to reroute the IRQ (or so I'm assuming from > the "Breaking affinity" message), but this doesn't ever happen, so the > the kernel stops seeing interrupts from the device. > > Granted, one should not be offlining the CPU that is currently > designated to handle an IRQ, but I suspect that the kernel ought at a > minimum to reject the offlining or route the IRQ to any online CPU > instead of screwing things up. I agree. > There exists a similar scenario. Set the IRQ affinity to a bunch of > CPUs, watch /proc/interrupts to see which CPU is actually servicing the > interrupts, then offline that CPU. The kernel does not reroute the IRQ > to any of the other CPUs and the device also hangs. > > The furthest that I've dug is that it works on 2.6.17 and is broken in > 2.6.22-rc3 and 2.6.21. Will git-bisect further, but I wanted to know if > anyone else has seen this sort of problem. afaik, this seems to happen > with both IOAPIC and MSI interrupts, possibly more. Thanks for the bug report. I'm chuckling because I just submitted a patch to count that whole code path as broken, based on code review. It is trying to do something that the hardware can not reliably accomplish. Now I am surprised you were seeing this with MSI as well because the hardware should theoretically work in that case. However the irq_fixup code has enough issues that I wouldn't be surprised if it was just doing something stupid and wrong. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
Darrick J. Wong [EMAIL PROTECTED] writes: Hi there, I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid about offlining CPUs. I suspect that this problem extends beyond a particular machine, as I've been able to replicate it with an IBM x3650 and an IBM x3755. This is what I'm doing: 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ 4341 is the network card and we're picking on CPU1 in this example): echo 2 /proc/irq/4341/smp_affinity 2) I then take CPU1 offline: echo 0 /sys/devices/system/cpu/cpu1/online 3) The kernel prints this: [ 1101.968040] Breaking affinity for irq 4341 [ 1102.074019] CPU 1 is now offline [ 1102.081593] lockdep: not fixing up alternatives. [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying After step 2 the system never sees interrupts from the network card and remains hung like that until CPU1 is brought back up. It looks as though the kernel is trying to reroute the IRQ (or so I'm assuming from the Breaking affinity message), but this doesn't ever happen, so the the kernel stops seeing interrupts from the device. Granted, one should not be offlining the CPU that is currently designated to handle an IRQ, but I suspect that the kernel ought at a minimum to reject the offlining or route the IRQ to any online CPU instead of screwing things up. I agree. There exists a similar scenario. Set the IRQ affinity to a bunch of CPUs, watch /proc/interrupts to see which CPU is actually servicing the interrupts, then offline that CPU. The kernel does not reroute the IRQ to any of the other CPUs and the device also hangs. The furthest that I've dug is that it works on 2.6.17 and is broken in 2.6.22-rc3 and 2.6.21. Will git-bisect further, but I wanted to know if anyone else has seen this sort of problem. afaik, this seems to happen with both IOAPIC and MSI interrupts, possibly more. Thanks for the bug report. I'm chuckling because I just submitted a patch to count that whole code path as broken, based on code review. It is trying to do something that the hardware can not reliably accomplish. Now I am surprised you were seeing this with MSI as well because the hardware should theoretically work in that case. However the irq_fixup code has enough issues that I wouldn't be surprised if it was just doing something stupid and wrong. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
There exists a similar scenario. Set the IRQ affinity to a bunch of CPUs, watch /proc/interrupts to see which CPU is actually servicing the interrupts, then offline that CPU. The kernel does not reroute the IRQ to any of the other CPUs and the device also hangs. The furthest that I've dug is that it works on 2.6.17 and is broken in 2.6.22-rc3 and 2.6.21. Will git-bisect further, but I wanted to know if anyone else has seen this sort of problem. afaik, this seems to happen with both IOAPIC and MSI interrupts, possibly more. Hi, As a side note, on my very old SMP machine, 2.6.20 correctly load-balance IRQs across CPU but 2.6.21 not. I know that in-kernel IRQ load balancer is marked as deprecated and somewhat broken, but with your report it make me think it could be a bug in the IRQ rerouting part in my case too and not necessary in the load-balancer (decision) part. Emmanuel. --- Créez votre adresse électronique [EMAIL PROTECTED] 1 Go d'espace de stockage, anti-spam et anti-virus intégrés. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
As a side note, on my very old SMP machine, 2.6.20 correctly load-balance IRQs across CPU but 2.6.21 not. I know that in-kernel IRQ load balancer is marked as deprecated and somewhat broken, but with your report it make me think it could be a bug in the IRQ rerouting part in my case too and not necessary in the load-balancer (decision) part. I doubt it. The practical problem is that cpu_down does not and by design can not call the irq balancing part properly and I haven't yet seen anything to suggest that we don't migrate irq properly. So I'm guessing it was the decision part. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
On Fri, Jun 01, 2007 at 06:18:32PM -0600, Eric W. Biederman wrote: I doubt it. The practical problem is that cpu_down does not and by design can not call the irq balancing part properly and I haven't yet seen anything to suggest that we don't migrate irq properly. So I'm guessing it was the decision part. I'm not using any IRQ balancer, afaik. As I recall, CONFIG_IRQBALANCE is i386-only, and I'm not running the userland irqbalance program either. Just messing around with /proc/irq/*/smp_affinity by hand. :) --D - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Device hang when offlining a CPU due to IRQ misrouting
Darrick J. Wong [EMAIL PROTECTED] writes: On Fri, Jun 01, 2007 at 06:18:32PM -0600, Eric W. Biederman wrote: I doubt it. The practical problem is that cpu_down does not and by design can not call the irq balancing part properly and I haven't yet seen anything to suggest that we don't migrate irq properly. So I'm guessing it was the decision part. I'm not using any IRQ balancer, afaik. As I recall, CONFIG_IRQBALANCE is i386-only, and I'm not running the userland irqbalance program either. Just messing around with /proc/irq/*/smp_affinity by hand. :) This is just getting confusing. Emmanuel Fust. Please play with /proc/irq/*/smp_affinity by hand and confirm that you can move your irqs. This will confirm it is the decision part. Darrick. The cpu hotplug architecture makes it impossible to properly call irq migration code that backs /proc/irq/*/smp_affinity. Therefore the cpu hotplug interface to irq migration is broken by design. There are some other bugs in the implementation of migrating irqs off of cpus as well. I'm pretty certain that some combination of those problems is biting you. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Device hang when offlining a CPU due to IRQ misrouting
Hi there, I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid about offlining CPUs. I suspect that this problem extends beyond a particular machine, as I've been able to replicate it with an IBM x3650 and an IBM x3755. This is what I'm doing: 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ 4341 is the network card and we're picking on CPU1 in this example): echo 2 > /proc/irq/4341/smp_affinity 2) I then take CPU1 offline: echo 0 > /sys/devices/system/cpu/cpu1/online 3) The kernel prints this: [ 1101.968040] Breaking affinity for irq 4341 [ 1102.074019] CPU 1 is now offline [ 1102.081593] lockdep: not fixing up alternatives. [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying After step 2 the system never sees interrupts from the network card and remains hung like that until CPU1 is brought back up. It looks as though the kernel is trying to reroute the IRQ (or so I'm assuming from the "Breaking affinity" message), but this doesn't ever happen, so the the kernel stops seeing interrupts from the device. Granted, one should not be offlining the CPU that is currently designated to handle an IRQ, but I suspect that the kernel ought at a minimum to reject the offlining or route the IRQ to any online CPU instead of screwing things up. There exists a similar scenario. Set the IRQ affinity to a bunch of CPUs, watch /proc/interrupts to see which CPU is actually servicing the interrupts, then offline that CPU. The kernel does not reroute the IRQ to any of the other CPUs and the device also hangs. The furthest that I've dug is that it works on 2.6.17 and is broken in 2.6.22-rc3 and 2.6.21. Will git-bisect further, but I wanted to know if anyone else has seen this sort of problem. afaik, this seems to happen with both IOAPIC and MSI interrupts, possibly more. --D - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Device hang when offlining a CPU due to IRQ misrouting
Hi there, I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid about offlining CPUs. I suspect that this problem extends beyond a particular machine, as I've been able to replicate it with an IBM x3650 and an IBM x3755. This is what I'm doing: 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ 4341 is the network card and we're picking on CPU1 in this example): echo 2 /proc/irq/4341/smp_affinity 2) I then take CPU1 offline: echo 0 /sys/devices/system/cpu/cpu1/online 3) The kernel prints this: [ 1101.968040] Breaking affinity for irq 4341 [ 1102.074019] CPU 1 is now offline [ 1102.081593] lockdep: not fixing up alternatives. [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying After step 2 the system never sees interrupts from the network card and remains hung like that until CPU1 is brought back up. It looks as though the kernel is trying to reroute the IRQ (or so I'm assuming from the Breaking affinity message), but this doesn't ever happen, so the the kernel stops seeing interrupts from the device. Granted, one should not be offlining the CPU that is currently designated to handle an IRQ, but I suspect that the kernel ought at a minimum to reject the offlining or route the IRQ to any online CPU instead of screwing things up. There exists a similar scenario. Set the IRQ affinity to a bunch of CPUs, watch /proc/interrupts to see which CPU is actually servicing the interrupts, then offline that CPU. The kernel does not reroute the IRQ to any of the other CPUs and the device also hangs. The furthest that I've dug is that it works on 2.6.17 and is broken in 2.6.22-rc3 and 2.6.21. Will git-bisect further, but I wanted to know if anyone else has seen this sort of problem. afaik, this seems to happen with both IOAPIC and MSI interrupts, possibly more. --D - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/