Re: [PATCH 2/2] Debug handling of early spurious interrupts

2007-07-30 Thread Andrew Morton
On Tue, 31 Jul 2007 11:25:26 +0900 Fernando Luis Vázquez Cao <[EMAIL 
PROTECTED]> wrote:

> > runtime.  Some drivers don't get used by many people and users of some
> > architectures (esp embedded) tend to lag kernel.org by a long time.  So it
> > could be years before all the fallout from this change is finally wrapped
> > up.
> Yes, that is a big concern. However, the same embedded people is
> starting to use both kexec and kdump, so they may suffer the issues we
> are trying to weed out anyway, even if these patches are not applied.
> The difference is that with this new functionality it is possible to
> catch potential problems relatively easily, because any incorrect
> behaviour this may cause will be easily reproducible and, in most cases,
> will reveal itself early at boot time.
> 
> As things stand now, I guess we will keep seeing occasional crashes and
> strange behaviour in kexec-booted kernels, which in some cases will be
> due to incorrect handling of spurious interrupts. Besides, such problems
> are really difficult to reproduce because, commonly, we would need to
> hit an obscure corner case.

Please have a think about what we can do to aid this debugging.  For
example, add an uncondtional printk into request_irq() for now which tells
us what irq is being registered - a print_symbol() of the irq_handler_t
would be pretty good, although I suspect there are a lot of interrupt
handlers with the same name..


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Debug handling of early spurious interrupts

2007-07-30 Thread Fernando Luis Vázquez Cao
On Mon, 2007-07-30 at 11:22 -0700, Andrew Morton wrote: 
> On Mon, 30 Jul 2007 18:58:14 +0900
> Fernando Luis V__zquez Cao <[EMAIL PROTECTED]> wrote:
> 
> > > 
> > > So bad things might happen because of this change.  And if they do, they
> > > will take a lng time to be discovered, because non-shared interrupt
> > > handlers tend to dwell in crufty old drivers which not many people use.
> > > 
> > > So I'm wondering if it would be more prudent to only do all this for 
> > > shared
> > > handlers?
> > 
> > I have been testing this patches on all my machines, which spans three
> > different architectures: i386, x86_64 (EM64T and Opteron), and ia64.
> > 
> > The good news is that nothing catastrophic happened and, in the process,
> > I managed to find some somewhat buggy interrupt handlers.
> > 
> > I will present a brief summary of my findings here.
> 
> That's quite a lot of breakage for such a small sample of machines.  I do
> suspect that if we were to merge this lot into mainline, all sorts of bad
> stuff would happen.
Yup.

> otoh, as you point out, pretty much everthing which goes wrong is due to
> incorrect or dubious driver behaviour, and there is value in weeding these
> things out. 
> 
> But the problem with this process is that we're weeding things out at
> runtime.  Some drivers don't get used by many people and users of some
> architectures (esp embedded) tend to lag kernel.org by a long time.  So it
> could be years before all the fallout from this change is finally wrapped
> up.
Yes, that is a big concern. However, the same embedded people is
starting to use both kexec and kdump, so they may suffer the issues we
are trying to weed out anyway, even if these patches are not applied.
The difference is that with this new functionality it is possible to
catch potential problems relatively easily, because any incorrect
behaviour this may cause will be easily reproducible and, in most cases,
will reveal itself early at boot time.

As things stand now, I guess we will keep seeing occasional crashes and
strange behaviour in kexec-booted kernels, which in some cases will be
due to incorrect handling of spurious interrupts. Besides, such problems
are really difficult to reproduce because, commonly, we would need to
hit an obscure corner case.

> > If we find drivers that are not fixable we should probably consider this
> > new behaviour on a per-driver basis, as Andrew suggested.
> 
> We haven't found any such yet, have we?
Not yet, fortunately.

> > This would
> > probably require passing a new flag into request_irq. Besides, when such
> > a driver is detected we should emit a warning that it may not work
> > properly in a kdump kernel.
> > 
> > I would appreciate your comments on this.
> 
> Oh well, let's just keep maintaining it in -mm for now, gather more
> information so that we can make a more informed decision later on.
Makes sense. I will look into all the issues I have found so far, do
some more testing (I think a new approach is needed to speed up testing
of these issues...), and get back to you.

Thank you.

Fernando

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Debug handling of early spurious interrupts

2007-07-30 Thread Andrew Morton
On Mon, 30 Jul 2007 18:58:14 +0900
Fernando Luis V__zquez Cao <[EMAIL PROTECTED]> wrote:

> > 
> > So bad things might happen because of this change.  And if they do, they
> > will take a lng time to be discovered, because non-shared interrupt
> > handlers tend to dwell in crufty old drivers which not many people use.
> > 
> > So I'm wondering if it would be more prudent to only do all this for shared
> > handlers?
> 
> I have been testing this patches on all my machines, which spans three
> different architectures: i386, x86_64 (EM64T and Opteron), and ia64.
> 
> The good news is that nothing catastrophic happened and, in the process,
> I managed to find some somewhat buggy interrupt handlers.
> 
> I will present a brief summary of my findings here.

That's quite a lot of breakage for such a small sample of machines.  I do
suspect that if we were to merge this lot into mainline, all sorts of bad
stuff would happen.

otoh, as you point out, pretty much everthing which goes wrong is due to
incorrect or dubious driver behaviour, and there is value in weeding these
things out. 

But the problem with this process is that we're weeding things out at
runtime.  Some drivers don't get used by many people and users of some
architectures (esp embedded) tend to lag kernel.org by a long time.  So it
could be years before all the fallout from this change is finally wrapped
up.

> 
> If we find drivers that are not fixable we should probably consider this
> new behaviour on a per-driver basis, as Andrew suggested.

We haven't found any such yet, have we?

> This would
> probably require passing a new flag into request_irq. Besides, when such
> a driver is detected we should emit a warning that it may not work
> properly in a kdump kernel.
> 
> I would appreciate your comments on this.

Oh well, let's just keep maintaining it in -mm for now, gather more
information so that we can make a more informed decision later on.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Debug handling of early spurious interrupts

2007-07-30 Thread Fernando Luis Vázquez Cao
On Fri, 2007-07-20 at 14:43 -0700, Andrew Morton wrote: 
> On Fri, 20 Jul 2007 11:20:43 +0900
> Fernando Luis V__zquez Cao <[EMAIL PROTECTED]> wrote:
> 
> > With the advent of kdump it is possible that device drivers receive
> > interrupts generated in the context of a previous kernel. Ideally
> > quiescing the underlying devices should suffice but not all drivers do
> > this, either because it is not possible or because they did not
> > contemplate this case. Thus drivers ought to be able to handle
> > interrupts coming in as soon as the interrupt handler is registered.
> > 
> > Signed-off-by: Fernando Luis Vazquez Cao <[EMAIL PROTECTED]>
> > ---
> > 
> > diff -urNp linux-2.6.22-orig/kernel/irq/manage.c 
> > linux-2.6.22-pendirq/kernel/irq/manage.c
> > --- linux-2.6.22-orig/kernel/irq/manage.c   2007-07-19 18:18:53.0 
> > +0900
> > +++ linux-2.6.22-pendirq/kernel/irq/manage.c2007-07-19 
> > 19:43:41.0 +0900
> > @@ -533,21 +533,32 @@ int request_irq(unsigned int irq, irq_ha
> >  
> > select_smp_affinity(irq);
> >  
> > -   if (irqflags & IRQF_SHARED) {
> > -   /*
> > -* It's a shared IRQ -- the driver ought to be prepared for it
> > -* to happen immediately, so let's make sure
> > -* We do this before actually registering it, to make sure that
> > -* a 'real' IRQ doesn't run in parallel with our fake
> > -*/
> > -   if (irqflags & IRQF_DISABLED) {
> > -   unsigned long flags;
> > +   /*
> > +* With the advent of kdump it possible that device drivers receive
> > +* interrupts generated in the context of a previous kernel. Ideally
> > +* quiescing the underlying devices should suffice but not all drivers
> > +* do this, either because it is not possible or because they did not
> > +* contemplate this case. Thus drivers ought to be able to handle
> > +* interrupts coming in as soon as the interrupt handler is registered.
> > +*
> > +* Besides, if it is a shared IRQ the driver ought to be prepared for
> > +* it to happen immediately too.
> > +*
> > +* We do this before actually registering it, to make sure that
> > +* a 'real' IRQ doesn't run in parallel with our fake.
> > +*/
> > +   if (irqflags & IRQF_DISABLED) {
> > +   unsigned long flags;
> >  
> > -   local_irq_save(flags);
> > -   handler(irq, dev_id);
> > -   local_irq_restore(flags);
> > -   } else
> > -   handler(irq, dev_id);
> > +   local_irq_save(flags);
> > +   retval = handler(irq, dev_id);
> > +   local_irq_restore(flags);
> > +   } else
> > +   retval = handler(irq, dev_id);
> > +   if (retval == IRQ_HANDLED) {
> > +   printk(KERN_WARNING
> > +  "%s (IRQ %d) handled a spurious interrupt\n",
> > +  devname, irq);
> > }
> >  
> 
> This change means that we'll run the irq handler at request_irq()-time even
> for non-shared interrupts.
> 
> This is a bit of a worry.  See, shared-interrupt handlers are required to
> be able to cope with being called when their device isn't interrupting. 
> But nobody ever said that non-shared interrupt handlers need to be able to
> cope with that.
> 
> Hence these non-shared handlers are within their rights to emit warning
> printks, go BUG or accuse the operator of having busted hardware.
> 
> So bad things might happen because of this change.  And if they do, they
> will take a lng time to be discovered, because non-shared interrupt
> handlers tend to dwell in crufty old drivers which not many people use.
> 
> So I'm wondering if it would be more prudent to only do all this for shared
> handlers?

I have been testing this patches on all my machines, which spans three
different architectures: i386, x86_64 (EM64T and Opteron), and ia64.

The good news is that nothing catastrophic happened and, in the process,
I managed to find some somewhat buggy interrupt handlers.

I will present a brief summary of my findings here. That said, I have to
admit that these are still preliminary results and have not had time to
dig deep into all the issues. I would like hear from you first.

- Test environment

-- x86_64 [EM64T]
   CPU: Intel(R) Xeon(TM) CPU 3.20GHz (2 cpus)
   SCSI: Adaptec AIC-7902B U320
   IDE: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller

-- x86_64 [Opteron]
   CPU: AMD Opteron(tm) Processor 240 (2 cpus)
   IDE: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)

-- ia64
   CPU: Itanium 2 Madison (2 cpus)
   SCSI: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual
Ultra320 SCSI
   IDE: Intel Corporation 82801DB (ICH4) IDE Controller

-- i386
   CPU: Intel(R) Xeon(TM) CPU 2.40GHz (2 cpus)
   IDE: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev
02)

- Test results for i386

--i8042 (IRQ 12) handled a spurious interrupt -> i8042
The culprit here was the i

Re: [PATCH 2/2] Debug handling of early spurious interrupts

2007-07-20 Thread Andrew Morton
On Fri, 20 Jul 2007 11:20:43 +0900
Fernando Luis V__zquez Cao <[EMAIL PROTECTED]> wrote:

> With the advent of kdump it is possible that device drivers receive
> interrupts generated in the context of a previous kernel. Ideally
> quiescing the underlying devices should suffice but not all drivers do
> this, either because it is not possible or because they did not
> contemplate this case. Thus drivers ought to be able to handle
> interrupts coming in as soon as the interrupt handler is registered.
> 
> Signed-off-by: Fernando Luis Vazquez Cao <[EMAIL PROTECTED]>
> ---
> 
> diff -urNp linux-2.6.22-orig/kernel/irq/manage.c 
> linux-2.6.22-pendirq/kernel/irq/manage.c
> --- linux-2.6.22-orig/kernel/irq/manage.c 2007-07-19 18:18:53.0 
> +0900
> +++ linux-2.6.22-pendirq/kernel/irq/manage.c  2007-07-19 19:43:41.0 
> +0900
> @@ -533,21 +533,32 @@ int request_irq(unsigned int irq, irq_ha
>  
>   select_smp_affinity(irq);
>  
> - if (irqflags & IRQF_SHARED) {
> - /*
> -  * It's a shared IRQ -- the driver ought to be prepared for it
> -  * to happen immediately, so let's make sure
> -  * We do this before actually registering it, to make sure that
> -  * a 'real' IRQ doesn't run in parallel with our fake
> -  */
> - if (irqflags & IRQF_DISABLED) {
> - unsigned long flags;
> + /*
> +  * With the advent of kdump it possible that device drivers receive
> +  * interrupts generated in the context of a previous kernel. Ideally
> +  * quiescing the underlying devices should suffice but not all drivers
> +  * do this, either because it is not possible or because they did not
> +  * contemplate this case. Thus drivers ought to be able to handle
> +  * interrupts coming in as soon as the interrupt handler is registered.
> +  *
> +  * Besides, if it is a shared IRQ the driver ought to be prepared for
> +  * it to happen immediately too.
> +  *
> +  * We do this before actually registering it, to make sure that
> +  * a 'real' IRQ doesn't run in parallel with our fake.
> +  */
> + if (irqflags & IRQF_DISABLED) {
> + unsigned long flags;
>  
> - local_irq_save(flags);
> - handler(irq, dev_id);
> - local_irq_restore(flags);
> - } else
> - handler(irq, dev_id);
> + local_irq_save(flags);
> + retval = handler(irq, dev_id);
> + local_irq_restore(flags);
> + } else
> + retval = handler(irq, dev_id);
> + if (retval == IRQ_HANDLED) {
> + printk(KERN_WARNING
> +"%s (IRQ %d) handled a spurious interrupt\n",
> +devname, irq);
>   }
>  

This change means that we'll run the irq handler at request_irq()-time even
for non-shared interrupts.

This is a bit of a worry.  See, shared-interrupt handlers are required to
be able to cope with being called when their device isn't interrupting. 
But nobody ever said that non-shared interrupt handlers need to be able to
cope with that.

Hence these non-shared handlers are within their rights to emit warning
printks, go BUG or accuse the operator of having busted hardware.

So bad things might happen because of this change.  And if they do, they
will take a lng time to be discovered, because non-shared interrupt
handlers tend to dwell in crufty old drivers which not many people use.

So I'm wondering if it would be more prudent to only do all this for shared
handlers?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] Debug handling of early spurious interrupts

2007-07-19 Thread Fernando Luis Vázquez Cao
With the advent of kdump it is possible that device drivers receive
interrupts generated in the context of a previous kernel. Ideally
quiescing the underlying devices should suffice but not all drivers do
this, either because it is not possible or because they did not
contemplate this case. Thus drivers ought to be able to handle
interrupts coming in as soon as the interrupt handler is registered.

Signed-off-by: Fernando Luis Vazquez Cao <[EMAIL PROTECTED]>
---

diff -urNp linux-2.6.22-orig/kernel/irq/manage.c 
linux-2.6.22-pendirq/kernel/irq/manage.c
--- linux-2.6.22-orig/kernel/irq/manage.c   2007-07-19 18:18:53.0 
+0900
+++ linux-2.6.22-pendirq/kernel/irq/manage.c2007-07-19 19:43:41.0 
+0900
@@ -533,21 +533,32 @@ int request_irq(unsigned int irq, irq_ha
 
select_smp_affinity(irq);
 
-   if (irqflags & IRQF_SHARED) {
-   /*
-* It's a shared IRQ -- the driver ought to be prepared for it
-* to happen immediately, so let's make sure
-* We do this before actually registering it, to make sure that
-* a 'real' IRQ doesn't run in parallel with our fake
-*/
-   if (irqflags & IRQF_DISABLED) {
-   unsigned long flags;
+   /*
+* With the advent of kdump it possible that device drivers receive
+* interrupts generated in the context of a previous kernel. Ideally
+* quiescing the underlying devices should suffice but not all drivers
+* do this, either because it is not possible or because they did not
+* contemplate this case. Thus drivers ought to be able to handle
+* interrupts coming in as soon as the interrupt handler is registered.
+*
+* Besides, if it is a shared IRQ the driver ought to be prepared for
+* it to happen immediately too.
+*
+* We do this before actually registering it, to make sure that
+* a 'real' IRQ doesn't run in parallel with our fake.
+*/
+   if (irqflags & IRQF_DISABLED) {
+   unsigned long flags;
 
-   local_irq_save(flags);
-   handler(irq, dev_id);
-   local_irq_restore(flags);
-   } else
-   handler(irq, dev_id);
+   local_irq_save(flags);
+   retval = handler(irq, dev_id);
+   local_irq_restore(flags);
+   } else
+   retval = handler(irq, dev_id);
+   if (retval == IRQ_HANDLED) {
+   printk(KERN_WARNING
+  "%s (IRQ %d) handled a spurious interrupt\n",
+  devname, irq);
}
 
retval = setup_irq(irq, action);


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/