subject:"Serial related oops"

Re: Serial related oops

2007-03-01 Thread Jose Goncalves

Russell King wrote:
> On Thu, Mar 01, 2007 at 01:33:28PM +, Jose Goncalves wrote:
>   
>> I've also done your suggestion and I've inserted "msleep(10);" just
>> before the "And clear the interrupt registers again for luck." and my
>> application is now running without problems fore more than 24H! So,
>> inserting a delay in this point definitely makes some difference (has
>> was with adding some extra printk() in several points of
>> serial8250_startup()).
>>
>> This said, for me, this is definitely a software problem. The question
>> is were?
>> 
>
> I'm personally convinced it's hardware because according to my analysis
> your CPU behaving in a way that the code is not asking it to do so.
>   

It's not possible that a interrupt is hitting just after enabling
interrupts with "serial_outp(up, UART_IER, up->ier);" which triggers the
execution of some code that is not reported by the Oops dump (at least
with my current configuration) ?

José Gonçalves

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-03-01 Thread Russell King

On Thu, Mar 01, 2007 at 01:33:28PM +, Jose Goncalves wrote:
> I've also done your suggestion and I've inserted "msleep(10);" just
> before the "And clear the interrupt registers again for luck." and my
> application is now running without problems fore more than 24H! So,
> inserting a delay in this point definitely makes some difference (has
> was with adding some extra printk() in several points of
> serial8250_startup()).
> 
> This said, for me, this is definitely a software problem. The question
> is were?

I'm personally convinced it's hardware because according to my analysis
your CPU behaving in a way that the code is not asking it to do so.

Maybe others have some further insight; I certainly don't.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-03-01 Thread Jose Goncalves

Hi again Russel,

I'm back, after some more testing. Here goes my report.

I've switched to another SBC and the kernel still Oops, so is not a
one-off fault on the hardware.

I've also run memtest86+ on this board for the maximum period that I
reach an Oops with my application (24 H) and it not detected any fault
(in 21 passes).

As I've said earlier, our hardware as an extra serial controller
(TL16C554A). To isolate the problem, I've removed the board with this
extra controller and used only the SBC (Vortex86-6070 -
http://www.icop.com.tw/products_detail.asp?ProductID=70). Still, with
that setup and with my application using only ttyS1, I get kernel Oops,
and always in the same point:

<1>[43477.986867] Unable to handle kernel NULL pointer dereference at
virtual address 0012
<1>[43477.995067]  printing eip:
<4>[43478.003087] c01bfa7a
<1>[43478.003116] *pde = 
<0>[43478.011231] Oops:  [#1]
<4>[43478.019188] Modules linked in:
<0>[43478.027308] CPU:0
<4>[43478.027325] EIP:0060:[]Not tainted VLI
<4>[43478.027341] EFLAGS: 00010202   (2.6.16.41-mtm6-debug1 #1)
<0>[43478.052490] EIP is at serial_in+0xa/0x4a
<0>[43478.061448] eax: 0060   ebx:    ecx:    edx:

<0>[43478.070945] esi:    edi: 0040   ebp: c7237e1c   esp:
c7237e18
<0>[43478.080720] ds: 007b   es: 007b   ss: 0068
<0>[43478.090470] Process gp_position (pid: 26205, threadinfo=c7236000
task=c775dab0)
<0>[43478.091319] Stack: <0>  c01c0f88  
c031fef0 0005 0202
<0>[43478.113464]c717fa1c c031fef0 c124b510 c7237e60 c01bd97d
c031fef0 c124b510 c124b510
<0>[43478.126484] c760c52c c7237e7c c01befe7 c124b510
 ffed c760c52c
<0>[43478.139984] Call Trace:
<0>[43478.152627]  [] show_stack_log_lvl+0xa5/0xad
<0>[43478.166200]  [] show_registers+0x106/0x16f
<0>[43478.179852]  [] die+0xb6/0x127
<0>[43478.193589]  [] do_page_fault+0x380/0x4b3
<0>[43478.207616]  [] error_code+0x4f/0x60
<0>[43478.221803]  [] serial8250_startup+0x28f/0x2a9
<0>[43478.236340] Code: 38 43 78 75 02 b2 01 89 d0 eb 10 8b 41 70 39 43
70 0f 94 c0 0f b6 c0 eb 02 31 c0 5b 5d c3 90 90 90 55 89 e5 53 8b 5d 08
8b 55 0c <0f> b6 4b 12 0f b6 43 13 d3 e2 83 f8 02 74 1a 7f 05 48 74 09 eb
<4>[43478.322255]  BUG: gp_position/26205, lock held at task exit time!
<4>[43478.341721]  [c124b528] {uart_register_driver}
<4>[43478.359168] .. held by:   gp_position:26205 [c775dab0, 117]
<4>[43478.377112] ... acquired at:   uart_get+0x28/0xde

I've also done your suggestion and I've inserted "msleep(10);" just
before the "And clear the interrupt registers again for luck." and my
application is now running without problems fore more than 24H! So,
inserting a delay in this point definitely makes some difference (has
was with adding some extra printk() in several points of
serial8250_startup()).

This said, for me, this is definitely a software problem. The question
is were?
I would appreciate if you (or anyone) could give me any pointers on how
to detect the cause of my kernel Oops (perhaps activating extra kernel
debug?)

Thanks,
José Gonçalves


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-03-01 Thread Jose Goncalves

Hi again Russel,

I'm back, after some more testing. Here goes my report.

I've switched to another SBC and the kernel still Oops, so is not a
one-off fault on the hardware.

I've also run memtest86+ on this board for the maximum period that I
reach an Oops with my application (24 H) and it not detected any fault
(in 21 passes).

As I've said earlier, our hardware as an extra serial controller
(TL16C554A). To isolate the problem, I've removed the board with this
extra controller and used only the SBC (Vortex86-6070 -
http://www.icop.com.tw/products_detail.asp?ProductID=70). Still, with
that setup and with my application using only ttyS1, I get kernel Oops,
and always in the same point:

1[43477.986867] Unable to handle kernel NULL pointer dereference at
virtual address 0012
1[43477.995067]  printing eip:
4[43478.003087] c01bfa7a
1[43478.003116] *pde = 
0[43478.011231] Oops:  [#1]
4[43478.019188] Modules linked in:
0[43478.027308] CPU:0
4[43478.027325] EIP:0060:[c01bfa7a]Not tainted VLI
4[43478.027341] EFLAGS: 00010202   (2.6.16.41-mtm6-debug1 #1)
0[43478.052490] EIP is at serial_in+0xa/0x4a
0[43478.061448] eax: 0060   ebx:    ecx:    edx:

0[43478.070945] esi:    edi: 0040   ebp: c7237e1c   esp:
c7237e18
0[43478.080720] ds: 007b   es: 007b   ss: 0068
0[43478.090470] Process gp_position (pid: 26205, threadinfo=c7236000
task=c775dab0)
0[43478.091319] Stack: 0  c01c0f88  
c031fef0 0005 0202
0[43478.113464]c717fa1c c031fef0 c124b510 c7237e60 c01bd97d
c031fef0 c124b510 c124b510
0[43478.126484] c760c52c c7237e7c c01befe7 c124b510
 ffed c760c52c
0[43478.139984] Call Trace:
0[43478.152627]  [c0102a35] show_stack_log_lvl+0xa5/0xad
0[43478.166200]  [c0102b70] show_registers+0x106/0x16f
0[43478.179852]  [c0102d06] die+0xb6/0x127
0[43478.193589]  [c0109677] do_page_fault+0x380/0x4b3
0[43478.207616]  [c01026bf] error_code+0x4f/0x60
0[43478.221803]  [c01c0f88] serial8250_startup+0x28f/0x2a9
0[43478.236340] Code: 38 43 78 75 02 b2 01 89 d0 eb 10 8b 41 70 39 43
70 0f 94 c0 0f b6 c0 eb 02 31 c0 5b 5d c3 90 90 90 55 89 e5 53 8b 5d 08
8b 55 0c 0f b6 4b 12 0f b6 43 13 d3 e2 83 f8 02 74 1a 7f 05 48 74 09 eb
4[43478.322255]  BUG: gp_position/26205, lock held at task exit time!
4[43478.341721]  [c124b528] {uart_register_driver}
4[43478.359168] .. held by:   gp_position:26205 [c775dab0, 117]
4[43478.377112] ... acquired at:   uart_get+0x28/0xde

I've also done your suggestion and I've inserted msleep(10); just
before the And clear the interrupt registers again for luck. and my
application is now running without problems fore more than 24H! So,
inserting a delay in this point definitely makes some difference (has
was with adding some extra printk() in several points of
serial8250_startup()).

This said, for me, this is definitely a software problem. The question
is were?
I would appreciate if you (or anyone) could give me any pointers on how
to detect the cause of my kernel Oops (perhaps activating extra kernel
debug?)

Thanks,
José Gonçalves


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-03-01 Thread Russell King

On Thu, Mar 01, 2007 at 01:33:28PM +, Jose Goncalves wrote:
 I've also done your suggestion and I've inserted msleep(10); just
 before the And clear the interrupt registers again for luck. and my
 application is now running without problems fore more than 24H! So,
 inserting a delay in this point definitely makes some difference (has
 was with adding some extra printk() in several points of
 serial8250_startup()).
 
 This said, for me, this is definitely a software problem. The question
 is were?

I'm personally convinced it's hardware because according to my analysis
your CPU behaving in a way that the code is not asking it to do so.

Maybe others have some further insight; I certainly don't.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-03-01 Thread Jose Goncalves

Russell King wrote:
 On Thu, Mar 01, 2007 at 01:33:28PM +, Jose Goncalves wrote:
   
 I've also done your suggestion and I've inserted msleep(10); just
 before the And clear the interrupt registers again for luck. and my
 application is now running without problems fore more than 24H! So,
 inserting a delay in this point definitely makes some difference (has
 was with adding some extra printk() in several points of
 serial8250_startup()).

 This said, for me, this is definitely a software problem. The question
 is were?
 

 I'm personally convinced it's hardware because according to my analysis
 your CPU behaving in a way that the code is not asking it to do so.
   

It's not possible that a interrupt is hitting just after enabling
interrupts with serial_outp(up, UART_IER, up-ier); which triggers the
execution of some code that is not reported by the Oops dump (at least
with my current configuration) ?

José Gonçalves

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-23 Thread Michael K. Edwards


Russell, thanks again for offering to look at this; the more oopses
and soft lockups I see on this board, the more I think you're right
and we have an IRQ handling race.

Here's the struct irqchip setup:

/* mask irq, refer ssection 2.6 under chip 8618 document */
static void mv88w8xx8_mask_irq(unsigned int irq)
{
   MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_CLR,(1 << irq));
}

/* unmask irq, refer ssection 2.6 under chip 8618 document */
static void mv88w8xx8_unmask_irq(unsigned int irq)
{
   MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_SET,(1 << irq));
}

/* ack to CPU interrupts and also individual timer interrupts */
static void mv88w8xx8_mask_ack_irq(unsigned int irq)
{
   mv88w8xx8_mask_irq(irq);

   if (irq < IRQ_TIMER1 || irq > IRQ_TIMER4) return;

   /* write 0 to clear interrupt and  re-enable further interrupts */
   MV88W8XX8_REG_WRITE(MV88W8XX8_TIMER_INT_SOURCE, ~(1<<(irq-4)));
}

static struct irqchip mv88w8xx8_chip = {
   .ack= mv88w8xx8_mask_ack_irq,
   .mask   = mv88w8xx8_mask_irq,
   .unmask = mv88w8xx8_unmask_irq,
};

/**
* called by core.c to  initialize the IRQ module
*/
void mv88w8xx8_init_irq(void)
{
   int irq;

   for (irq = 0; irq < NR_IRQS; irq++) {
   set_irq_chip(irq, _chip);
   set_irq_handler(irq, do_level_IRQ);
   set_irq_flags(irq, IRQF_VALID | IRQF_PROBE);
   }
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-23 Thread Michael K. Edwards


Russell, thanks again for offering to look at this; the more oopses
and soft lockups I see on this board, the more I think you're right
and we have an IRQ handling race.

Here's the struct irqchip setup:

/* mask irq, refer ssection 2.6 under chip 8618 document */
static void mv88w8xx8_mask_irq(unsigned int irq)
{
   MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_CLR,(1  irq));
}

/* unmask irq, refer ssection 2.6 under chip 8618 document */
static void mv88w8xx8_unmask_irq(unsigned int irq)
{
   MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_SET,(1  irq));
}

/* ack to CPU interrupts and also individual timer interrupts */
static void mv88w8xx8_mask_ack_irq(unsigned int irq)
{
   mv88w8xx8_mask_irq(irq);

   if (irq  IRQ_TIMER1 || irq  IRQ_TIMER4) return;

   /* write 0 to clear interrupt and  re-enable further interrupts */
   MV88W8XX8_REG_WRITE(MV88W8XX8_TIMER_INT_SOURCE, ~(1(irq-4)));
}

static struct irqchip mv88w8xx8_chip = {
   .ack= mv88w8xx8_mask_ack_irq,
   .mask   = mv88w8xx8_mask_irq,
   .unmask = mv88w8xx8_unmask_irq,
};

/**
* called by core.c to  initialize the IRQ module
*/
void mv88w8xx8_init_irq(void)
{
   int irq;

   for (irq = 0; irq  NR_IRQS; irq++) {
   set_irq_chip(irq, mv88w8xx8_chip);
   set_irq_handler(irq, do_level_IRQ);
   set_irq_flags(irq, IRQF_VALID | IRQF_PROBE);
   }
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Paul Fulghum


On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote:

What I find real hard to understand is why a hardware fault happens
always in the same software instruction! I would expect a hardware fault
to hit randomly...


I've experienced just such a hardware fault.

The Infineon DSCC4 serial controller has a hardware bug
in the PCI request/grant handling that can lead to the
device driving the PCI bus in conflict with another device.

While the results were random (as the oops in this problem
seem to be), the trigger was always activating certain
devices in combination.

In your case, altering the timing/behavior of the serial
device during open may be provoking the hardware fault.

--
Paul Fulghum
Microgate Systems, Ltd.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread jose . goncalves

Quoting Russell King <[EMAIL PROTECTED]>:

On Thu, Feb 22, 2007 at 03:07:18PM +, Jose Goncalves wrote:

Russell King wrote:
> On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:
>
>> Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
>> to us, at least on an ARM target ...
>>
>
> That's ruled out.  Please think about it for a moment - serial_in()
> managed to work correctly most of the time, and then spontaneously
> changes its well-defined ABI behaviour in a way that analysis of the
> asm doesn't allow it to.
>

I'm using gcc 3.4.6.
But I agree with Russell, if it was such a problem it would hit on the
first iteration of my application and not after 1 day of executing the
same piece of code...

One thing you might think about is running memtest86 on the machine
for the same kind of time interval, just in case it's something trivial
like bad ram.

OK. That's another thing to do.

Meanwhile I've switched to another SBC and I'm now running my 
application on the new unit. Lets wait and see...

José Gonçalves

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread jose . goncalves


Quoting Russell King <[EMAIL PROTECTED]>:


On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote:

It could be a silly question (tamper with me as I'm not familiar with
such low level programming), but couldn't it be possible for a interrupt
to hit in the middle of the serial_in() calls and mess with %ebx?


I'm no expert on x86, but if an interrupt was messing with %ebx, you'd
have random crashes verywhere - userspace, kernel space in unpredicatable
ways.


What I find real hard to understand is why a hardware fault happens
always in the same software instruction! I would expect a hardware fault
to hit randomly...


Well, compared with your previous report, your latest report is different.
Your first report had  both EIP and %ebx being zero (because they got
corrupted when returning from serial_in).  This time only %ebx was
corrupted.

Consequently, this time we oopsed in the subsequent serial_in() rather
than trying to return to serial8250_startup() as last time.


But there was also another difference. I CONFIGed the kernel to produce 
more debug info. This should influence the Oops report...





I left my application running this night, with a 2.6.16.41 kernel
unpatched  on the serial driver (my last Oops report was with Frederik
patch to remove the insertion made in 2.6.12) and it crashed again on
exactly the same point!



From that I take it that you removed the test in serial8250_startup which

sets UART_BUG_TXEN, and the problem persisted.  That tends to suggest
that it's not the culpret.


From that I mean that with or without this code - 
http://lkml.org/lkml/2007/2/19/124 - the problem persisted. The 
difference is that, without it, the crashes happens more sparsly.


José Gonçalves


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Russell King

On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote:
> It could be a silly question (tamper with me as I'm not familiar with
> such low level programming), but couldn't it be possible for a interrupt
> to hit in the middle of the serial_in() calls and mess with %ebx?

I'm no expert on x86, but if an interrupt was messing with %ebx, you'd
have random crashes verywhere - userspace, kernel space in unpredicatable
ways.

> What I find real hard to understand is why a hardware fault happens
> always in the same software instruction! I would expect a hardware fault
> to hit randomly...

Well, compared with your previous report, your latest report is different.
Your first report had  both EIP and %ebx being zero (because they got
corrupted when returning from serial_in).  This time only %ebx was
corrupted.

Consequently, this time we oopsed in the subsequent serial_in() rather
than trying to return to serial8250_startup() as last time.

> I left my application running this night, with a 2.6.16.41 kernel
> unpatched  on the serial driver (my last Oops report was with Frederik
> patch to remove the insertion made in 2.6.12) and it crashed again on
> exactly the same point!

>From that I take it that you removed the test in serial8250_startup which
sets UART_BUG_TXEN, and the problem persisted.  That tends to suggest
that it's not the culpret.

> > For all we know, it could be a one-off fault on the hardware you
> > happen to have - other identical units may not behave the same (can
> > you check?)
> 
> Yes I have other units that I can test it. I'll do that to see if it's
> really a one-off fault on the hardware.

Would be nice to know.

> If it continues to crash with other units I will then test with the
> msleep(10) before the "And clear the interrupt registers again for
> luck.", as you suggested earlier.
> 
> > If it is a one off case, you are welcome to patch that test out in
> > your kernel build to remove the problem, and if it's an isolated case
> > I encourage you to do this.  This is one of the great advantages of
> > open source - if you hit such a problem rather than throwing the
> > hardware away you can work around such issues.
> 
> I didn't understand what you mean by "you are welcome to patch that test
> out in your kernel build to remove the problem". Which test are you
> talking about?

The one which sets UART_BUG_TXEN.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Russell King

On Thu, Feb 22, 2007 at 03:07:18PM +, Jose Goncalves wrote:
> Russell King wrote:
> > On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:
> >   
> >> Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
> >> to us, at least on an ARM target ...
> >> 
> >
> > That's ruled out.  Please think about it for a moment - serial_in()
> > managed to work correctly most of the time, and then spontaneously
> > changes its well-defined ABI behaviour in a way that analysis of the
> > asm doesn't allow it to.
> >   
> 
> I'm using gcc 3.4.6.
> But I agree with Russell, if it was such a problem it would hit on the
> first iteration of my application and not after 1 day of executing the
> same piece of code...

One thing you might think about is running memtest86 on the machine
for the same kind of time interval, just in case it's something trivial
like bad ram.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Jose Goncalves

Russell King wrote:
> On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:
>   
>> Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
>> to us, at least on an ARM target ...
>> 
>
> That's ruled out.  Please think about it for a moment - serial_in()
> managed to work correctly most of the time, and then spontaneously
> changes its well-defined ABI behaviour in a way that analysis of the
> asm doesn't allow it to.
>   

I'm using gcc 3.4.6.
But I agree with Russell, if it was such a problem it would hit on the
first iteration of my application and not after 1 day of executing the
same piece of code...

Regards,
José Gonçalves

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Jose Goncalves

Russell King wrote:
> On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote:
>   
>> <1>[18840.304048] Unable to handle kernel NULL pointer dereference at 
>> virtual address 0012
>> <1>[18840.313046]  printing eip:
>> <4>[18840.321687] c01bfa7a
>> <1>[18840.321714] *pde = 
>> <0>[18840.331287] Oops:  [#1]
>> <4>[18840.340687] Modules linked in:
>> <0>[18840.349749] CPU:0
>> <4>[18840.349767] EIP:0060:[]Not tainted VLI
>> <4>[18840.349782] EFLAGS: 00010202   (2.6.16.41-mtm5-debug1 #1) 
>> <0>[18840.377277] EIP is at serial_in+0xa/0x4a
>> <0>[18840.387221] eax: 0060   ebx:    ecx:    edx: 
>> 
>> <0>[18840.397805] esi:    edi: 0040   ebp: c728fe1c   esp: 
>> c728fe18
>> <0>[18840.408579] ds: 007b   es: 007b   ss: 0068
>> <0>[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 
>> task=c7443a90)
>> <0>[18840.420509] Stack: <0>  c01c0f88   
>> c031fef0 0005 0202 
>> <0>[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d 
>> c031fef0 c124b510 c124b510 
>> <0>[18840.460540] c773dbcc c728fe7c c01befe7 c124b510 
>>  ffed c773dbcc 
>> 
>
> Okay, this one is even more plainly "not a coding error".
>
>   
>> <0>[18840.566645]  [] serial8250_startup+0x28f/0x2a9
>> 
>
> The code around this point (with the return point marked) is:
>
>   
>> c01c0f78:6a 05   push   $0x5
>> c01c0f7a:53  push   %ebx
>> c01c0f7b:e8 f0 ea ff ff  call   c01bfa70 
>> c01c0f80:6a 00   push   $0x0
>> c01c0f82:53  push   %ebx
>> c01c0f83:e8 e8 ea ff ff  call   c01bfa70 
>> c01c0f88<<<  6a 02   push   $0x2
>> c01c0f8a:53  push   %ebx
>> c01c0f8b:e8 e0 ea ff ff  call   c01bfa70 
>> 
>
> and corresponds with this C code:
>
> (void) serial_inp(up, UART_LSR);
> (void) serial_inp(up, UART_RX);
> (void) serial_inp(up, UART_IIR);
>
> Now let's look at the words pushed on the stack around this code:
>
>   
>   
>   c01c0f88 <- return address for serial_in (serial8250_startup+0x28f/0x2a9)
>    <- from push %ebx at c01c0f82
>    <- from push $0x0 at c01c0f80
>   c031fef0 <- from push %ebx at c01c0f7a
>   0005 <- from push %0x5 at c01c0f78
>
> Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
> First thing to notice is this violates the C code - "up" can not
> change.
>
> Now let's look at serial_in:
>
> c01bfa70:   55  push   %ebp
> c01bfa71:   89 e5   mov%esp,%ebp
> c01bfa73:   53  push   %ebx
> ...
> c01bfab7:   5b  pop%ebx
> c01bfab8:   5d  pop%ebp
> c01bfab9:   c3  ret
>
> This code tells the CPU to preserves %ebx and %ebp.  But we know %ebx
> _wasn't_ preserved.  Ergo, your CPU is plainly not doing what the code
> told it to do.
>
> Moreover, serial_in() has preserved %ebx in the past otherwise we'd
> never got past all the other serial_in()s in serial8250_startup().
>
> So I think it's very demonstrably a hardware fault, and not software
> related.
>   

It could be a silly question (tamper with me as I'm not familiar with
such low level programming), but couldn't it be possible for a interrupt
to hit in the middle of the serial_in() calls and mess with %ebx?

What I find real hard to understand is why a hardware fault happens
always in the same software instruction! I would expect a hardware fault
to hit randomly...

I left my application running this night, with a 2.6.16.41 kernel
unpatched  on the serial driver (my last Oops report was with Frederik
patch to remove the insertion made in 2.6.12) and it crashed again on
exactly the same point!

> For all we know, it could be a one-off fault on the hardware you
> happen to have - other identical units may not behave the same (can
> you check?)
>   

Yes I have other units that I can test it. I'll do that to see if it's
really a one-off fault on the hardware.
If it continues to crash with other units I will then test with the
msleep(10) before the "And clear the interrupt registers again for
luck.", as you suggested earlier.

> If it is a one off case, you are welcome to patch that test out in
> your kernel build to remove the problem, and if it's an isolated case
> I encourage you to do this.  This is one of the great advantages of
> open source - if you hit such a problem rather than throwing the
> hardware away you can work around such issues.
>   

I didn't understand what you mean by "you are welcome to patch that test
out in your kernel build to remove the problem". Which test are you
talking about?

Regards,
José Gonçalves

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to

Re: Serial related oops

2007-02-22 Thread Russell King

On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:
> Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
> to us, at least on an ARM target ...

That's ruled out.  Please think about it for a moment - serial_in()
managed to work correctly most of the time, and then spontaneously
changes its well-defined ABI behaviour in a way that analysis of the
asm doesn't allow it to.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Russell King

On Wed, Feb 21, 2007 at 09:57:50PM -0800, H. Peter Anvin wrote:
> Russell King wrote:
> 
> >
> >Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
> >First thing to notice is this violates the C code - "up" can not
> >change.
> >
> >Now let's look at serial_in:
> >
> >c01bfa70:   55  push   %ebp
> >c01bfa71:   89 e5   mov%esp,%ebp
> >c01bfa73:   53  push   %ebx
> >...
> >c01bfab7:   5b  pop%ebx
> >c01bfab8:   5d  pop%ebp
> >c01bfab9:   c3  ret
> >
> >This code tells the CPU to preserves %ebx and %ebp.  But we know %ebx
> >_wasn't_ preserved.  Ergo, your CPU is plainly not doing what the code
> >told it to do.
> >
> 
> ... assuming nothing else clobbered the stack slot (which would be a 
> compiler error, or a wild pointer.)
> 
> Got a disassembly of the whole function?

See Jose's subsequent message to the one I replied to.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Russell King

On Wed, Feb 21, 2007 at 09:57:50PM -0800, H. Peter Anvin wrote:
 Russell King wrote:
 
 
 Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
 First thing to notice is this violates the C code - up can not
 change.
 
 Now let's look at serial_in:
 
 c01bfa70:   55  push   %ebp
 c01bfa71:   89 e5   mov%esp,%ebp
 c01bfa73:   53  push   %ebx
 ...
 c01bfab7:   5b  pop%ebx
 c01bfab8:   5d  pop%ebp
 c01bfab9:   c3  ret
 
 This code tells the CPU to preserves %ebx and %ebp.  But we know %ebx
 _wasn't_ preserved.  Ergo, your CPU is plainly not doing what the code
 told it to do.
 
 
 ... assuming nothing else clobbered the stack slot (which would be a 
 compiler error, or a wild pointer.)
 
 Got a disassembly of the whole function?

See Jose's subsequent message to the one I replied to.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Russell King

On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:
 Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
 to us, at least on an ARM target ...

That's ruled out.  Please think about it for a moment - serial_in()
managed to work correctly most of the time, and then spontaneously
changes its well-defined ABI behaviour in a way that analysis of the
asm doesn't allow it to.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Jose Goncalves

Russell King wrote:
 On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote:
   
 1[18840.304048] Unable to handle kernel NULL pointer dereference at 
 virtual address 0012
 1[18840.313046]  printing eip:
 4[18840.321687] c01bfa7a
 1[18840.321714] *pde = 
 0[18840.331287] Oops:  [#1]
 4[18840.340687] Modules linked in:
 0[18840.349749] CPU:0
 4[18840.349767] EIP:0060:[c01bfa7a]Not tainted VLI
 4[18840.349782] EFLAGS: 00010202   (2.6.16.41-mtm5-debug1 #1) 
 0[18840.377277] EIP is at serial_in+0xa/0x4a
 0[18840.387221] eax: 0060   ebx:    ecx:    edx: 
 
 0[18840.397805] esi:    edi: 0040   ebp: c728fe1c   esp: 
 c728fe18
 0[18840.408579] ds: 007b   es: 007b   ss: 0068
 0[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 
 task=c7443a90)
 0[18840.420509] Stack: 0  c01c0f88   
 c031fef0 0005 0202 
 0[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d 
 c031fef0 c124b510 c124b510 
 0[18840.460540] c773dbcc c728fe7c c01befe7 c124b510 
  ffed c773dbcc 
 

 Okay, this one is even more plainly not a coding error.

   
 0[18840.566645]  [c01c0f88] serial8250_startup+0x28f/0x2a9
 

 The code around this point (with the return point marked) is:

   
 c01c0f78:6a 05   push   $0x5
 c01c0f7a:53  push   %ebx
 c01c0f7b:e8 f0 ea ff ff  call   c01bfa70 serial_in
 c01c0f80:6a 00   push   $0x0
 c01c0f82:53  push   %ebx
 c01c0f83:e8 e8 ea ff ff  call   c01bfa70 serial_in
 c01c0f88  6a 02   push   $0x2
 c01c0f8a:53  push   %ebx
 c01c0f8b:e8 e0 ea ff ff  call   c01bfa70 serial_in
 

 and corresponds with this C code:

 (void) serial_inp(up, UART_LSR);
 (void) serial_inp(up, UART_RX);
 (void) serial_inp(up, UART_IIR);

 Now let's look at the words pushed on the stack around this code:

   
   
   c01c0f88 - return address for serial_in (serial8250_startup+0x28f/0x2a9)
    - from push %ebx at c01c0f82
    - from push $0x0 at c01c0f80
   c031fef0 - from push %ebx at c01c0f7a
   0005 - from push %0x5 at c01c0f78

 Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
 First thing to notice is this violates the C code - up can not
 change.

 Now let's look at serial_in:

 c01bfa70:   55  push   %ebp
 c01bfa71:   89 e5   mov%esp,%ebp
 c01bfa73:   53  push   %ebx
 ...
 c01bfab7:   5b  pop%ebx
 c01bfab8:   5d  pop%ebp
 c01bfab9:   c3  ret

 This code tells the CPU to preserves %ebx and %ebp.  But we know %ebx
 _wasn't_ preserved.  Ergo, your CPU is plainly not doing what the code
 told it to do.

 Moreover, serial_in() has preserved %ebx in the past otherwise we'd
 never got past all the other serial_in()s in serial8250_startup().

 So I think it's very demonstrably a hardware fault, and not software
 related.
   

It could be a silly question (tamper with me as I'm not familiar with
such low level programming), but couldn't it be possible for a interrupt
to hit in the middle of the serial_in() calls and mess with %ebx?

What I find real hard to understand is why a hardware fault happens
always in the same software instruction! I would expect a hardware fault
to hit randomly...

I left my application running this night, with a 2.6.16.41 kernel
unpatched  on the serial driver (my last Oops report was with Frederik
patch to remove the insertion made in 2.6.12) and it crashed again on
exactly the same point!

 For all we know, it could be a one-off fault on the hardware you
 happen to have - other identical units may not behave the same (can
 you check?)
   

Yes I have other units that I can test it. I'll do that to see if it's
really a one-off fault on the hardware.
If it continues to crash with other units I will then test with the
msleep(10) before the And clear the interrupt registers again for
luck., as you suggested earlier.

 If it is a one off case, you are welcome to patch that test out in
 your kernel build to remove the problem, and if it's an isolated case
 I encourage you to do this.  This is one of the great advantages of
 open source - if you hit such a problem rather than throwing the
 hardware away you can work around such issues.
   

I didn't understand what you mean by you are welcome to patch that test
out in your kernel build to remove the problem. Which test are you
talking about?

Regards,
José Gonçalves

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Jose Goncalves

Russell King wrote:
 On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:
   
 Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
 to us, at least on an ARM target ...
 

 That's ruled out.  Please think about it for a moment - serial_in()
 managed to work correctly most of the time, and then spontaneously
 changes its well-defined ABI behaviour in a way that analysis of the
 asm doesn't allow it to.
   

I'm using gcc 3.4.6.
But I agree with Russell, if it was such a problem it would hit on the
first iteration of my application and not after 1 day of executing the
same piece of code...

Regards,
José Gonçalves

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Russell King

On Thu, Feb 22, 2007 at 03:07:18PM +, Jose Goncalves wrote:
 Russell King wrote:
  On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:

  Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
  to us, at least on an ARM target ...
  
 
  That's ruled out.  Please think about it for a moment - serial_in()
  managed to work correctly most of the time, and then spontaneously
  changes its well-defined ABI behaviour in a way that analysis of the
  asm doesn't allow it to.

 
 I'm using gcc 3.4.6.
 But I agree with Russell, if it was such a problem it would hit on the
 first iteration of my application and not after 1 day of executing the
 same piece of code...

One thing you might think about is running memtest86 on the machine
for the same kind of time interval, just in case it's something trivial
like bad ram.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Russell King

On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote:
 It could be a silly question (tamper with me as I'm not familiar with
 such low level programming), but couldn't it be possible for a interrupt
 to hit in the middle of the serial_in() calls and mess with %ebx?

I'm no expert on x86, but if an interrupt was messing with %ebx, you'd
have random crashes verywhere - userspace, kernel space in unpredicatable
ways.

 What I find real hard to understand is why a hardware fault happens
 always in the same software instruction! I would expect a hardware fault
 to hit randomly...

Well, compared with your previous report, your latest report is different.
Your first report had  both EIP and %ebx being zero (because they got
corrupted when returning from serial_in).  This time only %ebx was
corrupted.

Consequently, this time we oopsed in the subsequent serial_in() rather
than trying to return to serial8250_startup() as last time.

 I left my application running this night, with a 2.6.16.41 kernel
 unpatched  on the serial driver (my last Oops report was with Frederik
 patch to remove the insertion made in 2.6.12) and it crashed again on
 exactly the same point!

From that I take it that you removed the test in serial8250_startup which
sets UART_BUG_TXEN, and the problem persisted.  That tends to suggest
that it's not the culpret.

  For all we know, it could be a one-off fault on the hardware you
  happen to have - other identical units may not behave the same (can
  you check?)
 
 Yes I have other units that I can test it. I'll do that to see if it's
 really a one-off fault on the hardware.

Would be nice to know.

 If it continues to crash with other units I will then test with the
 msleep(10) before the And clear the interrupt registers again for
 luck., as you suggested earlier.
 
  If it is a one off case, you are welcome to patch that test out in
  your kernel build to remove the problem, and if it's an isolated case
  I encourage you to do this.  This is one of the great advantages of
  open source - if you hit such a problem rather than throwing the
  hardware away you can work around such issues.
 
 I didn't understand what you mean by you are welcome to patch that test
 out in your kernel build to remove the problem. Which test are you
 talking about?

The one which sets UART_BUG_TXEN.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread jose . goncalves


Quoting Russell King [EMAIL PROTECTED]:


On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote:

It could be a silly question (tamper with me as I'm not familiar with
such low level programming), but couldn't it be possible for a interrupt
to hit in the middle of the serial_in() calls and mess with %ebx?


I'm no expert on x86, but if an interrupt was messing with %ebx, you'd
have random crashes verywhere - userspace, kernel space in unpredicatable
ways.


What I find real hard to understand is why a hardware fault happens
always in the same software instruction! I would expect a hardware fault
to hit randomly...


Well, compared with your previous report, your latest report is different.
Your first report had  both EIP and %ebx being zero (because they got
corrupted when returning from serial_in).  This time only %ebx was
corrupted.

Consequently, this time we oopsed in the subsequent serial_in() rather
than trying to return to serial8250_startup() as last time.


But there was also another difference. I CONFIGed the kernel to produce 
more debug info. This should influence the Oops report...





I left my application running this night, with a 2.6.16.41 kernel
unpatched  on the serial driver (my last Oops report was with Frederik
patch to remove the insertion made in 2.6.12) and it crashed again on
exactly the same point!



From that I take it that you removed the test in serial8250_startup which

sets UART_BUG_TXEN, and the problem persisted.  That tends to suggest
that it's not the culpret.


From that I mean that with or without this code - 
http://lkml.org/lkml/2007/2/19/124 - the problem persisted. The 
difference is that, without it, the crashes happens more sparsly.


José Gonçalves


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread jose . goncalves


Quoting Russell King [EMAIL PROTECTED]:


On Thu, Feb 22, 2007 at 03:07:18PM +, Jose Goncalves wrote:

Russell King wrote:
 On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:

 Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
 to us, at least on an ARM target ...


 That's ruled out.  Please think about it for a moment - serial_in()
 managed to work correctly most of the time, and then spontaneously
 changes its well-defined ABI behaviour in a way that analysis of the
 asm doesn't allow it to.


I'm using gcc 3.4.6.
But I agree with Russell, if it was such a problem it would hit on the
first iteration of my application and not after 1 day of executing the
same piece of code...


One thing you might think about is running memtest86 on the machine
for the same kind of time interval, just in case it's something trivial
like bad ram.



OK. That's another thing to do.

Meanwhile I've switched to another SBC and I'm now running my 
application on the new unit. Lets wait and see...


José Gonçalves


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-22 Thread Paul Fulghum


On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote:

What I find real hard to understand is why a hardware fault happens
always in the same software instruction! I would expect a hardware fault
to hit randomly...


I've experienced just such a hardware fault.

The Infineon DSCC4 serial controller has a hardware bug
in the PCI request/grant handling that can lead to the
device driving the PCI bus in conflict with another device.

While the results were random (as the oops in this problem
seem to be), the trigger was always activating certain
devices in combination.

In your case, altering the timing/behavior of the serial
device during open may be provoking the hardware fault.

--
Paul Fulghum
Microgate Systems, Ltd.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-21 Thread Frederik Deweerdt

On Wed, Feb 21, 2007 at 09:57:50PM -0800, H. Peter Anvin wrote:
> Russell King wrote:
> 
> >Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
> >First thing to notice is this violates the C code - "up" can not
> >change.
> >Now let's look at serial_in:
> >c01bfa70:   55  push   %ebp
> >c01bfa71:   89 e5   mov%esp,%ebp
> >c01bfa73:   53  push   %ebx
> >...
> >c01bfab7:   5b  pop%ebx
> >c01bfab8:   5d  pop%ebp
> >c01bfab9:   c3  ret
> >This code tells the CPU to preserves %ebx and %ebp.  But we know %ebx
> >_wasn't_ preserved.  Ergo, your CPU is plainly not doing what the code
> >told it to do.
> 
> ... assuming nothing else clobbered the stack slot (which would be a compiler 
> error, or a wild pointer.)
> 
> Got a disassembly of the whole function?
> 
Jose posted it higher in the thread:
http://lkml.org/lkml/2007/2/21/139

Regards,
Frederik
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-21 Thread H. Peter Anvin


Russell King wrote:



Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
First thing to notice is this violates the C code - "up" can not
change.

Now let's look at serial_in:

c01bfa70:   55  push   %ebp
c01bfa71:   89 e5   mov%esp,%ebp
c01bfa73:   53  push   %ebx
...
c01bfab7:   5b  pop%ebx
c01bfab8:   5d  pop%ebp
c01bfab9:   c3  ret

This code tells the CPU to preserves %ebx and %ebp.  But we know %ebx
_wasn't_ preserved.  Ergo, your CPU is plainly not doing what the code
told it to do.



... assuming nothing else clobbered the stack slot (which would be a 
compiler error, or a wild pointer.)


Got a disassembly of the whole function?

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-21 Thread Michael K. Edwards


Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
to us, at least on an ARM target ...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-21 Thread Russell King

On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote:
> <1>[18840.304048] Unable to handle kernel NULL pointer dereference at virtual 
> address 0012
> <1>[18840.313046]  printing eip:
> <4>[18840.321687] c01bfa7a
> <1>[18840.321714] *pde = 
> <0>[18840.331287] Oops:  [#1]
> <4>[18840.340687] Modules linked in:
> <0>[18840.349749] CPU:0
> <4>[18840.349767] EIP:0060:[]Not tainted VLI
> <4>[18840.349782] EFLAGS: 00010202   (2.6.16.41-mtm5-debug1 #1) 
> <0>[18840.377277] EIP is at serial_in+0xa/0x4a
> <0>[18840.387221] eax: 0060   ebx:    ecx:    edx: 
> 
> <0>[18840.397805] esi:    edi: 0040   ebp: c728fe1c   esp: 
> c728fe18
> <0>[18840.408579] ds: 007b   es: 007b   ss: 0068
> <0>[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 
> task=c7443a90)
> <0>[18840.420509] Stack: <0>  c01c0f88   
> c031fef0 0005 0202 
> <0>[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d 
> c031fef0 c124b510 c124b510 
> <0>[18840.460540] c773dbcc c728fe7c c01befe7 c124b510 
>  ffed c773dbcc 

Okay, this one is even more plainly "not a coding error".

> <0>[18840.566645]  [] serial8250_startup+0x28f/0x2a9

The code around this point (with the return point marked) is:

> c01c0f78: 6a 05   push   $0x5
> c01c0f7a: 53  push   %ebx
> c01c0f7b: e8 f0 ea ff ff  call   c01bfa70 
> c01c0f80: 6a 00   push   $0x0
> c01c0f82: 53  push   %ebx
> c01c0f83: e8 e8 ea ff ff  call   c01bfa70 
> c01c0f88<<<   6a 02   push   $0x2
> c01c0f8a: 53  push   %ebx
> c01c0f8b: e8 e0 ea ff ff  call   c01bfa70 

and corresponds with this C code:

(void) serial_inp(up, UART_LSR);
(void) serial_inp(up, UART_RX);
(void) serial_inp(up, UART_IIR);

Now let's look at the words pushed on the stack around this code:

  
  
  c01c0f88 <- return address for serial_in (serial8250_startup+0x28f/0x2a9)
   <- from push %ebx at c01c0f82
   <- from push $0x0 at c01c0f80
  c031fef0 <- from push %ebx at c01c0f7a
  0005 <- from push %0x5 at c01c0f78

Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
First thing to notice is this violates the C code - "up" can not
change.

Now let's look at serial_in:

c01bfa70:   55  push   %ebp
c01bfa71:   89 e5   mov%esp,%ebp
c01bfa73:   53  push   %ebx
...
c01bfab7:   5b  pop%ebx
c01bfab8:   5d  pop%ebp
c01bfab9:   c3  ret

This code tells the CPU to preserves %ebx and %ebp.  But we know %ebx
_wasn't_ preserved.  Ergo, your CPU is plainly not doing what the code
told it to do.

Moreover, serial_in() has preserved %ebx in the past otherwise we'd
never got past all the other serial_in()s in serial8250_startup().

So I think it's very demonstrably a hardware fault, and not software
related.

For all we know, it could be a one-off fault on the hardware you
happen to have - other identical units may not behave the same (can
you check?)

If it is a one off case, you are welcome to patch that test out in
your kernel build to remove the problem, and if it's an isolated case
I encourage you to do this.  This is one of the great advantages of
open source - if you hit such a problem rather than throwing the
hardware away you can work around such issues.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-21 Thread Frederik Deweerdt

On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote:
> New devolpments.
> I have upgraded to 2.6.16.41, applied a patch sent by Frederik that
> removed the changed made in http://lkml.org/lkml/2005/6/23/266 and
> activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL,
> CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB,
> CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING
> (thanks to vda for pointing me to the right doc.).
> At first it seemed to work fine, but after some days of continuous
> running I've got another kernel Oops!
> I attach the dmesg output and the assembly dump of serial8250_startup()
> and serial8250_shutdown().
> 
As suspected by Russell, the badness seems to happen just at the
end of the serial_inp on LSR, drivers/serial/8250.c:1650.
The NULL deref happens at the beginning of the serial_inp(up, UART_RX)
call, when trying to dereference *up.

c01bfa70 :
c01bfa70:   55  push   %ebp
c01bfa71:   89 e5   mov%esp,%ebp
c01bfa73:   53  push   %ebx
c01bfa74:   8b 5d 08mov0x8(%ebp),%ebx << %ebx = up 
(which is NULL)
c01bfa77:   8b 55 0cmov0xc(%ebp),%edx
c01bfa7a:   0f b6 4b 12 movzbl 0x12(%ebx),%ecx << %ecx = 
*(%ebx+12) Oops
c01bfa7e:   0f b6 43 13 movzbl 0x13(%ebx),%eax

It seems that somehow, the pop %ebx at the end of
the serial_inp(up, UART_LSR) function poped a NULL value instead of the
expected pointer. Any suggestion on how this could happen?
Jose, did you try to msleep(10) before the "And clear the interrupt
registers again for luck." as suggested by Russell?

You should also revert the change I suggested, it seems I missed the
target by a few lines of code :).

Regards,
Frederik

diff --git a/drivers/serial/8250.c b/drivers/serial/8250.c
index 7aca22c..385cc51 100644
--- a/drivers/serial/8250.c
+++ b/drivers/serial/8250.c
@@ -1643,6 +1643,7 @@ static int serial8250_startup(struct uart_port *port)
(void) inb_p(icp);
}
 
+   msleep(10);
/*
 * And clear the interrupt registers again for luck.
 */



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-21 Thread Jose Goncalves

Jose Goncalves wrote:
> New devolpments.
> I have upgraded to 2.6.16.41, applied a patch sent by Frederik that
> removed the changed made in http://lkml.org/lkml/2005/6/23/266 and
> activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL,
> CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB,
> CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING
> (thanks to vda for pointing me to the right doc.).
> At first it seemed to work fine, but after some days of continuous
> running I've got another kernel Oops!
> I attach the dmesg output and the assembly dump of serial8250_startup()
> and serial8250_shutdown().
>   

And also the assembly dump of serial_in() were the NULL pointer
dereference happens.

José Gonçalves


vmlinux-2.6.16.41-mtm5-debug1: file format elf32-i386

Disassembly of section .text:

c01bfa70 :
c01bfa70:   55  push   %ebp
c01bfa71:   89 e5   mov%esp,%ebp
c01bfa73:   53  push   %ebx
c01bfa74:   8b 5d 08mov0x8(%ebp),%ebx
c01bfa77:   8b 55 0cmov0xc(%ebp),%edx
c01bfa7a:   0f b6 4b 12 movzbl 0x12(%ebx),%ecx
c01bfa7e:   0f b6 43 13 movzbl 0x13(%ebx),%eax
c01bfa82:   d3 e2   shl%cl,%edx
c01bfa84:   83 f8 02cmp$0x2,%eax
c01bfa87:   74 1a   je c01bfaa3 
c01bfa89:   7f 05   jg c01bfa90 
c01bfa8b:   48  dec%eax
c01bfa8c:   74 09   je c01bfa97 
c01bfa8e:   eb 21   jmpc01bfab1 
c01bfa90:   83 f8 03cmp$0x3,%eax
c01bfa93:   74 15   je c01bfaaa 
c01bfa95:   eb 1a   jmpc01bfab1 
c01bfa97:   8a 43 78mov0x78(%ebx),%al
c01bfa9a:   01 d0   add%edx,%eax
c01bfa9c:   8b 13   mov(%ebx),%edx
c01bfa9e:   48  dec%eax
c01bfa9f:   ee  out%al,(%dx)
c01bfaa0:   42  inc%edx
c01bfaa1:   eb 10   jmpc01bfab3 
c01bfaa3:   03 53 04add0x4(%ebx),%edx
c01bfaa6:   8a 02   mov(%edx),%al
c01bfaa8:   eb 0a   jmpc01bfab4 
c01bfaaa:   03 53 04add0x4(%ebx),%edx
c01bfaad:   8b 02   mov(%edx),%eax
c01bfaaf:   eb 06   jmpc01bfab7 
c01bfab1:   03 13   add(%ebx),%edx
c01bfab3:   ec  in (%dx),%al
c01bfab4:   0f b6 c0movzbl %al,%eax
c01bfab7:   5b  pop%ebx
c01bfab8:   5d  pop%ebp
c01bfab9:   c3  ret
Disassembly of section .init.text:
Disassembly of section .altinstr_replacement:
Disassembly of section .exit.text:

Re: Serial related oops

2007-02-21 Thread Jose Goncalves

New devolpments.
I have upgraded to 2.6.16.41, applied a patch sent by Frederik that
removed the changed made in http://lkml.org/lkml/2005/6/23/266 and
activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL,
CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB,
CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING
(thanks to vda for pointing me to the right doc.).
At first it seemed to work fine, but after some days of continuous
running I've got another kernel Oops!
I attach the dmesg output and the assembly dump of serial8250_startup()
and serial8250_shutdown().

Regards,
José Gonçalves

<1>[18840.304048] Unable to handle kernel NULL pointer dereference at virtual address 0012
<1>[18840.313046]  printing eip:
<4>[18840.321687] c01bfa7a
<1>[18840.321714] *pde = 
<0>[18840.331287] Oops:  [#1]
<4>[18840.340687] Modules linked in:
<0>[18840.349749] CPU:0
<4>[18840.349767] EIP:0060:[]Not tainted VLI
<4>[18840.349782] EFLAGS: 00010202   (2.6.16.41-mtm5-debug1 #1) 
<0>[18840.377277] EIP is at serial_in+0xa/0x4a
<0>[18840.387221] eax: 0060   ebx:    ecx:    edx: 
<0>[18840.397805] esi:    edi: 0040   ebp: c728fe1c   esp: c728fe18
<0>[18840.408579] ds: 007b   es: 007b   ss: 0068
<0>[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 task=c7443a90)
<0>[18840.420509] Stack: <0>  c01c0f88   c031fef0 0005 0202 
<0>[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d c031fef0 c124b510 c124b510 
<0>[18840.460540] c773dbcc c728fe7c c01befe7 c124b510  ffed c773dbcc 
<0>[18840.475892] Call Trace:
<0>[18840.490039]  [] show_stack_log_lvl+0xa5/0xad
<0>[18840.504944]  [] show_registers+0x106/0x16f
<0>[18840.520104]  [] die+0xb6/0x127
<0>[18840.535497]  [] do_page_fault+0x380/0x4b3
<0>[18840.550828]  [] error_code+0x4f/0x60
<0>[18840.566645]  [] serial8250_startup+0x28f/0x2a9
<0>[18840.582471] Code: 38 43 78 75 02 b2 01 89 d0 eb 10 8b 41 70 39 43 70 0f 94 c0 0f b6 c0 eb 02 31 c0 5b 5d c3 90 90 90 55 89 e5 53 8b 5d 08 8b 55 0c <0f> b6 4b 12 0f b6 43 13 d3 e2 83 f8 02 74 1a 7f 05 48 74 09 eb 
<4>[18840.680471]  BUG: gp_position/11629, lock held at task exit time!
<4>[18840.702808]  [c124b528] {uart_register_driver}
<4>[18840.722346] .. held by:   gp_position:11629 [c7443a90, 117]
<4>[18840.742113] ... acquired at:   uart_get+0x28/0xde

vmlinux-2.6.16.41-mtm5-debug1: file format elf32-i386

Disassembly of section .text:

c01c0cf9 :
c01c0cf9:   55  push   %ebp
c01c0cfa:   89 e5   mov%esp,%ebp
c01c0cfc:   57  push   %edi
c01c0cfd:   56  push   %esi
c01c0cfe:   53  push   %ebx
c01c0cff:   53  push   %ebx
c01c0d00:   8b 5d 08mov0x8(%ebp),%ebx
c01c0d03:   8b 43 60mov0x60(%ebx),%eax
c01c0d06:   c6 83 a7 00 00 00 00movb   $0x0,0xa7(%ebx)
c01c0d0d:   89 c2   mov%eax,%edx
c01c0d0f:   c1 e2 04shl$0x4,%edx
c01c0d12:   83 f8 0acmp$0xa,%eax
c01c0d15:   8b 92 ac 37 25 c0   mov0xc02537ac(%edx),%edx
c01c0d1b:   66 89 93 9c 00 00 00mov%dx,0x9c(%ebx)
c01c0d22:   75 66   jnec01c0d8a 

c01c0d24:   c6 83 a4 00 00 00 00movb   $0x0,0xa4(%ebx)
c01c0d2b:   68 bf 00 00 00  push   $0xbf
c01c0d30:   6a 03   push   $0x3
c01c0d32:   53  push   %ebx
c01c0d33:   e8 82 ed ff ff  call   c01bfaba 
c01c0d38:   6a 10   push   $0x10
c01c0d3a:   6a 02   push   $0x2
c01c0d3c:   53  push   %ebx
c01c0d3d:   e8 78 ed ff ff  call   c01bfaba 
c01c0d42:   6a 00   push   $0x0
c01c0d44:   6a 01   push   $0x1
c01c0d46:   53  push   %ebx
c01c0d47:   e8 6e ed ff ff  call   c01bfaba 
c01c0d4c:   83 c4 24add$0x24,%esp
c01c0d4f:   6a 00   push   $0x0
c01c0d51:   6a 03   push   $0x3
c01c0d53:   53  push   %ebx
c01c0d54:   e8 61 ed ff ff  call   c01bfaba 
c01c0d59:   6a 00   push   $0x0
c01c0d5b:   6a 0c   push   $0xc
c01c0d5d:   53  push   %ebx
c01c0d5e:   e8 a7 ed ff ff  call   c01bfb0a 
c01c0d63:   68 bf 00 00 00  push   $0xbf
c01c0d68:   6a 03   push   $0x3
c01c0d6a:   53  push   %ebx
c01c0d6b:   e8 4a ed ff ff  call   c01bfaba 
c01c0d70:   83 c4 24add$0x24,%esp
c01c0d73:   6a 10   push   $0x10
c01c0d75:   6a 02   push

Re: Serial related oops

2007-02-21 Thread Jose Goncalves

New devolpments.
I have upgraded to 2.6.16.41, applied a patch sent by Frederik that
removed the changed made in http://lkml.org/lkml/2005/6/23/266 and
activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL,
CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB,
CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING
(thanks to vda for pointing me to the right doc.).
At first it seemed to work fine, but after some days of continuous
running I've got another kernel Oops!
I attach the dmesg output and the assembly dump of serial8250_startup()
and serial8250_shutdown().

Regards,
José Gonçalves

1[18840.304048] Unable to handle kernel NULL pointer dereference at virtual address 0012
1[18840.313046]  printing eip:
4[18840.321687] c01bfa7a
1[18840.321714] *pde = 
0[18840.331287] Oops:  [#1]
4[18840.340687] Modules linked in:
0[18840.349749] CPU:0
4[18840.349767] EIP:0060:[c01bfa7a]Not tainted VLI
4[18840.349782] EFLAGS: 00010202   (2.6.16.41-mtm5-debug1 #1) 
0[18840.377277] EIP is at serial_in+0xa/0x4a
0[18840.387221] eax: 0060   ebx:    ecx:    edx: 
0[18840.397805] esi:    edi: 0040   ebp: c728fe1c   esp: c728fe18
0[18840.408579] ds: 007b   es: 007b   ss: 0068
0[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 task=c7443a90)
0[18840.420509] Stack: 0  c01c0f88   c031fef0 0005 0202 
0[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d c031fef0 c124b510 c124b510 
0[18840.460540] c773dbcc c728fe7c c01befe7 c124b510  ffed c773dbcc 
0[18840.475892] Call Trace:
0[18840.490039]  [c0102a35] show_stack_log_lvl+0xa5/0xad
0[18840.504944]  [c0102b70] show_registers+0x106/0x16f
0[18840.520104]  [c0102d06] die+0xb6/0x127
0[18840.535497]  [c0109677] do_page_fault+0x380/0x4b3
0[18840.550828]  [c01026bf] error_code+0x4f/0x60
0[18840.566645]  [c01c0f88] serial8250_startup+0x28f/0x2a9
0[18840.582471] Code: 38 43 78 75 02 b2 01 89 d0 eb 10 8b 41 70 39 43 70 0f 94 c0 0f b6 c0 eb 02 31 c0 5b 5d c3 90 90 90 55 89 e5 53 8b 5d 08 8b 55 0c 0f b6 4b 12 0f b6 43 13 d3 e2 83 f8 02 74 1a 7f 05 48 74 09 eb 
4[18840.680471]  BUG: gp_position/11629, lock held at task exit time!
4[18840.702808]  [c124b528] {uart_register_driver}
4[18840.722346] .. held by:   gp_position:11629 [c7443a90, 117]
4[18840.742113] ... acquired at:   uart_get+0x28/0xde

vmlinux-2.6.16.41-mtm5-debug1: file format elf32-i386

Disassembly of section .text:

c01c0cf9 serial8250_startup:
c01c0cf9:   55  push   %ebp
c01c0cfa:   89 e5   mov%esp,%ebp
c01c0cfc:   57  push   %edi
c01c0cfd:   56  push   %esi
c01c0cfe:   53  push   %ebx
c01c0cff:   53  push   %ebx
c01c0d00:   8b 5d 08mov0x8(%ebp),%ebx
c01c0d03:   8b 43 60mov0x60(%ebx),%eax
c01c0d06:   c6 83 a7 00 00 00 00movb   $0x0,0xa7(%ebx)
c01c0d0d:   89 c2   mov%eax,%edx
c01c0d0f:   c1 e2 04shl$0x4,%edx
c01c0d12:   83 f8 0acmp$0xa,%eax
c01c0d15:   8b 92 ac 37 25 c0   mov0xc02537ac(%edx),%edx
c01c0d1b:   66 89 93 9c 00 00 00mov%dx,0x9c(%ebx)
c01c0d22:   75 66   jnec01c0d8a 
serial8250_startup+0x91
c01c0d24:   c6 83 a4 00 00 00 00movb   $0x0,0xa4(%ebx)
c01c0d2b:   68 bf 00 00 00  push   $0xbf
c01c0d30:   6a 03   push   $0x3
c01c0d32:   53  push   %ebx
c01c0d33:   e8 82 ed ff ff  call   c01bfaba serial_out
c01c0d38:   6a 10   push   $0x10
c01c0d3a:   6a 02   push   $0x2
c01c0d3c:   53  push   %ebx
c01c0d3d:   e8 78 ed ff ff  call   c01bfaba serial_out
c01c0d42:   6a 00   push   $0x0
c01c0d44:   6a 01   push   $0x1
c01c0d46:   53  push   %ebx
c01c0d47:   e8 6e ed ff ff  call   c01bfaba serial_out
c01c0d4c:   83 c4 24add$0x24,%esp
c01c0d4f:   6a 00   push   $0x0
c01c0d51:   6a 03   push   $0x3
c01c0d53:   53  push   %ebx
c01c0d54:   e8 61 ed ff ff  call   c01bfaba serial_out
c01c0d59:   6a 00   push   $0x0
c01c0d5b:   6a 0c   push   $0xc
c01c0d5d:   53  push   %ebx
c01c0d5e:   e8 a7 ed ff ff  call   c01bfb0a serial_icr_write
c01c0d63:   68 bf 00 00 00  push   $0xbf
c01c0d68:   6a 03   push   $0x3
c01c0d6a:   53  push   %ebx
c01c0d6b:   e8 4a ed ff ff  call   c01bfaba serial_out
c01c0d70:   83 c4 24add$0x24,%esp

Re: Serial related oops

2007-02-21 Thread Jose Goncalves

Jose Goncalves wrote:
 New devolpments.
 I have upgraded to 2.6.16.41, applied a patch sent by Frederik that
 removed the changed made in http://lkml.org/lkml/2005/6/23/266 and
 activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL,
 CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB,
 CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING
 (thanks to vda for pointing me to the right doc.).
 At first it seemed to work fine, but after some days of continuous
 running I've got another kernel Oops!
 I attach the dmesg output and the assembly dump of serial8250_startup()
 and serial8250_shutdown().
   

And also the assembly dump of serial_in() were the NULL pointer
dereference happens.

José Gonçalves


vmlinux-2.6.16.41-mtm5-debug1: file format elf32-i386

Disassembly of section .text:

c01bfa70 serial_in:
c01bfa70:   55  push   %ebp
c01bfa71:   89 e5   mov%esp,%ebp
c01bfa73:   53  push   %ebx
c01bfa74:   8b 5d 08mov0x8(%ebp),%ebx
c01bfa77:   8b 55 0cmov0xc(%ebp),%edx
c01bfa7a:   0f b6 4b 12 movzbl 0x12(%ebx),%ecx
c01bfa7e:   0f b6 43 13 movzbl 0x13(%ebx),%eax
c01bfa82:   d3 e2   shl%cl,%edx
c01bfa84:   83 f8 02cmp$0x2,%eax
c01bfa87:   74 1a   je c01bfaa3 serial_in+0x33
c01bfa89:   7f 05   jg c01bfa90 serial_in+0x20
c01bfa8b:   48  dec%eax
c01bfa8c:   74 09   je c01bfa97 serial_in+0x27
c01bfa8e:   eb 21   jmpc01bfab1 serial_in+0x41
c01bfa90:   83 f8 03cmp$0x3,%eax
c01bfa93:   74 15   je c01bfaaa serial_in+0x3a
c01bfa95:   eb 1a   jmpc01bfab1 serial_in+0x41
c01bfa97:   8a 43 78mov0x78(%ebx),%al
c01bfa9a:   01 d0   add%edx,%eax
c01bfa9c:   8b 13   mov(%ebx),%edx
c01bfa9e:   48  dec%eax
c01bfa9f:   ee  out%al,(%dx)
c01bfaa0:   42  inc%edx
c01bfaa1:   eb 10   jmpc01bfab3 serial_in+0x43
c01bfaa3:   03 53 04add0x4(%ebx),%edx
c01bfaa6:   8a 02   mov(%edx),%al
c01bfaa8:   eb 0a   jmpc01bfab4 serial_in+0x44
c01bfaaa:   03 53 04add0x4(%ebx),%edx
c01bfaad:   8b 02   mov(%edx),%eax
c01bfaaf:   eb 06   jmpc01bfab7 serial_in+0x47
c01bfab1:   03 13   add(%ebx),%edx
c01bfab3:   ec  in (%dx),%al
c01bfab4:   0f b6 c0movzbl %al,%eax
c01bfab7:   5b  pop%ebx
c01bfab8:   5d  pop%ebp
c01bfab9:   c3  ret
Disassembly of section .init.text:
Disassembly of section .altinstr_replacement:
Disassembly of section .exit.text:

Re: Serial related oops

2007-02-21 Thread Frederik Deweerdt

On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote:
 New devolpments.
 I have upgraded to 2.6.16.41, applied a patch sent by Frederik that
 removed the changed made in http://lkml.org/lkml/2005/6/23/266 and
 activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL,
 CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB,
 CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING
 (thanks to vda for pointing me to the right doc.).
 At first it seemed to work fine, but after some days of continuous
 running I've got another kernel Oops!
 I attach the dmesg output and the assembly dump of serial8250_startup()
 and serial8250_shutdown().
 
As suspected by Russell, the badness seems to happen just at the
end of the serial_inp on LSR, drivers/serial/8250.c:1650.
The NULL deref happens at the beginning of the serial_inp(up, UART_RX)
call, when trying to dereference *up.

c01bfa70 serial_in:
c01bfa70:   55  push   %ebp
c01bfa71:   89 e5   mov%esp,%ebp
c01bfa73:   53  push   %ebx
c01bfa74:   8b 5d 08mov0x8(%ebp),%ebx  %ebx = up 
(which is NULL)
c01bfa77:   8b 55 0cmov0xc(%ebp),%edx
c01bfa7a:   0f b6 4b 12 movzbl 0x12(%ebx),%ecx  %ecx = 
*(%ebx+12) Oops
c01bfa7e:   0f b6 43 13 movzbl 0x13(%ebx),%eax

It seems that somehow, the pop %ebx at the end of
the serial_inp(up, UART_LSR) function poped a NULL value instead of the
expected pointer. Any suggestion on how this could happen?
Jose, did you try to msleep(10) before the And clear the interrupt
registers again for luck. as suggested by Russell?

You should also revert the change I suggested, it seems I missed the
target by a few lines of code :).

Regards,
Frederik

diff --git a/drivers/serial/8250.c b/drivers/serial/8250.c
index 7aca22c..385cc51 100644
--- a/drivers/serial/8250.c
+++ b/drivers/serial/8250.c
@@ -1643,6 +1643,7 @@ static int serial8250_startup(struct uart_port *port)
(void) inb_p(icp);
}
 
+   msleep(10);
/*
 * And clear the interrupt registers again for luck.
 */



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-21 Thread Russell King

On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote:
 1[18840.304048] Unable to handle kernel NULL pointer dereference at virtual 
 address 0012
 1[18840.313046]  printing eip:
 4[18840.321687] c01bfa7a
 1[18840.321714] *pde = 
 0[18840.331287] Oops:  [#1]
 4[18840.340687] Modules linked in:
 0[18840.349749] CPU:0
 4[18840.349767] EIP:0060:[c01bfa7a]Not tainted VLI
 4[18840.349782] EFLAGS: 00010202   (2.6.16.41-mtm5-debug1 #1) 
 0[18840.377277] EIP is at serial_in+0xa/0x4a
 0[18840.387221] eax: 0060   ebx:    ecx:    edx: 
 
 0[18840.397805] esi:    edi: 0040   ebp: c728fe1c   esp: 
 c728fe18
 0[18840.408579] ds: 007b   es: 007b   ss: 0068
 0[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 
 task=c7443a90)
 0[18840.420509] Stack: 0  c01c0f88   
 c031fef0 0005 0202 
 0[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d 
 c031fef0 c124b510 c124b510 
 0[18840.460540] c773dbcc c728fe7c c01befe7 c124b510 
  ffed c773dbcc 

Okay, this one is even more plainly not a coding error.

 0[18840.566645]  [c01c0f88] serial8250_startup+0x28f/0x2a9

The code around this point (with the return point marked) is:

 c01c0f78: 6a 05   push   $0x5
 c01c0f7a: 53  push   %ebx
 c01c0f7b: e8 f0 ea ff ff  call   c01bfa70 serial_in
 c01c0f80: 6a 00   push   $0x0
 c01c0f82: 53  push   %ebx
 c01c0f83: e8 e8 ea ff ff  call   c01bfa70 serial_in
 c01c0f88   6a 02   push   $0x2
 c01c0f8a: 53  push   %ebx
 c01c0f8b: e8 e0 ea ff ff  call   c01bfa70 serial_in

and corresponds with this C code:

(void) serial_inp(up, UART_LSR);
(void) serial_inp(up, UART_RX);
(void) serial_inp(up, UART_IIR);

Now let's look at the words pushed on the stack around this code:

  
  
  c01c0f88 - return address for serial_in (serial8250_startup+0x28f/0x2a9)
   - from push %ebx at c01c0f82
   - from push $0x0 at c01c0f80
  c031fef0 - from push %ebx at c01c0f7a
  0005 - from push %0x5 at c01c0f78

Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
First thing to notice is this violates the C code - up can not
change.

Now let's look at serial_in:

c01bfa70:   55  push   %ebp
c01bfa71:   89 e5   mov%esp,%ebp
c01bfa73:   53  push   %ebx
...
c01bfab7:   5b  pop%ebx
c01bfab8:   5d  pop%ebp
c01bfab9:   c3  ret

This code tells the CPU to preserves %ebx and %ebp.  But we know %ebx
_wasn't_ preserved.  Ergo, your CPU is plainly not doing what the code
told it to do.

Moreover, serial_in() has preserved %ebx in the past otherwise we'd
never got past all the other serial_in()s in serial8250_startup().

So I think it's very demonstrably a hardware fault, and not software
related.

For all we know, it could be a one-off fault on the hardware you
happen to have - other identical units may not behave the same (can
you check?)

If it is a one off case, you are welcome to patch that test out in
your kernel build to remove the problem, and if it's an isolated case
I encourage you to do this.  This is one of the great advantages of
open source - if you hit such a problem rather than throwing the
hardware away you can work around such issues.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-21 Thread Michael K. Edwards


Are you using an unpatched gcc 4.1.1?  Its optimizer did nasty things
to us, at least on an ARM target ...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-21 Thread H. Peter Anvin


Russell King wrote:



Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
First thing to notice is this violates the C code - up can not
change.

Now let's look at serial_in:

c01bfa70:   55  push   %ebp
c01bfa71:   89 e5   mov%esp,%ebp
c01bfa73:   53  push   %ebx
...
c01bfab7:   5b  pop%ebx
c01bfab8:   5d  pop%ebp
c01bfab9:   c3  ret

This code tells the CPU to preserves %ebx and %ebp.  But we know %ebx
_wasn't_ preserved.  Ergo, your CPU is plainly not doing what the code
told it to do.



... assuming nothing else clobbered the stack slot (which would be a 
compiler error, or a wild pointer.)


Got a disassembly of the whole function?

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-21 Thread Frederik Deweerdt

On Wed, Feb 21, 2007 at 09:57:50PM -0800, H. Peter Anvin wrote:
 Russell King wrote:
 
 Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
 First thing to notice is this violates the C code - up can not
 change.
 Now let's look at serial_in:
 c01bfa70:   55  push   %ebp
 c01bfa71:   89 e5   mov%esp,%ebp
 c01bfa73:   53  push   %ebx
 ...
 c01bfab7:   5b  pop%ebx
 c01bfab8:   5d  pop%ebp
 c01bfab9:   c3  ret
 This code tells the CPU to preserves %ebx and %ebp.  But we know %ebx
 _wasn't_ preserved.  Ergo, your CPU is plainly not doing what the code
 told it to do.
 
 ... assuming nothing else clobbered the stack slot (which would be a compiler 
 error, or a wild pointer.)
 
 Got a disassembly of the whole function?
 
Jose posted it higher in the thread:
http://lkml.org/lkml/2007/2/21/139

Regards,
Frederik
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Robert Hancock


Michael K. Edwards wrote:

Of course not.  But dealing with a stuck IRQ line by locking up isn't
very practical either.  IRQ sharing is stupid yet universal, and it


And we don't, that's why we have that "nobody cared" logic that disables 
the interrupt line if no driver services the interrupt. That doesn't 
provide a clean recovery, of course, it's meant to notify the user of 
what happened so that the problem can be fixed.



happens all the time that a device that has been sitting there minding
its own business since power-up, with no driver to drive it, decides
to assert its IRQ.  Maybe it just got hot-plugged, maybe it just got
its first dribble of input, whatever.  Other devices on the shared IRQ
are screwed (or at least semi-screwed; you could periodically
re-enable the IRQ long enough to make a run through the ISR chain
servicing the other devices).  But if you run "lspci" (or whatever)
and load a driver for the newly awake device, everything goes back to
normal.

For devices compiled into the kernel, you shouldn't have to play these
games.  If, that is, there were three stages of driver initialization,
called in successive passes:


Exactly, for devices compiled into the kernel. In most setups this is 
only a fraction of all devices, so solving this problem only for drivers 
built into the kernel is no solution.




1) installing an ISR with a fallback STFU path (device-specific but
not dependent on any particular pre-existing chip state), quiescing it
if you know how and registering for the IRQ if you know which it is;

2) going through the chip's soft-reset-wake-up-shut-up cycle and
populating driver data structures, possibly correcting the IRQ
registration along the way;

3) ready-as-we'll-ever-be, bring on the interrupts.

You probably can't help enabling the IRQ briefly during 2) so that you
can do tests like Russell's loopback.  But it's a needless gamble to
do that without doing 1) for all compiled-in drivers and platform
devices first, in a previous discovery pass.  And it's stupid to do 3)
in the same pass as 2), because you'll just open race condition
windows that will only bite when an all-the-way-live device raises its
IRQ at a moment when the writer of the wake-up-shut-up code wasn't
expecting it.  All code has bugs and they're only a problem when they
bite in the field.

If a system has a device that generates interrupts before they're 
enabled,

and the firmware doesn't fix it, then some platform-specific quirk has to
handle it and shut off the interrupt before it allows any interrupts
to be enabled. (We have such a quirk for certain network controllers 
where

the boot ROM can leave the chip generating interrupts on bootup.)


You don't need quirks if your driver initialization is bomb-proof to
begin with.  Devices that are quiet on power-up are purely
coincidental and should not be construed.


It's not coincidental, it is the only sane way to design hardware. You 
just can't go firing off interrupts without a driver having 
intentionally enabled them. There are a few devices that have had such 
issues, but they have been few and far between, certainly not enough to 
warrant the complexity of the scheme you propose.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


On 2/19/07, Robert Hancock <[EMAIL PROTECTED]> wrote:

How do you propose to do this? Drivers can get loaded and unloaded at any
time. If you have a device generating spurious interrupts on a shared IRQ
line, there's no way you can use any device on that line until that interrupt
is shut off. Requiring all drivers to be loaded before any of them can use
interrupts is simply not practical.


Of course not.  But dealing with a stuck IRQ line by locking up isn't
very practical either.  IRQ sharing is stupid yet universal, and it
happens all the time that a device that has been sitting there minding
its own business since power-up, with no driver to drive it, decides
to assert its IRQ.  Maybe it just got hot-plugged, maybe it just got
its first dribble of input, whatever.  Other devices on the shared IRQ
are screwed (or at least semi-screwed; you could periodically
re-enable the IRQ long enough to make a run through the ISR chain
servicing the other devices).  But if you run "lspci" (or whatever)
and load a driver for the newly awake device, everything goes back to
normal.

For devices compiled into the kernel, you shouldn't have to play these
games.  If, that is, there were three stages of driver initialization,
called in successive passes:

1) installing an ISR with a fallback STFU path (device-specific but
not dependent on any particular pre-existing chip state), quiescing it
if you know how and registering for the IRQ if you know which it is;

2) going through the chip's soft-reset-wake-up-shut-up cycle and
populating driver data structures, possibly correcting the IRQ
registration along the way;

3) ready-as-we'll-ever-be, bring on the interrupts.

You probably can't help enabling the IRQ briefly during 2) so that you
can do tests like Russell's loopback.  But it's a needless gamble to
do that without doing 1) for all compiled-in drivers and platform
devices first, in a previous discovery pass.  And it's stupid to do 3)
in the same pass as 2), because you'll just open race condition
windows that will only bite when an all-the-way-live device raises its
IRQ at a moment when the writer of the wake-up-shut-up code wasn't
expecting it.  All code has bugs and they're only a problem when they
bite in the field.


If a system has a device that generates interrupts before they're enabled,
and the firmware doesn't fix it, then some platform-specific quirk has to
handle it and shut off the interrupt before it allows any interrupts
to be enabled. (We have such a quirk for certain network controllers where
the boot ROM can leave the chip generating interrupts on bootup.)


You don't need quirks if your driver initialization is bomb-proof to
begin with.  Devices that are quiet on power-up are purely
coincidental and should not be construed.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Robert Hancock


Michael K. Edwards wrote:

Still open, though it's a pity you're more interested in my flawed
understanding that in the possibility that the kernel could be
systematically made more robust against hardware bugs and coding
errors by the simple expedient of putting all the ISRs in before
turning on any IRQ that might be shared.  Or are you telling me that's
already been done?  (Yes, I am aware that this interacts
entertainingly with hot-plug PCI.  Yes, I am aware that there is a
limit to how much software can fix stupid hardware.  But surely there
is room for an emergency IRQ suppressor to let chip initialization
code kick in and force the hardware to a known state.)


How do you propose to do this? Drivers can get loaded and unloaded at any
time. If you have a device generating spurious interrupts on a shared IRQ
line, there's no way you can use any device on that line until that interrupt
is shut off. Requiring all drivers to be loaded before any of them can use
interrupts is simply not practical.

If a system has a device that generates interrupts before they're enabled,
and the firmware doesn't fix it, then some platform-specific quirk has to
handle it and shut off the interrupt before it allows any interrupts
to be enabled. (We have such a quirk for certain network controllers where
the boot ROM can leave the chip generating interrupts on bootup.)

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote:

This can't happen because when __do_irq unmasks the interrupt source,
the CPU mask is set, thereby preventing any further interrupt exceptions
being taken.  This is done precisely to prevent this situation happening.

If you are seeing recursion for the same interrupt (two or more stack
frames containing asm_do_IRQ for that very same IRQ) then your interrupt
handling is buggy, plain and simple.


Imaginable.  I'll look at the mask/unmask code.  Thanks.


I don't doubt that it is on the same IRQ line - I have such setups here
and it works perfectly - multiple 8250 UARTs connected to a single
level-triggered interrupt input which also happens to be shared with
a SCSI host chip as well.  Absolutely no problems.


Can you do me a favor?  In the sys_open("/dev/console") path, turn on
the right bits in that second uart's IER, then insert a sleep in
request_irq or something (wherever seems best based on that
backtrace), and feed enough characters into the second UART during
that sleep to generate an IRQ.  Do you not get the same soft lockup?


I still say that your understanding is completely flawed.  Moreover,
you haven't read what I've said about the ordering of initialisation,
the stress on when we disable interrupts for the ports, etc.


Well, all I can say is that that's a real backtrace and it shouldn't
be hard to reproduce if it's anything other than a broken interrupt
controller or broken code called by the __do_irq postamble.  I don't
see any platform-provided unmask routines in that backtrace, but maybe
it got inlined; I'll go back and check.


You're actually *not* helping.  You're causing utter confusion through
misunderstanding, but it seems you're not open to the possibility that
your understanding is flawed.


Still open, though it's a pity you're more interested in my flawed
understanding that in the possibility that the kernel could be
systematically made more robust against hardware bugs and coding
errors by the simple expedient of putting all the ISRs in before
turning on any IRQ that might be shared.  Or are you telling me that's
already been done?  (Yes, I am aware that this interacts
entertainingly with hot-plug PCI.  Yes, I am aware that there is a
limit to how much software can fix stupid hardware.  But surely there
is room for an emergency IRQ suppressor to let chip initialization
code kick in and force the hardware to a known state.)


I'm offering to look through your code and point you at the source of
your issue for free.  Please don't throw that offer away without first
considering that maybe I have a clue about what's going on here.


I appreciate that offer, and I hope to take advantage of it as soon as
I have the source code at my fingertips (not just the chat log where I
recorded the backtrace).


... which showed the port being opened well after system initialisation
of devices, including all serial ports - including disabling of their
interrupt source at the IER, has been completed.


Now that you mention it, the backtrace I sent is the
serial8250_startup one, not the serial8250_init one.  Sorry, this
one's probably an artifact of brain damage specific to this UART.  I
need to dig through a different account to find the init-path example;
but in either case, we're getting a new interrupt during the __do_irq
postamble.  If you're telling me that that shouldn't happen, what
should the backtrace for a soft lockup due to a stuck level-triggered
IRQ look like on ARM?


Yes, and it's the same for any serial console with functioning break
support.  You'll find it in Documentation/sysrq.txt, though it does
misleadingly say "PC style standard serial ports only" whereas the
reality is "where possible".


Thank you very much; this will help me get to the bottom of some other
chip-support nastiness on this device.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 04:04:26PM -0800, Michael K. Edwards wrote:
> On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote:
> >The second interrupt comes in, and when you go to disable that
> >source, you inadvertently re-enable the UART interrupt, despite it
> >still being serviced.
> 
> Incorrect.  An attempt has been made to service the interrupt using
> the only ISR currently in the chain for that IRQ -- the ISR for the
> first UART.  That attempt was not successful, and when __do_irq
> unmasks the interrupt source preparatory to exiting interrupt context,
> __irq_svc is dispatched anew.

This can't happen because when __do_irq unmasks the interrupt source,
the CPU mask is set, thereby preventing any further interrupt exceptions
being taken.  This is done precisely to prevent this situation happening.

If you are seeing recursion for the same interrupt (two or more stack
frames containing asm_do_IRQ for that very same IRQ) then your interrupt
handling is buggy, plain and simple.

> >Please show your interrupt controller (mask, unmask, mask_ack)
> >handling functions corresponding with the interrupt which your
> >UART is connected to.
> 
> Don't have 'em handy; I'll be happy to post them when I do, perhaps
> later today.  I would hope they're pretty generic, though; it's a
> Feroceon core pretending to be an ARM926EJ-S, hooked to the usual
> half-assed Marvell imitation of an ARM licensed functional block.
> Trust me for the moment, it's the same IRQ line.

I don't doubt that it is on the same IRQ line - I have such setups here
and it works perfectly - multiple 8250 UARTs connected to a single
level-triggered interrupt input which also happens to be shared with
a SCSI host chip as well.  Absolutely no problems.

> If you don't enjoy this sort of forensics (which I
> for one do not, especially not when there is a project deadline
> looming and a Heisenbug starts firing 9 times out of 10), you might
> consider systematically installing ISRs that know how to shut
> everything up before turning on any interrupt sources at all.

I still say that your understanding is completely flawed.  Moreover,
you haven't read what I've said about the ordering of initialisation,
the stress on when we disable interrupts for the ports, etc.

> I'm not asking for anyone's help except in the
> let's-all-help-one-another spirit.

You're actually *not* helping.  You're causing utter confusion through
misunderstanding, but it seems you're not open to the possibility that
your understanding is flawed.

I'm offering to look through your code and point you at the source of
your issue for free.  Please don't throw that offer away without first
considering that maybe I have a clue about what's going on here.

> Now please take a second look at the backtrace before toasting me
> lightly again.

... which showed the port being opened well after system initialisation
of devices, including all serial ports - including disabling of their
interrupt source at the IER, has been completed.

> Mmm'kay?  Oh, and by the way -- is there an Alt-SysRq
> equivalent on an ARM serial console?

Yes, and it's the same for any serial console with functioning break
support.  You'll find it in Documentation/sysrq.txt, though it does
misleadingly say "PC style standard serial ports only" whereas the
reality is "where possible".

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote:

I think something else is going on here.  I think you're getting
an interrupt for the UART, and another interrupt is also pending.


Correct.  An interrupt for the other UART on the same IRQ.


When the UART interrupt is handled, it is masked at the interrupt
controller, and the CPU mask is dropped.


Correct.


The second interrupt comes in, and when you go to disable that
source, you inadvertently re-enable the UART interrupt, despite it
still being serviced.


Incorrect.  An attempt has been made to service the interrupt using
the only ISR currently in the chain for that IRQ -- the ISR for the
first UART.  That attempt was not successful, and when __do_irq
unmasks the interrupt source preparatory to exiting interrupt context,
__irq_svc is dispatched anew.


This leads to the UART interrupt again triggering an IRQ.


Right.  The _second_ UART's interrupt.  There's another problem with
these UARTs having to do with the implementor's inability to read and
follow a bog-standard twenty-year-old spec without asking software to
fix up corner cases, but that's another backtrace for another day.


Please show your interrupt controller (mask, unmask, mask_ack)
handling functions corresponding with the interrupt which your
UART is connected to.


Don't have 'em handy; I'll be happy to post them when I do, perhaps
later today.  I would hope they're pretty generic, though; it's a
Feroceon core pretending to be an ARM926EJ-S, hooked to the usual
half-assed Marvell imitation of an ARM licensed functional block.
Trust me for the moment, it's the same IRQ line.


This shows that you don't actually have an understanding of the Linux
kernel boot, especially in respect of serial devices.  At boot, devices
are detected and initialised to a safe state, where they will not
spuriously generate interrupts.


Sorry, 'taint so.  Not unless the chip support droid has put the right
stuff in arch/arm/mach-foo.  LKML is littered with the fall-out of the
decision to trust whoever jumped to main() to have left the hardware
in a sane state.  If you don't enjoy this sort of forensics (which I
for one do not, especially not when there is a project deadline
looming and a Heisenbug starts firing 9 times out of 10), you might
consider systematically installing ISRs that know how to shut
everything up before turning on any interrupt sources at all.

As I said, this is not going to happen overnight, and is not even
particularly in the economic interest of people who get paid by the
hour to wear bringup wizard hats.  That category currently includes
me, but I am intensely bored with this game and aspire to greater
things.


When a userspace program opens a serial port, which can only happen
once the kernel boot has completed (ergo, devices have been initialised
and placed in a safe state) the interrupts are claimed, and enabled
at the source.


As you can see from the console dump I posted (which begins with
"Freeing init memory: 92K" and ends with do_exit -> init -> sys_open,
which is obviously sys_open("/dev/console")), this happens long before
userspace comes into the picture.  Our 8250.c has some nasty hacks in
it but otherwise this call chain is from a very nearly vanilla
2.6.16.recent.

We've already worked around this on our board, and the whole kit and
kaboodle will eventually be posted to linux-arm-kernel in tidy patches
when my client lets me spend billable hours on it (immediately after
the damn thing passes its first functional test, long before it
ships).  I'm not asking for anyone's help except in the
let's-all-help-one-another spirit.  I'm trying to help with root cause
analysis of Frederik's (Jose's?) fandango on core.  If it's not
relevant, my apologies; and although it goes without saying, I salute
you for both the serial driver and the ARM port.

Now please take a second look at the backtrace before toasting me
lightly again.  Mmm'kay?  Oh, and by the way -- is there an Alt-SysRq
equivalent on an ARM serial console?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 02:16:41PM -0800, Michael K. Edwards wrote:
> Right.  But as soon as you turn the source back on, in the postamble
> of the interrupt dispatch handler, it fires again.  At least on ARM,
> that gives you recursive hits to __irq_svc and a couple of nested
> calls within it.

I think something else is going on here.  I think you're getting
an interrupt for the UART, and another interrupt is also pending.

When the UART interrupt is handled, it is masked at the interrupt
controller, and the CPU mask is dropped.

The second interrupt comes in, and when you go to disable that
source, you inadvertently re-enable the UART interrupt, despite it
still being serviced.

This leads to the UART interrupt again triggering an IRQ.

Please show your interrupt controller (mask, unmask, mask_ack)
handling functions corresponding with the interrupt which your
UART is connected to.

> >> But its context is not.  Shared IRQ lines are a _problem_.  You cannot
> >> safely enable an IRQ until all devices that share it have had their
> >> ISRs installed, unless you can absolutely guarantee at a hardware
> >> level that the unitialized ones cannot assert the IRQ line.
> >
> >Linux assumes that all interrupt sources on a shared IRQ line are
> >disabled at the point in time when the kernel boots.  When a device
> >is to be used, an interrupt handler is installed and then the kernel
> >will enable the interrupt on the device, not before.
> 
> Linux assumes incorrectly in this instance.
> It would improve the
> kernel if all drivers' __init code were refactored into an
> IRQ-discovery-ISR-installation pass, followed by a
> chip-reset-data-structure-initialization pass, followed by a
> chip-configuration-driver-activation pass.  This is unlikely to happen
> overnight.

This shows that you don't actually have an understanding of the Linux
kernel boot, especially in respect of serial devices.  At boot, devices
are detected and initialised to a safe state, where they will not
spuriously generate interrupts.

When a userspace program opens a serial port, which can only happen
once the kernel boot has completed (ergo, devices have been initialised
and placed in a safe state) the interrupts are claimed, and enabled
at the source.

> In the meantime, weird UART states on entry into platform_device_init
> are a reality.

Yes, uart states are indeterminent at this point.  However, as soon as
the 8250 driver loads it takes control of the 8250 ports, and DISABLES
the interrupt on ALL ports found, LONG BEFORE any service handlers are
installed.

So, by the time the system is up and running _all_ 8250 ports have
had their IERs written with zero.  Interrupts disabled at source.

By the time you get to open any serial port, the initialisation has
completed.

> >We follow that rule in the 8250 driver - in fact, when we initialise
> >we ensure that interrupts are disabled on any devices we find.
> 
> No, you rely on the caller of serial8250_init to have punctured the
> abstraction

Can you add any other useless complex words into that sentence?

> and forced any and all UARTs to a state where they cannot
> possibly generate an IRQ.

That is being done already at initialisation time.

Now, please show your interrupt mask/unmask/mask_ack code, which is
where I believe your problem to lie.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote:

> setup_irq() is where things go wrong, at least for us, at least on
> 2.6.16.x.  Interrupts are not disabled at the point in request_irq()
> when the interrupt controller is poked to enable the IRQ source.  If
> you're lucky, and you're on an architecture where the UART interrupt
> is properly level-triggered, and the worst thing that happens when you
> attempt to service an interrupt that isn't yours is that it stays on,
> then you get a soft lockup with two or three recursive __irq_svc hits
> in the backtrace.  If you're not lucky you do a fandango on core.

That should not happen if your interrupt handling is correct - okay, you
might get an interrupt at that point, but while servicing that interrupt
the source will be disabled on the interrupt controller.


Right.  But as soon as you turn the source back on, in the postamble
of the interrupt dispatch handler, it fires again.  At least on ARM,
that gives you recursive hits to __irq_svc and a couple of nested
calls within it.  Here's a backtrace (embedded in a chat log with some
commentary):

6:42 PM me: we have definitely confirmed that the serial ISR is
failing to clear the interrupt and the (presumably level-triggered)
IRQ is firing again on exit from the ISR.

6:43 PM The reason that __do_softirq is usually the last function
entrypoint in the backtrace before the __irq_svc associated with the
timer is that it is the first place where interrupts are enabled
during the IRQ dispatcher postamble.

6:44 PM Here is a backtrace from a case where the timer interrupt hit
during the perpetually firing ISR instead of during the dispatch code
surrounding it (which is not visible in backtraces)

6:45 PM [ 54.23] Freeing init memory: 92K
[ 52.24] rcu_do_batch: rcu node is 0xC03D7540, callback is 0xC00864C8
[ 52.24] rcu_do_batch: rcu node is 0xC02CCDA0, callback is 0xC006E7E4
[ 52.25] rcu_do_batch: rcu node is 0xC03D7730, callback is 0xC00864C8
[ 52.26] rcu_do_batch: rcu node is 0xC03D7920, callback is 0xC00864C8
[ 51.24] BUG: soft lockup detected on CPU#0!
[ 52.24] [] (dump_stack+0x0/0x14) from []
(softlockup_tick+0xa8/0xe8)
[ 52.24] [] (softlockup_tick+0x0/0xe8) from []
(run_local_timers+0x18/0x1c)
[ 52.24] r8 = 00010105 r7 = 0005 r6 =  r5 = 
[ 52.24] r4 = C0299B40
[ 52.24] [] (run_local_timers+0x0/0x1c) from
[] (update_process_times+0x50/0x7c)
[ 52.24] [] (update_process_times+0x0/0x7c) from
[] (timer_tick+0xc4/0xe0)
[ 52.24] r6 =  r5 = C029DB48 r4 = C029DB48
[ 52.24] [] (timer_tick+0x0/0xe0) from []
(mv88w8xx8_timer_interrupt+0x30/0x68)
[ 52.24] r6 =  r5 = C029DB48 r4 = C024775C
[ 52.24] [] (mv88w8xx8_timer_interrupt+0x0/0x68) from
[] (__do_irq+0xf0/0x140)
[ 52.24] r5 =  r4 = C0204280
6:46 PM [ 52.24] [] (__do_irq+0x0/0x140) from
[] (do_level_IRQ+0x70/0xc8)
[ 52.24] [] (do_level_IRQ+0x0/0xc8) from []
(asm_do_IRQ+0x50/0x134)
[ 52.24] r6 = C029DB48 r5 = C0240E24 r4 = 0005
[ 52.24] [] (asm_do_IRQ+0x0/0x134) from []
(__irq_svc+0x38/0x190)
[ 52.24] r6 = 0020 r5 = C029DB7C r4 = 
[ 52.24] [] (__do_irq+0x0/0x140) from []
(do_level_IRQ+0x70/0xc8)
[ 52.24] [] (do_level_IRQ+0x0/0xc8) from []
(asm_do_IRQ+0x50/0x134)
[ 52.24] r6 = C029DBFC r5 = C0240F5C r4 = 000B
[ 52.24] [] (asm_do_IRQ+0x0/0x134) from []
(__irq_svc+0x38/0x190)
[ 52.24] r6 = 0800 r5 = C029DC30 r4 = 
[ 52.24] [] (__do_softirq+0x0/0xd8) from []
(irq_exit+0x48/0x5c)
[ 52.24] r6 = C029DC94 r5 = C0240E24 r4 = 0005
[ 52.24] [] (irq_exit+0x0/0x5c) from []
(asm_do_IRQ+0x11c/0x134)
[ 52.24] [] (asm_do_IRQ+0x0/0x134) from []
(__irq_svc+0x38/0x190)
[ 52.24] r6 = 0820 r5 = C029DCC8 r4 = 
[ 52.24] [] (setup_irq+0x0/0x15c) from []
(request_irq+0xa4/0xd0)
[ 52.24] r7 =  r6 =  r5 = 000B r4 = C0C1B5C0
[ 52.24] [] (request_irq+0x0/0xd0) from []
(serial_link_irq_chain+0x264/0x2a0)
[ 52.24] [] (serial_link_irq_chain+0x0/0x2a0) from
[] (serial8250_startup+0x2f4/0x4f0)
[ 52.24] [] (serial8250_startup+0x0/0x4f0) from
[] (uart_startup+0x164/0x48c)
[ 52.24] [] (uart_startup+0x0/0x48c) from []
(uart_open+0x1a8/0x238)
[ 52.24] [] (uart_open+0x0/0x238) from []
(tty_open+0x1cc/0x390)
[ 52.24] [] (tty_open+0x0/0x390) from []
(chrdev_open+0x1e4/0x220)
[ 52.24] [] (chrdev_open+0x0/0x220) from []
(__dentry_open+0x13c/0x294)
[ 52.24] r8 = C028E2A0 r7 = C0077C60 r6 = C0C29B94 r5 = 
[ 52.24] r4 = C02CC300
[ 52.24] [] (__dentry_open+0x0/0x294) from []
(nameidata_to_filp+0x34/0x48)
[ 52.24] [] (nameidata_to_filp+0x0/0x48) from
[] (do_filp_open+0x44/0x4c)
[ 52.24] r4 = 0002
[ 52.24] [] (do_filp_open+0x0/0x4c) from []
(do_sys_open+0x50/0x94)
[ 52.24] r5 =  r4 = 0002
[ 52.24] [] (do_sys_open+0x0/0x94) from []
(sys_open+0x24/0x28)
[ 52.24] r8 =  r7 =

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 01:24:17PM -0800, Michael K. Edwards wrote:
> On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote:
> >On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote:
> >> What we've seen on our embedded ARM is that enabling an interrupt that
> >> is shared between multiple UARTs, at a stage when you have not set up
> >> all the data structures touched by the ISR and softirq, can have
> >> horrible consequences, including soft lockups and fandangos on core.
> >
> >Incorrect.  We have:
> >
> >1. registered an interrupt handler at this point.
> >2. disabled interrupts (we're under the spin lock)
> 
> setup_irq() is where things go wrong, at least for us, at least on
> 2.6.16.x.  Interrupts are not disabled at the point in request_irq()
> when the interrupt controller is poked to enable the IRQ source.  If
> you're lucky, and you're on an architecture where the UART interrupt
> is properly level-triggered, and the worst thing that happens when you
> attempt to service an interrupt that isn't yours is that it stays on,
> then you get a soft lockup with two or three recursive __irq_svc hits
> in the backtrace.  If you're not lucky you do a fandango on core.

That should not happen if your interrupt handling is correct - okay, you
might get an interrupt at that point, but while servicing that interrupt
the source will be disabled on the interrupt controller.

You should _never_ _ever_ get recusive interrupts for the same interrupt
source.  Ever.  If you do, your platforms interrupt handling is seriously
buggy.

> But its context is not.  Shared IRQ lines are a _problem_.  You cannot
> safely enable an IRQ until all devices that share it have had their
> ISRs installed, unless you can absolutely guarantee at a hardware
> level that the unitialized ones cannot assert the IRQ line.

Linux assumes that all interrupt sources on a shared IRQ line are
disabled at the point in time when the kernel boots.  When a device
is to be used, an interrupt handler is installed and then the kernel
will enable the interrupt on the device, not before.

We follow that rule in the 8250 driver - in fact, when we initialise
we ensure that interrupts are disabled on any devices we find.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards

On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote:

On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote:
> What we've seen on our embedded ARM is that enabling an interrupt that
> is shared between multiple UARTs, at a stage when you have not set up
> all the data structures touched by the ISR and softirq, can have
> horrible consequences, including soft lockups and fandangos on core.

Incorrect.  We have:

1. registered an interrupt handler at this point.
2. disabled interrupts (we're under the spin lock)

setup_irq() is where things go wrong, at least for us, at least on
2.6.16.x.  Interrupts are not disabled at the point in request_irq()
when the interrupt controller is poked to enable the IRQ source.  If
you're lucky, and you're on an architecture where the UART interrupt
is properly level-triggered, and the worst thing that happens when you
attempt to service an interrupt that isn't yours is that it stays on,
then you get a soft lockup with two or three recursive __irq_svc hits
in the backtrace.  If you're not lucky you do a fandango on core.

So, no interrupt will be seen by the CPU since the interrupt is masked.

The interrupt would need to be masked for the entire duration of the
outer loop that calls serial8250_init() or the equivalent for all
platform devices that share the IRQ.

The test is intentionally designed to be safe from the interrupt
generation point of view.

But its context is not.  Shared IRQ lines are a _problem_.  You cannot
safely enable an IRQ until all devices that share it have had their
ISRs installed, unless you can absolutely guarantee at a hardware
level that the unitialized ones cannot assert the IRQ line.  That does
not apply to any device that might have been touched by the bootloader
or the early init code, especially a UART.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 05:54:52PM +, Jose Goncalves wrote:
> Russell King wrote:
> Result is attached.

Right... in depth analysis follows.

[15423.650518] [] uart_startup+0x63/0xf4 equates to 0xc01ba49a, which
is indeed the instruction after the call to port->ops->startup.

The important code leading up to this is:

c01ba437 :
c01ba437:  55  push   %ebp
c01ba438:  57  push   %edi
c01ba439:  56  push   %esi
c01ba43a:  53  push   %ebx
c01ba43b:  8b 7c 24 14 mov0x14(%esp),%edi  @ load state
c01ba43f:  31 d2   xor%edx,%edx
c01ba441:  8b 5f 10mov0x10(%edi),%ebx  @ load state->info
c01ba444:  8b 77 14mov0x14(%edi),%esi  @ load state->port

c01ba493:  8b 46 64mov0x64(%esi),%eax  @ load port->ops
c01ba496:  56  push   %esi @ push "port" onto stack
c01ba497:  ff 50 24call   *0x24(%eax)  @ ops->startup(port)

c01bd74b :
c01bd74b:  55  push   %ebp
c01bd74c:  57  push   %edi
c01bd74d:  56  push   %esi
c01bd74e:  53  push   %ebx
c01bd74f:  8b 5c 24 14 mov0x14(%esp),%ebx

Comparing this with the stack dump:

Stack:
 c02fae70
 0005
 c74304e0 <- %ebx (pushed by serial8250_startup)
 c02fae70 <- %esi (pushed by serial8250_startup)
 c128d5e4 <- %edi (pushed by serial8250_startup)
 c7a69a80 <- %ebp (pushed by serial8250_startup)
 c01ba49a <- uart_startup+0x63/0xf4 (pushed by function called by
ops->startup, iow serial8250_startup)
 c02fae70 <- pushed on by "push %esi" at c01ba496, this is "port"
 c128d5e4 <- %ebx (pushed by uart_startup)
  <- %esi (pushed by uart_startup)
 c7a69a80 <- %edi (pushed by uart_startup)
 c7a69a80 <- %ebp (pushed by uart_startup)
 c01bbaa0 <- probably uart_open+0xaa/0xec

Once the instruction at c01bd74f completes, we have pushed into the stack
the structure commented above, but not the first two uncommented values.
%ebx contains the value of "port" at this point.  We're looking for some
place in the code which pushes a value of '5' and '%ebx' on to the stack,
and the CPUs registers contain values which correspond with the values
provided in your oops.

The code corresponding with the buggy uart check is as follows.  Comments
interspersed:

c01bd910:  9c  pushf
c01bd911:  5d  pop%ebp
c01bd912:  fa  cli

This code pushes the processors flag register onto the stack, pops it off
into the %ebp register, and then disables interrupts.

Your oops dump contained "ebp: 0202" which is a reasonable value for
x86 processors flags, which have been saved into the ebp register by the
above code sequence (according to Wikipedia).

c01bd937:  ff 73 58pushl  0x58(%ebx)
c01bd93a:  53  push   %ebx
c01bd93b:  e8 7f fd ff ff  call   c01bd6bf 
c01bd940:  6a 02   push   $0x2
c01bd942:  6a 01   push   $0x1
c01bd944:  53  push   %ebx
c01bd945:  e8 f6 eb ff ff  call   c01bc540  @ write IER
c01bd94a:  6a 05   push   $0x5
c01bd94c:  53  push   %ebx
c01bd94d:  e8 a6 eb ff ff  call   c01bc4f8   @ reads LSR
c01bd952:  89 c7   mov%eax,%edi @ saves result 
in %edi
c01bd954:  6a 02   push   $0x2
c01bd956:  53  push   %ebx
c01bd957:  e8 9c eb ff ff  call   c01bc4f8   @ reads IIR
c01bd95c:  83 c4 24add$0x24,%esp
c01bd95f:  89 c6   mov%eax,%esi @ saves result 
in %esi

This is the code corresponding with part of the buggy uart check - you can
see the call to serial8250_set_mctrl() there which confirms this.  The
sequence at c01bd94a pushes "5" and "port" (%ebx) onto the stack, but
this isn't the right place because before this we pushed "2", "1", and
"port" on the stack, and those are not present in the stack dump.

However, the reason for showing this is that a little while later, we
have:

c01bd96e:  83 e7 40and$0x40,%edi
c01bd971:  74 1c   je c01bd98f 
c01bd973:  83 e6 01and$0x1,%esi
c01bd976:  74 17   je c01bd98f 

The normal value we would read from the LSR (stored in %edi) would be
0x60, and if a transmit interrupt was pending (which is what the test
is trying to find out) the IIR (%esi) would be 0x02.

The above code sequence which involves masking these values would
therefore give:
 0x40 & 0x60 (%edi) -> 0x40 in %edi
 0x01 & 0x02 (%esi) -> 0x00 in %esi

>From your oops dump "edi: 0040" and "esi: " - that ties up,
so we know that the place we got to must be after this point.

We eventually come to this sequence.  The words previously pushed onto
the stack have been removed at this point, and %ebp, %edi nor %esi have
been touched by any other code since they were last

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote:
> What we've seen on our embedded ARM is that enabling an interrupt that
> is shared between multiple UARTs, at a stage when you have not set up
> all the data structures touched by the ISR and softirq, can have
> horrible consequences, including soft lockups and fandangos on core.

Incorrect.  We have:

1. registered an interrupt handler at this point.
2. disabled interrupts (we're under the spin lock)

So, no interrupt will be seen by the CPU since the interrupt is masked.

The test is intentionally designed to be safe from the interrupt
generation point of view.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


What we've seen on our embedded ARM is that enabling an interrupt that
is shared between multiple UARTs, at a stage when you have not set up
all the data structures touched by the ISR and softirq, can have
horrible consequences, including soft lockups and fandangos on core.
You will be vulnerable to this unless you lock out the interrupt
source (at the interrupt controller or, if you have to, globally)
across the UART registration process in your platform's
arch/mach-dependent core.c, in which case the TX irq test will of
course fail.  Roll-your-own SoC UARTs with bugs or "extended features"
in IRQ enabling and delivery make things worse.

I would love to see this disentangled in a maintainable way.  It's
such a nasty problem (especially given that bootloaders and early boot
code frequently turn on one or more UARTs and leave them in an unknown
state) that all we've been able to do so far is hack around it.  I'll
send an example patch when we've more or less isolated it, but it will
be of limited use to you unless you have the exact set of UART warpage
we do.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Jose Goncalves

Russell King wrote:
> On Mon, Feb 19, 2007 at 04:29:39PM +, Jose Goncalves wrote:
>   
>> Russell King wrote:
>> 
>>> On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote:
>>>   
>>>   
 (trimmed tie-fei.zang from the CC, added by mistake)
 On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote:
 
 
>> Neither did I, but introducing printk's through the function, we narrowed
>> the problem to this part of the code. And removing it makes the problem
>> go away. We inserted 37 printk's in the function body, and Jose bisected
>> those until the problem went away.
>> 
>> 
> Well, there's still little clue about why this is causing a NULL pointer
> dereference.  The only thing I can think is that somehow performing
> this test is causing a power glitch to your CPU, causing its registers
> to get corrupted, and which results in it doing a NULL pointer deref.
>   
>   
 That may be the case, indeed.
 
 
>> But if the problem was a power glitch I should get Oops with or without
>> printk() inserted, shouldn't I?
>> 
>
> That depends if the printk() changes the timing such that it doesn't occur.
> Don't know, I'm only grasping at straws due to the lack of any concrete
> information.
>
>   
 If you see other tests to be performed...
 
>>> Maybe adding some delays in that bit of code?  I'm sure you've already
>>> thought of that though.  Since no one has a proper understanding of the
>>> problem, the only suggestions possible are mere shots in the dark.
>>>   
>> I'm no kernel expert, but it's not possible to trace what is the
>> instruction that is causing the NULL pointer dereference?
>> 
>
> The reported dump shows that the kernel tried to access virtual address 0,
> and the instruction pointer seems to be the cause of that - it has a value
> of zero in that dump.
>
> The call trace indicates that the last function was called from around
> "uart_startup+0x63/0xf4" which is probably the indirect function call
> to serial8250_startup().  That's unconfirmed - the only way to get it
> confirmed is if you could dump the entire uart_startup() function.
>
> $ grep uart_startup System.map
> (address) T uart_startup
> $ objdump -r -d vmlinux --start-addr=0x --stop-addr=0x
>
> The grep should get you the address of uart_startup.  Replace 
> with that value and  with the value plus 256 (0x100) and
> mail the result.
>   

Result is attached.

José Gonçalves


vmlinux-2.6.16.38-mtm4-debug: file format elf32-i386

Disassembly of section .text:

c01ba437 :
c01ba437:	55   	push   %ebp
c01ba438:	57   	push   %edi
c01ba439:	56   	push   %esi
c01ba43a:	53   	push   %ebx
c01ba43b:	8b 7c 24 14  	mov0x14(%esp),%edi
c01ba43f:	31 d2	xor%edx,%edx
c01ba441:	8b 5f 10 	mov0x10(%edi),%ebx
c01ba444:	8b 77 14 	mov0x14(%edi),%esi
c01ba447:	83 7b 10 00  	cmpl   $0x0,0x10(%ebx)
c01ba44b:	0f 88 d3 00 00 00	js c01ba524 
c01ba451:	8b 03	mov(%ebx),%eax
c01ba453:	0f ba a8 b4 00 00 00 	btsl   $0x1,0xb4(%eax)
c01ba45a:	01 
c01ba45b:	83 7e 60 00  	cmpl   $0x0,0x60(%esi)
c01ba45f:	0f 84 bf 00 00 00	je c01ba524 
c01ba465:	83 7b 04 00  	cmpl   $0x0,0x4(%ebx)
c01ba469:	75 28	jnec01ba493 
c01ba46b:	b8 d0 00 00 00   	mov$0xd0,%eax
c01ba470:	e8 36 c7 f6 ff   	call   c0126bab 
c01ba475:	ba f4 ff ff ff   	mov$0xfff4,%edx
c01ba47a:	85 c0	test   %eax,%eax
c01ba47c:	0f 84 a2 00 00 00	je c01ba524 
c01ba482:	89 43 04 	mov%eax,0x4(%ebx)
c01ba485:	c7 43 0c 00 00 00 00 	movl   $0x0,0xc(%ebx)
c01ba48c:	c7 43 08 00 00 00 00 	movl   $0x0,0x8(%ebx)
c01ba493:	8b 46 64 	mov0x64(%esi),%eax
c01ba496:	56   	push   %esi
c01ba497:	ff 50 24 	call   *0x24(%eax)
c01ba49a:	89 c5	mov%eax,%ebp
c01ba49c:	58   	pop%eax
c01ba49d:	85 ed	test   %ebp,%ebp
c01ba49f:	75 6d	jnec01ba50e 
c01ba4a1:	83 7c 24 18 00   	cmpl   $0x0,0x18(%esp)
c01ba4a6:	74 36	je c01ba4de 
c01ba4a8:	6a 00	push   $0x0
c01ba4aa:	57   	push   %edi
c01ba4ab:	e8 5f 02 00 00   	call   c01ba70f 
c01ba4b0:	8b 03	mov(%ebx),%eax
c01ba4b2:	8b 40 64 	mov0x64(%eax),%eax
c01ba4b5:	59   	pop%ecx
c01ba4b6:	5f   	pop%edi
c01ba4b7:	f7 40 08 0f 10 00 00 	testl  $0x100f,0x8(%eax)
c01ba4be:	74 1e	je c01ba4de 
c01ba4c0:	9c   	pushf  
c01ba4c1:	5f   	pop%edi
c01ba4c2:	fa   	cli
c01ba4c3:	8b 46 58 	mov0x58(%esi),%eax
c01ba4c6:	89 c2

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 04:29:39PM +, Jose Goncalves wrote:
> Russell King wrote:
> > On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote:
> >   
> >> (trimmed tie-fei.zang from the CC, added by mistake)
> >> On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote:
> >> 
>  Neither did I, but introducing printk's through the function, we narrowed
>  the problem to this part of the code. And removing it makes the problem
>  go away. We inserted 37 printk's in the function body, and Jose bisected
>  those until the problem went away.
>  
> >>> Well, there's still little clue about why this is causing a NULL pointer
> >>> dereference.  The only thing I can think is that somehow performing
> >>> this test is causing a power glitch to your CPU, causing its registers
> >>> to get corrupted, and which results in it doing a NULL pointer deref.
> >>>   
> >> That may be the case, indeed.
> >> 
> 
> But if the problem was a power glitch I should get Oops with or without
> printk() inserted, shouldn't I?

That depends if the printk() changes the timing such that it doesn't occur.
Don't know, I'm only grasping at straws due to the lack of any concrete
information.

> >> If you see other tests to be performed...
> >
> > Maybe adding some delays in that bit of code?  I'm sure you've already
> > thought of that though.  Since no one has a proper understanding of the
> > problem, the only suggestions possible are mere shots in the dark.
> 
> I'm no kernel expert, but it's not possible to trace what is the
> instruction that is causing the NULL pointer dereference?

The reported dump shows that the kernel tried to access virtual address 0,
and the instruction pointer seems to be the cause of that - it has a value
of zero in that dump.

The call trace indicates that the last function was called from around
"uart_startup+0x63/0xf4" which is probably the indirect function call
to serial8250_startup().  That's unconfirmed - the only way to get it
confirmed is if you could dump the entire uart_startup() function.

$ grep uart_startup System.map
(address) T uart_startup
$ objdump -r -d vmlinux --start-addr=0x --stop-addr=0x

The grep should get you the address of uart_startup.  Replace 
with that value and  with the value plus 256 (0x100) and
mail the result.

> I have no clue on what is causing this problem but, what I know, is
> that I can always reproduce it, and it always happens in the same code
> section of serial8250_startup().

We're both at the same level of clue about the problem then.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Jose Goncalves

Russell King wrote:
> On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote:
>   
>> (trimmed tie-fei.zang from the CC, added by mistake)
>> On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote:
>> 
 Neither did I, but introducing printk's through the function, we narrowed
 the problem to this part of the code. And removing it makes the problem
 go away. We inserted 37 printk's in the function body, and Jose bisected
 those until the problem went away.
 
>>> Well, there's still little clue about why this is causing a NULL pointer
>>> dereference.  The only thing I can think is that somehow performing
>>> this test is causing a power glitch to your CPU, causing its registers
>>> to get corrupted, and which results in it doing a NULL pointer deref.
>>>   
>> That may be the case, indeed.
>> 

But if the problem was a power glitch I should get Oops with or without
printk() inserted, shouldn't I?

>>> Are you saying that the NULL pointer occurred while executing this code?
>>> If not, where does the NULL pointer occur?
>>>   
>> The thing is, the NULL pointer deref dissapeared as soon as we
>> instrumented (printk'ed) the code. So it's seems to be triggered by
>> check+timing+hardware.
>> 
>
> So to summarise, we have some code somewhere which is causing a NULL
> pointer deref in uart_startup().  If we remove some code, the NULL
> pointer deref stops happening.
>
> And that's about the sum total of the information we know.  We don't
> know precisely where the NULL pointer deref occurs, and we don't know
> what's causing it.
>
> It doesn't sound like there's much understanding of the problem at hand. ;(
>
>   
>>> Andrew's said no (in that the thread you refer to) and suggested an
>>> alternative, I've said no, how many more 'no's do you need to turn
>>> you away from the wrong approach?
>>>   
>> One is usually sufficient once I've understood :). I missed the module
>> option approach. Is it ok with you? If yes, I'll put up a patch to do
>> this.
>> 
>
> I guess so, but how does the user know whether they need this enabled or
> disabled?
>
>   
>> The problem appears to be reproducible on Jose's hardware within 2-3 days.
>> 

In a kernel without instrumentation I get problems within a 1 day period.

>> If you see other tests to be performed...
>> 
>
> Maybe adding some delays in that bit of code?  I'm sure you've already
> thought of that though.  Since no one has a proper understanding of the
> problem, the only suggestions possible are mere shots in the dark.
>   

I'm no kernel expert, but it's not possible to trace what is the
instruction that is causing the NULL pointer dereference? The kernel
dump does not show this?

I have no clue on what is causing this problem but, what I know, is that
I can always reproduce it, and it always happens in the same code
section of serial8250_startup().

Regards,
José Gonçalves
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote:
> (trimmed tie-fei.zang from the CC, added by mistake)
> On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote:
> > > Neither did I, but introducing printk's through the function, we narrowed
> > > the problem to this part of the code. And removing it makes the problem
> > > go away. We inserted 37 printk's in the function body, and Jose bisected
> > > those until the problem went away.
> > 
> > Well, there's still little clue about why this is causing a NULL pointer
> > dereference.  The only thing I can think is that somehow performing
> > this test is causing a power glitch to your CPU, causing its registers
> > to get corrupted, and which results in it doing a NULL pointer deref.
> That may be the case, indeed.
> > 
> > Are you saying that the NULL pointer occurred while executing this code?
> > If not, where does the NULL pointer occur?
> The thing is, the NULL pointer deref dissapeared as soon as we
> instrumented (printk'ed) the code. So it's seems to be triggered by
> check+timing+hardware.

So to summarise, we have some code somewhere which is causing a NULL
pointer deref in uart_startup().  If we remove some code, the NULL
pointer deref stops happening.

And that's about the sum total of the information we know.  We don't
know precisely where the NULL pointer deref occurs, and we don't know
what's causing it.

It doesn't sound like there's much understanding of the problem at hand. ;(

> > Andrew's said no (in that the thread you refer to) and suggested an
> > alternative, I've said no, how many more 'no's do you need to turn
> > you away from the wrong approach?
> One is usually sufficient once I've understood :). I missed the module
> option approach. Is it ok with you? If yes, I'll put up a patch to do
> this.

I guess so, but how does the user know whether they need this enabled or
disabled?

> The problem appears to be reproducible on Jose's hardware within 2-3 days.
> If you see other tests to be performed...

Maybe adding some delays in that bit of code?  I'm sure you've already
thought of that though.  Since no one has a proper understanding of the
problem, the only suggestions possible are mere shots in the dark.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Frederik Deweerdt

(trimmed tie-fei.zang from the CC, added by mistake)
On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote:
> > Neither did I, but introducing printk's through the function, we narrowed
> > the problem to this part of the code. And removing it makes the problem
> > go away. We inserted 37 printk's in the function body, and Jose bisected
> > those until the problem went away.
> 
> Well, there's still little clue about why this is causing a NULL pointer
> dereference.  The only thing I can think is that somehow performing
> this test is causing a power glitch to your CPU, causing its registers
> to get corrupted, and which results in it doing a NULL pointer deref.
That may be the case, indeed.
> 
> Are you saying that the NULL pointer occurred while executing this code?
> If not, where does the NULL pointer occur?
The thing is, the NULL pointer deref dissapeared as soon as we
instrumented (printk'ed) the code. So it's seems to be triggered by
check+timing+hardware.
> 
> > > No, it's only runtime because you can't tell which ports might be
> > > affected, and you might have a mixture of ports which are affected
> > > and those which aren't.
> > Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option?
> 
> Andrew's said no (in that the thread you refer to) and suggested an
> alternative, I've said no, how many more 'no's do you need to turn
> you away from the wrong approach?
One is usually sufficient once I've understood :). I missed the module
option approach. Is it ok with you? If yes, I'll put up a patch to do
this.
> 
> > > > PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
> > > > http://lkml.org/lkml/2006/6/13/21
> > > 
> > > I don't see any reference to this problem there.
> >
> > Sorry, I suck, I got that mixed with that one:
> > http://lkml.org/lkml/2006/12/26/63
> > "probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some
> > ARM boards"
> 
> The "weird effects" were never quantified, so that's one of the reasons
> I ignored that report (another being is that I stopped being the serial
> maintainer a while ago, and now serial is maintainerless.)
> 
The problem appears to be reproducible on Jose's hardware within 2-3 days.
If you see other tests to be performed...

Regards,
Frederik
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Tue, Feb 20, 2007 at 02:24:42PM +, Frederik Deweerdt wrote:
> On Mon, Feb 19, 2007 at 01:45:39PM +, Russell King wrote:
> > On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote:
> > > (Sorry for the resend, I forgot to cc the list)
> > > Hi Russell,
> > > 
> > > It seems that the following change in drivers/serial/8250.c
> > > 
> > > +
> > > + /*
> > > +  * Do a quick test to see if we receive an
> > > +  * interrupt when we enable the TX irq.
> > > +  */
> > > + serial_outp(up, UART_IER, UART_IER_THRI);
> > > + lsr = serial_in(up, UART_LSR);
> > > + iir = serial_in(up, UART_IIR);
> > > + serial_outp(up, UART_IER, 0);
> > > +
> > > + if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) {
> > > + if (!(up->capabilities & UART_BUG_TXEN)) {
> > > + up->capabilities |= UART_BUG_TXEN;
> > > + pr_debug("ttyS%d - enabling bad tx status 
> > > workarounds\n",
> > > +  port->line);
> > > + }
> > > + } else {
> > > + up->capabilities &= ~UART_BUG_TXEN;
> > > + }
> > > +
> > > 
> > > that was introduced in 2.6.12[1], is causing oopses on some hardware. In
> > > particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible
> > 
> > I don't see that.  The oops your referring to is a NULL pointer
> > dereference.  The only dereferences the above code does is via
> > 'up' and 'port' both of which are provably always non-null here.
>
> Neither did I, but introducing printk's through the function, we narrowed
> the problem to this part of the code. And removing it makes the problem
> go away. We inserted 37 printk's in the function body, and Jose bisected
> those until the problem went away.

Well, there's still little clue about why this is causing a NULL pointer
dereference.  The only thing I can think is that somehow performing
this test is causing a power glitch to your CPU, causing its registers
to get corrupted, and which results in it doing a NULL pointer deref.

Are you saying that the NULL pointer occurred while executing this code?
If not, where does the NULL pointer occur?

> > No, it's only runtime because you can't tell which ports might be
> > affected, and you might have a mixture of ports which are affected
> > and those which aren't.
> Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option?

Andrew's said no (in that the thread you refer to) and suggested an
alternative, I've said no, how many more 'no's do you need to turn
you away from the wrong approach?

> > > PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
> > > http://lkml.org/lkml/2006/6/13/21
> > 
> > I don't see any reference to this problem there.
>
> Sorry, I suck, I got that mixed with that one:
> http://lkml.org/lkml/2006/12/26/63
> "probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some
> ARM boards"

The "weird effects" were never quantified, so that's one of the reasons
I ignored that report (another being is that I stopped being the serial
maintainer a while ago, and now serial is maintainerless.)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Frederik Deweerdt

On Mon, Feb 19, 2007 at 01:45:39PM +, Russell King wrote:
> On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote:
> > (Sorry for the resend, I forgot to cc the list)
> > Hi Russell,
> > 
> > It seems that the following change in drivers/serial/8250.c
> > 
> > +
> > +   /*
> > +* Do a quick test to see if we receive an
> > +* interrupt when we enable the TX irq.
> > +*/
> > +   serial_outp(up, UART_IER, UART_IER_THRI);
> > +   lsr = serial_in(up, UART_LSR);
> > +   iir = serial_in(up, UART_IIR);
> > +   serial_outp(up, UART_IER, 0);
> > +
> > +   if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) {
> > +   if (!(up->capabilities & UART_BUG_TXEN)) {
> > +   up->capabilities |= UART_BUG_TXEN;
> > +   pr_debug("ttyS%d - enabling bad tx status 
> > workarounds\n",
> > +port->line);
> > +   }
> > +   } else {
> > +   up->capabilities &= ~UART_BUG_TXEN;
> > +   }
> > +
> > 
> > that was introduced in 2.6.12[1], is causing oopses on some hardware. In
> > particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible
> 
> I don't see that.  The oops your referring to is a NULL pointer
> dereference.  The only dereferences the above code does is via
> 'up' and 'port' both of which are provably always non-null here.
Neither did I, but introducing printk's through the function, we narrowed
the problem to this part of the code. And removing it makes the problem
go away. We inserted 37 printk's in the function body, and Jose bisected
those until the problem went away.
>
> > I was wondering:
> > - what is the goal of the test?
> 
> To detect UARTs which do not assert an interrupt when the transmit
> interrupt is enabled, which then causes no data to ever be transmitted
> without this work-around.
OK, thanks for clarifying.
> 
> > - could this be CONFIGed ?
> 
> No, it's only runtime because you can't tell which ports might be
> affected, and you might have a mixture of ports which are affected
> and those which aren't.
Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option?
> 
> > PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
> > http://lkml.org/lkml/2006/6/13/21
> 
> I don't see any reference to this problem there.
Sorry, I suck, I got that mixed with that one:
http://lkml.org/lkml/2006/12/26/63
"probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some
ARM boards"

Regards,
Frederik
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote:
> (Sorry for the resend, I forgot to cc the list)
> Hi Russell,
> 
> It seems that the following change in drivers/serial/8250.c
> 
> +
> + /*
> +  * Do a quick test to see if we receive an
> +  * interrupt when we enable the TX irq.
> +  */
> + serial_outp(up, UART_IER, UART_IER_THRI);
> + lsr = serial_in(up, UART_LSR);
> + iir = serial_in(up, UART_IIR);
> + serial_outp(up, UART_IER, 0);
> +
> + if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) {
> + if (!(up->capabilities & UART_BUG_TXEN)) {
> + up->capabilities |= UART_BUG_TXEN;
> + pr_debug("ttyS%d - enabling bad tx status 
> workarounds\n",
> +  port->line);
> + }
> + } else {
> + up->capabilities &= ~UART_BUG_TXEN;
> + }
> +
> 
> that was introduced in 2.6.12[1], is causing oopses on some hardware. In
> particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible

I don't see that.  The oops your referring to is a NULL pointer
dereference.  The only dereferences the above code does is via
'up' and 'port' both of which are provably always non-null here.

> I was wondering:
> - what is the goal of the test?

To detect UARTs which do not assert an interrupt when the transmit
interrupt is enabled, which then causes no data to ever be transmitted
without this work-around.

> - could this be CONFIGed ?

No, it's only runtime because you can't tell which ports might be
affected, and you might have a mixture of ports which are affected
and those which aren't.

> PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
> http://lkml.org/lkml/2006/6/13/21

I don't see any reference to this problem there.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Serial related oops

2007-02-19 Thread Frederik Deweerdt

(Sorry for the resend, I forgot to cc the list)
Hi Russell,

It seems that the following change in drivers/serial/8250.c

+
+   /*
+* Do a quick test to see if we receive an
+* interrupt when we enable the TX irq.
+*/
+   serial_outp(up, UART_IER, UART_IER_THRI);
+   lsr = serial_in(up, UART_LSR);
+   iir = serial_in(up, UART_IIR);
+   serial_outp(up, UART_IER, 0);
+
+   if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) {
+   if (!(up->capabilities & UART_BUG_TXEN)) {
+   up->capabilities |= UART_BUG_TXEN;
+   pr_debug("ttyS%d - enabling bad tx status 
workarounds\n",
+port->line);
+   }
+   } else {
+   up->capabilities &= ~UART_BUG_TXEN;
+   }
+

that was introduced in 2.6.12[1], is causing oopses on some hardware. In
particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible
(after a few days of open()/close() on the serial port). He bisected this
to that change -thanks for the long debugging Jose ;)-. and reverting
that part of the 2.6.12 git patch seems to fix the problem.
I was wondering:
- what is the goal of the test?
- could this be CONFIGed ?

Regards,
Frederik

PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
http://lkml.org/lkml/2006/6/13/21

[1]http://lkml.org/lkml/2005/6/23/266
[2]http://lkml.org/lkml/2007/1/26/157
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Serial related oops

2007-02-19 Thread Frederik Deweerdt

(Sorry for the resend, I forgot to cc the list)
Hi Russell,

It seems that the following change in drivers/serial/8250.c

+
+   /*
+* Do a quick test to see if we receive an
+* interrupt when we enable the TX irq.
+*/
+   serial_outp(up, UART_IER, UART_IER_THRI);
+   lsr = serial_in(up, UART_LSR);
+   iir = serial_in(up, UART_IIR);
+   serial_outp(up, UART_IER, 0);
+
+   if (lsr  UART_LSR_TEMT  iir  UART_IIR_NO_INT) {
+   if (!(up-capabilities  UART_BUG_TXEN)) {
+   up-capabilities |= UART_BUG_TXEN;
+   pr_debug(ttyS%d - enabling bad tx status 
workarounds\n,
+port-line);
+   }
+   } else {
+   up-capabilities = ~UART_BUG_TXEN;
+   }
+

that was introduced in 2.6.12[1], is causing oopses on some hardware. In
particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible
(after a few days of open()/close() on the serial port). He bisected this
to that change -thanks for the long debugging Jose ;)-. and reverting
that part of the 2.6.12 git patch seems to fix the problem.
I was wondering:
- what is the goal of the test?
- could this be CONFIGed ?

Regards,
Frederik

PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
http://lkml.org/lkml/2006/6/13/21

[1]http://lkml.org/lkml/2005/6/23/266
[2]http://lkml.org/lkml/2007/1/26/157
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote:
 (Sorry for the resend, I forgot to cc the list)
 Hi Russell,
 
 It seems that the following change in drivers/serial/8250.c
 
 +
 + /*
 +  * Do a quick test to see if we receive an
 +  * interrupt when we enable the TX irq.
 +  */
 + serial_outp(up, UART_IER, UART_IER_THRI);
 + lsr = serial_in(up, UART_LSR);
 + iir = serial_in(up, UART_IIR);
 + serial_outp(up, UART_IER, 0);
 +
 + if (lsr  UART_LSR_TEMT  iir  UART_IIR_NO_INT) {
 + if (!(up-capabilities  UART_BUG_TXEN)) {
 + up-capabilities |= UART_BUG_TXEN;
 + pr_debug(ttyS%d - enabling bad tx status 
 workarounds\n,
 +  port-line);
 + }
 + } else {
 + up-capabilities = ~UART_BUG_TXEN;
 + }
 +
 
 that was introduced in 2.6.12[1], is causing oopses on some hardware. In
 particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible

I don't see that.  The oops your referring to is a NULL pointer
dereference.  The only dereferences the above code does is via
'up' and 'port' both of which are provably always non-null here.

 I was wondering:
 - what is the goal of the test?

To detect UARTs which do not assert an interrupt when the transmit
interrupt is enabled, which then causes no data to ever be transmitted
without this work-around.

 - could this be CONFIGed ?

No, it's only runtime because you can't tell which ports might be
affected, and you might have a mixture of ports which are affected
and those which aren't.

 PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
 http://lkml.org/lkml/2006/6/13/21

I don't see any reference to this problem there.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Frederik Deweerdt

On Mon, Feb 19, 2007 at 01:45:39PM +, Russell King wrote:
 On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote:
  (Sorry for the resend, I forgot to cc the list)
  Hi Russell,
  
  It seems that the following change in drivers/serial/8250.c
  
  +
  +   /*
  +* Do a quick test to see if we receive an
  +* interrupt when we enable the TX irq.
  +*/
  +   serial_outp(up, UART_IER, UART_IER_THRI);
  +   lsr = serial_in(up, UART_LSR);
  +   iir = serial_in(up, UART_IIR);
  +   serial_outp(up, UART_IER, 0);
  +
  +   if (lsr  UART_LSR_TEMT  iir  UART_IIR_NO_INT) {
  +   if (!(up-capabilities  UART_BUG_TXEN)) {
  +   up-capabilities |= UART_BUG_TXEN;
  +   pr_debug(ttyS%d - enabling bad tx status 
  workarounds\n,
  +port-line);
  +   }
  +   } else {
  +   up-capabilities = ~UART_BUG_TXEN;
  +   }
  +
  
  that was introduced in 2.6.12[1], is causing oopses on some hardware. In
  particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible
 
 I don't see that.  The oops your referring to is a NULL pointer
 dereference.  The only dereferences the above code does is via
 'up' and 'port' both of which are provably always non-null here.
Neither did I, but introducing printk's through the function, we narrowed
the problem to this part of the code. And removing it makes the problem
go away. We inserted 37 printk's in the function body, and Jose bisected
those until the problem went away.

  I was wondering:
  - what is the goal of the test?
 
 To detect UARTs which do not assert an interrupt when the transmit
 interrupt is enabled, which then causes no data to ever be transmitted
 without this work-around.
OK, thanks for clarifying.
 
  - could this be CONFIGed ?
 
 No, it's only runtime because you can't tell which ports might be
 affected, and you might have a mixture of ports which are affected
 and those which aren't.
Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option?
 
  PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
  http://lkml.org/lkml/2006/6/13/21
 
 I don't see any reference to this problem there.
Sorry, I suck, I got that mixed with that one:
http://lkml.org/lkml/2006/12/26/63
probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some
ARM boards

Regards,
Frederik
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Tue, Feb 20, 2007 at 02:24:42PM +, Frederik Deweerdt wrote:
 On Mon, Feb 19, 2007 at 01:45:39PM +, Russell King wrote:
  On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote:
   (Sorry for the resend, I forgot to cc the list)
   Hi Russell,
   
   It seems that the following change in drivers/serial/8250.c
   
   +
   + /*
   +  * Do a quick test to see if we receive an
   +  * interrupt when we enable the TX irq.
   +  */
   + serial_outp(up, UART_IER, UART_IER_THRI);
   + lsr = serial_in(up, UART_LSR);
   + iir = serial_in(up, UART_IIR);
   + serial_outp(up, UART_IER, 0);
   +
   + if (lsr  UART_LSR_TEMT  iir  UART_IIR_NO_INT) {
   + if (!(up-capabilities  UART_BUG_TXEN)) {
   + up-capabilities |= UART_BUG_TXEN;
   + pr_debug(ttyS%d - enabling bad tx status 
   workarounds\n,
   +  port-line);
   + }
   + } else {
   + up-capabilities = ~UART_BUG_TXEN;
   + }
   +
   
   that was introduced in 2.6.12[1], is causing oopses on some hardware. In
   particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible
  
  I don't see that.  The oops your referring to is a NULL pointer
  dereference.  The only dereferences the above code does is via
  'up' and 'port' both of which are provably always non-null here.

 Neither did I, but introducing printk's through the function, we narrowed
 the problem to this part of the code. And removing it makes the problem
 go away. We inserted 37 printk's in the function body, and Jose bisected
 those until the problem went away.

Well, there's still little clue about why this is causing a NULL pointer
dereference.  The only thing I can think is that somehow performing
this test is causing a power glitch to your CPU, causing its registers
to get corrupted, and which results in it doing a NULL pointer deref.

Are you saying that the NULL pointer occurred while executing this code?
If not, where does the NULL pointer occur?

  No, it's only runtime because you can't tell which ports might be
  affected, and you might have a mixture of ports which are affected
  and those which aren't.
 Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option?

Andrew's said no (in that the thread you refer to) and suggested an
alternative, I've said no, how many more 'no's do you need to turn
you away from the wrong approach?

   PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
   http://lkml.org/lkml/2006/6/13/21
  
  I don't see any reference to this problem there.

 Sorry, I suck, I got that mixed with that one:
 http://lkml.org/lkml/2006/12/26/63
 probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some
 ARM boards

The weird effects were never quantified, so that's one of the reasons
I ignored that report (another being is that I stopped being the serial
maintainer a while ago, and now serial is maintainerless.)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Frederik Deweerdt

(trimmed tie-fei.zang from the CC, added by mistake)
On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote:
  Neither did I, but introducing printk's through the function, we narrowed
  the problem to this part of the code. And removing it makes the problem
  go away. We inserted 37 printk's in the function body, and Jose bisected
  those until the problem went away.
 
 Well, there's still little clue about why this is causing a NULL pointer
 dereference.  The only thing I can think is that somehow performing
 this test is causing a power glitch to your CPU, causing its registers
 to get corrupted, and which results in it doing a NULL pointer deref.
That may be the case, indeed.
 
 Are you saying that the NULL pointer occurred while executing this code?
 If not, where does the NULL pointer occur?
The thing is, the NULL pointer deref dissapeared as soon as we
instrumented (printk'ed) the code. So it's seems to be triggered by
check+timing+hardware.
 
   No, it's only runtime because you can't tell which ports might be
   affected, and you might have a mixture of ports which are affected
   and those which aren't.
  Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option?
 
 Andrew's said no (in that the thread you refer to) and suggested an
 alternative, I've said no, how many more 'no's do you need to turn
 you away from the wrong approach?
One is usually sufficient once I've understood :). I missed the module
option approach. Is it ok with you? If yes, I'll put up a patch to do
this.
 
PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
http://lkml.org/lkml/2006/6/13/21
   
   I don't see any reference to this problem there.
 
  Sorry, I suck, I got that mixed with that one:
  http://lkml.org/lkml/2006/12/26/63
  probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some
  ARM boards
 
 The weird effects were never quantified, so that's one of the reasons
 I ignored that report (another being is that I stopped being the serial
 maintainer a while ago, and now serial is maintainerless.)
 
The problem appears to be reproducible on Jose's hardware within 2-3 days.
If you see other tests to be performed...

Regards,
Frederik
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote:
 (trimmed tie-fei.zang from the CC, added by mistake)
 On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote:
   Neither did I, but introducing printk's through the function, we narrowed
   the problem to this part of the code. And removing it makes the problem
   go away. We inserted 37 printk's in the function body, and Jose bisected
   those until the problem went away.
  
  Well, there's still little clue about why this is causing a NULL pointer
  dereference.  The only thing I can think is that somehow performing
  this test is causing a power glitch to your CPU, causing its registers
  to get corrupted, and which results in it doing a NULL pointer deref.
 That may be the case, indeed.
  
  Are you saying that the NULL pointer occurred while executing this code?
  If not, where does the NULL pointer occur?
 The thing is, the NULL pointer deref dissapeared as soon as we
 instrumented (printk'ed) the code. So it's seems to be triggered by
 check+timing+hardware.

So to summarise, we have some code somewhere which is causing a NULL
pointer deref in uart_startup().  If we remove some code, the NULL
pointer deref stops happening.

And that's about the sum total of the information we know.  We don't
know precisely where the NULL pointer deref occurs, and we don't know
what's causing it.

It doesn't sound like there's much understanding of the problem at hand. ;(

  Andrew's said no (in that the thread you refer to) and suggested an
  alternative, I've said no, how many more 'no's do you need to turn
  you away from the wrong approach?
 One is usually sufficient once I've understood :). I missed the module
 option approach. Is it ok with you? If yes, I'll put up a patch to do
 this.

I guess so, but how does the user know whether they need this enabled or
disabled?

 The problem appears to be reproducible on Jose's hardware within 2-3 days.
 If you see other tests to be performed...

Maybe adding some delays in that bit of code?  I'm sure you've already
thought of that though.  Since no one has a proper understanding of the
problem, the only suggestions possible are mere shots in the dark.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Jose Goncalves

Russell King wrote:
 On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote:
   
 (trimmed tie-fei.zang from the CC, added by mistake)
 On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote:
 
 Neither did I, but introducing printk's through the function, we narrowed
 the problem to this part of the code. And removing it makes the problem
 go away. We inserted 37 printk's in the function body, and Jose bisected
 those until the problem went away.
 
 Well, there's still little clue about why this is causing a NULL pointer
 dereference.  The only thing I can think is that somehow performing
 this test is causing a power glitch to your CPU, causing its registers
 to get corrupted, and which results in it doing a NULL pointer deref.
   
 That may be the case, indeed.
 

But if the problem was a power glitch I should get Oops with or without
printk() inserted, shouldn't I?

 Are you saying that the NULL pointer occurred while executing this code?
 If not, where does the NULL pointer occur?
   
 The thing is, the NULL pointer deref dissapeared as soon as we
 instrumented (printk'ed) the code. So it's seems to be triggered by
 check+timing+hardware.
 

 So to summarise, we have some code somewhere which is causing a NULL
 pointer deref in uart_startup().  If we remove some code, the NULL
 pointer deref stops happening.

 And that's about the sum total of the information we know.  We don't
 know precisely where the NULL pointer deref occurs, and we don't know
 what's causing it.

 It doesn't sound like there's much understanding of the problem at hand. ;(

   
 Andrew's said no (in that the thread you refer to) and suggested an
 alternative, I've said no, how many more 'no's do you need to turn
 you away from the wrong approach?
   
 One is usually sufficient once I've understood :). I missed the module
 option approach. Is it ok with you? If yes, I'll put up a patch to do
 this.
 

 I guess so, but how does the user know whether they need this enabled or
 disabled?

   
 The problem appears to be reproducible on Jose's hardware within 2-3 days.
 

In a kernel without instrumentation I get problems within a 1 day period.

 If you see other tests to be performed...
 

 Maybe adding some delays in that bit of code?  I'm sure you've already
 thought of that though.  Since no one has a proper understanding of the
 problem, the only suggestions possible are mere shots in the dark.
   

I'm no kernel expert, but it's not possible to trace what is the
instruction that is causing the NULL pointer dereference? The kernel
dump does not show this?

I have no clue on what is causing this problem but, what I know, is that
I can always reproduce it, and it always happens in the same code
section of serial8250_startup().

Regards,
José Gonçalves
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 04:29:39PM +, Jose Goncalves wrote:
 Russell King wrote:
  On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote:

  (trimmed tie-fei.zang from the CC, added by mistake)
  On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote:
  
  Neither did I, but introducing printk's through the function, we narrowed
  the problem to this part of the code. And removing it makes the problem
  go away. We inserted 37 printk's in the function body, and Jose bisected
  those until the problem went away.
  
  Well, there's still little clue about why this is causing a NULL pointer
  dereference.  The only thing I can think is that somehow performing
  this test is causing a power glitch to your CPU, causing its registers
  to get corrupted, and which results in it doing a NULL pointer deref.

  That may be the case, indeed.
  
 
 But if the problem was a power glitch I should get Oops with or without
 printk() inserted, shouldn't I?

That depends if the printk() changes the timing such that it doesn't occur.
Don't know, I'm only grasping at straws due to the lack of any concrete
information.

  If you see other tests to be performed...
 
  Maybe adding some delays in that bit of code?  I'm sure you've already
  thought of that though.  Since no one has a proper understanding of the
  problem, the only suggestions possible are mere shots in the dark.
 
 I'm no kernel expert, but it's not possible to trace what is the
 instruction that is causing the NULL pointer dereference?

The reported dump shows that the kernel tried to access virtual address 0,
and the instruction pointer seems to be the cause of that - it has a value
of zero in that dump.

The call trace indicates that the last function was called from around
uart_startup+0x63/0xf4 which is probably the indirect function call
to serial8250_startup().  That's unconfirmed - the only way to get it
confirmed is if you could dump the entire uart_startup() function.

$ grep uart_startup System.map
(address) T uart_startup
$ objdump -r -d vmlinux --start-addr=0xaddress --stop-addr=0xaddress+256

The grep should get you the address of uart_startup.  Replace address
with that value and address+256 with the value plus 256 (0x100) and
mail the result.

 I have no clue on what is causing this problem but, what I know, is
 that I can always reproduce it, and it always happens in the same code
 section of serial8250_startup().

We're both at the same level of clue about the problem then.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Jose Goncalves

Russell King wrote:
 On Mon, Feb 19, 2007 at 04:29:39PM +, Jose Goncalves wrote:
   
 Russell King wrote:
 
 On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote:
   
   
 (trimmed tie-fei.zang from the CC, added by mistake)
 On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote:
 
 
 Neither did I, but introducing printk's through the function, we narrowed
 the problem to this part of the code. And removing it makes the problem
 go away. We inserted 37 printk's in the function body, and Jose bisected
 those until the problem went away.
 
 
 Well, there's still little clue about why this is causing a NULL pointer
 dereference.  The only thing I can think is that somehow performing
 this test is causing a power glitch to your CPU, causing its registers
 to get corrupted, and which results in it doing a NULL pointer deref.
   
   
 That may be the case, indeed.
 
 
 But if the problem was a power glitch I should get Oops with or without
 printk() inserted, shouldn't I?
 

 That depends if the printk() changes the timing such that it doesn't occur.
 Don't know, I'm only grasping at straws due to the lack of any concrete
 information.

   
 If you see other tests to be performed...
 
 Maybe adding some delays in that bit of code?  I'm sure you've already
 thought of that though.  Since no one has a proper understanding of the
 problem, the only suggestions possible are mere shots in the dark.
   
 I'm no kernel expert, but it's not possible to trace what is the
 instruction that is causing the NULL pointer dereference?
 

 The reported dump shows that the kernel tried to access virtual address 0,
 and the instruction pointer seems to be the cause of that - it has a value
 of zero in that dump.

 The call trace indicates that the last function was called from around
 uart_startup+0x63/0xf4 which is probably the indirect function call
 to serial8250_startup().  That's unconfirmed - the only way to get it
 confirmed is if you could dump the entire uart_startup() function.

 $ grep uart_startup System.map
 (address) T uart_startup
 $ objdump -r -d vmlinux --start-addr=0xaddress --stop-addr=0xaddress+256

 The grep should get you the address of uart_startup.  Replace address
 with that value and address+256 with the value plus 256 (0x100) and
 mail the result.
   

Result is attached.

José Gonçalves


vmlinux-2.6.16.38-mtm4-debug: file format elf32-i386

Disassembly of section .text:

c01ba437 uart_startup:
c01ba437:	55   	push   %ebp
c01ba438:	57   	push   %edi
c01ba439:	56   	push   %esi
c01ba43a:	53   	push   %ebx
c01ba43b:	8b 7c 24 14  	mov0x14(%esp),%edi
c01ba43f:	31 d2	xor%edx,%edx
c01ba441:	8b 5f 10 	mov0x10(%edi),%ebx
c01ba444:	8b 77 14 	mov0x14(%edi),%esi
c01ba447:	83 7b 10 00  	cmpl   $0x0,0x10(%ebx)
c01ba44b:	0f 88 d3 00 00 00	js c01ba524 uart_startup+0xed
c01ba451:	8b 03	mov(%ebx),%eax
c01ba453:	0f ba a8 b4 00 00 00 	btsl   $0x1,0xb4(%eax)
c01ba45a:	01 
c01ba45b:	83 7e 60 00  	cmpl   $0x0,0x60(%esi)
c01ba45f:	0f 84 bf 00 00 00	je c01ba524 uart_startup+0xed
c01ba465:	83 7b 04 00  	cmpl   $0x0,0x4(%ebx)
c01ba469:	75 28	jnec01ba493 uart_startup+0x5c
c01ba46b:	b8 d0 00 00 00   	mov$0xd0,%eax
c01ba470:	e8 36 c7 f6 ff   	call   c0126bab get_zeroed_page
c01ba475:	ba f4 ff ff ff   	mov$0xfff4,%edx
c01ba47a:	85 c0	test   %eax,%eax
c01ba47c:	0f 84 a2 00 00 00	je c01ba524 uart_startup+0xed
c01ba482:	89 43 04 	mov%eax,0x4(%ebx)
c01ba485:	c7 43 0c 00 00 00 00 	movl   $0x0,0xc(%ebx)
c01ba48c:	c7 43 08 00 00 00 00 	movl   $0x0,0x8(%ebx)
c01ba493:	8b 46 64 	mov0x64(%esi),%eax
c01ba496:	56   	push   %esi
c01ba497:	ff 50 24 	call   *0x24(%eax)
c01ba49a:	89 c5	mov%eax,%ebp
c01ba49c:	58   	pop%eax
c01ba49d:	85 ed	test   %ebp,%ebp
c01ba49f:	75 6d	jnec01ba50e uart_startup+0xd7
c01ba4a1:	83 7c 24 18 00   	cmpl   $0x0,0x18(%esp)
c01ba4a6:	74 36	je c01ba4de uart_startup+0xa7
c01ba4a8:	6a 00	push   $0x0
c01ba4aa:	57   	push   %edi
c01ba4ab:	e8 5f 02 00 00   	call   c01ba70f uart_change_speed
c01ba4b0:	8b 03	mov(%ebx),%eax
c01ba4b2:	8b 40 64 	mov0x64(%eax),%eax
c01ba4b5:	59   	pop%ecx
c01ba4b6:	5f   	pop%edi
c01ba4b7:	f7 40 08 0f 10 00 00 	testl  $0x100f,0x8(%eax)
c01ba4be:	74 1e	je c01ba4de uart_startup+0xa7
c01ba4c0:	9c   	pushf  
c01ba4c1:	5f   	pop%edi
c01ba4c2:	fa   	cli
c01ba4c3:	8b 46 58 	mov

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


What we've seen on our embedded ARM is that enabling an interrupt that
is shared between multiple UARTs, at a stage when you have not set up
all the data structures touched by the ISR and softirq, can have
horrible consequences, including soft lockups and fandangos on core.
You will be vulnerable to this unless you lock out the interrupt
source (at the interrupt controller or, if you have to, globally)
across the UART registration process in your platform's
arch/mach-dependent core.c, in which case the TX irq test will of
course fail.  Roll-your-own SoC UARTs with bugs or extended features
in IRQ enabling and delivery make things worse.

I would love to see this disentangled in a maintainable way.  It's
such a nasty problem (especially given that bootloaders and early boot
code frequently turn on one or more UARTs and leave them in an unknown
state) that all we've been able to do so far is hack around it.  I'll
send an example patch when we've more or less isolated it, but it will
be of limited use to you unless you have the exact set of UART warpage
we do.

Cheers,
- Michael
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote:
 What we've seen on our embedded ARM is that enabling an interrupt that
 is shared between multiple UARTs, at a stage when you have not set up
 all the data structures touched by the ISR and softirq, can have
 horrible consequences, including soft lockups and fandangos on core.

Incorrect.  We have:

1. registered an interrupt handler at this point.
2. disabled interrupts (we're under the spin lock)

So, no interrupt will be seen by the CPU since the interrupt is masked.

The test is intentionally designed to be safe from the interrupt
generation point of view.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 05:54:52PM +, Jose Goncalves wrote:
 Russell King wrote:
 Result is attached.

Right... in depth analysis follows.

[15423.650518] [] uart_startup+0x63/0xf4 equates to 0xc01ba49a, which
is indeed the instruction after the call to port-ops-startup.

The important code leading up to this is:

c01ba437 uart_startup:
c01ba437:  55  push   %ebp
c01ba438:  57  push   %edi
c01ba439:  56  push   %esi
c01ba43a:  53  push   %ebx
c01ba43b:  8b 7c 24 14 mov0x14(%esp),%edi  @ load state
c01ba43f:  31 d2   xor%edx,%edx
c01ba441:  8b 5f 10mov0x10(%edi),%ebx  @ load state-info
c01ba444:  8b 77 14mov0x14(%edi),%esi  @ load state-port

c01ba493:  8b 46 64mov0x64(%esi),%eax  @ load port-ops
c01ba496:  56  push   %esi @ push port onto stack
c01ba497:  ff 50 24call   *0x24(%eax)  @ ops-startup(port)

c01bd74b serial8250_startup:
c01bd74b:  55  push   %ebp
c01bd74c:  57  push   %edi
c01bd74d:  56  push   %esi
c01bd74e:  53  push   %ebx
c01bd74f:  8b 5c 24 14 mov0x14(%esp),%ebx

Comparing this with the stack dump:

Stack:
 c02fae70
 0005
 c74304e0 - %ebx (pushed by serial8250_startup)
 c02fae70 - %esi (pushed by serial8250_startup)
 c128d5e4 - %edi (pushed by serial8250_startup)
 c7a69a80 - %ebp (pushed by serial8250_startup)
 c01ba49a - uart_startup+0x63/0xf4 (pushed by function called by
ops-startup, iow serial8250_startup)
 c02fae70 - pushed on by push %esi at c01ba496, this is port
 c128d5e4 - %ebx (pushed by uart_startup)
  - %esi (pushed by uart_startup)
 c7a69a80 - %edi (pushed by uart_startup)
 c7a69a80 - %ebp (pushed by uart_startup)
 c01bbaa0 - probably uart_open+0xaa/0xec

Once the instruction at c01bd74f completes, we have pushed into the stack
the structure commented above, but not the first two uncommented values.
%ebx contains the value of port at this point.  We're looking for some
place in the code which pushes a value of '5' and '%ebx' on to the stack,
and the CPUs registers contain values which correspond with the values
provided in your oops.

The code corresponding with the buggy uart check is as follows.  Comments
interspersed:

c01bd910:  9c  pushf
c01bd911:  5d  pop%ebp
c01bd912:  fa  cli

This code pushes the processors flag register onto the stack, pops it off
into the %ebp register, and then disables interrupts.

Your oops dump contained ebp: 0202 which is a reasonable value for
x86 processors flags, which have been saved into the ebp register by the
above code sequence (according to Wikipedia).

c01bd937:  ff 73 58pushl  0x58(%ebx)
c01bd93a:  53  push   %ebx
c01bd93b:  e8 7f fd ff ff  call   c01bd6bf serial8250_set_mctrl
c01bd940:  6a 02   push   $0x2
c01bd942:  6a 01   push   $0x1
c01bd944:  53  push   %ebx
c01bd945:  e8 f6 eb ff ff  call   c01bc540 serial_out @ write IER
c01bd94a:  6a 05   push   $0x5
c01bd94c:  53  push   %ebx
c01bd94d:  e8 a6 eb ff ff  call   c01bc4f8 serial_in  @ reads LSR
c01bd952:  89 c7   mov%eax,%edi @ saves result 
in %edi
c01bd954:  6a 02   push   $0x2
c01bd956:  53  push   %ebx
c01bd957:  e8 9c eb ff ff  call   c01bc4f8 serial_in  @ reads IIR
c01bd95c:  83 c4 24add$0x24,%esp
c01bd95f:  89 c6   mov%eax,%esi @ saves result 
in %esi

This is the code corresponding with part of the buggy uart check - you can
see the call to serial8250_set_mctrl() there which confirms this.  The
sequence at c01bd94a pushes 5 and port (%ebx) onto the stack, but
this isn't the right place because before this we pushed 2, 1, and
port on the stack, and those are not present in the stack dump.

However, the reason for showing this is that a little while later, we
have:

c01bd96e:  83 e7 40and$0x40,%edi
c01bd971:  74 1c   je c01bd98f serial8250_startup+0x244
c01bd973:  83 e6 01and$0x1,%esi
c01bd976:  74 17   je c01bd98f serial8250_startup+0x244

The normal value we would read from the LSR (stored in %edi) would be
0x60, and if a transmit interrupt was pending (which is what the test
is trying to find out) the IIR (%esi) would be 0x02.

The above code sequence which involves masking these values would
therefore give:
 0x40  0x60 (%edi) - 0x40 in %edi
 0x01  0x02 (%esi) - 0x00 in %esi

From your oops dump edi: 0040 and esi:  - that ties up,
so we know that the place we got to must be after this point.

We eventually come to this sequence.  The words previously pushed onto
the stack have been removed at this point, and %ebp,

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


On 2/19/07, Russell King [EMAIL PROTECTED] wrote:

On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote:
 What we've seen on our embedded ARM is that enabling an interrupt that
 is shared between multiple UARTs, at a stage when you have not set up
 all the data structures touched by the ISR and softirq, can have
 horrible consequences, including soft lockups and fandangos on core.

Incorrect.  We have:

1. registered an interrupt handler at this point.
2. disabled interrupts (we're under the spin lock)


setup_irq() is where things go wrong, at least for us, at least on
2.6.16.x.  Interrupts are not disabled at the point in request_irq()
when the interrupt controller is poked to enable the IRQ source.  If
you're lucky, and you're on an architecture where the UART interrupt
is properly level-triggered, and the worst thing that happens when you
attempt to service an interrupt that isn't yours is that it stays on,
then you get a soft lockup with two or three recursive __irq_svc hits
in the backtrace.  If you're not lucky you do a fandango on core.


So, no interrupt will be seen by the CPU since the interrupt is masked.


The interrupt would need to be masked for the entire duration of the
outer loop that calls serial8250_init() or the equivalent for all
platform devices that share the IRQ.


The test is intentionally designed to be safe from the interrupt
generation point of view.


But its context is not.  Shared IRQ lines are a _problem_.  You cannot
safely enable an IRQ until all devices that share it have had their
ISRs installed, unless you can absolutely guarantee at a hardware
level that the unitialized ones cannot assert the IRQ line.  That does
not apply to any device that might have been touched by the bootloader
or the early init code, especially a UART.

Cheers,
- Michael
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 01:24:17PM -0800, Michael K. Edwards wrote:
 On 2/19/07, Russell King [EMAIL PROTECTED] wrote:
 On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote:
  What we've seen on our embedded ARM is that enabling an interrupt that
  is shared between multiple UARTs, at a stage when you have not set up
  all the data structures touched by the ISR and softirq, can have
  horrible consequences, including soft lockups and fandangos on core.
 
 Incorrect.  We have:
 
 1. registered an interrupt handler at this point.
 2. disabled interrupts (we're under the spin lock)
 
 setup_irq() is where things go wrong, at least for us, at least on
 2.6.16.x.  Interrupts are not disabled at the point in request_irq()
 when the interrupt controller is poked to enable the IRQ source.  If
 you're lucky, and you're on an architecture where the UART interrupt
 is properly level-triggered, and the worst thing that happens when you
 attempt to service an interrupt that isn't yours is that it stays on,
 then you get a soft lockup with two or three recursive __irq_svc hits
 in the backtrace.  If you're not lucky you do a fandango on core.

That should not happen if your interrupt handling is correct - okay, you
might get an interrupt at that point, but while servicing that interrupt
the source will be disabled on the interrupt controller.

You should _never_ _ever_ get recusive interrupts for the same interrupt
source.  Ever.  If you do, your platforms interrupt handling is seriously
buggy.

 But its context is not.  Shared IRQ lines are a _problem_.  You cannot
 safely enable an IRQ until all devices that share it have had their
 ISRs installed, unless you can absolutely guarantee at a hardware
 level that the unitialized ones cannot assert the IRQ line.

Linux assumes that all interrupt sources on a shared IRQ line are
disabled at the point in time when the kernel boots.  When a device
is to be used, an interrupt handler is installed and then the kernel
will enable the interrupt on the device, not before.

We follow that rule in the 8250 driver - in fact, when we initialise
we ensure that interrupts are disabled on any devices we find.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


On 2/19/07, Russell King [EMAIL PROTECTED] wrote:

 setup_irq() is where things go wrong, at least for us, at least on
 2.6.16.x.  Interrupts are not disabled at the point in request_irq()
 when the interrupt controller is poked to enable the IRQ source.  If
 you're lucky, and you're on an architecture where the UART interrupt
 is properly level-triggered, and the worst thing that happens when you
 attempt to service an interrupt that isn't yours is that it stays on,
 then you get a soft lockup with two or three recursive __irq_svc hits
 in the backtrace.  If you're not lucky you do a fandango on core.

That should not happen if your interrupt handling is correct - okay, you
might get an interrupt at that point, but while servicing that interrupt
the source will be disabled on the interrupt controller.


Right.  But as soon as you turn the source back on, in the postamble
of the interrupt dispatch handler, it fires again.  At least on ARM,
that gives you recursive hits to __irq_svc and a couple of nested
calls within it.  Here's a backtrace (embedded in a chat log with some
commentary):

6:42 PM me: we have definitely confirmed that the serial ISR is
failing to clear the interrupt and the (presumably level-triggered)
IRQ is firing again on exit from the ISR.

6:43 PM The reason that __do_softirq is usually the last function
entrypoint in the backtrace before the __irq_svc associated with the
timer is that it is the first place where interrupts are enabled
during the IRQ dispatcher postamble.

6:44 PM Here is a backtrace from a case where the timer interrupt hit
during the perpetually firing ISR instead of during the dispatch code
surrounding it (which is not visible in backtraces)

6:45 PM [ 54.23] Freeing init memory: 92K
[ 52.24] rcu_do_batch: rcu node is 0xC03D7540, callback is 0xC00864C8
[ 52.24] rcu_do_batch: rcu node is 0xC02CCDA0, callback is 0xC006E7E4
[ 52.25] rcu_do_batch: rcu node is 0xC03D7730, callback is 0xC00864C8
[ 52.26] rcu_do_batch: rcu node is 0xC03D7920, callback is 0xC00864C8
[ 51.24] BUG: soft lockup detected on CPU#0!
[ 52.24] [c0025834] (dump_stack+0x0/0x14) from [c0050e40]
(softlockup_tick+0xa8/0xe8)
[ 52.24] [c0050d98] (softlockup_tick+0x0/0xe8) from [c003bb18]
(run_local_timers+0x18/0x1c)
[ 52.24] r8 = 00010105 r7 = 0005 r6 =  r5 = 
[ 52.24] r4 = C0299B40
[ 52.24] [c003bb00] (run_local_timers+0x0/0x1c) from
[c003bdec] (update_process_times+0x50/0x7c)
[ 52.24] [c003bd9c] (update_process_times+0x0/0x7c) from
[c0024f24] (timer_tick+0xc4/0xe0)
[ 52.24] r6 =  r5 = C029DB48 r4 = C029DB48
[ 52.24] [c0024e60] (timer_tick+0x0/0xe0) from [c002a79c]
(mv88w8xx8_timer_interrupt+0x30/0x68)
[ 52.24] r6 =  r5 = C029DB48 r4 = C024775C
[ 52.24] [c002a76c] (mv88w8xx8_timer_interrupt+0x0/0x68) from
[c0020c84] (__do_irq+0xf0/0x140)
[ 52.24] r5 =  r4 = C0204280
6:46 PM [ 52.24] [c0020b94] (__do_irq+0x0/0x140) from
[c0020f48] (do_level_IRQ+0x70/0xc8)
[ 52.24] [c0020ed8] (do_level_IRQ+0x0/0xc8) from [c00212b8]
(asm_do_IRQ+0x50/0x134)
[ 52.24] r6 = C029DB48 r5 = C0240E24 r4 = 0005
[ 52.24] [c0021268] (asm_do_IRQ+0x0/0x134) from [c001f978]
(__irq_svc+0x38/0x190)
[ 52.24] r6 = 0020 r5 = C029DB7C r4 = 
[ 52.24] [c0020b94] (__do_irq+0x0/0x140) from [c0020f48]
(do_level_IRQ+0x70/0xc8)
[ 52.24] [c0020ed8] (do_level_IRQ+0x0/0xc8) from [c00212b8]
(asm_do_IRQ+0x50/0x134)
[ 52.24] r6 = C029DBFC r5 = C0240F5C r4 = 000B
[ 52.24] [c0021268] (asm_do_IRQ+0x0/0x134) from [c001f978]
(__irq_svc+0x38/0x190)
[ 52.24] r6 = 0800 r5 = C029DC30 r4 = 
[ 52.24] [c0036d68] (__do_softirq+0x0/0xd8) from [c00370e0]
(irq_exit+0x48/0x5c)
[ 52.24] r6 = C029DC94 r5 = C0240E24 r4 = 0005
[ 52.24] [c0037098] (irq_exit+0x0/0x5c) from [c0021384]
(asm_do_IRQ+0x11c/0x134)
[ 52.24] [c0021268] (asm_do_IRQ+0x0/0x134) from [c001f978]
(__irq_svc+0x38/0x190)
[ 52.24] r6 = 0820 r5 = C029DCC8 r4 = 
[ 52.24] [c0020968] (setup_irq+0x0/0x15c) from [c0020b68]
(request_irq+0xa4/0xd0)
[ 52.24] r7 =  r6 =  r5 = 000B r4 = C0C1B5C0
[ 52.24] [c0020ac4] (request_irq+0x0/0xd0) from [c0100c24]
(serial_link_irq_chain+0x264/0x2a0)
[ 52.24] [c01009c0] (serial_link_irq_chain+0x0/0x2a0) from
[c0101558] (serial8250_startup+0x2f4/0x4f0)
[ 52.24] [c0101264] (serial8250_startup+0x0/0x4f0) from
[c00f85ec] (uart_startup+0x164/0x48c)
[ 52.24] [c00f8488] (uart_startup+0x0/0x48c) from [c00fca98]
(uart_open+0x1a8/0x238)
[ 52.24] [c00fc8f0] (uart_open+0x0/0x238) from [c00f2160]
(tty_open+0x1cc/0x390)
[ 52.24] [c00f1f94] (tty_open+0x0/0x390) from [c0077e44]
(chrdev_open+0x1e4/0x220)
[ 52.24] [c0077c60] (chrdev_open+0x0/0x220) from [c006ba58]
(__dentry_open+0x13c/0x294)
[ 52.24] r8 = C028E2A0 r7 = C0077C60 r6 = C0C29B94 r5 = 
[ 52.24] r4 = C02CC300
[ 52.24] [c006b91c]

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 02:16:41PM -0800, Michael K. Edwards wrote:
 Right.  But as soon as you turn the source back on, in the postamble
 of the interrupt dispatch handler, it fires again.  At least on ARM,
 that gives you recursive hits to __irq_svc and a couple of nested
 calls within it.

I think something else is going on here.  I think you're getting
an interrupt for the UART, and another interrupt is also pending.

When the UART interrupt is handled, it is masked at the interrupt
controller, and the CPU mask is dropped.

The second interrupt comes in, and when you go to disable that
source, you inadvertently re-enable the UART interrupt, despite it
still being serviced.

This leads to the UART interrupt again triggering an IRQ.

Please show your interrupt controller (mask, unmask, mask_ack)
handling functions corresponding with the interrupt which your
UART is connected to.

  But its context is not.  Shared IRQ lines are a _problem_.  You cannot
  safely enable an IRQ until all devices that share it have had their
  ISRs installed, unless you can absolutely guarantee at a hardware
  level that the unitialized ones cannot assert the IRQ line.
 
 Linux assumes that all interrupt sources on a shared IRQ line are
 disabled at the point in time when the kernel boots.  When a device
 is to be used, an interrupt handler is installed and then the kernel
 will enable the interrupt on the device, not before.
 
 Linux assumes incorrectly in this instance.
 It would improve the
 kernel if all drivers' __init code were refactored into an
 IRQ-discovery-ISR-installation pass, followed by a
 chip-reset-data-structure-initialization pass, followed by a
 chip-configuration-driver-activation pass.  This is unlikely to happen
 overnight.

This shows that you don't actually have an understanding of the Linux
kernel boot, especially in respect of serial devices.  At boot, devices
are detected and initialised to a safe state, where they will not
spuriously generate interrupts.

When a userspace program opens a serial port, which can only happen
once the kernel boot has completed (ergo, devices have been initialised
and placed in a safe state) the interrupts are claimed, and enabled
at the source.

 In the meantime, weird UART states on entry into platform_device_init
 are a reality.

Yes, uart states are indeterminent at this point.  However, as soon as
the 8250 driver loads it takes control of the 8250 ports, and DISABLES
the interrupt on ALL ports found, LONG BEFORE any service handlers are
installed.

So, by the time the system is up and running _all_ 8250 ports have
had their IERs written with zero.  Interrupts disabled at source.

By the time you get to open any serial port, the initialisation has
completed.

 We follow that rule in the 8250 driver - in fact, when we initialise
 we ensure that interrupts are disabled on any devices we find.
 
 No, you rely on the caller of serial8250_init to have punctured the
 abstraction

Can you add any other useless complex words into that sentence?

 and forced any and all UARTs to a state where they cannot
 possibly generate an IRQ.

That is being done already at initialisation time.

Now, please show your interrupt mask/unmask/mask_ack code, which is
where I believe your problem to lie.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


On 2/19/07, Russell King [EMAIL PROTECTED] wrote:

I think something else is going on here.  I think you're getting
an interrupt for the UART, and another interrupt is also pending.


Correct.  An interrupt for the other UART on the same IRQ.


When the UART interrupt is handled, it is masked at the interrupt
controller, and the CPU mask is dropped.


Correct.


The second interrupt comes in, and when you go to disable that
source, you inadvertently re-enable the UART interrupt, despite it
still being serviced.


Incorrect.  An attempt has been made to service the interrupt using
the only ISR currently in the chain for that IRQ -- the ISR for the
first UART.  That attempt was not successful, and when __do_irq
unmasks the interrupt source preparatory to exiting interrupt context,
__irq_svc is dispatched anew.


This leads to the UART interrupt again triggering an IRQ.


Right.  The _second_ UART's interrupt.  There's another problem with
these UARTs having to do with the implementor's inability to read and
follow a bog-standard twenty-year-old spec without asking software to
fix up corner cases, but that's another backtrace for another day.


Please show your interrupt controller (mask, unmask, mask_ack)
handling functions corresponding with the interrupt which your
UART is connected to.


Don't have 'em handy; I'll be happy to post them when I do, perhaps
later today.  I would hope they're pretty generic, though; it's a
Feroceon core pretending to be an ARM926EJ-S, hooked to the usual
half-assed Marvell imitation of an ARM licensed functional block.
Trust me for the moment, it's the same IRQ line.


This shows that you don't actually have an understanding of the Linux
kernel boot, especially in respect of serial devices.  At boot, devices
are detected and initialised to a safe state, where they will not
spuriously generate interrupts.


Sorry, 'taint so.  Not unless the chip support droid has put the right
stuff in arch/arm/mach-foo.  LKML is littered with the fall-out of the
decision to trust whoever jumped to main() to have left the hardware
in a sane state.  If you don't enjoy this sort of forensics (which I
for one do not, especially not when there is a project deadline
looming and a Heisenbug starts firing 9 times out of 10), you might
consider systematically installing ISRs that know how to shut
everything up before turning on any interrupt sources at all.

As I said, this is not going to happen overnight, and is not even
particularly in the economic interest of people who get paid by the
hour to wear bringup wizard hats.  That category currently includes
me, but I am intensely bored with this game and aspire to greater
things.


When a userspace program opens a serial port, which can only happen
once the kernel boot has completed (ergo, devices have been initialised
and placed in a safe state) the interrupts are claimed, and enabled
at the source.


As you can see from the console dump I posted (which begins with
Freeing init memory: 92K and ends with do_exit - init - sys_open,
which is obviously sys_open(/dev/console)), this happens long before
userspace comes into the picture.  Our 8250.c has some nasty hacks in
it but otherwise this call chain is from a very nearly vanilla
2.6.16.recent.

We've already worked around this on our board, and the whole kit and
kaboodle will eventually be posted to linux-arm-kernel in tidy patches
when my client lets me spend billable hours on it (immediately after
the damn thing passes its first functional test, long before it
ships).  I'm not asking for anyone's help except in the
let's-all-help-one-another spirit.  I'm trying to help with root cause
analysis of Frederik's (Jose's?) fandango on core.  If it's not
relevant, my apologies; and although it goes without saying, I salute
you for both the serial driver and the ARM port.

Now please take a second look at the backtrace before toasting me
lightly again.  Mmm'kay?  Oh, and by the way -- is there an Alt-SysRq
equivalent on an ARM serial console?

Cheers,
- Michael
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Russell King

On Mon, Feb 19, 2007 at 04:04:26PM -0800, Michael K. Edwards wrote:
 On 2/19/07, Russell King [EMAIL PROTECTED] wrote:
 The second interrupt comes in, and when you go to disable that
 source, you inadvertently re-enable the UART interrupt, despite it
 still being serviced.
 
 Incorrect.  An attempt has been made to service the interrupt using
 the only ISR currently in the chain for that IRQ -- the ISR for the
 first UART.  That attempt was not successful, and when __do_irq
 unmasks the interrupt source preparatory to exiting interrupt context,
 __irq_svc is dispatched anew.

This can't happen because when __do_irq unmasks the interrupt source,
the CPU mask is set, thereby preventing any further interrupt exceptions
being taken.  This is done precisely to prevent this situation happening.

If you are seeing recursion for the same interrupt (two or more stack
frames containing asm_do_IRQ for that very same IRQ) then your interrupt
handling is buggy, plain and simple.

 Please show your interrupt controller (mask, unmask, mask_ack)
 handling functions corresponding with the interrupt which your
 UART is connected to.
 
 Don't have 'em handy; I'll be happy to post them when I do, perhaps
 later today.  I would hope they're pretty generic, though; it's a
 Feroceon core pretending to be an ARM926EJ-S, hooked to the usual
 half-assed Marvell imitation of an ARM licensed functional block.
 Trust me for the moment, it's the same IRQ line.

I don't doubt that it is on the same IRQ line - I have such setups here
and it works perfectly - multiple 8250 UARTs connected to a single
level-triggered interrupt input which also happens to be shared with
a SCSI host chip as well.  Absolutely no problems.

 If you don't enjoy this sort of forensics (which I
 for one do not, especially not when there is a project deadline
 looming and a Heisenbug starts firing 9 times out of 10), you might
 consider systematically installing ISRs that know how to shut
 everything up before turning on any interrupt sources at all.

I still say that your understanding is completely flawed.  Moreover,
you haven't read what I've said about the ordering of initialisation,
the stress on when we disable interrupts for the ports, etc.

 I'm not asking for anyone's help except in the
 let's-all-help-one-another spirit.

You're actually *not* helping.  You're causing utter confusion through
misunderstanding, but it seems you're not open to the possibility that
your understanding is flawed.

I'm offering to look through your code and point you at the source of
your issue for free.  Please don't throw that offer away without first
considering that maybe I have a clue about what's going on here.

 Now please take a second look at the backtrace before toasting me
 lightly again.

... which showed the port being opened well after system initialisation
of devices, including all serial ports - including disabling of their
interrupt source at the IER, has been completed.

 Mmm'kay?  Oh, and by the way -- is there an Alt-SysRq
 equivalent on an ARM serial console?

Yes, and it's the same for any serial console with functioning break
support.  You'll find it in Documentation/sysrq.txt, though it does
misleadingly say PC style standard serial ports only whereas the
reality is where possible.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


On 2/19/07, Russell King [EMAIL PROTECTED] wrote:

This can't happen because when __do_irq unmasks the interrupt source,
the CPU mask is set, thereby preventing any further interrupt exceptions
being taken.  This is done precisely to prevent this situation happening.

If you are seeing recursion for the same interrupt (two or more stack
frames containing asm_do_IRQ for that very same IRQ) then your interrupt
handling is buggy, plain and simple.


Imaginable.  I'll look at the mask/unmask code.  Thanks.


I don't doubt that it is on the same IRQ line - I have such setups here
and it works perfectly - multiple 8250 UARTs connected to a single
level-triggered interrupt input which also happens to be shared with
a SCSI host chip as well.  Absolutely no problems.


Can you do me a favor?  In the sys_open(/dev/console) path, turn on
the right bits in that second uart's IER, then insert a sleep in
request_irq or something (wherever seems best based on that
backtrace), and feed enough characters into the second UART during
that sleep to generate an IRQ.  Do you not get the same soft lockup?


I still say that your understanding is completely flawed.  Moreover,
you haven't read what I've said about the ordering of initialisation,
the stress on when we disable interrupts for the ports, etc.


Well, all I can say is that that's a real backtrace and it shouldn't
be hard to reproduce if it's anything other than a broken interrupt
controller or broken code called by the __do_irq postamble.  I don't
see any platform-provided unmask routines in that backtrace, but maybe
it got inlined; I'll go back and check.


You're actually *not* helping.  You're causing utter confusion through
misunderstanding, but it seems you're not open to the possibility that
your understanding is flawed.


Still open, though it's a pity you're more interested in my flawed
understanding that in the possibility that the kernel could be
systematically made more robust against hardware bugs and coding
errors by the simple expedient of putting all the ISRs in before
turning on any IRQ that might be shared.  Or are you telling me that's
already been done?  (Yes, I am aware that this interacts
entertainingly with hot-plug PCI.  Yes, I am aware that there is a
limit to how much software can fix stupid hardware.  But surely there
is room for an emergency IRQ suppressor to let chip initialization
code kick in and force the hardware to a known state.)


I'm offering to look through your code and point you at the source of
your issue for free.  Please don't throw that offer away without first
considering that maybe I have a clue about what's going on here.


I appreciate that offer, and I hope to take advantage of it as soon as
I have the source code at my fingertips (not just the chat log where I
recorded the backtrace).


... which showed the port being opened well after system initialisation
of devices, including all serial ports - including disabling of their
interrupt source at the IER, has been completed.


Now that you mention it, the backtrace I sent is the
serial8250_startup one, not the serial8250_init one.  Sorry, this
one's probably an artifact of brain damage specific to this UART.  I
need to dig through a different account to find the init-path example;
but in either case, we're getting a new interrupt during the __do_irq
postamble.  If you're telling me that that shouldn't happen, what
should the backtrace for a soft lockup due to a stuck level-triggered
IRQ look like on ARM?


Yes, and it's the same for any serial console with functioning break
support.  You'll find it in Documentation/sysrq.txt, though it does
misleadingly say PC style standard serial ports only whereas the
reality is where possible.


Thank you very much; this will help me get to the bottom of some other
chip-support nastiness on this device.

Cheers,
- Michael
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Robert Hancock


Michael K. Edwards wrote:

Still open, though it's a pity you're more interested in my flawed
understanding that in the possibility that the kernel could be
systematically made more robust against hardware bugs and coding
errors by the simple expedient of putting all the ISRs in before
turning on any IRQ that might be shared.  Or are you telling me that's
already been done?  (Yes, I am aware that this interacts
entertainingly with hot-plug PCI.  Yes, I am aware that there is a
limit to how much software can fix stupid hardware.  But surely there
is room for an emergency IRQ suppressor to let chip initialization
code kick in and force the hardware to a known state.)


How do you propose to do this? Drivers can get loaded and unloaded at any
time. If you have a device generating spurious interrupts on a shared IRQ
line, there's no way you can use any device on that line until that interrupt
is shut off. Requiring all drivers to be loaded before any of them can use
interrupts is simply not practical.

If a system has a device that generates interrupts before they're enabled,
and the firmware doesn't fix it, then some platform-specific quirk has to
handle it and shut off the interrupt before it allows any interrupts
to be enabled. (We have such a quirk for certain network controllers where
the boot ROM can leave the chip generating interrupts on bootup.)

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Michael K. Edwards


On 2/19/07, Robert Hancock [EMAIL PROTECTED] wrote:

How do you propose to do this? Drivers can get loaded and unloaded at any
time. If you have a device generating spurious interrupts on a shared IRQ
line, there's no way you can use any device on that line until that interrupt
is shut off. Requiring all drivers to be loaded before any of them can use
interrupts is simply not practical.


Of course not.  But dealing with a stuck IRQ line by locking up isn't
very practical either.  IRQ sharing is stupid yet universal, and it
happens all the time that a device that has been sitting there minding
its own business since power-up, with no driver to drive it, decides
to assert its IRQ.  Maybe it just got hot-plugged, maybe it just got
its first dribble of input, whatever.  Other devices on the shared IRQ
are screwed (or at least semi-screwed; you could periodically
re-enable the IRQ long enough to make a run through the ISR chain
servicing the other devices).  But if you run lspci (or whatever)
and load a driver for the newly awake device, everything goes back to
normal.

For devices compiled into the kernel, you shouldn't have to play these
games.  If, that is, there were three stages of driver initialization,
called in successive passes:

1) installing an ISR with a fallback STFU path (device-specific but
not dependent on any particular pre-existing chip state), quiescing it
if you know how and registering for the IRQ if you know which it is;

2) going through the chip's soft-reset-wake-up-shut-up cycle and
populating driver data structures, possibly correcting the IRQ
registration along the way;

3) ready-as-we'll-ever-be, bring on the interrupts.

You probably can't help enabling the IRQ briefly during 2) so that you
can do tests like Russell's loopback.  But it's a needless gamble to
do that without doing 1) for all compiled-in drivers and platform
devices first, in a previous discovery pass.  And it's stupid to do 3)
in the same pass as 2), because you'll just open race condition
windows that will only bite when an all-the-way-live device raises its
IRQ at a moment when the writer of the wake-up-shut-up code wasn't
expecting it.  All code has bugs and they're only a problem when they
bite in the field.


If a system has a device that generates interrupts before they're enabled,
and the firmware doesn't fix it, then some platform-specific quirk has to
handle it and shut off the interrupt before it allows any interrupts
to be enabled. (We have such a quirk for certain network controllers where
the boot ROM can leave the chip generating interrupts on bootup.)


You don't need quirks if your driver initialization is bomb-proof to
begin with.  Devices that are quiet on power-up are purely
coincidental and should not be construed.

Cheers,
- Michael
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial related oops

2007-02-19 Thread Robert Hancock


Michael K. Edwards wrote:

Of course not.  But dealing with a stuck IRQ line by locking up isn't
very practical either.  IRQ sharing is stupid yet universal, and it


And we don't, that's why we have that nobody cared logic that disables 
the interrupt line if no driver services the interrupt. That doesn't 
provide a clean recovery, of course, it's meant to notify the user of 
what happened so that the problem can be fixed.



happens all the time that a device that has been sitting there minding
its own business since power-up, with no driver to drive it, decides
to assert its IRQ.  Maybe it just got hot-plugged, maybe it just got
its first dribble of input, whatever.  Other devices on the shared IRQ
are screwed (or at least semi-screwed; you could periodically
re-enable the IRQ long enough to make a run through the ISR chain
servicing the other devices).  But if you run lspci (or whatever)
and load a driver for the newly awake device, everything goes back to
normal.

For devices compiled into the kernel, you shouldn't have to play these
games.  If, that is, there were three stages of driver initialization,
called in successive passes:


Exactly, for devices compiled into the kernel. In most setups this is 
only a fraction of all devices, so solving this problem only for drivers 
built into the kernel is no solution.




1) installing an ISR with a fallback STFU path (device-specific but
not dependent on any particular pre-existing chip state), quiescing it
if you know how and registering for the IRQ if you know which it is;

2) going through the chip's soft-reset-wake-up-shut-up cycle and
populating driver data structures, possibly correcting the IRQ
registration along the way;

3) ready-as-we'll-ever-be, bring on the interrupts.

You probably can't help enabling the IRQ briefly during 2) so that you
can do tests like Russell's loopback.  But it's a needless gamble to
do that without doing 1) for all compiled-in drivers and platform
devices first, in a previous discovery pass.  And it's stupid to do 3)
in the same pass as 2), because you'll just open race condition
windows that will only bite when an all-the-way-live device raises its
IRQ at a moment when the writer of the wake-up-shut-up code wasn't
expecting it.  All code has bugs and they're only a problem when they
bite in the field.

If a system has a device that generates interrupts before they're 
enabled,

and the firmware doesn't fix it, then some platform-specific quirk has to
handle it and shut off the interrupt before it allows any interrupts
to be enabled. (We have such a quirk for certain network controllers 
where

the boot ROM can leave the chip generating interrupts on bootup.)


You don't need quirks if your driver initialization is bomb-proof to
begin with.  Devices that are quiet on power-up are purely
coincidental and should not be construed.


It's not coincidental, it is the only sane way to design hardware. You 
just can't go firing off interrupts without a driver having 
intentionally enabled them. There are a few devices that have had such 
issues, but they have been few and far between, certainly not enough to 
warrant the complexity of the scheme you propose.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

84 matches

Mail list logo