Re: Serial related oops
Russell King wrote: > On Thu, Mar 01, 2007 at 01:33:28PM +, Jose Goncalves wrote: > >> I've also done your suggestion and I've inserted "msleep(10);" just >> before the "And clear the interrupt registers again for luck." and my >> application is now running without problems fore more than 24H! So, >> inserting a delay in this point definitely makes some difference (has >> was with adding some extra printk() in several points of >> serial8250_startup()). >> >> This said, for me, this is definitely a software problem. The question >> is were? >> > > I'm personally convinced it's hardware because according to my analysis > your CPU behaving in a way that the code is not asking it to do so. > It's not possible that a interrupt is hitting just after enabling interrupts with "serial_outp(up, UART_IER, up->ier);" which triggers the execution of some code that is not reported by the Oops dump (at least with my current configuration) ? José Gonçalves - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Thu, Mar 01, 2007 at 01:33:28PM +, Jose Goncalves wrote: > I've also done your suggestion and I've inserted "msleep(10);" just > before the "And clear the interrupt registers again for luck." and my > application is now running without problems fore more than 24H! So, > inserting a delay in this point definitely makes some difference (has > was with adding some extra printk() in several points of > serial8250_startup()). > > This said, for me, this is definitely a software problem. The question > is were? I'm personally convinced it's hardware because according to my analysis your CPU behaving in a way that the code is not asking it to do so. Maybe others have some further insight; I certainly don't. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Hi again Russel, I'm back, after some more testing. Here goes my report. I've switched to another SBC and the kernel still Oops, so is not a one-off fault on the hardware. I've also run memtest86+ on this board for the maximum period that I reach an Oops with my application (24 H) and it not detected any fault (in 21 passes). As I've said earlier, our hardware as an extra serial controller (TL16C554A). To isolate the problem, I've removed the board with this extra controller and used only the SBC (Vortex86-6070 - http://www.icop.com.tw/products_detail.asp?ProductID=70). Still, with that setup and with my application using only ttyS1, I get kernel Oops, and always in the same point: <1>[43477.986867] Unable to handle kernel NULL pointer dereference at virtual address 0012 <1>[43477.995067] printing eip: <4>[43478.003087] c01bfa7a <1>[43478.003116] *pde = <0>[43478.011231] Oops: [#1] <4>[43478.019188] Modules linked in: <0>[43478.027308] CPU:0 <4>[43478.027325] EIP:0060:[]Not tainted VLI <4>[43478.027341] EFLAGS: 00010202 (2.6.16.41-mtm6-debug1 #1) <0>[43478.052490] EIP is at serial_in+0xa/0x4a <0>[43478.061448] eax: 0060 ebx: ecx: edx: <0>[43478.070945] esi: edi: 0040 ebp: c7237e1c esp: c7237e18 <0>[43478.080720] ds: 007b es: 007b ss: 0068 <0>[43478.090470] Process gp_position (pid: 26205, threadinfo=c7236000 task=c775dab0) <0>[43478.091319] Stack: <0> c01c0f88 c031fef0 0005 0202 <0>[43478.113464]c717fa1c c031fef0 c124b510 c7237e60 c01bd97d c031fef0 c124b510 c124b510 <0>[43478.126484] c760c52c c7237e7c c01befe7 c124b510 ffed c760c52c <0>[43478.139984] Call Trace: <0>[43478.152627] [] show_stack_log_lvl+0xa5/0xad <0>[43478.166200] [] show_registers+0x106/0x16f <0>[43478.179852] [] die+0xb6/0x127 <0>[43478.193589] [] do_page_fault+0x380/0x4b3 <0>[43478.207616] [] error_code+0x4f/0x60 <0>[43478.221803] [] serial8250_startup+0x28f/0x2a9 <0>[43478.236340] Code: 38 43 78 75 02 b2 01 89 d0 eb 10 8b 41 70 39 43 70 0f 94 c0 0f b6 c0 eb 02 31 c0 5b 5d c3 90 90 90 55 89 e5 53 8b 5d 08 8b 55 0c <0f> b6 4b 12 0f b6 43 13 d3 e2 83 f8 02 74 1a 7f 05 48 74 09 eb <4>[43478.322255] BUG: gp_position/26205, lock held at task exit time! <4>[43478.341721] [c124b528] {uart_register_driver} <4>[43478.359168] .. held by: gp_position:26205 [c775dab0, 117] <4>[43478.377112] ... acquired at: uart_get+0x28/0xde I've also done your suggestion and I've inserted "msleep(10);" just before the "And clear the interrupt registers again for luck." and my application is now running without problems fore more than 24H! So, inserting a delay in this point definitely makes some difference (has was with adding some extra printk() in several points of serial8250_startup()). This said, for me, this is definitely a software problem. The question is were? I would appreciate if you (or anyone) could give me any pointers on how to detect the cause of my kernel Oops (perhaps activating extra kernel debug?) Thanks, José Gonçalves - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Hi again Russel, I'm back, after some more testing. Here goes my report. I've switched to another SBC and the kernel still Oops, so is not a one-off fault on the hardware. I've also run memtest86+ on this board for the maximum period that I reach an Oops with my application (24 H) and it not detected any fault (in 21 passes). As I've said earlier, our hardware as an extra serial controller (TL16C554A). To isolate the problem, I've removed the board with this extra controller and used only the SBC (Vortex86-6070 - http://www.icop.com.tw/products_detail.asp?ProductID=70). Still, with that setup and with my application using only ttyS1, I get kernel Oops, and always in the same point: 1[43477.986867] Unable to handle kernel NULL pointer dereference at virtual address 0012 1[43477.995067] printing eip: 4[43478.003087] c01bfa7a 1[43478.003116] *pde = 0[43478.011231] Oops: [#1] 4[43478.019188] Modules linked in: 0[43478.027308] CPU:0 4[43478.027325] EIP:0060:[c01bfa7a]Not tainted VLI 4[43478.027341] EFLAGS: 00010202 (2.6.16.41-mtm6-debug1 #1) 0[43478.052490] EIP is at serial_in+0xa/0x4a 0[43478.061448] eax: 0060 ebx: ecx: edx: 0[43478.070945] esi: edi: 0040 ebp: c7237e1c esp: c7237e18 0[43478.080720] ds: 007b es: 007b ss: 0068 0[43478.090470] Process gp_position (pid: 26205, threadinfo=c7236000 task=c775dab0) 0[43478.091319] Stack: 0 c01c0f88 c031fef0 0005 0202 0[43478.113464]c717fa1c c031fef0 c124b510 c7237e60 c01bd97d c031fef0 c124b510 c124b510 0[43478.126484] c760c52c c7237e7c c01befe7 c124b510 ffed c760c52c 0[43478.139984] Call Trace: 0[43478.152627] [c0102a35] show_stack_log_lvl+0xa5/0xad 0[43478.166200] [c0102b70] show_registers+0x106/0x16f 0[43478.179852] [c0102d06] die+0xb6/0x127 0[43478.193589] [c0109677] do_page_fault+0x380/0x4b3 0[43478.207616] [c01026bf] error_code+0x4f/0x60 0[43478.221803] [c01c0f88] serial8250_startup+0x28f/0x2a9 0[43478.236340] Code: 38 43 78 75 02 b2 01 89 d0 eb 10 8b 41 70 39 43 70 0f 94 c0 0f b6 c0 eb 02 31 c0 5b 5d c3 90 90 90 55 89 e5 53 8b 5d 08 8b 55 0c 0f b6 4b 12 0f b6 43 13 d3 e2 83 f8 02 74 1a 7f 05 48 74 09 eb 4[43478.322255] BUG: gp_position/26205, lock held at task exit time! 4[43478.341721] [c124b528] {uart_register_driver} 4[43478.359168] .. held by: gp_position:26205 [c775dab0, 117] 4[43478.377112] ... acquired at: uart_get+0x28/0xde I've also done your suggestion and I've inserted msleep(10); just before the And clear the interrupt registers again for luck. and my application is now running without problems fore more than 24H! So, inserting a delay in this point definitely makes some difference (has was with adding some extra printk() in several points of serial8250_startup()). This said, for me, this is definitely a software problem. The question is were? I would appreciate if you (or anyone) could give me any pointers on how to detect the cause of my kernel Oops (perhaps activating extra kernel debug?) Thanks, José Gonçalves - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Thu, Mar 01, 2007 at 01:33:28PM +, Jose Goncalves wrote: I've also done your suggestion and I've inserted msleep(10); just before the And clear the interrupt registers again for luck. and my application is now running without problems fore more than 24H! So, inserting a delay in this point definitely makes some difference (has was with adding some extra printk() in several points of serial8250_startup()). This said, for me, this is definitely a software problem. The question is were? I'm personally convinced it's hardware because according to my analysis your CPU behaving in a way that the code is not asking it to do so. Maybe others have some further insight; I certainly don't. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: On Thu, Mar 01, 2007 at 01:33:28PM +, Jose Goncalves wrote: I've also done your suggestion and I've inserted msleep(10); just before the And clear the interrupt registers again for luck. and my application is now running without problems fore more than 24H! So, inserting a delay in this point definitely makes some difference (has was with adding some extra printk() in several points of serial8250_startup()). This said, for me, this is definitely a software problem. The question is were? I'm personally convinced it's hardware because according to my analysis your CPU behaving in a way that the code is not asking it to do so. It's not possible that a interrupt is hitting just after enabling interrupts with serial_outp(up, UART_IER, up-ier); which triggers the execution of some code that is not reported by the Oops dump (at least with my current configuration) ? José Gonçalves - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell, thanks again for offering to look at this; the more oopses and soft lockups I see on this board, the more I think you're right and we have an IRQ handling race. Here's the struct irqchip setup: /* mask irq, refer ssection 2.6 under chip 8618 document */ static void mv88w8xx8_mask_irq(unsigned int irq) { MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_CLR,(1 << irq)); } /* unmask irq, refer ssection 2.6 under chip 8618 document */ static void mv88w8xx8_unmask_irq(unsigned int irq) { MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_SET,(1 << irq)); } /* ack to CPU interrupts and also individual timer interrupts */ static void mv88w8xx8_mask_ack_irq(unsigned int irq) { mv88w8xx8_mask_irq(irq); if (irq < IRQ_TIMER1 || irq > IRQ_TIMER4) return; /* write 0 to clear interrupt and re-enable further interrupts */ MV88W8XX8_REG_WRITE(MV88W8XX8_TIMER_INT_SOURCE, ~(1<<(irq-4))); } static struct irqchip mv88w8xx8_chip = { .ack= mv88w8xx8_mask_ack_irq, .mask = mv88w8xx8_mask_irq, .unmask = mv88w8xx8_unmask_irq, }; /** * called by core.c to initialize the IRQ module */ void mv88w8xx8_init_irq(void) { int irq; for (irq = 0; irq < NR_IRQS; irq++) { set_irq_chip(irq, _chip); set_irq_handler(irq, do_level_IRQ); set_irq_flags(irq, IRQF_VALID | IRQF_PROBE); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell, thanks again for offering to look at this; the more oopses and soft lockups I see on this board, the more I think you're right and we have an IRQ handling race. Here's the struct irqchip setup: /* mask irq, refer ssection 2.6 under chip 8618 document */ static void mv88w8xx8_mask_irq(unsigned int irq) { MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_CLR,(1 irq)); } /* unmask irq, refer ssection 2.6 under chip 8618 document */ static void mv88w8xx8_unmask_irq(unsigned int irq) { MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_SET,(1 irq)); } /* ack to CPU interrupts and also individual timer interrupts */ static void mv88w8xx8_mask_ack_irq(unsigned int irq) { mv88w8xx8_mask_irq(irq); if (irq IRQ_TIMER1 || irq IRQ_TIMER4) return; /* write 0 to clear interrupt and re-enable further interrupts */ MV88W8XX8_REG_WRITE(MV88W8XX8_TIMER_INT_SOURCE, ~(1(irq-4))); } static struct irqchip mv88w8xx8_chip = { .ack= mv88w8xx8_mask_ack_irq, .mask = mv88w8xx8_mask_irq, .unmask = mv88w8xx8_unmask_irq, }; /** * called by core.c to initialize the IRQ module */ void mv88w8xx8_init_irq(void) { int irq; for (irq = 0; irq NR_IRQS; irq++) { set_irq_chip(irq, mv88w8xx8_chip); set_irq_handler(irq, do_level_IRQ); set_irq_flags(irq, IRQF_VALID | IRQF_PROBE); } } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote: What I find real hard to understand is why a hardware fault happens always in the same software instruction! I would expect a hardware fault to hit randomly... I've experienced just such a hardware fault. The Infineon DSCC4 serial controller has a hardware bug in the PCI request/grant handling that can lead to the device driving the PCI bus in conflict with another device. While the results were random (as the oops in this problem seem to be), the trigger was always activating certain devices in combination. In your case, altering the timing/behavior of the serial device during open may be provoking the hardware fault. -- Paul Fulghum Microgate Systems, Ltd. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Quoting Russell King <[EMAIL PROTECTED]>: On Thu, Feb 22, 2007 at 03:07:18PM +, Jose Goncalves wrote: Russell King wrote: > On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote: > >> Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things >> to us, at least on an ARM target ... >> > > That's ruled out. Please think about it for a moment - serial_in() > managed to work correctly most of the time, and then spontaneously > changes its well-defined ABI behaviour in a way that analysis of the > asm doesn't allow it to. > I'm using gcc 3.4.6. But I agree with Russell, if it was such a problem it would hit on the first iteration of my application and not after 1 day of executing the same piece of code... One thing you might think about is running memtest86 on the machine for the same kind of time interval, just in case it's something trivial like bad ram. OK. That's another thing to do. Meanwhile I've switched to another SBC and I'm now running my application on the new unit. Lets wait and see... José Gonçalves - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Quoting Russell King <[EMAIL PROTECTED]>: On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote: It could be a silly question (tamper with me as I'm not familiar with such low level programming), but couldn't it be possible for a interrupt to hit in the middle of the serial_in() calls and mess with %ebx? I'm no expert on x86, but if an interrupt was messing with %ebx, you'd have random crashes verywhere - userspace, kernel space in unpredicatable ways. What I find real hard to understand is why a hardware fault happens always in the same software instruction! I would expect a hardware fault to hit randomly... Well, compared with your previous report, your latest report is different. Your first report had both EIP and %ebx being zero (because they got corrupted when returning from serial_in). This time only %ebx was corrupted. Consequently, this time we oopsed in the subsequent serial_in() rather than trying to return to serial8250_startup() as last time. But there was also another difference. I CONFIGed the kernel to produce more debug info. This should influence the Oops report... I left my application running this night, with a 2.6.16.41 kernel unpatched on the serial driver (my last Oops report was with Frederik patch to remove the insertion made in 2.6.12) and it crashed again on exactly the same point! From that I take it that you removed the test in serial8250_startup which sets UART_BUG_TXEN, and the problem persisted. That tends to suggest that it's not the culpret. From that I mean that with or without this code - http://lkml.org/lkml/2007/2/19/124 - the problem persisted. The difference is that, without it, the crashes happens more sparsly. José Gonçalves - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote: > It could be a silly question (tamper with me as I'm not familiar with > such low level programming), but couldn't it be possible for a interrupt > to hit in the middle of the serial_in() calls and mess with %ebx? I'm no expert on x86, but if an interrupt was messing with %ebx, you'd have random crashes verywhere - userspace, kernel space in unpredicatable ways. > What I find real hard to understand is why a hardware fault happens > always in the same software instruction! I would expect a hardware fault > to hit randomly... Well, compared with your previous report, your latest report is different. Your first report had both EIP and %ebx being zero (because they got corrupted when returning from serial_in). This time only %ebx was corrupted. Consequently, this time we oopsed in the subsequent serial_in() rather than trying to return to serial8250_startup() as last time. > I left my application running this night, with a 2.6.16.41 kernel > unpatched on the serial driver (my last Oops report was with Frederik > patch to remove the insertion made in 2.6.12) and it crashed again on > exactly the same point! >From that I take it that you removed the test in serial8250_startup which sets UART_BUG_TXEN, and the problem persisted. That tends to suggest that it's not the culpret. > > For all we know, it could be a one-off fault on the hardware you > > happen to have - other identical units may not behave the same (can > > you check?) > > Yes I have other units that I can test it. I'll do that to see if it's > really a one-off fault on the hardware. Would be nice to know. > If it continues to crash with other units I will then test with the > msleep(10) before the "And clear the interrupt registers again for > luck.", as you suggested earlier. > > > If it is a one off case, you are welcome to patch that test out in > > your kernel build to remove the problem, and if it's an isolated case > > I encourage you to do this. This is one of the great advantages of > > open source - if you hit such a problem rather than throwing the > > hardware away you can work around such issues. > > I didn't understand what you mean by "you are welcome to patch that test > out in your kernel build to remove the problem". Which test are you > talking about? The one which sets UART_BUG_TXEN. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Thu, Feb 22, 2007 at 03:07:18PM +, Jose Goncalves wrote: > Russell King wrote: > > On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote: > > > >> Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things > >> to us, at least on an ARM target ... > >> > > > > That's ruled out. Please think about it for a moment - serial_in() > > managed to work correctly most of the time, and then spontaneously > > changes its well-defined ABI behaviour in a way that analysis of the > > asm doesn't allow it to. > > > > I'm using gcc 3.4.6. > But I agree with Russell, if it was such a problem it would hit on the > first iteration of my application and not after 1 day of executing the > same piece of code... One thing you might think about is running memtest86 on the machine for the same kind of time interval, just in case it's something trivial like bad ram. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: > On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote: > >> Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things >> to us, at least on an ARM target ... >> > > That's ruled out. Please think about it for a moment - serial_in() > managed to work correctly most of the time, and then spontaneously > changes its well-defined ABI behaviour in a way that analysis of the > asm doesn't allow it to. > I'm using gcc 3.4.6. But I agree with Russell, if it was such a problem it would hit on the first iteration of my application and not after 1 day of executing the same piece of code... Regards, José Gonçalves - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: > On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote: > >> <1>[18840.304048] Unable to handle kernel NULL pointer dereference at >> virtual address 0012 >> <1>[18840.313046] printing eip: >> <4>[18840.321687] c01bfa7a >> <1>[18840.321714] *pde = >> <0>[18840.331287] Oops: [#1] >> <4>[18840.340687] Modules linked in: >> <0>[18840.349749] CPU:0 >> <4>[18840.349767] EIP:0060:[]Not tainted VLI >> <4>[18840.349782] EFLAGS: 00010202 (2.6.16.41-mtm5-debug1 #1) >> <0>[18840.377277] EIP is at serial_in+0xa/0x4a >> <0>[18840.387221] eax: 0060 ebx: ecx: edx: >> >> <0>[18840.397805] esi: edi: 0040 ebp: c728fe1c esp: >> c728fe18 >> <0>[18840.408579] ds: 007b es: 007b ss: 0068 >> <0>[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 >> task=c7443a90) >> <0>[18840.420509] Stack: <0> c01c0f88 >> c031fef0 0005 0202 >> <0>[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d >> c031fef0 c124b510 c124b510 >> <0>[18840.460540] c773dbcc c728fe7c c01befe7 c124b510 >> ffed c773dbcc >> > > Okay, this one is even more plainly "not a coding error". > > >> <0>[18840.566645] [] serial8250_startup+0x28f/0x2a9 >> > > The code around this point (with the return point marked) is: > > >> c01c0f78:6a 05 push $0x5 >> c01c0f7a:53 push %ebx >> c01c0f7b:e8 f0 ea ff ff call c01bfa70 >> c01c0f80:6a 00 push $0x0 >> c01c0f82:53 push %ebx >> c01c0f83:e8 e8 ea ff ff call c01bfa70 >> c01c0f88<<< 6a 02 push $0x2 >> c01c0f8a:53 push %ebx >> c01c0f8b:e8 e0 ea ff ff call c01bfa70 >> > > and corresponds with this C code: > > (void) serial_inp(up, UART_LSR); > (void) serial_inp(up, UART_RX); > (void) serial_inp(up, UART_IIR); > > Now let's look at the words pushed on the stack around this code: > > > > c01c0f88 <- return address for serial_in (serial8250_startup+0x28f/0x2a9) > <- from push %ebx at c01c0f82 > <- from push $0x0 at c01c0f80 > c031fef0 <- from push %ebx at c01c0f7a > 0005 <- from push %0x5 at c01c0f78 > > Plainly, %ebx changed across the call to serial_in() at c01c0f7b. > First thing to notice is this violates the C code - "up" can not > change. > > Now let's look at serial_in: > > c01bfa70: 55 push %ebp > c01bfa71: 89 e5 mov%esp,%ebp > c01bfa73: 53 push %ebx > ... > c01bfab7: 5b pop%ebx > c01bfab8: 5d pop%ebp > c01bfab9: c3 ret > > This code tells the CPU to preserves %ebx and %ebp. But we know %ebx > _wasn't_ preserved. Ergo, your CPU is plainly not doing what the code > told it to do. > > Moreover, serial_in() has preserved %ebx in the past otherwise we'd > never got past all the other serial_in()s in serial8250_startup(). > > So I think it's very demonstrably a hardware fault, and not software > related. > It could be a silly question (tamper with me as I'm not familiar with such low level programming), but couldn't it be possible for a interrupt to hit in the middle of the serial_in() calls and mess with %ebx? What I find real hard to understand is why a hardware fault happens always in the same software instruction! I would expect a hardware fault to hit randomly... I left my application running this night, with a 2.6.16.41 kernel unpatched on the serial driver (my last Oops report was with Frederik patch to remove the insertion made in 2.6.12) and it crashed again on exactly the same point! > For all we know, it could be a one-off fault on the hardware you > happen to have - other identical units may not behave the same (can > you check?) > Yes I have other units that I can test it. I'll do that to see if it's really a one-off fault on the hardware. If it continues to crash with other units I will then test with the msleep(10) before the "And clear the interrupt registers again for luck.", as you suggested earlier. > If it is a one off case, you are welcome to patch that test out in > your kernel build to remove the problem, and if it's an isolated case > I encourage you to do this. This is one of the great advantages of > open source - if you hit such a problem rather than throwing the > hardware away you can work around such issues. > I didn't understand what you mean by "you are welcome to patch that test out in your kernel build to remove the problem". Which test are you talking about? Regards, José Gonçalves - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to
Re: Serial related oops
On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote: > Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things > to us, at least on an ARM target ... That's ruled out. Please think about it for a moment - serial_in() managed to work correctly most of the time, and then spontaneously changes its well-defined ABI behaviour in a way that analysis of the asm doesn't allow it to. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Wed, Feb 21, 2007 at 09:57:50PM -0800, H. Peter Anvin wrote: > Russell King wrote: > > > > >Plainly, %ebx changed across the call to serial_in() at c01c0f7b. > >First thing to notice is this violates the C code - "up" can not > >change. > > > >Now let's look at serial_in: > > > >c01bfa70: 55 push %ebp > >c01bfa71: 89 e5 mov%esp,%ebp > >c01bfa73: 53 push %ebx > >... > >c01bfab7: 5b pop%ebx > >c01bfab8: 5d pop%ebp > >c01bfab9: c3 ret > > > >This code tells the CPU to preserves %ebx and %ebp. But we know %ebx > >_wasn't_ preserved. Ergo, your CPU is plainly not doing what the code > >told it to do. > > > > ... assuming nothing else clobbered the stack slot (which would be a > compiler error, or a wild pointer.) > > Got a disassembly of the whole function? See Jose's subsequent message to the one I replied to. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Wed, Feb 21, 2007 at 09:57:50PM -0800, H. Peter Anvin wrote: Russell King wrote: Plainly, %ebx changed across the call to serial_in() at c01c0f7b. First thing to notice is this violates the C code - up can not change. Now let's look at serial_in: c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx ... c01bfab7: 5b pop%ebx c01bfab8: 5d pop%ebp c01bfab9: c3 ret This code tells the CPU to preserves %ebx and %ebp. But we know %ebx _wasn't_ preserved. Ergo, your CPU is plainly not doing what the code told it to do. ... assuming nothing else clobbered the stack slot (which would be a compiler error, or a wild pointer.) Got a disassembly of the whole function? See Jose's subsequent message to the one I replied to. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote: Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things to us, at least on an ARM target ... That's ruled out. Please think about it for a moment - serial_in() managed to work correctly most of the time, and then spontaneously changes its well-defined ABI behaviour in a way that analysis of the asm doesn't allow it to. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote: 1[18840.304048] Unable to handle kernel NULL pointer dereference at virtual address 0012 1[18840.313046] printing eip: 4[18840.321687] c01bfa7a 1[18840.321714] *pde = 0[18840.331287] Oops: [#1] 4[18840.340687] Modules linked in: 0[18840.349749] CPU:0 4[18840.349767] EIP:0060:[c01bfa7a]Not tainted VLI 4[18840.349782] EFLAGS: 00010202 (2.6.16.41-mtm5-debug1 #1) 0[18840.377277] EIP is at serial_in+0xa/0x4a 0[18840.387221] eax: 0060 ebx: ecx: edx: 0[18840.397805] esi: edi: 0040 ebp: c728fe1c esp: c728fe18 0[18840.408579] ds: 007b es: 007b ss: 0068 0[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 task=c7443a90) 0[18840.420509] Stack: 0 c01c0f88 c031fef0 0005 0202 0[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d c031fef0 c124b510 c124b510 0[18840.460540] c773dbcc c728fe7c c01befe7 c124b510 ffed c773dbcc Okay, this one is even more plainly not a coding error. 0[18840.566645] [c01c0f88] serial8250_startup+0x28f/0x2a9 The code around this point (with the return point marked) is: c01c0f78:6a 05 push $0x5 c01c0f7a:53 push %ebx c01c0f7b:e8 f0 ea ff ff call c01bfa70 serial_in c01c0f80:6a 00 push $0x0 c01c0f82:53 push %ebx c01c0f83:e8 e8 ea ff ff call c01bfa70 serial_in c01c0f88 6a 02 push $0x2 c01c0f8a:53 push %ebx c01c0f8b:e8 e0 ea ff ff call c01bfa70 serial_in and corresponds with this C code: (void) serial_inp(up, UART_LSR); (void) serial_inp(up, UART_RX); (void) serial_inp(up, UART_IIR); Now let's look at the words pushed on the stack around this code: c01c0f88 - return address for serial_in (serial8250_startup+0x28f/0x2a9) - from push %ebx at c01c0f82 - from push $0x0 at c01c0f80 c031fef0 - from push %ebx at c01c0f7a 0005 - from push %0x5 at c01c0f78 Plainly, %ebx changed across the call to serial_in() at c01c0f7b. First thing to notice is this violates the C code - up can not change. Now let's look at serial_in: c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx ... c01bfab7: 5b pop%ebx c01bfab8: 5d pop%ebp c01bfab9: c3 ret This code tells the CPU to preserves %ebx and %ebp. But we know %ebx _wasn't_ preserved. Ergo, your CPU is plainly not doing what the code told it to do. Moreover, serial_in() has preserved %ebx in the past otherwise we'd never got past all the other serial_in()s in serial8250_startup(). So I think it's very demonstrably a hardware fault, and not software related. It could be a silly question (tamper with me as I'm not familiar with such low level programming), but couldn't it be possible for a interrupt to hit in the middle of the serial_in() calls and mess with %ebx? What I find real hard to understand is why a hardware fault happens always in the same software instruction! I would expect a hardware fault to hit randomly... I left my application running this night, with a 2.6.16.41 kernel unpatched on the serial driver (my last Oops report was with Frederik patch to remove the insertion made in 2.6.12) and it crashed again on exactly the same point! For all we know, it could be a one-off fault on the hardware you happen to have - other identical units may not behave the same (can you check?) Yes I have other units that I can test it. I'll do that to see if it's really a one-off fault on the hardware. If it continues to crash with other units I will then test with the msleep(10) before the And clear the interrupt registers again for luck., as you suggested earlier. If it is a one off case, you are welcome to patch that test out in your kernel build to remove the problem, and if it's an isolated case I encourage you to do this. This is one of the great advantages of open source - if you hit such a problem rather than throwing the hardware away you can work around such issues. I didn't understand what you mean by you are welcome to patch that test out in your kernel build to remove the problem. Which test are you talking about? Regards, José Gonçalves - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote: Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things to us, at least on an ARM target ... That's ruled out. Please think about it for a moment - serial_in() managed to work correctly most of the time, and then spontaneously changes its well-defined ABI behaviour in a way that analysis of the asm doesn't allow it to. I'm using gcc 3.4.6. But I agree with Russell, if it was such a problem it would hit on the first iteration of my application and not after 1 day of executing the same piece of code... Regards, José Gonçalves - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Thu, Feb 22, 2007 at 03:07:18PM +, Jose Goncalves wrote: Russell King wrote: On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote: Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things to us, at least on an ARM target ... That's ruled out. Please think about it for a moment - serial_in() managed to work correctly most of the time, and then spontaneously changes its well-defined ABI behaviour in a way that analysis of the asm doesn't allow it to. I'm using gcc 3.4.6. But I agree with Russell, if it was such a problem it would hit on the first iteration of my application and not after 1 day of executing the same piece of code... One thing you might think about is running memtest86 on the machine for the same kind of time interval, just in case it's something trivial like bad ram. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote: It could be a silly question (tamper with me as I'm not familiar with such low level programming), but couldn't it be possible for a interrupt to hit in the middle of the serial_in() calls and mess with %ebx? I'm no expert on x86, but if an interrupt was messing with %ebx, you'd have random crashes verywhere - userspace, kernel space in unpredicatable ways. What I find real hard to understand is why a hardware fault happens always in the same software instruction! I would expect a hardware fault to hit randomly... Well, compared with your previous report, your latest report is different. Your first report had both EIP and %ebx being zero (because they got corrupted when returning from serial_in). This time only %ebx was corrupted. Consequently, this time we oopsed in the subsequent serial_in() rather than trying to return to serial8250_startup() as last time. I left my application running this night, with a 2.6.16.41 kernel unpatched on the serial driver (my last Oops report was with Frederik patch to remove the insertion made in 2.6.12) and it crashed again on exactly the same point! From that I take it that you removed the test in serial8250_startup which sets UART_BUG_TXEN, and the problem persisted. That tends to suggest that it's not the culpret. For all we know, it could be a one-off fault on the hardware you happen to have - other identical units may not behave the same (can you check?) Yes I have other units that I can test it. I'll do that to see if it's really a one-off fault on the hardware. Would be nice to know. If it continues to crash with other units I will then test with the msleep(10) before the And clear the interrupt registers again for luck., as you suggested earlier. If it is a one off case, you are welcome to patch that test out in your kernel build to remove the problem, and if it's an isolated case I encourage you to do this. This is one of the great advantages of open source - if you hit such a problem rather than throwing the hardware away you can work around such issues. I didn't understand what you mean by you are welcome to patch that test out in your kernel build to remove the problem. Which test are you talking about? The one which sets UART_BUG_TXEN. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Quoting Russell King [EMAIL PROTECTED]: On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote: It could be a silly question (tamper with me as I'm not familiar with such low level programming), but couldn't it be possible for a interrupt to hit in the middle of the serial_in() calls and mess with %ebx? I'm no expert on x86, but if an interrupt was messing with %ebx, you'd have random crashes verywhere - userspace, kernel space in unpredicatable ways. What I find real hard to understand is why a hardware fault happens always in the same software instruction! I would expect a hardware fault to hit randomly... Well, compared with your previous report, your latest report is different. Your first report had both EIP and %ebx being zero (because they got corrupted when returning from serial_in). This time only %ebx was corrupted. Consequently, this time we oopsed in the subsequent serial_in() rather than trying to return to serial8250_startup() as last time. But there was also another difference. I CONFIGed the kernel to produce more debug info. This should influence the Oops report... I left my application running this night, with a 2.6.16.41 kernel unpatched on the serial driver (my last Oops report was with Frederik patch to remove the insertion made in 2.6.12) and it crashed again on exactly the same point! From that I take it that you removed the test in serial8250_startup which sets UART_BUG_TXEN, and the problem persisted. That tends to suggest that it's not the culpret. From that I mean that with or without this code - http://lkml.org/lkml/2007/2/19/124 - the problem persisted. The difference is that, without it, the crashes happens more sparsly. José Gonçalves - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Quoting Russell King [EMAIL PROTECTED]: On Thu, Feb 22, 2007 at 03:07:18PM +, Jose Goncalves wrote: Russell King wrote: On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote: Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things to us, at least on an ARM target ... That's ruled out. Please think about it for a moment - serial_in() managed to work correctly most of the time, and then spontaneously changes its well-defined ABI behaviour in a way that analysis of the asm doesn't allow it to. I'm using gcc 3.4.6. But I agree with Russell, if it was such a problem it would hit on the first iteration of my application and not after 1 day of executing the same piece of code... One thing you might think about is running memtest86 on the machine for the same kind of time interval, just in case it's something trivial like bad ram. OK. That's another thing to do. Meanwhile I've switched to another SBC and I'm now running my application on the new unit. Lets wait and see... José Gonçalves - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Thu, Feb 22, 2007 at 03:02:46PM +, Jose Goncalves wrote: What I find real hard to understand is why a hardware fault happens always in the same software instruction! I would expect a hardware fault to hit randomly... I've experienced just such a hardware fault. The Infineon DSCC4 serial controller has a hardware bug in the PCI request/grant handling that can lead to the device driving the PCI bus in conflict with another device. While the results were random (as the oops in this problem seem to be), the trigger was always activating certain devices in combination. In your case, altering the timing/behavior of the serial device during open may be provoking the hardware fault. -- Paul Fulghum Microgate Systems, Ltd. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Wed, Feb 21, 2007 at 09:57:50PM -0800, H. Peter Anvin wrote: > Russell King wrote: > > >Plainly, %ebx changed across the call to serial_in() at c01c0f7b. > >First thing to notice is this violates the C code - "up" can not > >change. > >Now let's look at serial_in: > >c01bfa70: 55 push %ebp > >c01bfa71: 89 e5 mov%esp,%ebp > >c01bfa73: 53 push %ebx > >... > >c01bfab7: 5b pop%ebx > >c01bfab8: 5d pop%ebp > >c01bfab9: c3 ret > >This code tells the CPU to preserves %ebx and %ebp. But we know %ebx > >_wasn't_ preserved. Ergo, your CPU is plainly not doing what the code > >told it to do. > > ... assuming nothing else clobbered the stack slot (which would be a compiler > error, or a wild pointer.) > > Got a disassembly of the whole function? > Jose posted it higher in the thread: http://lkml.org/lkml/2007/2/21/139 Regards, Frederik - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: Plainly, %ebx changed across the call to serial_in() at c01c0f7b. First thing to notice is this violates the C code - "up" can not change. Now let's look at serial_in: c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx ... c01bfab7: 5b pop%ebx c01bfab8: 5d pop%ebp c01bfab9: c3 ret This code tells the CPU to preserves %ebx and %ebp. But we know %ebx _wasn't_ preserved. Ergo, your CPU is plainly not doing what the code told it to do. ... assuming nothing else clobbered the stack slot (which would be a compiler error, or a wild pointer.) Got a disassembly of the whole function? -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things to us, at least on an ARM target ... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote: > <1>[18840.304048] Unable to handle kernel NULL pointer dereference at virtual > address 0012 > <1>[18840.313046] printing eip: > <4>[18840.321687] c01bfa7a > <1>[18840.321714] *pde = > <0>[18840.331287] Oops: [#1] > <4>[18840.340687] Modules linked in: > <0>[18840.349749] CPU:0 > <4>[18840.349767] EIP:0060:[]Not tainted VLI > <4>[18840.349782] EFLAGS: 00010202 (2.6.16.41-mtm5-debug1 #1) > <0>[18840.377277] EIP is at serial_in+0xa/0x4a > <0>[18840.387221] eax: 0060 ebx: ecx: edx: > > <0>[18840.397805] esi: edi: 0040 ebp: c728fe1c esp: > c728fe18 > <0>[18840.408579] ds: 007b es: 007b ss: 0068 > <0>[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 > task=c7443a90) > <0>[18840.420509] Stack: <0> c01c0f88 > c031fef0 0005 0202 > <0>[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d > c031fef0 c124b510 c124b510 > <0>[18840.460540] c773dbcc c728fe7c c01befe7 c124b510 > ffed c773dbcc Okay, this one is even more plainly "not a coding error". > <0>[18840.566645] [] serial8250_startup+0x28f/0x2a9 The code around this point (with the return point marked) is: > c01c0f78: 6a 05 push $0x5 > c01c0f7a: 53 push %ebx > c01c0f7b: e8 f0 ea ff ff call c01bfa70 > c01c0f80: 6a 00 push $0x0 > c01c0f82: 53 push %ebx > c01c0f83: e8 e8 ea ff ff call c01bfa70 > c01c0f88<<< 6a 02 push $0x2 > c01c0f8a: 53 push %ebx > c01c0f8b: e8 e0 ea ff ff call c01bfa70 and corresponds with this C code: (void) serial_inp(up, UART_LSR); (void) serial_inp(up, UART_RX); (void) serial_inp(up, UART_IIR); Now let's look at the words pushed on the stack around this code: c01c0f88 <- return address for serial_in (serial8250_startup+0x28f/0x2a9) <- from push %ebx at c01c0f82 <- from push $0x0 at c01c0f80 c031fef0 <- from push %ebx at c01c0f7a 0005 <- from push %0x5 at c01c0f78 Plainly, %ebx changed across the call to serial_in() at c01c0f7b. First thing to notice is this violates the C code - "up" can not change. Now let's look at serial_in: c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx ... c01bfab7: 5b pop%ebx c01bfab8: 5d pop%ebp c01bfab9: c3 ret This code tells the CPU to preserves %ebx and %ebp. But we know %ebx _wasn't_ preserved. Ergo, your CPU is plainly not doing what the code told it to do. Moreover, serial_in() has preserved %ebx in the past otherwise we'd never got past all the other serial_in()s in serial8250_startup(). So I think it's very demonstrably a hardware fault, and not software related. For all we know, it could be a one-off fault on the hardware you happen to have - other identical units may not behave the same (can you check?) If it is a one off case, you are welcome to patch that test out in your kernel build to remove the problem, and if it's an isolated case I encourage you to do this. This is one of the great advantages of open source - if you hit such a problem rather than throwing the hardware away you can work around such issues. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote: > New devolpments. > I have upgraded to 2.6.16.41, applied a patch sent by Frederik that > removed the changed made in http://lkml.org/lkml/2005/6/23/266 and > activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL, > CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB, > CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING > (thanks to vda for pointing me to the right doc.). > At first it seemed to work fine, but after some days of continuous > running I've got another kernel Oops! > I attach the dmesg output and the assembly dump of serial8250_startup() > and serial8250_shutdown(). > As suspected by Russell, the badness seems to happen just at the end of the serial_inp on LSR, drivers/serial/8250.c:1650. The NULL deref happens at the beginning of the serial_inp(up, UART_RX) call, when trying to dereference *up. c01bfa70 : c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx c01bfa74: 8b 5d 08mov0x8(%ebp),%ebx << %ebx = up (which is NULL) c01bfa77: 8b 55 0cmov0xc(%ebp),%edx c01bfa7a: 0f b6 4b 12 movzbl 0x12(%ebx),%ecx << %ecx = *(%ebx+12) Oops c01bfa7e: 0f b6 43 13 movzbl 0x13(%ebx),%eax It seems that somehow, the pop %ebx at the end of the serial_inp(up, UART_LSR) function poped a NULL value instead of the expected pointer. Any suggestion on how this could happen? Jose, did you try to msleep(10) before the "And clear the interrupt registers again for luck." as suggested by Russell? You should also revert the change I suggested, it seems I missed the target by a few lines of code :). Regards, Frederik diff --git a/drivers/serial/8250.c b/drivers/serial/8250.c index 7aca22c..385cc51 100644 --- a/drivers/serial/8250.c +++ b/drivers/serial/8250.c @@ -1643,6 +1643,7 @@ static int serial8250_startup(struct uart_port *port) (void) inb_p(icp); } + msleep(10); /* * And clear the interrupt registers again for luck. */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Jose Goncalves wrote: > New devolpments. > I have upgraded to 2.6.16.41, applied a patch sent by Frederik that > removed the changed made in http://lkml.org/lkml/2005/6/23/266 and > activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL, > CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB, > CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING > (thanks to vda for pointing me to the right doc.). > At first it seemed to work fine, but after some days of continuous > running I've got another kernel Oops! > I attach the dmesg output and the assembly dump of serial8250_startup() > and serial8250_shutdown(). > And also the assembly dump of serial_in() were the NULL pointer dereference happens. José Gonçalves vmlinux-2.6.16.41-mtm5-debug1: file format elf32-i386 Disassembly of section .text: c01bfa70 : c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx c01bfa74: 8b 5d 08mov0x8(%ebp),%ebx c01bfa77: 8b 55 0cmov0xc(%ebp),%edx c01bfa7a: 0f b6 4b 12 movzbl 0x12(%ebx),%ecx c01bfa7e: 0f b6 43 13 movzbl 0x13(%ebx),%eax c01bfa82: d3 e2 shl%cl,%edx c01bfa84: 83 f8 02cmp$0x2,%eax c01bfa87: 74 1a je c01bfaa3 c01bfa89: 7f 05 jg c01bfa90 c01bfa8b: 48 dec%eax c01bfa8c: 74 09 je c01bfa97 c01bfa8e: eb 21 jmpc01bfab1 c01bfa90: 83 f8 03cmp$0x3,%eax c01bfa93: 74 15 je c01bfaaa c01bfa95: eb 1a jmpc01bfab1 c01bfa97: 8a 43 78mov0x78(%ebx),%al c01bfa9a: 01 d0 add%edx,%eax c01bfa9c: 8b 13 mov(%ebx),%edx c01bfa9e: 48 dec%eax c01bfa9f: ee out%al,(%dx) c01bfaa0: 42 inc%edx c01bfaa1: eb 10 jmpc01bfab3 c01bfaa3: 03 53 04add0x4(%ebx),%edx c01bfaa6: 8a 02 mov(%edx),%al c01bfaa8: eb 0a jmpc01bfab4 c01bfaaa: 03 53 04add0x4(%ebx),%edx c01bfaad: 8b 02 mov(%edx),%eax c01bfaaf: eb 06 jmpc01bfab7 c01bfab1: 03 13 add(%ebx),%edx c01bfab3: ec in (%dx),%al c01bfab4: 0f b6 c0movzbl %al,%eax c01bfab7: 5b pop%ebx c01bfab8: 5d pop%ebp c01bfab9: c3 ret Disassembly of section .init.text: Disassembly of section .altinstr_replacement: Disassembly of section .exit.text:
Re: Serial related oops
New devolpments. I have upgraded to 2.6.16.41, applied a patch sent by Frederik that removed the changed made in http://lkml.org/lkml/2005/6/23/266 and activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL, CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB, CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING (thanks to vda for pointing me to the right doc.). At first it seemed to work fine, but after some days of continuous running I've got another kernel Oops! I attach the dmesg output and the assembly dump of serial8250_startup() and serial8250_shutdown(). Regards, José Gonçalves <1>[18840.304048] Unable to handle kernel NULL pointer dereference at virtual address 0012 <1>[18840.313046] printing eip: <4>[18840.321687] c01bfa7a <1>[18840.321714] *pde = <0>[18840.331287] Oops: [#1] <4>[18840.340687] Modules linked in: <0>[18840.349749] CPU:0 <4>[18840.349767] EIP:0060:[]Not tainted VLI <4>[18840.349782] EFLAGS: 00010202 (2.6.16.41-mtm5-debug1 #1) <0>[18840.377277] EIP is at serial_in+0xa/0x4a <0>[18840.387221] eax: 0060 ebx: ecx: edx: <0>[18840.397805] esi: edi: 0040 ebp: c728fe1c esp: c728fe18 <0>[18840.408579] ds: 007b es: 007b ss: 0068 <0>[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 task=c7443a90) <0>[18840.420509] Stack: <0> c01c0f88 c031fef0 0005 0202 <0>[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d c031fef0 c124b510 c124b510 <0>[18840.460540] c773dbcc c728fe7c c01befe7 c124b510 ffed c773dbcc <0>[18840.475892] Call Trace: <0>[18840.490039] [] show_stack_log_lvl+0xa5/0xad <0>[18840.504944] [] show_registers+0x106/0x16f <0>[18840.520104] [] die+0xb6/0x127 <0>[18840.535497] [] do_page_fault+0x380/0x4b3 <0>[18840.550828] [] error_code+0x4f/0x60 <0>[18840.566645] [] serial8250_startup+0x28f/0x2a9 <0>[18840.582471] Code: 38 43 78 75 02 b2 01 89 d0 eb 10 8b 41 70 39 43 70 0f 94 c0 0f b6 c0 eb 02 31 c0 5b 5d c3 90 90 90 55 89 e5 53 8b 5d 08 8b 55 0c <0f> b6 4b 12 0f b6 43 13 d3 e2 83 f8 02 74 1a 7f 05 48 74 09 eb <4>[18840.680471] BUG: gp_position/11629, lock held at task exit time! <4>[18840.702808] [c124b528] {uart_register_driver} <4>[18840.722346] .. held by: gp_position:11629 [c7443a90, 117] <4>[18840.742113] ... acquired at: uart_get+0x28/0xde vmlinux-2.6.16.41-mtm5-debug1: file format elf32-i386 Disassembly of section .text: c01c0cf9 : c01c0cf9: 55 push %ebp c01c0cfa: 89 e5 mov%esp,%ebp c01c0cfc: 57 push %edi c01c0cfd: 56 push %esi c01c0cfe: 53 push %ebx c01c0cff: 53 push %ebx c01c0d00: 8b 5d 08mov0x8(%ebp),%ebx c01c0d03: 8b 43 60mov0x60(%ebx),%eax c01c0d06: c6 83 a7 00 00 00 00movb $0x0,0xa7(%ebx) c01c0d0d: 89 c2 mov%eax,%edx c01c0d0f: c1 e2 04shl$0x4,%edx c01c0d12: 83 f8 0acmp$0xa,%eax c01c0d15: 8b 92 ac 37 25 c0 mov0xc02537ac(%edx),%edx c01c0d1b: 66 89 93 9c 00 00 00mov%dx,0x9c(%ebx) c01c0d22: 75 66 jnec01c0d8a c01c0d24: c6 83 a4 00 00 00 00movb $0x0,0xa4(%ebx) c01c0d2b: 68 bf 00 00 00 push $0xbf c01c0d30: 6a 03 push $0x3 c01c0d32: 53 push %ebx c01c0d33: e8 82 ed ff ff call c01bfaba c01c0d38: 6a 10 push $0x10 c01c0d3a: 6a 02 push $0x2 c01c0d3c: 53 push %ebx c01c0d3d: e8 78 ed ff ff call c01bfaba c01c0d42: 6a 00 push $0x0 c01c0d44: 6a 01 push $0x1 c01c0d46: 53 push %ebx c01c0d47: e8 6e ed ff ff call c01bfaba c01c0d4c: 83 c4 24add$0x24,%esp c01c0d4f: 6a 00 push $0x0 c01c0d51: 6a 03 push $0x3 c01c0d53: 53 push %ebx c01c0d54: e8 61 ed ff ff call c01bfaba c01c0d59: 6a 00 push $0x0 c01c0d5b: 6a 0c push $0xc c01c0d5d: 53 push %ebx c01c0d5e: e8 a7 ed ff ff call c01bfb0a c01c0d63: 68 bf 00 00 00 push $0xbf c01c0d68: 6a 03 push $0x3 c01c0d6a: 53 push %ebx c01c0d6b: e8 4a ed ff ff call c01bfaba c01c0d70: 83 c4 24add$0x24,%esp c01c0d73: 6a 10 push $0x10 c01c0d75: 6a 02 push
Re: Serial related oops
New devolpments. I have upgraded to 2.6.16.41, applied a patch sent by Frederik that removed the changed made in http://lkml.org/lkml/2005/6/23/266 and activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL, CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB, CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING (thanks to vda for pointing me to the right doc.). At first it seemed to work fine, but after some days of continuous running I've got another kernel Oops! I attach the dmesg output and the assembly dump of serial8250_startup() and serial8250_shutdown(). Regards, José Gonçalves 1[18840.304048] Unable to handle kernel NULL pointer dereference at virtual address 0012 1[18840.313046] printing eip: 4[18840.321687] c01bfa7a 1[18840.321714] *pde = 0[18840.331287] Oops: [#1] 4[18840.340687] Modules linked in: 0[18840.349749] CPU:0 4[18840.349767] EIP:0060:[c01bfa7a]Not tainted VLI 4[18840.349782] EFLAGS: 00010202 (2.6.16.41-mtm5-debug1 #1) 0[18840.377277] EIP is at serial_in+0xa/0x4a 0[18840.387221] eax: 0060 ebx: ecx: edx: 0[18840.397805] esi: edi: 0040 ebp: c728fe1c esp: c728fe18 0[18840.408579] ds: 007b es: 007b ss: 0068 0[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 task=c7443a90) 0[18840.420509] Stack: 0 c01c0f88 c031fef0 0005 0202 0[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d c031fef0 c124b510 c124b510 0[18840.460540] c773dbcc c728fe7c c01befe7 c124b510 ffed c773dbcc 0[18840.475892] Call Trace: 0[18840.490039] [c0102a35] show_stack_log_lvl+0xa5/0xad 0[18840.504944] [c0102b70] show_registers+0x106/0x16f 0[18840.520104] [c0102d06] die+0xb6/0x127 0[18840.535497] [c0109677] do_page_fault+0x380/0x4b3 0[18840.550828] [c01026bf] error_code+0x4f/0x60 0[18840.566645] [c01c0f88] serial8250_startup+0x28f/0x2a9 0[18840.582471] Code: 38 43 78 75 02 b2 01 89 d0 eb 10 8b 41 70 39 43 70 0f 94 c0 0f b6 c0 eb 02 31 c0 5b 5d c3 90 90 90 55 89 e5 53 8b 5d 08 8b 55 0c 0f b6 4b 12 0f b6 43 13 d3 e2 83 f8 02 74 1a 7f 05 48 74 09 eb 4[18840.680471] BUG: gp_position/11629, lock held at task exit time! 4[18840.702808] [c124b528] {uart_register_driver} 4[18840.722346] .. held by: gp_position:11629 [c7443a90, 117] 4[18840.742113] ... acquired at: uart_get+0x28/0xde vmlinux-2.6.16.41-mtm5-debug1: file format elf32-i386 Disassembly of section .text: c01c0cf9 serial8250_startup: c01c0cf9: 55 push %ebp c01c0cfa: 89 e5 mov%esp,%ebp c01c0cfc: 57 push %edi c01c0cfd: 56 push %esi c01c0cfe: 53 push %ebx c01c0cff: 53 push %ebx c01c0d00: 8b 5d 08mov0x8(%ebp),%ebx c01c0d03: 8b 43 60mov0x60(%ebx),%eax c01c0d06: c6 83 a7 00 00 00 00movb $0x0,0xa7(%ebx) c01c0d0d: 89 c2 mov%eax,%edx c01c0d0f: c1 e2 04shl$0x4,%edx c01c0d12: 83 f8 0acmp$0xa,%eax c01c0d15: 8b 92 ac 37 25 c0 mov0xc02537ac(%edx),%edx c01c0d1b: 66 89 93 9c 00 00 00mov%dx,0x9c(%ebx) c01c0d22: 75 66 jnec01c0d8a serial8250_startup+0x91 c01c0d24: c6 83 a4 00 00 00 00movb $0x0,0xa4(%ebx) c01c0d2b: 68 bf 00 00 00 push $0xbf c01c0d30: 6a 03 push $0x3 c01c0d32: 53 push %ebx c01c0d33: e8 82 ed ff ff call c01bfaba serial_out c01c0d38: 6a 10 push $0x10 c01c0d3a: 6a 02 push $0x2 c01c0d3c: 53 push %ebx c01c0d3d: e8 78 ed ff ff call c01bfaba serial_out c01c0d42: 6a 00 push $0x0 c01c0d44: 6a 01 push $0x1 c01c0d46: 53 push %ebx c01c0d47: e8 6e ed ff ff call c01bfaba serial_out c01c0d4c: 83 c4 24add$0x24,%esp c01c0d4f: 6a 00 push $0x0 c01c0d51: 6a 03 push $0x3 c01c0d53: 53 push %ebx c01c0d54: e8 61 ed ff ff call c01bfaba serial_out c01c0d59: 6a 00 push $0x0 c01c0d5b: 6a 0c push $0xc c01c0d5d: 53 push %ebx c01c0d5e: e8 a7 ed ff ff call c01bfb0a serial_icr_write c01c0d63: 68 bf 00 00 00 push $0xbf c01c0d68: 6a 03 push $0x3 c01c0d6a: 53 push %ebx c01c0d6b: e8 4a ed ff ff call c01bfaba serial_out c01c0d70: 83 c4 24add$0x24,%esp
Re: Serial related oops
Jose Goncalves wrote: New devolpments. I have upgraded to 2.6.16.41, applied a patch sent by Frederik that removed the changed made in http://lkml.org/lkml/2005/6/23/266 and activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL, CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB, CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING (thanks to vda for pointing me to the right doc.). At first it seemed to work fine, but after some days of continuous running I've got another kernel Oops! I attach the dmesg output and the assembly dump of serial8250_startup() and serial8250_shutdown(). And also the assembly dump of serial_in() were the NULL pointer dereference happens. José Gonçalves vmlinux-2.6.16.41-mtm5-debug1: file format elf32-i386 Disassembly of section .text: c01bfa70 serial_in: c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx c01bfa74: 8b 5d 08mov0x8(%ebp),%ebx c01bfa77: 8b 55 0cmov0xc(%ebp),%edx c01bfa7a: 0f b6 4b 12 movzbl 0x12(%ebx),%ecx c01bfa7e: 0f b6 43 13 movzbl 0x13(%ebx),%eax c01bfa82: d3 e2 shl%cl,%edx c01bfa84: 83 f8 02cmp$0x2,%eax c01bfa87: 74 1a je c01bfaa3 serial_in+0x33 c01bfa89: 7f 05 jg c01bfa90 serial_in+0x20 c01bfa8b: 48 dec%eax c01bfa8c: 74 09 je c01bfa97 serial_in+0x27 c01bfa8e: eb 21 jmpc01bfab1 serial_in+0x41 c01bfa90: 83 f8 03cmp$0x3,%eax c01bfa93: 74 15 je c01bfaaa serial_in+0x3a c01bfa95: eb 1a jmpc01bfab1 serial_in+0x41 c01bfa97: 8a 43 78mov0x78(%ebx),%al c01bfa9a: 01 d0 add%edx,%eax c01bfa9c: 8b 13 mov(%ebx),%edx c01bfa9e: 48 dec%eax c01bfa9f: ee out%al,(%dx) c01bfaa0: 42 inc%edx c01bfaa1: eb 10 jmpc01bfab3 serial_in+0x43 c01bfaa3: 03 53 04add0x4(%ebx),%edx c01bfaa6: 8a 02 mov(%edx),%al c01bfaa8: eb 0a jmpc01bfab4 serial_in+0x44 c01bfaaa: 03 53 04add0x4(%ebx),%edx c01bfaad: 8b 02 mov(%edx),%eax c01bfaaf: eb 06 jmpc01bfab7 serial_in+0x47 c01bfab1: 03 13 add(%ebx),%edx c01bfab3: ec in (%dx),%al c01bfab4: 0f b6 c0movzbl %al,%eax c01bfab7: 5b pop%ebx c01bfab8: 5d pop%ebp c01bfab9: c3 ret Disassembly of section .init.text: Disassembly of section .altinstr_replacement: Disassembly of section .exit.text:
Re: Serial related oops
On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote: New devolpments. I have upgraded to 2.6.16.41, applied a patch sent by Frederik that removed the changed made in http://lkml.org/lkml/2005/6/23/266 and activated some more kernel debug, i.e., CONFIG_KALLSYMS_ALL, CONFIG_DEBUG_KERNEL, CONFIG_DETECT_SOFTLOCKUP, CONFIG_DEBUG_SLAB, CONFIG_DEBUG_MUTEXES, CONFIG_FRAME_POINTER and CONFIG_FORCED_INLINING (thanks to vda for pointing me to the right doc.). At first it seemed to work fine, but after some days of continuous running I've got another kernel Oops! I attach the dmesg output and the assembly dump of serial8250_startup() and serial8250_shutdown(). As suspected by Russell, the badness seems to happen just at the end of the serial_inp on LSR, drivers/serial/8250.c:1650. The NULL deref happens at the beginning of the serial_inp(up, UART_RX) call, when trying to dereference *up. c01bfa70 serial_in: c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx c01bfa74: 8b 5d 08mov0x8(%ebp),%ebx %ebx = up (which is NULL) c01bfa77: 8b 55 0cmov0xc(%ebp),%edx c01bfa7a: 0f b6 4b 12 movzbl 0x12(%ebx),%ecx %ecx = *(%ebx+12) Oops c01bfa7e: 0f b6 43 13 movzbl 0x13(%ebx),%eax It seems that somehow, the pop %ebx at the end of the serial_inp(up, UART_LSR) function poped a NULL value instead of the expected pointer. Any suggestion on how this could happen? Jose, did you try to msleep(10) before the And clear the interrupt registers again for luck. as suggested by Russell? You should also revert the change I suggested, it seems I missed the target by a few lines of code :). Regards, Frederik diff --git a/drivers/serial/8250.c b/drivers/serial/8250.c index 7aca22c..385cc51 100644 --- a/drivers/serial/8250.c +++ b/drivers/serial/8250.c @@ -1643,6 +1643,7 @@ static int serial8250_startup(struct uart_port *port) (void) inb_p(icp); } + msleep(10); /* * And clear the interrupt registers again for luck. */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Wed, Feb 21, 2007 at 02:13:15PM +, Jose Goncalves wrote: 1[18840.304048] Unable to handle kernel NULL pointer dereference at virtual address 0012 1[18840.313046] printing eip: 4[18840.321687] c01bfa7a 1[18840.321714] *pde = 0[18840.331287] Oops: [#1] 4[18840.340687] Modules linked in: 0[18840.349749] CPU:0 4[18840.349767] EIP:0060:[c01bfa7a]Not tainted VLI 4[18840.349782] EFLAGS: 00010202 (2.6.16.41-mtm5-debug1 #1) 0[18840.377277] EIP is at serial_in+0xa/0x4a 0[18840.387221] eax: 0060 ebx: ecx: edx: 0[18840.397805] esi: edi: 0040 ebp: c728fe1c esp: c728fe18 0[18840.408579] ds: 007b es: 007b ss: 0068 0[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 task=c7443a90) 0[18840.420509] Stack: 0 c01c0f88 c031fef0 0005 0202 0[18840.445655]c7161a1c c031fef0 c124b510 c728fe60 c01bd97d c031fef0 c124b510 c124b510 0[18840.460540] c773dbcc c728fe7c c01befe7 c124b510 ffed c773dbcc Okay, this one is even more plainly not a coding error. 0[18840.566645] [c01c0f88] serial8250_startup+0x28f/0x2a9 The code around this point (with the return point marked) is: c01c0f78: 6a 05 push $0x5 c01c0f7a: 53 push %ebx c01c0f7b: e8 f0 ea ff ff call c01bfa70 serial_in c01c0f80: 6a 00 push $0x0 c01c0f82: 53 push %ebx c01c0f83: e8 e8 ea ff ff call c01bfa70 serial_in c01c0f88 6a 02 push $0x2 c01c0f8a: 53 push %ebx c01c0f8b: e8 e0 ea ff ff call c01bfa70 serial_in and corresponds with this C code: (void) serial_inp(up, UART_LSR); (void) serial_inp(up, UART_RX); (void) serial_inp(up, UART_IIR); Now let's look at the words pushed on the stack around this code: c01c0f88 - return address for serial_in (serial8250_startup+0x28f/0x2a9) - from push %ebx at c01c0f82 - from push $0x0 at c01c0f80 c031fef0 - from push %ebx at c01c0f7a 0005 - from push %0x5 at c01c0f78 Plainly, %ebx changed across the call to serial_in() at c01c0f7b. First thing to notice is this violates the C code - up can not change. Now let's look at serial_in: c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx ... c01bfab7: 5b pop%ebx c01bfab8: 5d pop%ebp c01bfab9: c3 ret This code tells the CPU to preserves %ebx and %ebp. But we know %ebx _wasn't_ preserved. Ergo, your CPU is plainly not doing what the code told it to do. Moreover, serial_in() has preserved %ebx in the past otherwise we'd never got past all the other serial_in()s in serial8250_startup(). So I think it's very demonstrably a hardware fault, and not software related. For all we know, it could be a one-off fault on the hardware you happen to have - other identical units may not behave the same (can you check?) If it is a one off case, you are welcome to patch that test out in your kernel build to remove the problem, and if it's an isolated case I encourage you to do this. This is one of the great advantages of open source - if you hit such a problem rather than throwing the hardware away you can work around such issues. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things to us, at least on an ARM target ... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: Plainly, %ebx changed across the call to serial_in() at c01c0f7b. First thing to notice is this violates the C code - up can not change. Now let's look at serial_in: c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx ... c01bfab7: 5b pop%ebx c01bfab8: 5d pop%ebp c01bfab9: c3 ret This code tells the CPU to preserves %ebx and %ebp. But we know %ebx _wasn't_ preserved. Ergo, your CPU is plainly not doing what the code told it to do. ... assuming nothing else clobbered the stack slot (which would be a compiler error, or a wild pointer.) Got a disassembly of the whole function? -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Wed, Feb 21, 2007 at 09:57:50PM -0800, H. Peter Anvin wrote: Russell King wrote: Plainly, %ebx changed across the call to serial_in() at c01c0f7b. First thing to notice is this violates the C code - up can not change. Now let's look at serial_in: c01bfa70: 55 push %ebp c01bfa71: 89 e5 mov%esp,%ebp c01bfa73: 53 push %ebx ... c01bfab7: 5b pop%ebx c01bfab8: 5d pop%ebp c01bfab9: c3 ret This code tells the CPU to preserves %ebx and %ebp. But we know %ebx _wasn't_ preserved. Ergo, your CPU is plainly not doing what the code told it to do. ... assuming nothing else clobbered the stack slot (which would be a compiler error, or a wild pointer.) Got a disassembly of the whole function? Jose posted it higher in the thread: http://lkml.org/lkml/2007/2/21/139 Regards, Frederik - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Michael K. Edwards wrote: Of course not. But dealing with a stuck IRQ line by locking up isn't very practical either. IRQ sharing is stupid yet universal, and it And we don't, that's why we have that "nobody cared" logic that disables the interrupt line if no driver services the interrupt. That doesn't provide a clean recovery, of course, it's meant to notify the user of what happened so that the problem can be fixed. happens all the time that a device that has been sitting there minding its own business since power-up, with no driver to drive it, decides to assert its IRQ. Maybe it just got hot-plugged, maybe it just got its first dribble of input, whatever. Other devices on the shared IRQ are screwed (or at least semi-screwed; you could periodically re-enable the IRQ long enough to make a run through the ISR chain servicing the other devices). But if you run "lspci" (or whatever) and load a driver for the newly awake device, everything goes back to normal. For devices compiled into the kernel, you shouldn't have to play these games. If, that is, there were three stages of driver initialization, called in successive passes: Exactly, for devices compiled into the kernel. In most setups this is only a fraction of all devices, so solving this problem only for drivers built into the kernel is no solution. 1) installing an ISR with a fallback STFU path (device-specific but not dependent on any particular pre-existing chip state), quiescing it if you know how and registering for the IRQ if you know which it is; 2) going through the chip's soft-reset-wake-up-shut-up cycle and populating driver data structures, possibly correcting the IRQ registration along the way; 3) ready-as-we'll-ever-be, bring on the interrupts. You probably can't help enabling the IRQ briefly during 2) so that you can do tests like Russell's loopback. But it's a needless gamble to do that without doing 1) for all compiled-in drivers and platform devices first, in a previous discovery pass. And it's stupid to do 3) in the same pass as 2), because you'll just open race condition windows that will only bite when an all-the-way-live device raises its IRQ at a moment when the writer of the wake-up-shut-up code wasn't expecting it. All code has bugs and they're only a problem when they bite in the field. If a system has a device that generates interrupts before they're enabled, and the firmware doesn't fix it, then some platform-specific quirk has to handle it and shut off the interrupt before it allows any interrupts to be enabled. (We have such a quirk for certain network controllers where the boot ROM can leave the chip generating interrupts on bootup.) You don't need quirks if your driver initialization is bomb-proof to begin with. Devices that are quiet on power-up are purely coincidental and should not be construed. It's not coincidental, it is the only sane way to design hardware. You just can't go firing off interrupts without a driver having intentionally enabled them. There are a few devices that have had such issues, but they have been few and far between, certainly not enough to warrant the complexity of the scheme you propose. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On 2/19/07, Robert Hancock <[EMAIL PROTECTED]> wrote: How do you propose to do this? Drivers can get loaded and unloaded at any time. If you have a device generating spurious interrupts on a shared IRQ line, there's no way you can use any device on that line until that interrupt is shut off. Requiring all drivers to be loaded before any of them can use interrupts is simply not practical. Of course not. But dealing with a stuck IRQ line by locking up isn't very practical either. IRQ sharing is stupid yet universal, and it happens all the time that a device that has been sitting there minding its own business since power-up, with no driver to drive it, decides to assert its IRQ. Maybe it just got hot-plugged, maybe it just got its first dribble of input, whatever. Other devices on the shared IRQ are screwed (or at least semi-screwed; you could periodically re-enable the IRQ long enough to make a run through the ISR chain servicing the other devices). But if you run "lspci" (or whatever) and load a driver for the newly awake device, everything goes back to normal. For devices compiled into the kernel, you shouldn't have to play these games. If, that is, there were three stages of driver initialization, called in successive passes: 1) installing an ISR with a fallback STFU path (device-specific but not dependent on any particular pre-existing chip state), quiescing it if you know how and registering for the IRQ if you know which it is; 2) going through the chip's soft-reset-wake-up-shut-up cycle and populating driver data structures, possibly correcting the IRQ registration along the way; 3) ready-as-we'll-ever-be, bring on the interrupts. You probably can't help enabling the IRQ briefly during 2) so that you can do tests like Russell's loopback. But it's a needless gamble to do that without doing 1) for all compiled-in drivers and platform devices first, in a previous discovery pass. And it's stupid to do 3) in the same pass as 2), because you'll just open race condition windows that will only bite when an all-the-way-live device raises its IRQ at a moment when the writer of the wake-up-shut-up code wasn't expecting it. All code has bugs and they're only a problem when they bite in the field. If a system has a device that generates interrupts before they're enabled, and the firmware doesn't fix it, then some platform-specific quirk has to handle it and shut off the interrupt before it allows any interrupts to be enabled. (We have such a quirk for certain network controllers where the boot ROM can leave the chip generating interrupts on bootup.) You don't need quirks if your driver initialization is bomb-proof to begin with. Devices that are quiet on power-up are purely coincidental and should not be construed. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Michael K. Edwards wrote: Still open, though it's a pity you're more interested in my flawed understanding that in the possibility that the kernel could be systematically made more robust against hardware bugs and coding errors by the simple expedient of putting all the ISRs in before turning on any IRQ that might be shared. Or are you telling me that's already been done? (Yes, I am aware that this interacts entertainingly with hot-plug PCI. Yes, I am aware that there is a limit to how much software can fix stupid hardware. But surely there is room for an emergency IRQ suppressor to let chip initialization code kick in and force the hardware to a known state.) How do you propose to do this? Drivers can get loaded and unloaded at any time. If you have a device generating spurious interrupts on a shared IRQ line, there's no way you can use any device on that line until that interrupt is shut off. Requiring all drivers to be loaded before any of them can use interrupts is simply not practical. If a system has a device that generates interrupts before they're enabled, and the firmware doesn't fix it, then some platform-specific quirk has to handle it and shut off the interrupt before it allows any interrupts to be enabled. (We have such a quirk for certain network controllers where the boot ROM can leave the chip generating interrupts on bootup.) -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote: This can't happen because when __do_irq unmasks the interrupt source, the CPU mask is set, thereby preventing any further interrupt exceptions being taken. This is done precisely to prevent this situation happening. If you are seeing recursion for the same interrupt (two or more stack frames containing asm_do_IRQ for that very same IRQ) then your interrupt handling is buggy, plain and simple. Imaginable. I'll look at the mask/unmask code. Thanks. I don't doubt that it is on the same IRQ line - I have such setups here and it works perfectly - multiple 8250 UARTs connected to a single level-triggered interrupt input which also happens to be shared with a SCSI host chip as well. Absolutely no problems. Can you do me a favor? In the sys_open("/dev/console") path, turn on the right bits in that second uart's IER, then insert a sleep in request_irq or something (wherever seems best based on that backtrace), and feed enough characters into the second UART during that sleep to generate an IRQ. Do you not get the same soft lockup? I still say that your understanding is completely flawed. Moreover, you haven't read what I've said about the ordering of initialisation, the stress on when we disable interrupts for the ports, etc. Well, all I can say is that that's a real backtrace and it shouldn't be hard to reproduce if it's anything other than a broken interrupt controller or broken code called by the __do_irq postamble. I don't see any platform-provided unmask routines in that backtrace, but maybe it got inlined; I'll go back and check. You're actually *not* helping. You're causing utter confusion through misunderstanding, but it seems you're not open to the possibility that your understanding is flawed. Still open, though it's a pity you're more interested in my flawed understanding that in the possibility that the kernel could be systematically made more robust against hardware bugs and coding errors by the simple expedient of putting all the ISRs in before turning on any IRQ that might be shared. Or are you telling me that's already been done? (Yes, I am aware that this interacts entertainingly with hot-plug PCI. Yes, I am aware that there is a limit to how much software can fix stupid hardware. But surely there is room for an emergency IRQ suppressor to let chip initialization code kick in and force the hardware to a known state.) I'm offering to look through your code and point you at the source of your issue for free. Please don't throw that offer away without first considering that maybe I have a clue about what's going on here. I appreciate that offer, and I hope to take advantage of it as soon as I have the source code at my fingertips (not just the chat log where I recorded the backtrace). ... which showed the port being opened well after system initialisation of devices, including all serial ports - including disabling of their interrupt source at the IER, has been completed. Now that you mention it, the backtrace I sent is the serial8250_startup one, not the serial8250_init one. Sorry, this one's probably an artifact of brain damage specific to this UART. I need to dig through a different account to find the init-path example; but in either case, we're getting a new interrupt during the __do_irq postamble. If you're telling me that that shouldn't happen, what should the backtrace for a soft lockup due to a stuck level-triggered IRQ look like on ARM? Yes, and it's the same for any serial console with functioning break support. You'll find it in Documentation/sysrq.txt, though it does misleadingly say "PC style standard serial ports only" whereas the reality is "where possible". Thank you very much; this will help me get to the bottom of some other chip-support nastiness on this device. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Mon, Feb 19, 2007 at 04:04:26PM -0800, Michael K. Edwards wrote: > On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote: > >The second interrupt comes in, and when you go to disable that > >source, you inadvertently re-enable the UART interrupt, despite it > >still being serviced. > > Incorrect. An attempt has been made to service the interrupt using > the only ISR currently in the chain for that IRQ -- the ISR for the > first UART. That attempt was not successful, and when __do_irq > unmasks the interrupt source preparatory to exiting interrupt context, > __irq_svc is dispatched anew. This can't happen because when __do_irq unmasks the interrupt source, the CPU mask is set, thereby preventing any further interrupt exceptions being taken. This is done precisely to prevent this situation happening. If you are seeing recursion for the same interrupt (two or more stack frames containing asm_do_IRQ for that very same IRQ) then your interrupt handling is buggy, plain and simple. > >Please show your interrupt controller (mask, unmask, mask_ack) > >handling functions corresponding with the interrupt which your > >UART is connected to. > > Don't have 'em handy; I'll be happy to post them when I do, perhaps > later today. I would hope they're pretty generic, though; it's a > Feroceon core pretending to be an ARM926EJ-S, hooked to the usual > half-assed Marvell imitation of an ARM licensed functional block. > Trust me for the moment, it's the same IRQ line. I don't doubt that it is on the same IRQ line - I have such setups here and it works perfectly - multiple 8250 UARTs connected to a single level-triggered interrupt input which also happens to be shared with a SCSI host chip as well. Absolutely no problems. > If you don't enjoy this sort of forensics (which I > for one do not, especially not when there is a project deadline > looming and a Heisenbug starts firing 9 times out of 10), you might > consider systematically installing ISRs that know how to shut > everything up before turning on any interrupt sources at all. I still say that your understanding is completely flawed. Moreover, you haven't read what I've said about the ordering of initialisation, the stress on when we disable interrupts for the ports, etc. > I'm not asking for anyone's help except in the > let's-all-help-one-another spirit. You're actually *not* helping. You're causing utter confusion through misunderstanding, but it seems you're not open to the possibility that your understanding is flawed. I'm offering to look through your code and point you at the source of your issue for free. Please don't throw that offer away without first considering that maybe I have a clue about what's going on here. > Now please take a second look at the backtrace before toasting me > lightly again. ... which showed the port being opened well after system initialisation of devices, including all serial ports - including disabling of their interrupt source at the IER, has been completed. > Mmm'kay? Oh, and by the way -- is there an Alt-SysRq > equivalent on an ARM serial console? Yes, and it's the same for any serial console with functioning break support. You'll find it in Documentation/sysrq.txt, though it does misleadingly say "PC style standard serial ports only" whereas the reality is "where possible". -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote: I think something else is going on here. I think you're getting an interrupt for the UART, and another interrupt is also pending. Correct. An interrupt for the other UART on the same IRQ. When the UART interrupt is handled, it is masked at the interrupt controller, and the CPU mask is dropped. Correct. The second interrupt comes in, and when you go to disable that source, you inadvertently re-enable the UART interrupt, despite it still being serviced. Incorrect. An attempt has been made to service the interrupt using the only ISR currently in the chain for that IRQ -- the ISR for the first UART. That attempt was not successful, and when __do_irq unmasks the interrupt source preparatory to exiting interrupt context, __irq_svc is dispatched anew. This leads to the UART interrupt again triggering an IRQ. Right. The _second_ UART's interrupt. There's another problem with these UARTs having to do with the implementor's inability to read and follow a bog-standard twenty-year-old spec without asking software to fix up corner cases, but that's another backtrace for another day. Please show your interrupt controller (mask, unmask, mask_ack) handling functions corresponding with the interrupt which your UART is connected to. Don't have 'em handy; I'll be happy to post them when I do, perhaps later today. I would hope they're pretty generic, though; it's a Feroceon core pretending to be an ARM926EJ-S, hooked to the usual half-assed Marvell imitation of an ARM licensed functional block. Trust me for the moment, it's the same IRQ line. This shows that you don't actually have an understanding of the Linux kernel boot, especially in respect of serial devices. At boot, devices are detected and initialised to a safe state, where they will not spuriously generate interrupts. Sorry, 'taint so. Not unless the chip support droid has put the right stuff in arch/arm/mach-foo. LKML is littered with the fall-out of the decision to trust whoever jumped to main() to have left the hardware in a sane state. If you don't enjoy this sort of forensics (which I for one do not, especially not when there is a project deadline looming and a Heisenbug starts firing 9 times out of 10), you might consider systematically installing ISRs that know how to shut everything up before turning on any interrupt sources at all. As I said, this is not going to happen overnight, and is not even particularly in the economic interest of people who get paid by the hour to wear bringup wizard hats. That category currently includes me, but I am intensely bored with this game and aspire to greater things. When a userspace program opens a serial port, which can only happen once the kernel boot has completed (ergo, devices have been initialised and placed in a safe state) the interrupts are claimed, and enabled at the source. As you can see from the console dump I posted (which begins with "Freeing init memory: 92K" and ends with do_exit -> init -> sys_open, which is obviously sys_open("/dev/console")), this happens long before userspace comes into the picture. Our 8250.c has some nasty hacks in it but otherwise this call chain is from a very nearly vanilla 2.6.16.recent. We've already worked around this on our board, and the whole kit and kaboodle will eventually be posted to linux-arm-kernel in tidy patches when my client lets me spend billable hours on it (immediately after the damn thing passes its first functional test, long before it ships). I'm not asking for anyone's help except in the let's-all-help-one-another spirit. I'm trying to help with root cause analysis of Frederik's (Jose's?) fandango on core. If it's not relevant, my apologies; and although it goes without saying, I salute you for both the serial driver and the ARM port. Now please take a second look at the backtrace before toasting me lightly again. Mmm'kay? Oh, and by the way -- is there an Alt-SysRq equivalent on an ARM serial console? Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Mon, Feb 19, 2007 at 02:16:41PM -0800, Michael K. Edwards wrote: > Right. But as soon as you turn the source back on, in the postamble > of the interrupt dispatch handler, it fires again. At least on ARM, > that gives you recursive hits to __irq_svc and a couple of nested > calls within it. I think something else is going on here. I think you're getting an interrupt for the UART, and another interrupt is also pending. When the UART interrupt is handled, it is masked at the interrupt controller, and the CPU mask is dropped. The second interrupt comes in, and when you go to disable that source, you inadvertently re-enable the UART interrupt, despite it still being serviced. This leads to the UART interrupt again triggering an IRQ. Please show your interrupt controller (mask, unmask, mask_ack) handling functions corresponding with the interrupt which your UART is connected to. > >> But its context is not. Shared IRQ lines are a _problem_. You cannot > >> safely enable an IRQ until all devices that share it have had their > >> ISRs installed, unless you can absolutely guarantee at a hardware > >> level that the unitialized ones cannot assert the IRQ line. > > > >Linux assumes that all interrupt sources on a shared IRQ line are > >disabled at the point in time when the kernel boots. When a device > >is to be used, an interrupt handler is installed and then the kernel > >will enable the interrupt on the device, not before. > > Linux assumes incorrectly in this instance. > It would improve the > kernel if all drivers' __init code were refactored into an > IRQ-discovery-ISR-installation pass, followed by a > chip-reset-data-structure-initialization pass, followed by a > chip-configuration-driver-activation pass. This is unlikely to happen > overnight. This shows that you don't actually have an understanding of the Linux kernel boot, especially in respect of serial devices. At boot, devices are detected and initialised to a safe state, where they will not spuriously generate interrupts. When a userspace program opens a serial port, which can only happen once the kernel boot has completed (ergo, devices have been initialised and placed in a safe state) the interrupts are claimed, and enabled at the source. > In the meantime, weird UART states on entry into platform_device_init > are a reality. Yes, uart states are indeterminent at this point. However, as soon as the 8250 driver loads it takes control of the 8250 ports, and DISABLES the interrupt on ALL ports found, LONG BEFORE any service handlers are installed. So, by the time the system is up and running _all_ 8250 ports have had their IERs written with zero. Interrupts disabled at source. By the time you get to open any serial port, the initialisation has completed. > >We follow that rule in the 8250 driver - in fact, when we initialise > >we ensure that interrupts are disabled on any devices we find. > > No, you rely on the caller of serial8250_init to have punctured the > abstraction Can you add any other useless complex words into that sentence? > and forced any and all UARTs to a state where they cannot > possibly generate an IRQ. That is being done already at initialisation time. Now, please show your interrupt mask/unmask/mask_ack code, which is where I believe your problem to lie. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote: > setup_irq() is where things go wrong, at least for us, at least on > 2.6.16.x. Interrupts are not disabled at the point in request_irq() > when the interrupt controller is poked to enable the IRQ source. If > you're lucky, and you're on an architecture where the UART interrupt > is properly level-triggered, and the worst thing that happens when you > attempt to service an interrupt that isn't yours is that it stays on, > then you get a soft lockup with two or three recursive __irq_svc hits > in the backtrace. If you're not lucky you do a fandango on core. That should not happen if your interrupt handling is correct - okay, you might get an interrupt at that point, but while servicing that interrupt the source will be disabled on the interrupt controller. Right. But as soon as you turn the source back on, in the postamble of the interrupt dispatch handler, it fires again. At least on ARM, that gives you recursive hits to __irq_svc and a couple of nested calls within it. Here's a backtrace (embedded in a chat log with some commentary): 6:42 PM me: we have definitely confirmed that the serial ISR is failing to clear the interrupt and the (presumably level-triggered) IRQ is firing again on exit from the ISR. 6:43 PM The reason that __do_softirq is usually the last function entrypoint in the backtrace before the __irq_svc associated with the timer is that it is the first place where interrupts are enabled during the IRQ dispatcher postamble. 6:44 PM Here is a backtrace from a case where the timer interrupt hit during the perpetually firing ISR instead of during the dispatch code surrounding it (which is not visible in backtraces) 6:45 PM [ 54.23] Freeing init memory: 92K [ 52.24] rcu_do_batch: rcu node is 0xC03D7540, callback is 0xC00864C8 [ 52.24] rcu_do_batch: rcu node is 0xC02CCDA0, callback is 0xC006E7E4 [ 52.25] rcu_do_batch: rcu node is 0xC03D7730, callback is 0xC00864C8 [ 52.26] rcu_do_batch: rcu node is 0xC03D7920, callback is 0xC00864C8 [ 51.24] BUG: soft lockup detected on CPU#0! [ 52.24] [] (dump_stack+0x0/0x14) from [] (softlockup_tick+0xa8/0xe8) [ 52.24] [] (softlockup_tick+0x0/0xe8) from [] (run_local_timers+0x18/0x1c) [ 52.24] r8 = 00010105 r7 = 0005 r6 = r5 = [ 52.24] r4 = C0299B40 [ 52.24] [] (run_local_timers+0x0/0x1c) from [] (update_process_times+0x50/0x7c) [ 52.24] [] (update_process_times+0x0/0x7c) from [] (timer_tick+0xc4/0xe0) [ 52.24] r6 = r5 = C029DB48 r4 = C029DB48 [ 52.24] [] (timer_tick+0x0/0xe0) from [] (mv88w8xx8_timer_interrupt+0x30/0x68) [ 52.24] r6 = r5 = C029DB48 r4 = C024775C [ 52.24] [] (mv88w8xx8_timer_interrupt+0x0/0x68) from [] (__do_irq+0xf0/0x140) [ 52.24] r5 = r4 = C0204280 6:46 PM [ 52.24] [] (__do_irq+0x0/0x140) from [] (do_level_IRQ+0x70/0xc8) [ 52.24] [] (do_level_IRQ+0x0/0xc8) from [] (asm_do_IRQ+0x50/0x134) [ 52.24] r6 = C029DB48 r5 = C0240E24 r4 = 0005 [ 52.24] [] (asm_do_IRQ+0x0/0x134) from [] (__irq_svc+0x38/0x190) [ 52.24] r6 = 0020 r5 = C029DB7C r4 = [ 52.24] [] (__do_irq+0x0/0x140) from [] (do_level_IRQ+0x70/0xc8) [ 52.24] [] (do_level_IRQ+0x0/0xc8) from [] (asm_do_IRQ+0x50/0x134) [ 52.24] r6 = C029DBFC r5 = C0240F5C r4 = 000B [ 52.24] [] (asm_do_IRQ+0x0/0x134) from [] (__irq_svc+0x38/0x190) [ 52.24] r6 = 0800 r5 = C029DC30 r4 = [ 52.24] [] (__do_softirq+0x0/0xd8) from [] (irq_exit+0x48/0x5c) [ 52.24] r6 = C029DC94 r5 = C0240E24 r4 = 0005 [ 52.24] [] (irq_exit+0x0/0x5c) from [] (asm_do_IRQ+0x11c/0x134) [ 52.24] [] (asm_do_IRQ+0x0/0x134) from [] (__irq_svc+0x38/0x190) [ 52.24] r6 = 0820 r5 = C029DCC8 r4 = [ 52.24] [] (setup_irq+0x0/0x15c) from [] (request_irq+0xa4/0xd0) [ 52.24] r7 = r6 = r5 = 000B r4 = C0C1B5C0 [ 52.24] [] (request_irq+0x0/0xd0) from [] (serial_link_irq_chain+0x264/0x2a0) [ 52.24] [] (serial_link_irq_chain+0x0/0x2a0) from [] (serial8250_startup+0x2f4/0x4f0) [ 52.24] [] (serial8250_startup+0x0/0x4f0) from [] (uart_startup+0x164/0x48c) [ 52.24] [] (uart_startup+0x0/0x48c) from [] (uart_open+0x1a8/0x238) [ 52.24] [] (uart_open+0x0/0x238) from [] (tty_open+0x1cc/0x390) [ 52.24] [] (tty_open+0x0/0x390) from [] (chrdev_open+0x1e4/0x220) [ 52.24] [] (chrdev_open+0x0/0x220) from [] (__dentry_open+0x13c/0x294) [ 52.24] r8 = C028E2A0 r7 = C0077C60 r6 = C0C29B94 r5 = [ 52.24] r4 = C02CC300 [ 52.24] [] (__dentry_open+0x0/0x294) from [] (nameidata_to_filp+0x34/0x48) [ 52.24] [] (nameidata_to_filp+0x0/0x48) from [] (do_filp_open+0x44/0x4c) [ 52.24] r4 = 0002 [ 52.24] [] (do_filp_open+0x0/0x4c) from [] (do_sys_open+0x50/0x94) [ 52.24] r5 = r4 = 0002 [ 52.24] [] (do_sys_open+0x0/0x94) from [] (sys_open+0x24/0x28) [ 52.24] r8 = r7 =
Re: Serial related oops
On Mon, Feb 19, 2007 at 01:24:17PM -0800, Michael K. Edwards wrote: > On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote: > >On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote: > >> What we've seen on our embedded ARM is that enabling an interrupt that > >> is shared between multiple UARTs, at a stage when you have not set up > >> all the data structures touched by the ISR and softirq, can have > >> horrible consequences, including soft lockups and fandangos on core. > > > >Incorrect. We have: > > > >1. registered an interrupt handler at this point. > >2. disabled interrupts (we're under the spin lock) > > setup_irq() is where things go wrong, at least for us, at least on > 2.6.16.x. Interrupts are not disabled at the point in request_irq() > when the interrupt controller is poked to enable the IRQ source. If > you're lucky, and you're on an architecture where the UART interrupt > is properly level-triggered, and the worst thing that happens when you > attempt to service an interrupt that isn't yours is that it stays on, > then you get a soft lockup with two or three recursive __irq_svc hits > in the backtrace. If you're not lucky you do a fandango on core. That should not happen if your interrupt handling is correct - okay, you might get an interrupt at that point, but while servicing that interrupt the source will be disabled on the interrupt controller. You should _never_ _ever_ get recusive interrupts for the same interrupt source. Ever. If you do, your platforms interrupt handling is seriously buggy. > But its context is not. Shared IRQ lines are a _problem_. You cannot > safely enable an IRQ until all devices that share it have had their > ISRs installed, unless you can absolutely guarantee at a hardware > level that the unitialized ones cannot assert the IRQ line. Linux assumes that all interrupt sources on a shared IRQ line are disabled at the point in time when the kernel boots. When a device is to be used, an interrupt handler is installed and then the kernel will enable the interrupt on the device, not before. We follow that rule in the 8250 driver - in fact, when we initialise we ensure that interrupts are disabled on any devices we find. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On 2/19/07, Russell King <[EMAIL PROTECTED]> wrote: On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote: > What we've seen on our embedded ARM is that enabling an interrupt that > is shared between multiple UARTs, at a stage when you have not set up > all the data structures touched by the ISR and softirq, can have > horrible consequences, including soft lockups and fandangos on core. Incorrect. We have: 1. registered an interrupt handler at this point. 2. disabled interrupts (we're under the spin lock) setup_irq() is where things go wrong, at least for us, at least on 2.6.16.x. Interrupts are not disabled at the point in request_irq() when the interrupt controller is poked to enable the IRQ source. If you're lucky, and you're on an architecture where the UART interrupt is properly level-triggered, and the worst thing that happens when you attempt to service an interrupt that isn't yours is that it stays on, then you get a soft lockup with two or three recursive __irq_svc hits in the backtrace. If you're not lucky you do a fandango on core. So, no interrupt will be seen by the CPU since the interrupt is masked. The interrupt would need to be masked for the entire duration of the outer loop that calls serial8250_init() or the equivalent for all platform devices that share the IRQ. The test is intentionally designed to be safe from the interrupt generation point of view. But its context is not. Shared IRQ lines are a _problem_. You cannot safely enable an IRQ until all devices that share it have had their ISRs installed, unless you can absolutely guarantee at a hardware level that the unitialized ones cannot assert the IRQ line. That does not apply to any device that might have been touched by the bootloader or the early init code, especially a UART. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Mon, Feb 19, 2007 at 05:54:52PM +, Jose Goncalves wrote: > Russell King wrote: > Result is attached. Right... in depth analysis follows. [15423.650518] [] uart_startup+0x63/0xf4 equates to 0xc01ba49a, which is indeed the instruction after the call to port->ops->startup. The important code leading up to this is: c01ba437 : c01ba437: 55 push %ebp c01ba438: 57 push %edi c01ba439: 56 push %esi c01ba43a: 53 push %ebx c01ba43b: 8b 7c 24 14 mov0x14(%esp),%edi @ load state c01ba43f: 31 d2 xor%edx,%edx c01ba441: 8b 5f 10mov0x10(%edi),%ebx @ load state->info c01ba444: 8b 77 14mov0x14(%edi),%esi @ load state->port c01ba493: 8b 46 64mov0x64(%esi),%eax @ load port->ops c01ba496: 56 push %esi @ push "port" onto stack c01ba497: ff 50 24call *0x24(%eax) @ ops->startup(port) c01bd74b : c01bd74b: 55 push %ebp c01bd74c: 57 push %edi c01bd74d: 56 push %esi c01bd74e: 53 push %ebx c01bd74f: 8b 5c 24 14 mov0x14(%esp),%ebx Comparing this with the stack dump: Stack: c02fae70 0005 c74304e0 <- %ebx (pushed by serial8250_startup) c02fae70 <- %esi (pushed by serial8250_startup) c128d5e4 <- %edi (pushed by serial8250_startup) c7a69a80 <- %ebp (pushed by serial8250_startup) c01ba49a <- uart_startup+0x63/0xf4 (pushed by function called by ops->startup, iow serial8250_startup) c02fae70 <- pushed on by "push %esi" at c01ba496, this is "port" c128d5e4 <- %ebx (pushed by uart_startup) <- %esi (pushed by uart_startup) c7a69a80 <- %edi (pushed by uart_startup) c7a69a80 <- %ebp (pushed by uart_startup) c01bbaa0 <- probably uart_open+0xaa/0xec Once the instruction at c01bd74f completes, we have pushed into the stack the structure commented above, but not the first two uncommented values. %ebx contains the value of "port" at this point. We're looking for some place in the code which pushes a value of '5' and '%ebx' on to the stack, and the CPUs registers contain values which correspond with the values provided in your oops. The code corresponding with the buggy uart check is as follows. Comments interspersed: c01bd910: 9c pushf c01bd911: 5d pop%ebp c01bd912: fa cli This code pushes the processors flag register onto the stack, pops it off into the %ebp register, and then disables interrupts. Your oops dump contained "ebp: 0202" which is a reasonable value for x86 processors flags, which have been saved into the ebp register by the above code sequence (according to Wikipedia). c01bd937: ff 73 58pushl 0x58(%ebx) c01bd93a: 53 push %ebx c01bd93b: e8 7f fd ff ff call c01bd6bf c01bd940: 6a 02 push $0x2 c01bd942: 6a 01 push $0x1 c01bd944: 53 push %ebx c01bd945: e8 f6 eb ff ff call c01bc540 @ write IER c01bd94a: 6a 05 push $0x5 c01bd94c: 53 push %ebx c01bd94d: e8 a6 eb ff ff call c01bc4f8 @ reads LSR c01bd952: 89 c7 mov%eax,%edi @ saves result in %edi c01bd954: 6a 02 push $0x2 c01bd956: 53 push %ebx c01bd957: e8 9c eb ff ff call c01bc4f8 @ reads IIR c01bd95c: 83 c4 24add$0x24,%esp c01bd95f: 89 c6 mov%eax,%esi @ saves result in %esi This is the code corresponding with part of the buggy uart check - you can see the call to serial8250_set_mctrl() there which confirms this. The sequence at c01bd94a pushes "5" and "port" (%ebx) onto the stack, but this isn't the right place because before this we pushed "2", "1", and "port" on the stack, and those are not present in the stack dump. However, the reason for showing this is that a little while later, we have: c01bd96e: 83 e7 40and$0x40,%edi c01bd971: 74 1c je c01bd98f c01bd973: 83 e6 01and$0x1,%esi c01bd976: 74 17 je c01bd98f The normal value we would read from the LSR (stored in %edi) would be 0x60, and if a transmit interrupt was pending (which is what the test is trying to find out) the IIR (%esi) would be 0x02. The above code sequence which involves masking these values would therefore give: 0x40 & 0x60 (%edi) -> 0x40 in %edi 0x01 & 0x02 (%esi) -> 0x00 in %esi >From your oops dump "edi: 0040" and "esi: " - that ties up, so we know that the place we got to must be after this point. We eventually come to this sequence. The words previously pushed onto the stack have been removed at this point, and %ebp, %edi nor %esi have been touched by any other code since they were last
Re: Serial related oops
On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote: > What we've seen on our embedded ARM is that enabling an interrupt that > is shared between multiple UARTs, at a stage when you have not set up > all the data structures touched by the ISR and softirq, can have > horrible consequences, including soft lockups and fandangos on core. Incorrect. We have: 1. registered an interrupt handler at this point. 2. disabled interrupts (we're under the spin lock) So, no interrupt will be seen by the CPU since the interrupt is masked. The test is intentionally designed to be safe from the interrupt generation point of view. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
What we've seen on our embedded ARM is that enabling an interrupt that is shared between multiple UARTs, at a stage when you have not set up all the data structures touched by the ISR and softirq, can have horrible consequences, including soft lockups and fandangos on core. You will be vulnerable to this unless you lock out the interrupt source (at the interrupt controller or, if you have to, globally) across the UART registration process in your platform's arch/mach-dependent core.c, in which case the TX irq test will of course fail. Roll-your-own SoC UARTs with bugs or "extended features" in IRQ enabling and delivery make things worse. I would love to see this disentangled in a maintainable way. It's such a nasty problem (especially given that bootloaders and early boot code frequently turn on one or more UARTs and leave them in an unknown state) that all we've been able to do so far is hack around it. I'll send an example patch when we've more or less isolated it, but it will be of limited use to you unless you have the exact set of UART warpage we do. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: > On Mon, Feb 19, 2007 at 04:29:39PM +, Jose Goncalves wrote: > >> Russell King wrote: >> >>> On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote: >>> >>> (trimmed tie-fei.zang from the CC, added by mistake) On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote: >> Neither did I, but introducing printk's through the function, we narrowed >> the problem to this part of the code. And removing it makes the problem >> go away. We inserted 37 printk's in the function body, and Jose bisected >> those until the problem went away. >> >> > Well, there's still little clue about why this is causing a NULL pointer > dereference. The only thing I can think is that somehow performing > this test is causing a power glitch to your CPU, causing its registers > to get corrupted, and which results in it doing a NULL pointer deref. > > That may be the case, indeed. >> But if the problem was a power glitch I should get Oops with or without >> printk() inserted, shouldn't I? >> > > That depends if the printk() changes the timing such that it doesn't occur. > Don't know, I'm only grasping at straws due to the lack of any concrete > information. > > If you see other tests to be performed... >>> Maybe adding some delays in that bit of code? I'm sure you've already >>> thought of that though. Since no one has a proper understanding of the >>> problem, the only suggestions possible are mere shots in the dark. >>> >> I'm no kernel expert, but it's not possible to trace what is the >> instruction that is causing the NULL pointer dereference? >> > > The reported dump shows that the kernel tried to access virtual address 0, > and the instruction pointer seems to be the cause of that - it has a value > of zero in that dump. > > The call trace indicates that the last function was called from around > "uart_startup+0x63/0xf4" which is probably the indirect function call > to serial8250_startup(). That's unconfirmed - the only way to get it > confirmed is if you could dump the entire uart_startup() function. > > $ grep uart_startup System.map > (address) T uart_startup > $ objdump -r -d vmlinux --start-addr=0x --stop-addr=0x > > The grep should get you the address of uart_startup. Replace > with that value and with the value plus 256 (0x100) and > mail the result. > Result is attached. José Gonçalves vmlinux-2.6.16.38-mtm4-debug: file format elf32-i386 Disassembly of section .text: c01ba437 : c01ba437: 55 push %ebp c01ba438: 57 push %edi c01ba439: 56 push %esi c01ba43a: 53 push %ebx c01ba43b: 8b 7c 24 14 mov0x14(%esp),%edi c01ba43f: 31 d2 xor%edx,%edx c01ba441: 8b 5f 10 mov0x10(%edi),%ebx c01ba444: 8b 77 14 mov0x14(%edi),%esi c01ba447: 83 7b 10 00 cmpl $0x0,0x10(%ebx) c01ba44b: 0f 88 d3 00 00 00 js c01ba524 c01ba451: 8b 03 mov(%ebx),%eax c01ba453: 0f ba a8 b4 00 00 00 btsl $0x1,0xb4(%eax) c01ba45a: 01 c01ba45b: 83 7e 60 00 cmpl $0x0,0x60(%esi) c01ba45f: 0f 84 bf 00 00 00 je c01ba524 c01ba465: 83 7b 04 00 cmpl $0x0,0x4(%ebx) c01ba469: 75 28 jnec01ba493 c01ba46b: b8 d0 00 00 00 mov$0xd0,%eax c01ba470: e8 36 c7 f6 ff call c0126bab c01ba475: ba f4 ff ff ff mov$0xfff4,%edx c01ba47a: 85 c0 test %eax,%eax c01ba47c: 0f 84 a2 00 00 00 je c01ba524 c01ba482: 89 43 04 mov%eax,0x4(%ebx) c01ba485: c7 43 0c 00 00 00 00 movl $0x0,0xc(%ebx) c01ba48c: c7 43 08 00 00 00 00 movl $0x0,0x8(%ebx) c01ba493: 8b 46 64 mov0x64(%esi),%eax c01ba496: 56 push %esi c01ba497: ff 50 24 call *0x24(%eax) c01ba49a: 89 c5 mov%eax,%ebp c01ba49c: 58 pop%eax c01ba49d: 85 ed test %ebp,%ebp c01ba49f: 75 6d jnec01ba50e c01ba4a1: 83 7c 24 18 00 cmpl $0x0,0x18(%esp) c01ba4a6: 74 36 je c01ba4de c01ba4a8: 6a 00 push $0x0 c01ba4aa: 57 push %edi c01ba4ab: e8 5f 02 00 00 call c01ba70f c01ba4b0: 8b 03 mov(%ebx),%eax c01ba4b2: 8b 40 64 mov0x64(%eax),%eax c01ba4b5: 59 pop%ecx c01ba4b6: 5f pop%edi c01ba4b7: f7 40 08 0f 10 00 00 testl $0x100f,0x8(%eax) c01ba4be: 74 1e je c01ba4de c01ba4c0: 9c pushf c01ba4c1: 5f pop%edi c01ba4c2: fa cli c01ba4c3: 8b 46 58 mov0x58(%esi),%eax c01ba4c6: 89 c2
Re: Serial related oops
On Mon, Feb 19, 2007 at 04:29:39PM +, Jose Goncalves wrote: > Russell King wrote: > > On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote: > > > >> (trimmed tie-fei.zang from the CC, added by mistake) > >> On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote: > >> > Neither did I, but introducing printk's through the function, we narrowed > the problem to this part of the code. And removing it makes the problem > go away. We inserted 37 printk's in the function body, and Jose bisected > those until the problem went away. > > >>> Well, there's still little clue about why this is causing a NULL pointer > >>> dereference. The only thing I can think is that somehow performing > >>> this test is causing a power glitch to your CPU, causing its registers > >>> to get corrupted, and which results in it doing a NULL pointer deref. > >>> > >> That may be the case, indeed. > >> > > But if the problem was a power glitch I should get Oops with or without > printk() inserted, shouldn't I? That depends if the printk() changes the timing such that it doesn't occur. Don't know, I'm only grasping at straws due to the lack of any concrete information. > >> If you see other tests to be performed... > > > > Maybe adding some delays in that bit of code? I'm sure you've already > > thought of that though. Since no one has a proper understanding of the > > problem, the only suggestions possible are mere shots in the dark. > > I'm no kernel expert, but it's not possible to trace what is the > instruction that is causing the NULL pointer dereference? The reported dump shows that the kernel tried to access virtual address 0, and the instruction pointer seems to be the cause of that - it has a value of zero in that dump. The call trace indicates that the last function was called from around "uart_startup+0x63/0xf4" which is probably the indirect function call to serial8250_startup(). That's unconfirmed - the only way to get it confirmed is if you could dump the entire uart_startup() function. $ grep uart_startup System.map (address) T uart_startup $ objdump -r -d vmlinux --start-addr=0x --stop-addr=0x The grep should get you the address of uart_startup. Replace with that value and with the value plus 256 (0x100) and mail the result. > I have no clue on what is causing this problem but, what I know, is > that I can always reproduce it, and it always happens in the same code > section of serial8250_startup(). We're both at the same level of clue about the problem then. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: > On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote: > >> (trimmed tie-fei.zang from the CC, added by mistake) >> On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote: >> Neither did I, but introducing printk's through the function, we narrowed the problem to this part of the code. And removing it makes the problem go away. We inserted 37 printk's in the function body, and Jose bisected those until the problem went away. >>> Well, there's still little clue about why this is causing a NULL pointer >>> dereference. The only thing I can think is that somehow performing >>> this test is causing a power glitch to your CPU, causing its registers >>> to get corrupted, and which results in it doing a NULL pointer deref. >>> >> That may be the case, indeed. >> But if the problem was a power glitch I should get Oops with or without printk() inserted, shouldn't I? >>> Are you saying that the NULL pointer occurred while executing this code? >>> If not, where does the NULL pointer occur? >>> >> The thing is, the NULL pointer deref dissapeared as soon as we >> instrumented (printk'ed) the code. So it's seems to be triggered by >> check+timing+hardware. >> > > So to summarise, we have some code somewhere which is causing a NULL > pointer deref in uart_startup(). If we remove some code, the NULL > pointer deref stops happening. > > And that's about the sum total of the information we know. We don't > know precisely where the NULL pointer deref occurs, and we don't know > what's causing it. > > It doesn't sound like there's much understanding of the problem at hand. ;( > > >>> Andrew's said no (in that the thread you refer to) and suggested an >>> alternative, I've said no, how many more 'no's do you need to turn >>> you away from the wrong approach? >>> >> One is usually sufficient once I've understood :). I missed the module >> option approach. Is it ok with you? If yes, I'll put up a patch to do >> this. >> > > I guess so, but how does the user know whether they need this enabled or > disabled? > > >> The problem appears to be reproducible on Jose's hardware within 2-3 days. >> In a kernel without instrumentation I get problems within a 1 day period. >> If you see other tests to be performed... >> > > Maybe adding some delays in that bit of code? I'm sure you've already > thought of that though. Since no one has a proper understanding of the > problem, the only suggestions possible are mere shots in the dark. > I'm no kernel expert, but it's not possible to trace what is the instruction that is causing the NULL pointer dereference? The kernel dump does not show this? I have no clue on what is causing this problem but, what I know, is that I can always reproduce it, and it always happens in the same code section of serial8250_startup(). Regards, José Gonçalves - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote: > (trimmed tie-fei.zang from the CC, added by mistake) > On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote: > > > Neither did I, but introducing printk's through the function, we narrowed > > > the problem to this part of the code. And removing it makes the problem > > > go away. We inserted 37 printk's in the function body, and Jose bisected > > > those until the problem went away. > > > > Well, there's still little clue about why this is causing a NULL pointer > > dereference. The only thing I can think is that somehow performing > > this test is causing a power glitch to your CPU, causing its registers > > to get corrupted, and which results in it doing a NULL pointer deref. > That may be the case, indeed. > > > > Are you saying that the NULL pointer occurred while executing this code? > > If not, where does the NULL pointer occur? > The thing is, the NULL pointer deref dissapeared as soon as we > instrumented (printk'ed) the code. So it's seems to be triggered by > check+timing+hardware. So to summarise, we have some code somewhere which is causing a NULL pointer deref in uart_startup(). If we remove some code, the NULL pointer deref stops happening. And that's about the sum total of the information we know. We don't know precisely where the NULL pointer deref occurs, and we don't know what's causing it. It doesn't sound like there's much understanding of the problem at hand. ;( > > Andrew's said no (in that the thread you refer to) and suggested an > > alternative, I've said no, how many more 'no's do you need to turn > > you away from the wrong approach? > One is usually sufficient once I've understood :). I missed the module > option approach. Is it ok with you? If yes, I'll put up a patch to do > this. I guess so, but how does the user know whether they need this enabled or disabled? > The problem appears to be reproducible on Jose's hardware within 2-3 days. > If you see other tests to be performed... Maybe adding some delays in that bit of code? I'm sure you've already thought of that though. Since no one has a proper understanding of the problem, the only suggestions possible are mere shots in the dark. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
(trimmed tie-fei.zang from the CC, added by mistake) On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote: > > Neither did I, but introducing printk's through the function, we narrowed > > the problem to this part of the code. And removing it makes the problem > > go away. We inserted 37 printk's in the function body, and Jose bisected > > those until the problem went away. > > Well, there's still little clue about why this is causing a NULL pointer > dereference. The only thing I can think is that somehow performing > this test is causing a power glitch to your CPU, causing its registers > to get corrupted, and which results in it doing a NULL pointer deref. That may be the case, indeed. > > Are you saying that the NULL pointer occurred while executing this code? > If not, where does the NULL pointer occur? The thing is, the NULL pointer deref dissapeared as soon as we instrumented (printk'ed) the code. So it's seems to be triggered by check+timing+hardware. > > > > No, it's only runtime because you can't tell which ports might be > > > affected, and you might have a mixture of ports which are affected > > > and those which aren't. > > Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option? > > Andrew's said no (in that the thread you refer to) and suggested an > alternative, I've said no, how many more 'no's do you need to turn > you away from the wrong approach? One is usually sufficient once I've understood :). I missed the module option approach. Is it ok with you? If yes, I'll put up a patch to do this. > > > > > PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in > > > > http://lkml.org/lkml/2006/6/13/21 > > > > > > I don't see any reference to this problem there. > > > > Sorry, I suck, I got that mixed with that one: > > http://lkml.org/lkml/2006/12/26/63 > > "probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some > > ARM boards" > > The "weird effects" were never quantified, so that's one of the reasons > I ignored that report (another being is that I stopped being the serial > maintainer a while ago, and now serial is maintainerless.) > The problem appears to be reproducible on Jose's hardware within 2-3 days. If you see other tests to be performed... Regards, Frederik - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Tue, Feb 20, 2007 at 02:24:42PM +, Frederik Deweerdt wrote: > On Mon, Feb 19, 2007 at 01:45:39PM +, Russell King wrote: > > On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote: > > > (Sorry for the resend, I forgot to cc the list) > > > Hi Russell, > > > > > > It seems that the following change in drivers/serial/8250.c > > > > > > + > > > + /* > > > + * Do a quick test to see if we receive an > > > + * interrupt when we enable the TX irq. > > > + */ > > > + serial_outp(up, UART_IER, UART_IER_THRI); > > > + lsr = serial_in(up, UART_LSR); > > > + iir = serial_in(up, UART_IIR); > > > + serial_outp(up, UART_IER, 0); > > > + > > > + if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) { > > > + if (!(up->capabilities & UART_BUG_TXEN)) { > > > + up->capabilities |= UART_BUG_TXEN; > > > + pr_debug("ttyS%d - enabling bad tx status > > > workarounds\n", > > > + port->line); > > > + } > > > + } else { > > > + up->capabilities &= ~UART_BUG_TXEN; > > > + } > > > + > > > > > > that was introduced in 2.6.12[1], is causing oopses on some hardware. In > > > particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible > > > > I don't see that. The oops your referring to is a NULL pointer > > dereference. The only dereferences the above code does is via > > 'up' and 'port' both of which are provably always non-null here. > > Neither did I, but introducing printk's through the function, we narrowed > the problem to this part of the code. And removing it makes the problem > go away. We inserted 37 printk's in the function body, and Jose bisected > those until the problem went away. Well, there's still little clue about why this is causing a NULL pointer dereference. The only thing I can think is that somehow performing this test is causing a power glitch to your CPU, causing its registers to get corrupted, and which results in it doing a NULL pointer deref. Are you saying that the NULL pointer occurred while executing this code? If not, where does the NULL pointer occur? > > No, it's only runtime because you can't tell which ports might be > > affected, and you might have a mixture of ports which are affected > > and those which aren't. > Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option? Andrew's said no (in that the thread you refer to) and suggested an alternative, I've said no, how many more 'no's do you need to turn you away from the wrong approach? > > > PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in > > > http://lkml.org/lkml/2006/6/13/21 > > > > I don't see any reference to this problem there. > > Sorry, I suck, I got that mixed with that one: > http://lkml.org/lkml/2006/12/26/63 > "probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some > ARM boards" The "weird effects" were never quantified, so that's one of the reasons I ignored that report (another being is that I stopped being the serial maintainer a while ago, and now serial is maintainerless.) -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Mon, Feb 19, 2007 at 01:45:39PM +, Russell King wrote: > On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote: > > (Sorry for the resend, I forgot to cc the list) > > Hi Russell, > > > > It seems that the following change in drivers/serial/8250.c > > > > + > > + /* > > +* Do a quick test to see if we receive an > > +* interrupt when we enable the TX irq. > > +*/ > > + serial_outp(up, UART_IER, UART_IER_THRI); > > + lsr = serial_in(up, UART_LSR); > > + iir = serial_in(up, UART_IIR); > > + serial_outp(up, UART_IER, 0); > > + > > + if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) { > > + if (!(up->capabilities & UART_BUG_TXEN)) { > > + up->capabilities |= UART_BUG_TXEN; > > + pr_debug("ttyS%d - enabling bad tx status > > workarounds\n", > > +port->line); > > + } > > + } else { > > + up->capabilities &= ~UART_BUG_TXEN; > > + } > > + > > > > that was introduced in 2.6.12[1], is causing oopses on some hardware. In > > particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible > > I don't see that. The oops your referring to is a NULL pointer > dereference. The only dereferences the above code does is via > 'up' and 'port' both of which are provably always non-null here. Neither did I, but introducing printk's through the function, we narrowed the problem to this part of the code. And removing it makes the problem go away. We inserted 37 printk's in the function body, and Jose bisected those until the problem went away. > > > I was wondering: > > - what is the goal of the test? > > To detect UARTs which do not assert an interrupt when the transmit > interrupt is enabled, which then causes no data to ever be transmitted > without this work-around. OK, thanks for clarifying. > > > - could this be CONFIGed ? > > No, it's only runtime because you can't tell which ports might be > affected, and you might have a mixture of ports which are affected > and those which aren't. Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option? > > > PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in > > http://lkml.org/lkml/2006/6/13/21 > > I don't see any reference to this problem there. Sorry, I suck, I got that mixed with that one: http://lkml.org/lkml/2006/12/26/63 "probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some ARM boards" Regards, Frederik - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote: > (Sorry for the resend, I forgot to cc the list) > Hi Russell, > > It seems that the following change in drivers/serial/8250.c > > + > + /* > + * Do a quick test to see if we receive an > + * interrupt when we enable the TX irq. > + */ > + serial_outp(up, UART_IER, UART_IER_THRI); > + lsr = serial_in(up, UART_LSR); > + iir = serial_in(up, UART_IIR); > + serial_outp(up, UART_IER, 0); > + > + if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) { > + if (!(up->capabilities & UART_BUG_TXEN)) { > + up->capabilities |= UART_BUG_TXEN; > + pr_debug("ttyS%d - enabling bad tx status > workarounds\n", > + port->line); > + } > + } else { > + up->capabilities &= ~UART_BUG_TXEN; > + } > + > > that was introduced in 2.6.12[1], is causing oopses on some hardware. In > particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible I don't see that. The oops your referring to is a NULL pointer dereference. The only dereferences the above code does is via 'up' and 'port' both of which are provably always non-null here. > I was wondering: > - what is the goal of the test? To detect UARTs which do not assert an interrupt when the transmit interrupt is enabled, which then causes no data to ever be transmitted without this work-around. > - could this be CONFIGed ? No, it's only runtime because you can't tell which ports might be affected, and you might have a mixture of ports which are affected and those which aren't. > PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in > http://lkml.org/lkml/2006/6/13/21 I don't see any reference to this problem there. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Serial related oops
(Sorry for the resend, I forgot to cc the list) Hi Russell, It seems that the following change in drivers/serial/8250.c + + /* +* Do a quick test to see if we receive an +* interrupt when we enable the TX irq. +*/ + serial_outp(up, UART_IER, UART_IER_THRI); + lsr = serial_in(up, UART_LSR); + iir = serial_in(up, UART_IIR); + serial_outp(up, UART_IER, 0); + + if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) { + if (!(up->capabilities & UART_BUG_TXEN)) { + up->capabilities |= UART_BUG_TXEN; + pr_debug("ttyS%d - enabling bad tx status workarounds\n", +port->line); + } + } else { + up->capabilities &= ~UART_BUG_TXEN; + } + that was introduced in 2.6.12[1], is causing oopses on some hardware. In particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible (after a few days of open()/close() on the serial port). He bisected this to that change -thanks for the long debugging Jose ;)-. and reverting that part of the 2.6.12 git patch seems to fix the problem. I was wondering: - what is the goal of the test? - could this be CONFIGed ? Regards, Frederik PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in http://lkml.org/lkml/2006/6/13/21 [1]http://lkml.org/lkml/2005/6/23/266 [2]http://lkml.org/lkml/2007/1/26/157 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Serial related oops
(Sorry for the resend, I forgot to cc the list) Hi Russell, It seems that the following change in drivers/serial/8250.c + + /* +* Do a quick test to see if we receive an +* interrupt when we enable the TX irq. +*/ + serial_outp(up, UART_IER, UART_IER_THRI); + lsr = serial_in(up, UART_LSR); + iir = serial_in(up, UART_IIR); + serial_outp(up, UART_IER, 0); + + if (lsr UART_LSR_TEMT iir UART_IIR_NO_INT) { + if (!(up-capabilities UART_BUG_TXEN)) { + up-capabilities |= UART_BUG_TXEN; + pr_debug(ttyS%d - enabling bad tx status workarounds\n, +port-line); + } + } else { + up-capabilities = ~UART_BUG_TXEN; + } + that was introduced in 2.6.12[1], is causing oopses on some hardware. In particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible (after a few days of open()/close() on the serial port). He bisected this to that change -thanks for the long debugging Jose ;)-. and reverting that part of the 2.6.12 git patch seems to fix the problem. I was wondering: - what is the goal of the test? - could this be CONFIGed ? Regards, Frederik PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in http://lkml.org/lkml/2006/6/13/21 [1]http://lkml.org/lkml/2005/6/23/266 [2]http://lkml.org/lkml/2007/1/26/157 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote: (Sorry for the resend, I forgot to cc the list) Hi Russell, It seems that the following change in drivers/serial/8250.c + + /* + * Do a quick test to see if we receive an + * interrupt when we enable the TX irq. + */ + serial_outp(up, UART_IER, UART_IER_THRI); + lsr = serial_in(up, UART_LSR); + iir = serial_in(up, UART_IIR); + serial_outp(up, UART_IER, 0); + + if (lsr UART_LSR_TEMT iir UART_IIR_NO_INT) { + if (!(up-capabilities UART_BUG_TXEN)) { + up-capabilities |= UART_BUG_TXEN; + pr_debug(ttyS%d - enabling bad tx status workarounds\n, + port-line); + } + } else { + up-capabilities = ~UART_BUG_TXEN; + } + that was introduced in 2.6.12[1], is causing oopses on some hardware. In particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible I don't see that. The oops your referring to is a NULL pointer dereference. The only dereferences the above code does is via 'up' and 'port' both of which are provably always non-null here. I was wondering: - what is the goal of the test? To detect UARTs which do not assert an interrupt when the transmit interrupt is enabled, which then causes no data to ever be transmitted without this work-around. - could this be CONFIGed ? No, it's only runtime because you can't tell which ports might be affected, and you might have a mixture of ports which are affected and those which aren't. PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in http://lkml.org/lkml/2006/6/13/21 I don't see any reference to this problem there. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Mon, Feb 19, 2007 at 01:45:39PM +, Russell King wrote: On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote: (Sorry for the resend, I forgot to cc the list) Hi Russell, It seems that the following change in drivers/serial/8250.c + + /* +* Do a quick test to see if we receive an +* interrupt when we enable the TX irq. +*/ + serial_outp(up, UART_IER, UART_IER_THRI); + lsr = serial_in(up, UART_LSR); + iir = serial_in(up, UART_IIR); + serial_outp(up, UART_IER, 0); + + if (lsr UART_LSR_TEMT iir UART_IIR_NO_INT) { + if (!(up-capabilities UART_BUG_TXEN)) { + up-capabilities |= UART_BUG_TXEN; + pr_debug(ttyS%d - enabling bad tx status workarounds\n, +port-line); + } + } else { + up-capabilities = ~UART_BUG_TXEN; + } + that was introduced in 2.6.12[1], is causing oopses on some hardware. In particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible I don't see that. The oops your referring to is a NULL pointer dereference. The only dereferences the above code does is via 'up' and 'port' both of which are provably always non-null here. Neither did I, but introducing printk's through the function, we narrowed the problem to this part of the code. And removing it makes the problem go away. We inserted 37 printk's in the function body, and Jose bisected those until the problem went away. I was wondering: - what is the goal of the test? To detect UARTs which do not assert an interrupt when the transmit interrupt is enabled, which then causes no data to ever be transmitted without this work-around. OK, thanks for clarifying. - could this be CONFIGed ? No, it's only runtime because you can't tell which ports might be affected, and you might have a mixture of ports which are affected and those which aren't. Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option? PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in http://lkml.org/lkml/2006/6/13/21 I don't see any reference to this problem there. Sorry, I suck, I got that mixed with that one: http://lkml.org/lkml/2006/12/26/63 probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some ARM boards Regards, Frederik - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Tue, Feb 20, 2007 at 02:24:42PM +, Frederik Deweerdt wrote: On Mon, Feb 19, 2007 at 01:45:39PM +, Russell King wrote: On Tue, Feb 20, 2007 at 01:29:09PM +, Frederik Deweerdt wrote: (Sorry for the resend, I forgot to cc the list) Hi Russell, It seems that the following change in drivers/serial/8250.c + + /* + * Do a quick test to see if we receive an + * interrupt when we enable the TX irq. + */ + serial_outp(up, UART_IER, UART_IER_THRI); + lsr = serial_in(up, UART_LSR); + iir = serial_in(up, UART_IIR); + serial_outp(up, UART_IER, 0); + + if (lsr UART_LSR_TEMT iir UART_IIR_NO_INT) { + if (!(up-capabilities UART_BUG_TXEN)) { + up-capabilities |= UART_BUG_TXEN; + pr_debug(ttyS%d - enabling bad tx status workarounds\n, + port-line); + } + } else { + up-capabilities = ~UART_BUG_TXEN; + } + that was introduced in 2.6.12[1], is causing oopses on some hardware. In particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible I don't see that. The oops your referring to is a NULL pointer dereference. The only dereferences the above code does is via 'up' and 'port' both of which are provably always non-null here. Neither did I, but introducing printk's through the function, we narrowed the problem to this part of the code. And removing it makes the problem go away. We inserted 37 printk's in the function body, and Jose bisected those until the problem went away. Well, there's still little clue about why this is causing a NULL pointer dereference. The only thing I can think is that somehow performing this test is causing a power glitch to your CPU, causing its registers to get corrupted, and which results in it doing a NULL pointer deref. Are you saying that the NULL pointer occurred while executing this code? If not, where does the NULL pointer occur? No, it's only runtime because you can't tell which ports might be affected, and you might have a mixture of ports which are affected and those which aren't. Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option? Andrew's said no (in that the thread you refer to) and suggested an alternative, I've said no, how many more 'no's do you need to turn you away from the wrong approach? PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in http://lkml.org/lkml/2006/6/13/21 I don't see any reference to this problem there. Sorry, I suck, I got that mixed with that one: http://lkml.org/lkml/2006/12/26/63 probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some ARM boards The weird effects were never quantified, so that's one of the reasons I ignored that report (another being is that I stopped being the serial maintainer a while ago, and now serial is maintainerless.) -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
(trimmed tie-fei.zang from the CC, added by mistake) On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote: Neither did I, but introducing printk's through the function, we narrowed the problem to this part of the code. And removing it makes the problem go away. We inserted 37 printk's in the function body, and Jose bisected those until the problem went away. Well, there's still little clue about why this is causing a NULL pointer dereference. The only thing I can think is that somehow performing this test is causing a power glitch to your CPU, causing its registers to get corrupted, and which results in it doing a NULL pointer deref. That may be the case, indeed. Are you saying that the NULL pointer occurred while executing this code? If not, where does the NULL pointer occur? The thing is, the NULL pointer deref dissapeared as soon as we instrumented (printk'ed) the code. So it's seems to be triggered by check+timing+hardware. No, it's only runtime because you can't tell which ports might be affected, and you might have a mixture of ports which are affected and those which aren't. Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option? Andrew's said no (in that the thread you refer to) and suggested an alternative, I've said no, how many more 'no's do you need to turn you away from the wrong approach? One is usually sufficient once I've understood :). I missed the module option approach. Is it ok with you? If yes, I'll put up a patch to do this. PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in http://lkml.org/lkml/2006/6/13/21 I don't see any reference to this problem there. Sorry, I suck, I got that mixed with that one: http://lkml.org/lkml/2006/12/26/63 probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some ARM boards The weird effects were never quantified, so that's one of the reasons I ignored that report (another being is that I stopped being the serial maintainer a while ago, and now serial is maintainerless.) The problem appears to be reproducible on Jose's hardware within 2-3 days. If you see other tests to be performed... Regards, Frederik - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote: (trimmed tie-fei.zang from the CC, added by mistake) On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote: Neither did I, but introducing printk's through the function, we narrowed the problem to this part of the code. And removing it makes the problem go away. We inserted 37 printk's in the function body, and Jose bisected those until the problem went away. Well, there's still little clue about why this is causing a NULL pointer dereference. The only thing I can think is that somehow performing this test is causing a power glitch to your CPU, causing its registers to get corrupted, and which results in it doing a NULL pointer deref. That may be the case, indeed. Are you saying that the NULL pointer occurred while executing this code? If not, where does the NULL pointer occur? The thing is, the NULL pointer deref dissapeared as soon as we instrumented (printk'ed) the code. So it's seems to be triggered by check+timing+hardware. So to summarise, we have some code somewhere which is causing a NULL pointer deref in uart_startup(). If we remove some code, the NULL pointer deref stops happening. And that's about the sum total of the information we know. We don't know precisely where the NULL pointer deref occurs, and we don't know what's causing it. It doesn't sound like there's much understanding of the problem at hand. ;( Andrew's said no (in that the thread you refer to) and suggested an alternative, I've said no, how many more 'no's do you need to turn you away from the wrong approach? One is usually sufficient once I've understood :). I missed the module option approach. Is it ok with you? If yes, I'll put up a patch to do this. I guess so, but how does the user know whether they need this enabled or disabled? The problem appears to be reproducible on Jose's hardware within 2-3 days. If you see other tests to be performed... Maybe adding some delays in that bit of code? I'm sure you've already thought of that though. Since no one has a proper understanding of the problem, the only suggestions possible are mere shots in the dark. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote: (trimmed tie-fei.zang from the CC, added by mistake) On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote: Neither did I, but introducing printk's through the function, we narrowed the problem to this part of the code. And removing it makes the problem go away. We inserted 37 printk's in the function body, and Jose bisected those until the problem went away. Well, there's still little clue about why this is causing a NULL pointer dereference. The only thing I can think is that somehow performing this test is causing a power glitch to your CPU, causing its registers to get corrupted, and which results in it doing a NULL pointer deref. That may be the case, indeed. But if the problem was a power glitch I should get Oops with or without printk() inserted, shouldn't I? Are you saying that the NULL pointer occurred while executing this code? If not, where does the NULL pointer occur? The thing is, the NULL pointer deref dissapeared as soon as we instrumented (printk'ed) the code. So it's seems to be triggered by check+timing+hardware. So to summarise, we have some code somewhere which is causing a NULL pointer deref in uart_startup(). If we remove some code, the NULL pointer deref stops happening. And that's about the sum total of the information we know. We don't know precisely where the NULL pointer deref occurs, and we don't know what's causing it. It doesn't sound like there's much understanding of the problem at hand. ;( Andrew's said no (in that the thread you refer to) and suggested an alternative, I've said no, how many more 'no's do you need to turn you away from the wrong approach? One is usually sufficient once I've understood :). I missed the module option approach. Is it ok with you? If yes, I'll put up a patch to do this. I guess so, but how does the user know whether they need this enabled or disabled? The problem appears to be reproducible on Jose's hardware within 2-3 days. In a kernel without instrumentation I get problems within a 1 day period. If you see other tests to be performed... Maybe adding some delays in that bit of code? I'm sure you've already thought of that though. Since no one has a proper understanding of the problem, the only suggestions possible are mere shots in the dark. I'm no kernel expert, but it's not possible to trace what is the instruction that is causing the NULL pointer dereference? The kernel dump does not show this? I have no clue on what is causing this problem but, what I know, is that I can always reproduce it, and it always happens in the same code section of serial8250_startup(). Regards, José Gonçalves - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Mon, Feb 19, 2007 at 04:29:39PM +, Jose Goncalves wrote: Russell King wrote: On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote: (trimmed tie-fei.zang from the CC, added by mistake) On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote: Neither did I, but introducing printk's through the function, we narrowed the problem to this part of the code. And removing it makes the problem go away. We inserted 37 printk's in the function body, and Jose bisected those until the problem went away. Well, there's still little clue about why this is causing a NULL pointer dereference. The only thing I can think is that somehow performing this test is causing a power glitch to your CPU, causing its registers to get corrupted, and which results in it doing a NULL pointer deref. That may be the case, indeed. But if the problem was a power glitch I should get Oops with or without printk() inserted, shouldn't I? That depends if the printk() changes the timing such that it doesn't occur. Don't know, I'm only grasping at straws due to the lack of any concrete information. If you see other tests to be performed... Maybe adding some delays in that bit of code? I'm sure you've already thought of that though. Since no one has a proper understanding of the problem, the only suggestions possible are mere shots in the dark. I'm no kernel expert, but it's not possible to trace what is the instruction that is causing the NULL pointer dereference? The reported dump shows that the kernel tried to access virtual address 0, and the instruction pointer seems to be the cause of that - it has a value of zero in that dump. The call trace indicates that the last function was called from around uart_startup+0x63/0xf4 which is probably the indirect function call to serial8250_startup(). That's unconfirmed - the only way to get it confirmed is if you could dump the entire uart_startup() function. $ grep uart_startup System.map (address) T uart_startup $ objdump -r -d vmlinux --start-addr=0xaddress --stop-addr=0xaddress+256 The grep should get you the address of uart_startup. Replace address with that value and address+256 with the value plus 256 (0x100) and mail the result. I have no clue on what is causing this problem but, what I know, is that I can always reproduce it, and it always happens in the same code section of serial8250_startup(). We're both at the same level of clue about the problem then. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Russell King wrote: On Mon, Feb 19, 2007 at 04:29:39PM +, Jose Goncalves wrote: Russell King wrote: On Tue, Feb 20, 2007 at 02:48:14PM +, Frederik Deweerdt wrote: (trimmed tie-fei.zang from the CC, added by mistake) On Mon, Feb 19, 2007 at 02:35:20PM +, Russell King wrote: Neither did I, but introducing printk's through the function, we narrowed the problem to this part of the code. And removing it makes the problem go away. We inserted 37 printk's in the function body, and Jose bisected those until the problem went away. Well, there's still little clue about why this is causing a NULL pointer dereference. The only thing I can think is that somehow performing this test is causing a power glitch to your CPU, causing its registers to get corrupted, and which results in it doing a NULL pointer deref. That may be the case, indeed. But if the problem was a power glitch I should get Oops with or without printk() inserted, shouldn't I? That depends if the printk() changes the timing such that it doesn't occur. Don't know, I'm only grasping at straws due to the lack of any concrete information. If you see other tests to be performed... Maybe adding some delays in that bit of code? I'm sure you've already thought of that though. Since no one has a proper understanding of the problem, the only suggestions possible are mere shots in the dark. I'm no kernel expert, but it's not possible to trace what is the instruction that is causing the NULL pointer dereference? The reported dump shows that the kernel tried to access virtual address 0, and the instruction pointer seems to be the cause of that - it has a value of zero in that dump. The call trace indicates that the last function was called from around uart_startup+0x63/0xf4 which is probably the indirect function call to serial8250_startup(). That's unconfirmed - the only way to get it confirmed is if you could dump the entire uart_startup() function. $ grep uart_startup System.map (address) T uart_startup $ objdump -r -d vmlinux --start-addr=0xaddress --stop-addr=0xaddress+256 The grep should get you the address of uart_startup. Replace address with that value and address+256 with the value plus 256 (0x100) and mail the result. Result is attached. José Gonçalves vmlinux-2.6.16.38-mtm4-debug: file format elf32-i386 Disassembly of section .text: c01ba437 uart_startup: c01ba437: 55 push %ebp c01ba438: 57 push %edi c01ba439: 56 push %esi c01ba43a: 53 push %ebx c01ba43b: 8b 7c 24 14 mov0x14(%esp),%edi c01ba43f: 31 d2 xor%edx,%edx c01ba441: 8b 5f 10 mov0x10(%edi),%ebx c01ba444: 8b 77 14 mov0x14(%edi),%esi c01ba447: 83 7b 10 00 cmpl $0x0,0x10(%ebx) c01ba44b: 0f 88 d3 00 00 00 js c01ba524 uart_startup+0xed c01ba451: 8b 03 mov(%ebx),%eax c01ba453: 0f ba a8 b4 00 00 00 btsl $0x1,0xb4(%eax) c01ba45a: 01 c01ba45b: 83 7e 60 00 cmpl $0x0,0x60(%esi) c01ba45f: 0f 84 bf 00 00 00 je c01ba524 uart_startup+0xed c01ba465: 83 7b 04 00 cmpl $0x0,0x4(%ebx) c01ba469: 75 28 jnec01ba493 uart_startup+0x5c c01ba46b: b8 d0 00 00 00 mov$0xd0,%eax c01ba470: e8 36 c7 f6 ff call c0126bab get_zeroed_page c01ba475: ba f4 ff ff ff mov$0xfff4,%edx c01ba47a: 85 c0 test %eax,%eax c01ba47c: 0f 84 a2 00 00 00 je c01ba524 uart_startup+0xed c01ba482: 89 43 04 mov%eax,0x4(%ebx) c01ba485: c7 43 0c 00 00 00 00 movl $0x0,0xc(%ebx) c01ba48c: c7 43 08 00 00 00 00 movl $0x0,0x8(%ebx) c01ba493: 8b 46 64 mov0x64(%esi),%eax c01ba496: 56 push %esi c01ba497: ff 50 24 call *0x24(%eax) c01ba49a: 89 c5 mov%eax,%ebp c01ba49c: 58 pop%eax c01ba49d: 85 ed test %ebp,%ebp c01ba49f: 75 6d jnec01ba50e uart_startup+0xd7 c01ba4a1: 83 7c 24 18 00 cmpl $0x0,0x18(%esp) c01ba4a6: 74 36 je c01ba4de uart_startup+0xa7 c01ba4a8: 6a 00 push $0x0 c01ba4aa: 57 push %edi c01ba4ab: e8 5f 02 00 00 call c01ba70f uart_change_speed c01ba4b0: 8b 03 mov(%ebx),%eax c01ba4b2: 8b 40 64 mov0x64(%eax),%eax c01ba4b5: 59 pop%ecx c01ba4b6: 5f pop%edi c01ba4b7: f7 40 08 0f 10 00 00 testl $0x100f,0x8(%eax) c01ba4be: 74 1e je c01ba4de uart_startup+0xa7 c01ba4c0: 9c pushf c01ba4c1: 5f pop%edi c01ba4c2: fa cli c01ba4c3: 8b 46 58 mov
Re: Serial related oops
What we've seen on our embedded ARM is that enabling an interrupt that is shared between multiple UARTs, at a stage when you have not set up all the data structures touched by the ISR and softirq, can have horrible consequences, including soft lockups and fandangos on core. You will be vulnerable to this unless you lock out the interrupt source (at the interrupt controller or, if you have to, globally) across the UART registration process in your platform's arch/mach-dependent core.c, in which case the TX irq test will of course fail. Roll-your-own SoC UARTs with bugs or extended features in IRQ enabling and delivery make things worse. I would love to see this disentangled in a maintainable way. It's such a nasty problem (especially given that bootloaders and early boot code frequently turn on one or more UARTs and leave them in an unknown state) that all we've been able to do so far is hack around it. I'll send an example patch when we've more or less isolated it, but it will be of limited use to you unless you have the exact set of UART warpage we do. Cheers, - Michael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote: What we've seen on our embedded ARM is that enabling an interrupt that is shared between multiple UARTs, at a stage when you have not set up all the data structures touched by the ISR and softirq, can have horrible consequences, including soft lockups and fandangos on core. Incorrect. We have: 1. registered an interrupt handler at this point. 2. disabled interrupts (we're under the spin lock) So, no interrupt will be seen by the CPU since the interrupt is masked. The test is intentionally designed to be safe from the interrupt generation point of view. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Mon, Feb 19, 2007 at 05:54:52PM +, Jose Goncalves wrote: Russell King wrote: Result is attached. Right... in depth analysis follows. [15423.650518] [] uart_startup+0x63/0xf4 equates to 0xc01ba49a, which is indeed the instruction after the call to port-ops-startup. The important code leading up to this is: c01ba437 uart_startup: c01ba437: 55 push %ebp c01ba438: 57 push %edi c01ba439: 56 push %esi c01ba43a: 53 push %ebx c01ba43b: 8b 7c 24 14 mov0x14(%esp),%edi @ load state c01ba43f: 31 d2 xor%edx,%edx c01ba441: 8b 5f 10mov0x10(%edi),%ebx @ load state-info c01ba444: 8b 77 14mov0x14(%edi),%esi @ load state-port c01ba493: 8b 46 64mov0x64(%esi),%eax @ load port-ops c01ba496: 56 push %esi @ push port onto stack c01ba497: ff 50 24call *0x24(%eax) @ ops-startup(port) c01bd74b serial8250_startup: c01bd74b: 55 push %ebp c01bd74c: 57 push %edi c01bd74d: 56 push %esi c01bd74e: 53 push %ebx c01bd74f: 8b 5c 24 14 mov0x14(%esp),%ebx Comparing this with the stack dump: Stack: c02fae70 0005 c74304e0 - %ebx (pushed by serial8250_startup) c02fae70 - %esi (pushed by serial8250_startup) c128d5e4 - %edi (pushed by serial8250_startup) c7a69a80 - %ebp (pushed by serial8250_startup) c01ba49a - uart_startup+0x63/0xf4 (pushed by function called by ops-startup, iow serial8250_startup) c02fae70 - pushed on by push %esi at c01ba496, this is port c128d5e4 - %ebx (pushed by uart_startup) - %esi (pushed by uart_startup) c7a69a80 - %edi (pushed by uart_startup) c7a69a80 - %ebp (pushed by uart_startup) c01bbaa0 - probably uart_open+0xaa/0xec Once the instruction at c01bd74f completes, we have pushed into the stack the structure commented above, but not the first two uncommented values. %ebx contains the value of port at this point. We're looking for some place in the code which pushes a value of '5' and '%ebx' on to the stack, and the CPUs registers contain values which correspond with the values provided in your oops. The code corresponding with the buggy uart check is as follows. Comments interspersed: c01bd910: 9c pushf c01bd911: 5d pop%ebp c01bd912: fa cli This code pushes the processors flag register onto the stack, pops it off into the %ebp register, and then disables interrupts. Your oops dump contained ebp: 0202 which is a reasonable value for x86 processors flags, which have been saved into the ebp register by the above code sequence (according to Wikipedia). c01bd937: ff 73 58pushl 0x58(%ebx) c01bd93a: 53 push %ebx c01bd93b: e8 7f fd ff ff call c01bd6bf serial8250_set_mctrl c01bd940: 6a 02 push $0x2 c01bd942: 6a 01 push $0x1 c01bd944: 53 push %ebx c01bd945: e8 f6 eb ff ff call c01bc540 serial_out @ write IER c01bd94a: 6a 05 push $0x5 c01bd94c: 53 push %ebx c01bd94d: e8 a6 eb ff ff call c01bc4f8 serial_in @ reads LSR c01bd952: 89 c7 mov%eax,%edi @ saves result in %edi c01bd954: 6a 02 push $0x2 c01bd956: 53 push %ebx c01bd957: e8 9c eb ff ff call c01bc4f8 serial_in @ reads IIR c01bd95c: 83 c4 24add$0x24,%esp c01bd95f: 89 c6 mov%eax,%esi @ saves result in %esi This is the code corresponding with part of the buggy uart check - you can see the call to serial8250_set_mctrl() there which confirms this. The sequence at c01bd94a pushes 5 and port (%ebx) onto the stack, but this isn't the right place because before this we pushed 2, 1, and port on the stack, and those are not present in the stack dump. However, the reason for showing this is that a little while later, we have: c01bd96e: 83 e7 40and$0x40,%edi c01bd971: 74 1c je c01bd98f serial8250_startup+0x244 c01bd973: 83 e6 01and$0x1,%esi c01bd976: 74 17 je c01bd98f serial8250_startup+0x244 The normal value we would read from the LSR (stored in %edi) would be 0x60, and if a transmit interrupt was pending (which is what the test is trying to find out) the IIR (%esi) would be 0x02. The above code sequence which involves masking these values would therefore give: 0x40 0x60 (%edi) - 0x40 in %edi 0x01 0x02 (%esi) - 0x00 in %esi From your oops dump edi: 0040 and esi: - that ties up, so we know that the place we got to must be after this point. We eventually come to this sequence. The words previously pushed onto the stack have been removed at this point, and %ebp,
Re: Serial related oops
On 2/19/07, Russell King [EMAIL PROTECTED] wrote: On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote: What we've seen on our embedded ARM is that enabling an interrupt that is shared between multiple UARTs, at a stage when you have not set up all the data structures touched by the ISR and softirq, can have horrible consequences, including soft lockups and fandangos on core. Incorrect. We have: 1. registered an interrupt handler at this point. 2. disabled interrupts (we're under the spin lock) setup_irq() is where things go wrong, at least for us, at least on 2.6.16.x. Interrupts are not disabled at the point in request_irq() when the interrupt controller is poked to enable the IRQ source. If you're lucky, and you're on an architecture where the UART interrupt is properly level-triggered, and the worst thing that happens when you attempt to service an interrupt that isn't yours is that it stays on, then you get a soft lockup with two or three recursive __irq_svc hits in the backtrace. If you're not lucky you do a fandango on core. So, no interrupt will be seen by the CPU since the interrupt is masked. The interrupt would need to be masked for the entire duration of the outer loop that calls serial8250_init() or the equivalent for all platform devices that share the IRQ. The test is intentionally designed to be safe from the interrupt generation point of view. But its context is not. Shared IRQ lines are a _problem_. You cannot safely enable an IRQ until all devices that share it have had their ISRs installed, unless you can absolutely guarantee at a hardware level that the unitialized ones cannot assert the IRQ line. That does not apply to any device that might have been touched by the bootloader or the early init code, especially a UART. Cheers, - Michael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Mon, Feb 19, 2007 at 01:24:17PM -0800, Michael K. Edwards wrote: On 2/19/07, Russell King [EMAIL PROTECTED] wrote: On Mon, Feb 19, 2007 at 12:37:00PM -0800, Michael K. Edwards wrote: What we've seen on our embedded ARM is that enabling an interrupt that is shared between multiple UARTs, at a stage when you have not set up all the data structures touched by the ISR and softirq, can have horrible consequences, including soft lockups and fandangos on core. Incorrect. We have: 1. registered an interrupt handler at this point. 2. disabled interrupts (we're under the spin lock) setup_irq() is where things go wrong, at least for us, at least on 2.6.16.x. Interrupts are not disabled at the point in request_irq() when the interrupt controller is poked to enable the IRQ source. If you're lucky, and you're on an architecture where the UART interrupt is properly level-triggered, and the worst thing that happens when you attempt to service an interrupt that isn't yours is that it stays on, then you get a soft lockup with two or three recursive __irq_svc hits in the backtrace. If you're not lucky you do a fandango on core. That should not happen if your interrupt handling is correct - okay, you might get an interrupt at that point, but while servicing that interrupt the source will be disabled on the interrupt controller. You should _never_ _ever_ get recusive interrupts for the same interrupt source. Ever. If you do, your platforms interrupt handling is seriously buggy. But its context is not. Shared IRQ lines are a _problem_. You cannot safely enable an IRQ until all devices that share it have had their ISRs installed, unless you can absolutely guarantee at a hardware level that the unitialized ones cannot assert the IRQ line. Linux assumes that all interrupt sources on a shared IRQ line are disabled at the point in time when the kernel boots. When a device is to be used, an interrupt handler is installed and then the kernel will enable the interrupt on the device, not before. We follow that rule in the 8250 driver - in fact, when we initialise we ensure that interrupts are disabled on any devices we find. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On 2/19/07, Russell King [EMAIL PROTECTED] wrote: setup_irq() is where things go wrong, at least for us, at least on 2.6.16.x. Interrupts are not disabled at the point in request_irq() when the interrupt controller is poked to enable the IRQ source. If you're lucky, and you're on an architecture where the UART interrupt is properly level-triggered, and the worst thing that happens when you attempt to service an interrupt that isn't yours is that it stays on, then you get a soft lockup with two or three recursive __irq_svc hits in the backtrace. If you're not lucky you do a fandango on core. That should not happen if your interrupt handling is correct - okay, you might get an interrupt at that point, but while servicing that interrupt the source will be disabled on the interrupt controller. Right. But as soon as you turn the source back on, in the postamble of the interrupt dispatch handler, it fires again. At least on ARM, that gives you recursive hits to __irq_svc and a couple of nested calls within it. Here's a backtrace (embedded in a chat log with some commentary): 6:42 PM me: we have definitely confirmed that the serial ISR is failing to clear the interrupt and the (presumably level-triggered) IRQ is firing again on exit from the ISR. 6:43 PM The reason that __do_softirq is usually the last function entrypoint in the backtrace before the __irq_svc associated with the timer is that it is the first place where interrupts are enabled during the IRQ dispatcher postamble. 6:44 PM Here is a backtrace from a case where the timer interrupt hit during the perpetually firing ISR instead of during the dispatch code surrounding it (which is not visible in backtraces) 6:45 PM [ 54.23] Freeing init memory: 92K [ 52.24] rcu_do_batch: rcu node is 0xC03D7540, callback is 0xC00864C8 [ 52.24] rcu_do_batch: rcu node is 0xC02CCDA0, callback is 0xC006E7E4 [ 52.25] rcu_do_batch: rcu node is 0xC03D7730, callback is 0xC00864C8 [ 52.26] rcu_do_batch: rcu node is 0xC03D7920, callback is 0xC00864C8 [ 51.24] BUG: soft lockup detected on CPU#0! [ 52.24] [c0025834] (dump_stack+0x0/0x14) from [c0050e40] (softlockup_tick+0xa8/0xe8) [ 52.24] [c0050d98] (softlockup_tick+0x0/0xe8) from [c003bb18] (run_local_timers+0x18/0x1c) [ 52.24] r8 = 00010105 r7 = 0005 r6 = r5 = [ 52.24] r4 = C0299B40 [ 52.24] [c003bb00] (run_local_timers+0x0/0x1c) from [c003bdec] (update_process_times+0x50/0x7c) [ 52.24] [c003bd9c] (update_process_times+0x0/0x7c) from [c0024f24] (timer_tick+0xc4/0xe0) [ 52.24] r6 = r5 = C029DB48 r4 = C029DB48 [ 52.24] [c0024e60] (timer_tick+0x0/0xe0) from [c002a79c] (mv88w8xx8_timer_interrupt+0x30/0x68) [ 52.24] r6 = r5 = C029DB48 r4 = C024775C [ 52.24] [c002a76c] (mv88w8xx8_timer_interrupt+0x0/0x68) from [c0020c84] (__do_irq+0xf0/0x140) [ 52.24] r5 = r4 = C0204280 6:46 PM [ 52.24] [c0020b94] (__do_irq+0x0/0x140) from [c0020f48] (do_level_IRQ+0x70/0xc8) [ 52.24] [c0020ed8] (do_level_IRQ+0x0/0xc8) from [c00212b8] (asm_do_IRQ+0x50/0x134) [ 52.24] r6 = C029DB48 r5 = C0240E24 r4 = 0005 [ 52.24] [c0021268] (asm_do_IRQ+0x0/0x134) from [c001f978] (__irq_svc+0x38/0x190) [ 52.24] r6 = 0020 r5 = C029DB7C r4 = [ 52.24] [c0020b94] (__do_irq+0x0/0x140) from [c0020f48] (do_level_IRQ+0x70/0xc8) [ 52.24] [c0020ed8] (do_level_IRQ+0x0/0xc8) from [c00212b8] (asm_do_IRQ+0x50/0x134) [ 52.24] r6 = C029DBFC r5 = C0240F5C r4 = 000B [ 52.24] [c0021268] (asm_do_IRQ+0x0/0x134) from [c001f978] (__irq_svc+0x38/0x190) [ 52.24] r6 = 0800 r5 = C029DC30 r4 = [ 52.24] [c0036d68] (__do_softirq+0x0/0xd8) from [c00370e0] (irq_exit+0x48/0x5c) [ 52.24] r6 = C029DC94 r5 = C0240E24 r4 = 0005 [ 52.24] [c0037098] (irq_exit+0x0/0x5c) from [c0021384] (asm_do_IRQ+0x11c/0x134) [ 52.24] [c0021268] (asm_do_IRQ+0x0/0x134) from [c001f978] (__irq_svc+0x38/0x190) [ 52.24] r6 = 0820 r5 = C029DCC8 r4 = [ 52.24] [c0020968] (setup_irq+0x0/0x15c) from [c0020b68] (request_irq+0xa4/0xd0) [ 52.24] r7 = r6 = r5 = 000B r4 = C0C1B5C0 [ 52.24] [c0020ac4] (request_irq+0x0/0xd0) from [c0100c24] (serial_link_irq_chain+0x264/0x2a0) [ 52.24] [c01009c0] (serial_link_irq_chain+0x0/0x2a0) from [c0101558] (serial8250_startup+0x2f4/0x4f0) [ 52.24] [c0101264] (serial8250_startup+0x0/0x4f0) from [c00f85ec] (uart_startup+0x164/0x48c) [ 52.24] [c00f8488] (uart_startup+0x0/0x48c) from [c00fca98] (uart_open+0x1a8/0x238) [ 52.24] [c00fc8f0] (uart_open+0x0/0x238) from [c00f2160] (tty_open+0x1cc/0x390) [ 52.24] [c00f1f94] (tty_open+0x0/0x390) from [c0077e44] (chrdev_open+0x1e4/0x220) [ 52.24] [c0077c60] (chrdev_open+0x0/0x220) from [c006ba58] (__dentry_open+0x13c/0x294) [ 52.24] r8 = C028E2A0 r7 = C0077C60 r6 = C0C29B94 r5 = [ 52.24] r4 = C02CC300 [ 52.24] [c006b91c]
Re: Serial related oops
On Mon, Feb 19, 2007 at 02:16:41PM -0800, Michael K. Edwards wrote: Right. But as soon as you turn the source back on, in the postamble of the interrupt dispatch handler, it fires again. At least on ARM, that gives you recursive hits to __irq_svc and a couple of nested calls within it. I think something else is going on here. I think you're getting an interrupt for the UART, and another interrupt is also pending. When the UART interrupt is handled, it is masked at the interrupt controller, and the CPU mask is dropped. The second interrupt comes in, and when you go to disable that source, you inadvertently re-enable the UART interrupt, despite it still being serviced. This leads to the UART interrupt again triggering an IRQ. Please show your interrupt controller (mask, unmask, mask_ack) handling functions corresponding with the interrupt which your UART is connected to. But its context is not. Shared IRQ lines are a _problem_. You cannot safely enable an IRQ until all devices that share it have had their ISRs installed, unless you can absolutely guarantee at a hardware level that the unitialized ones cannot assert the IRQ line. Linux assumes that all interrupt sources on a shared IRQ line are disabled at the point in time when the kernel boots. When a device is to be used, an interrupt handler is installed and then the kernel will enable the interrupt on the device, not before. Linux assumes incorrectly in this instance. It would improve the kernel if all drivers' __init code were refactored into an IRQ-discovery-ISR-installation pass, followed by a chip-reset-data-structure-initialization pass, followed by a chip-configuration-driver-activation pass. This is unlikely to happen overnight. This shows that you don't actually have an understanding of the Linux kernel boot, especially in respect of serial devices. At boot, devices are detected and initialised to a safe state, where they will not spuriously generate interrupts. When a userspace program opens a serial port, which can only happen once the kernel boot has completed (ergo, devices have been initialised and placed in a safe state) the interrupts are claimed, and enabled at the source. In the meantime, weird UART states on entry into platform_device_init are a reality. Yes, uart states are indeterminent at this point. However, as soon as the 8250 driver loads it takes control of the 8250 ports, and DISABLES the interrupt on ALL ports found, LONG BEFORE any service handlers are installed. So, by the time the system is up and running _all_ 8250 ports have had their IERs written with zero. Interrupts disabled at source. By the time you get to open any serial port, the initialisation has completed. We follow that rule in the 8250 driver - in fact, when we initialise we ensure that interrupts are disabled on any devices we find. No, you rely on the caller of serial8250_init to have punctured the abstraction Can you add any other useless complex words into that sentence? and forced any and all UARTs to a state where they cannot possibly generate an IRQ. That is being done already at initialisation time. Now, please show your interrupt mask/unmask/mask_ack code, which is where I believe your problem to lie. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On 2/19/07, Russell King [EMAIL PROTECTED] wrote: I think something else is going on here. I think you're getting an interrupt for the UART, and another interrupt is also pending. Correct. An interrupt for the other UART on the same IRQ. When the UART interrupt is handled, it is masked at the interrupt controller, and the CPU mask is dropped. Correct. The second interrupt comes in, and when you go to disable that source, you inadvertently re-enable the UART interrupt, despite it still being serviced. Incorrect. An attempt has been made to service the interrupt using the only ISR currently in the chain for that IRQ -- the ISR for the first UART. That attempt was not successful, and when __do_irq unmasks the interrupt source preparatory to exiting interrupt context, __irq_svc is dispatched anew. This leads to the UART interrupt again triggering an IRQ. Right. The _second_ UART's interrupt. There's another problem with these UARTs having to do with the implementor's inability to read and follow a bog-standard twenty-year-old spec without asking software to fix up corner cases, but that's another backtrace for another day. Please show your interrupt controller (mask, unmask, mask_ack) handling functions corresponding with the interrupt which your UART is connected to. Don't have 'em handy; I'll be happy to post them when I do, perhaps later today. I would hope they're pretty generic, though; it's a Feroceon core pretending to be an ARM926EJ-S, hooked to the usual half-assed Marvell imitation of an ARM licensed functional block. Trust me for the moment, it's the same IRQ line. This shows that you don't actually have an understanding of the Linux kernel boot, especially in respect of serial devices. At boot, devices are detected and initialised to a safe state, where they will not spuriously generate interrupts. Sorry, 'taint so. Not unless the chip support droid has put the right stuff in arch/arm/mach-foo. LKML is littered with the fall-out of the decision to trust whoever jumped to main() to have left the hardware in a sane state. If you don't enjoy this sort of forensics (which I for one do not, especially not when there is a project deadline looming and a Heisenbug starts firing 9 times out of 10), you might consider systematically installing ISRs that know how to shut everything up before turning on any interrupt sources at all. As I said, this is not going to happen overnight, and is not even particularly in the economic interest of people who get paid by the hour to wear bringup wizard hats. That category currently includes me, but I am intensely bored with this game and aspire to greater things. When a userspace program opens a serial port, which can only happen once the kernel boot has completed (ergo, devices have been initialised and placed in a safe state) the interrupts are claimed, and enabled at the source. As you can see from the console dump I posted (which begins with Freeing init memory: 92K and ends with do_exit - init - sys_open, which is obviously sys_open(/dev/console)), this happens long before userspace comes into the picture. Our 8250.c has some nasty hacks in it but otherwise this call chain is from a very nearly vanilla 2.6.16.recent. We've already worked around this on our board, and the whole kit and kaboodle will eventually be posted to linux-arm-kernel in tidy patches when my client lets me spend billable hours on it (immediately after the damn thing passes its first functional test, long before it ships). I'm not asking for anyone's help except in the let's-all-help-one-another spirit. I'm trying to help with root cause analysis of Frederik's (Jose's?) fandango on core. If it's not relevant, my apologies; and although it goes without saying, I salute you for both the serial driver and the ARM port. Now please take a second look at the backtrace before toasting me lightly again. Mmm'kay? Oh, and by the way -- is there an Alt-SysRq equivalent on an ARM serial console? Cheers, - Michael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On Mon, Feb 19, 2007 at 04:04:26PM -0800, Michael K. Edwards wrote: On 2/19/07, Russell King [EMAIL PROTECTED] wrote: The second interrupt comes in, and when you go to disable that source, you inadvertently re-enable the UART interrupt, despite it still being serviced. Incorrect. An attempt has been made to service the interrupt using the only ISR currently in the chain for that IRQ -- the ISR for the first UART. That attempt was not successful, and when __do_irq unmasks the interrupt source preparatory to exiting interrupt context, __irq_svc is dispatched anew. This can't happen because when __do_irq unmasks the interrupt source, the CPU mask is set, thereby preventing any further interrupt exceptions being taken. This is done precisely to prevent this situation happening. If you are seeing recursion for the same interrupt (two or more stack frames containing asm_do_IRQ for that very same IRQ) then your interrupt handling is buggy, plain and simple. Please show your interrupt controller (mask, unmask, mask_ack) handling functions corresponding with the interrupt which your UART is connected to. Don't have 'em handy; I'll be happy to post them when I do, perhaps later today. I would hope they're pretty generic, though; it's a Feroceon core pretending to be an ARM926EJ-S, hooked to the usual half-assed Marvell imitation of an ARM licensed functional block. Trust me for the moment, it's the same IRQ line. I don't doubt that it is on the same IRQ line - I have such setups here and it works perfectly - multiple 8250 UARTs connected to a single level-triggered interrupt input which also happens to be shared with a SCSI host chip as well. Absolutely no problems. If you don't enjoy this sort of forensics (which I for one do not, especially not when there is a project deadline looming and a Heisenbug starts firing 9 times out of 10), you might consider systematically installing ISRs that know how to shut everything up before turning on any interrupt sources at all. I still say that your understanding is completely flawed. Moreover, you haven't read what I've said about the ordering of initialisation, the stress on when we disable interrupts for the ports, etc. I'm not asking for anyone's help except in the let's-all-help-one-another spirit. You're actually *not* helping. You're causing utter confusion through misunderstanding, but it seems you're not open to the possibility that your understanding is flawed. I'm offering to look through your code and point you at the source of your issue for free. Please don't throw that offer away without first considering that maybe I have a clue about what's going on here. Now please take a second look at the backtrace before toasting me lightly again. ... which showed the port being opened well after system initialisation of devices, including all serial ports - including disabling of their interrupt source at the IER, has been completed. Mmm'kay? Oh, and by the way -- is there an Alt-SysRq equivalent on an ARM serial console? Yes, and it's the same for any serial console with functioning break support. You'll find it in Documentation/sysrq.txt, though it does misleadingly say PC style standard serial ports only whereas the reality is where possible. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On 2/19/07, Russell King [EMAIL PROTECTED] wrote: This can't happen because when __do_irq unmasks the interrupt source, the CPU mask is set, thereby preventing any further interrupt exceptions being taken. This is done precisely to prevent this situation happening. If you are seeing recursion for the same interrupt (two or more stack frames containing asm_do_IRQ for that very same IRQ) then your interrupt handling is buggy, plain and simple. Imaginable. I'll look at the mask/unmask code. Thanks. I don't doubt that it is on the same IRQ line - I have such setups here and it works perfectly - multiple 8250 UARTs connected to a single level-triggered interrupt input which also happens to be shared with a SCSI host chip as well. Absolutely no problems. Can you do me a favor? In the sys_open(/dev/console) path, turn on the right bits in that second uart's IER, then insert a sleep in request_irq or something (wherever seems best based on that backtrace), and feed enough characters into the second UART during that sleep to generate an IRQ. Do you not get the same soft lockup? I still say that your understanding is completely flawed. Moreover, you haven't read what I've said about the ordering of initialisation, the stress on when we disable interrupts for the ports, etc. Well, all I can say is that that's a real backtrace and it shouldn't be hard to reproduce if it's anything other than a broken interrupt controller or broken code called by the __do_irq postamble. I don't see any platform-provided unmask routines in that backtrace, but maybe it got inlined; I'll go back and check. You're actually *not* helping. You're causing utter confusion through misunderstanding, but it seems you're not open to the possibility that your understanding is flawed. Still open, though it's a pity you're more interested in my flawed understanding that in the possibility that the kernel could be systematically made more robust against hardware bugs and coding errors by the simple expedient of putting all the ISRs in before turning on any IRQ that might be shared. Or are you telling me that's already been done? (Yes, I am aware that this interacts entertainingly with hot-plug PCI. Yes, I am aware that there is a limit to how much software can fix stupid hardware. But surely there is room for an emergency IRQ suppressor to let chip initialization code kick in and force the hardware to a known state.) I'm offering to look through your code and point you at the source of your issue for free. Please don't throw that offer away without first considering that maybe I have a clue about what's going on here. I appreciate that offer, and I hope to take advantage of it as soon as I have the source code at my fingertips (not just the chat log where I recorded the backtrace). ... which showed the port being opened well after system initialisation of devices, including all serial ports - including disabling of their interrupt source at the IER, has been completed. Now that you mention it, the backtrace I sent is the serial8250_startup one, not the serial8250_init one. Sorry, this one's probably an artifact of brain damage specific to this UART. I need to dig through a different account to find the init-path example; but in either case, we're getting a new interrupt during the __do_irq postamble. If you're telling me that that shouldn't happen, what should the backtrace for a soft lockup due to a stuck level-triggered IRQ look like on ARM? Yes, and it's the same for any serial console with functioning break support. You'll find it in Documentation/sysrq.txt, though it does misleadingly say PC style standard serial ports only whereas the reality is where possible. Thank you very much; this will help me get to the bottom of some other chip-support nastiness on this device. Cheers, - Michael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Michael K. Edwards wrote: Still open, though it's a pity you're more interested in my flawed understanding that in the possibility that the kernel could be systematically made more robust against hardware bugs and coding errors by the simple expedient of putting all the ISRs in before turning on any IRQ that might be shared. Or are you telling me that's already been done? (Yes, I am aware that this interacts entertainingly with hot-plug PCI. Yes, I am aware that there is a limit to how much software can fix stupid hardware. But surely there is room for an emergency IRQ suppressor to let chip initialization code kick in and force the hardware to a known state.) How do you propose to do this? Drivers can get loaded and unloaded at any time. If you have a device generating spurious interrupts on a shared IRQ line, there's no way you can use any device on that line until that interrupt is shut off. Requiring all drivers to be loaded before any of them can use interrupts is simply not practical. If a system has a device that generates interrupts before they're enabled, and the firmware doesn't fix it, then some platform-specific quirk has to handle it and shut off the interrupt before it allows any interrupts to be enabled. (We have such a quirk for certain network controllers where the boot ROM can leave the chip generating interrupts on bootup.) -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
On 2/19/07, Robert Hancock [EMAIL PROTECTED] wrote: How do you propose to do this? Drivers can get loaded and unloaded at any time. If you have a device generating spurious interrupts on a shared IRQ line, there's no way you can use any device on that line until that interrupt is shut off. Requiring all drivers to be loaded before any of them can use interrupts is simply not practical. Of course not. But dealing with a stuck IRQ line by locking up isn't very practical either. IRQ sharing is stupid yet universal, and it happens all the time that a device that has been sitting there minding its own business since power-up, with no driver to drive it, decides to assert its IRQ. Maybe it just got hot-plugged, maybe it just got its first dribble of input, whatever. Other devices on the shared IRQ are screwed (or at least semi-screwed; you could periodically re-enable the IRQ long enough to make a run through the ISR chain servicing the other devices). But if you run lspci (or whatever) and load a driver for the newly awake device, everything goes back to normal. For devices compiled into the kernel, you shouldn't have to play these games. If, that is, there were three stages of driver initialization, called in successive passes: 1) installing an ISR with a fallback STFU path (device-specific but not dependent on any particular pre-existing chip state), quiescing it if you know how and registering for the IRQ if you know which it is; 2) going through the chip's soft-reset-wake-up-shut-up cycle and populating driver data structures, possibly correcting the IRQ registration along the way; 3) ready-as-we'll-ever-be, bring on the interrupts. You probably can't help enabling the IRQ briefly during 2) so that you can do tests like Russell's loopback. But it's a needless gamble to do that without doing 1) for all compiled-in drivers and platform devices first, in a previous discovery pass. And it's stupid to do 3) in the same pass as 2), because you'll just open race condition windows that will only bite when an all-the-way-live device raises its IRQ at a moment when the writer of the wake-up-shut-up code wasn't expecting it. All code has bugs and they're only a problem when they bite in the field. If a system has a device that generates interrupts before they're enabled, and the firmware doesn't fix it, then some platform-specific quirk has to handle it and shut off the interrupt before it allows any interrupts to be enabled. (We have such a quirk for certain network controllers where the boot ROM can leave the chip generating interrupts on bootup.) You don't need quirks if your driver initialization is bomb-proof to begin with. Devices that are quiet on power-up are purely coincidental and should not be construed. Cheers, - Michael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops
Michael K. Edwards wrote: Of course not. But dealing with a stuck IRQ line by locking up isn't very practical either. IRQ sharing is stupid yet universal, and it And we don't, that's why we have that nobody cared logic that disables the interrupt line if no driver services the interrupt. That doesn't provide a clean recovery, of course, it's meant to notify the user of what happened so that the problem can be fixed. happens all the time that a device that has been sitting there minding its own business since power-up, with no driver to drive it, decides to assert its IRQ. Maybe it just got hot-plugged, maybe it just got its first dribble of input, whatever. Other devices on the shared IRQ are screwed (or at least semi-screwed; you could periodically re-enable the IRQ long enough to make a run through the ISR chain servicing the other devices). But if you run lspci (or whatever) and load a driver for the newly awake device, everything goes back to normal. For devices compiled into the kernel, you shouldn't have to play these games. If, that is, there were three stages of driver initialization, called in successive passes: Exactly, for devices compiled into the kernel. In most setups this is only a fraction of all devices, so solving this problem only for drivers built into the kernel is no solution. 1) installing an ISR with a fallback STFU path (device-specific but not dependent on any particular pre-existing chip state), quiescing it if you know how and registering for the IRQ if you know which it is; 2) going through the chip's soft-reset-wake-up-shut-up cycle and populating driver data structures, possibly correcting the IRQ registration along the way; 3) ready-as-we'll-ever-be, bring on the interrupts. You probably can't help enabling the IRQ briefly during 2) so that you can do tests like Russell's loopback. But it's a needless gamble to do that without doing 1) for all compiled-in drivers and platform devices first, in a previous discovery pass. And it's stupid to do 3) in the same pass as 2), because you'll just open race condition windows that will only bite when an all-the-way-live device raises its IRQ at a moment when the writer of the wake-up-shut-up code wasn't expecting it. All code has bugs and they're only a problem when they bite in the field. If a system has a device that generates interrupts before they're enabled, and the firmware doesn't fix it, then some platform-specific quirk has to handle it and shut off the interrupt before it allows any interrupts to be enabled. (We have such a quirk for certain network controllers where the boot ROM can leave the chip generating interrupts on bootup.) You don't need quirks if your driver initialization is bomb-proof to begin with. Devices that are quiet on power-up are purely coincidental and should not be construed. It's not coincidental, it is the only sane way to design hardware. You just can't go firing off interrupts without a driver having intentionally enabled them. There are a few devices that have had such issues, but they have been few and far between, certainly not enough to warrant the complexity of the scheme you propose. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/