Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Tue, 5 Jan 2010, Sean wrote: > Thanks so much for your help Alan. I also think this is definitely a bug > in the hardware. Let's leave it at that, I'm happy. Okay. You should mark the Bugzilla report as REJECTED and close it out. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Alan Stern wrote: On Mon, 4 Jan 2010, Sean wrote: Alan Stern wrote: Um, when you say it does the job, what do you mean? It traps the error and prevents the kernel from crashing. As did some of the earlier patches, right? Yes, but this one is good because it works without a debug kernel. Still, at this point I'm not sure it's worthwhile to pursue this any farther. I'm convinced it's a hardware problem. Do you want to continue, or are you happy to switch computers and forget about it? Thanks so much for your help Alan. I also think this is definitely a bug in the hardware. Let's leave it at that, I'm happy. Sean -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Mon, 4 Jan 2010, Sean wrote: > Alan Stern wrote: > > Um, when you say it does the job, what do you mean? > It traps the error and prevents the kernel from crashing. As did some of the earlier patches, right? > > The job it was _intended_ to do was to prove that your problems are > > caused by hardware errors rather than software bugs. If the patch > > causes the problems to stop, without printing any error messages in the > > log, then it does indeed prove this. After all, the only places the > > patch changes any persistent values are after it prints an error > > message. > > > It did print out error messages: > .ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 > c677b900 > ...ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 > c677b900 > .ohci_hcd :00:0b.0: Circular hash: 32 c669f800 c677b800 > c677b800 Ooh, that's odd. The "Circular hash" message occurs in the same spot as the "Circular pointer #2a" message in the previous patch -- and that message never got printed! > > I noticed that your CPU is a Cyrix. Perhaps it is the culprit. Have > > you tried running the program on a different computer? > > > Yes, on other computers I don't get this error. Same os image. Though I > haven't found a computer with an ohci controller yet. So that's not a real test, unfortunately. Still, at this point I'm not sure it's worthwhile to pursue this any farther. I'm convinced it's a hardware problem. Do you want to continue, or are you happy to switch computers and forget about it? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Alan Stern wrote: Um, when you say it does the job, what do you mean? It traps the error and prevents the kernel from crashing. The job it was _intended_ to do was to prove that your problems are caused by hardware errors rather than software bugs. If the patch causes the problems to stop, without printing any error messages in the log, then it does indeed prove this. After all, the only places the patch changes any persistent values are after it prints an error message. It did print out error messages: usb 4-2: new full speed USB device using ohci_hcd and address 2 usb 4-2: New USB device found, idVendor=093a, idProduct=2460 usb 4-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 usb 4-2: Product: CIF Single Chip usb 4-2: Manufacturer: Pixart Imaging Inc. usb 4-2: configuration #1 chosen from 1 choice [r...@x-linux]:~ # modprobe gspca_pac207 Linux video capture interface: v2.00 gspca: main v2.8.0 registered gspca: probing 093a:2460 pac207: Pixart Sensor ID 0x27 Chips ID 0x09 pac207: Pixart PAC207BCA Image Processor and Control Chip detected (vid/pid 0x093A:0x2460) gspca: video0 created usbcore: registered new interface driver pac207 pac207: registered [r...@x-linux]:~ # capture-example .. capture-example used greatest stack depth: 5848 bytes left [r...@x-linux]:~ # capture-example .ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ...ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 .ohci_hcd :00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 .ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd :00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ..ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd :00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd :00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 .ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd :00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ohci_hcd :00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd :00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd :00:0b.0: Circular hash: 32 c669f8
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Mon, 4 Jan 2010, Sean wrote: > Alan, > This last patch seems to do the job. Thanks so much for your help! Where > do I donate/send beer? Um, when you say it does the job, what do you mean? The job it was _intended_ to do was to prove that your problems are caused by hardware errors rather than software bugs. If the patch causes the problems to stop, without printing any error messages in the log, then it does indeed prove this. After all, the only places the patch changes any persistent values are after it prints an error message. (Admittedly, I didn't expect the problem to stop; I expected to get a bunch of messages from the second ohci_err(). Just out of curiosity, does it make any difference if you remove all those "volatile"s in the declaration line for td1 and td2?) I noticed that your CPU is a Cyrix. Perhaps it is the culprit. Have you tried running the program on a different computer? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Alan Stern wrote: ... All right. Let's try this patch in place of all the others, then. Alan Stern Index: usb-2.6/drivers/usb/host/ohci-q.c === --- usb-2.6.orig/drivers/usb/host/ohci-q.c +++ usb-2.6/drivers/usb/host/ohci-q.c @@ -505,6 +505,7 @@ td_fill (struct ohci_hcd *ohci, u32 info struct urb_priv *urb_priv = urb->hcpriv; int is_iso = info & TD_ISO; int hash; + volatile struct td * volatile td1, * volatile td2; // ASSERT (index < urb_priv->length); @@ -558,11 +559,30 @@ td_fill (struct ohci_hcd *ohci, u32 info /* hash it for later reverse mapping */ hash = TD_HASH_FUNC (td->td_dma); + + td1 = ohci->td_hash[hash]; + td2 = NULL; + if (td1) { + td2 = td1->td_hash; + if (td2 == td1 || td2 == td) { + ohci_err(ohci, "Circular hash: %d %p %p %p\n", + hash, td1, td2, td); + td2 = td1->td_hash = NULL; + } + } + td->td_hash = ohci->td_hash [hash]; ohci->td_hash [hash] = td; /* HC might read the TD (or cachelines) right away ... */ wmb (); + + if (td1 && td1->td_hash != td2) { + ohci_err(ohci, "Hash value changed: %d %p %p %p\n", + hash, td1, td2, td); + td1->td_hash = (struct td *) td2; + } + td->ed->hwTailP = td->hwNextTD; } Alan, This last patch seems to do the job. Thanks so much for your help! Where do I donate/send beer? Sean -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Mon, 4 Jan 2010, Sean wrote: > Alan Stern wrote: > > Try inserting a line saying: > > > > td_check(ohci, hash, "#2c"); > > > > two lines above the #2b line, i.e., just after the wmb(). That'll help > > narrow down the search for the bug. > Alan, > > I put the extra line in and ran capture-example twice. This is what I got: > > ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 > ohci_hcd :00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040 ... All right. Let's try this patch in place of all the others, then. Alan Stern Index: usb-2.6/drivers/usb/host/ohci-q.c === --- usb-2.6.orig/drivers/usb/host/ohci-q.c +++ usb-2.6/drivers/usb/host/ohci-q.c @@ -505,6 +505,7 @@ td_fill (struct ohci_hcd *ohci, u32 info struct urb_priv *urb_priv = urb->hcpriv; int is_iso = info & TD_ISO; int hash; + volatile struct td * volatile td1, * volatile td2; // ASSERT (index < urb_priv->length); @@ -558,11 +559,30 @@ td_fill (struct ohci_hcd *ohci, u32 info /* hash it for later reverse mapping */ hash = TD_HASH_FUNC (td->td_dma); + + td1 = ohci->td_hash[hash]; + td2 = NULL; + if (td1) { + td2 = td1->td_hash; + if (td2 == td1 || td2 == td) { + ohci_err(ohci, "Circular hash: %d %p %p %p\n", + hash, td1, td2, td); + td2 = td1->td_hash = NULL; + } + } + td->td_hash = ohci->td_hash [hash]; ohci->td_hash [hash] = td; /* HC might read the TD (or cachelines) right away ... */ wmb (); + + if (td1 && td1->td_hash != td2) { + ohci_err(ohci, "Hash value changed: %d %p %p %p\n", + hash, td1, td2, td); + td1->td_hash = (struct td *) td2; + } + td->ed->hwTailP = td->hwNextTD; } -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Alan Stern wrote: Try inserting a line saying: td_check(ohci, hash, "#2c"); two lines above the #2b line, i.e., just after the wmb(). That'll help narrow down the search for the bug. Alan, I put the extra line in and ran capture-example twice. This is what I got: ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd :00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 Sean -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Sun, 3 Jan 2010, Sean wrote: > Alan, > > I applied the patches and ran capture-example twice. On the second run > of capture-example a circular pointer popped up. I did not need to > remove the camera. Attached are the serial console capture as well as > the dmesg log in debug4.tar.gz. Did you want me to try to reproduce the > poison message? No. Among the things that patch did was to fix up the errors that caused the invalid pointers. Hence there should not have been any "poisoned hash" messages -- and indeed there weren't. The interesting part of the log is the error messages: ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 1 c6774040 c6542040 c6774040 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 1 c6774040 c6542040 c6774040 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 1 c6774040 c6542040 c6774040 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd :00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 There are two different hash chains here (32 and 1), but in each case see the message is #2b, never #2a. This means the problem occurs between the places where the #2a and #2b messages were inserted, i.e., in td_fill(). The hash chain contained a single TD and was fine to begin with; then another TD was added at the start of the chain and the pointer in the earlier TD (now at the second position in the chain) got messed up. For example, the error message in the first line above implies that originally the 32nd hash chain contained only the TD at c6542800 with its td_hash member set to NULL. But then c6774800 was added to the start of the chain, after which c6542800's td_hash pointed to c6774800. Try inserting a line saying: td_check(ohci, hash, "#2c"); two lines above the #2b line, i.e., just after the wmb(). That'll help narrow down the search for the bug. And by the way, you don't need to post your entire dmesg log. Just the portion containing the new debugging messages will be enough. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Alan Stern wrote: Here is a completely new patch. This one is more directed, to catch the sort of errors we now know to look for. Alan Stern Alan, I applied the patches and ran capture-example twice. On the second run of capture-example a circular pointer popped up. I did not need to remove the camera. Attached are the serial console capture as well as the dmesg log in debug4.tar.gz. Did you want me to try to reproduce the poison message? Sean debug4.tar.gz Description: Unix tar archive
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Sat, 2 Jan 2010, Sean wrote: > Hmm, I applied the changes and I did not see a place where *prev differs > from td->td_hash. I have run memtest86+ on this box and it has passed 16 > times, so I do not suspect a hardware memory error. What do you think? > Attached is the latest dmesg output. I don't know. The same pattern as before appears here: $ egrep -n '[1b]e(40|5c)' dmesg3.log 167:kobject: 'fs' (c7801e40): kobject_add_internal: parent: '', set: '' 4990:ohci_hcd :00:0b.0: td alloc for 2 ep85: c6691e40 5023:ohci_hcd :00:0b.0: hash c6691e40 to 57 -> (null) 5181:ohci_hcd :00:0b.0: td alloc for 2 ep85: c676be40 5214:ohci_hcd :00:0b.0: hash c676be40 to 57 -> c6691e40 5271:ohci_hcd :00:0b.0: td free c6691e40 5272:ohci_hcd :00:0b.0: (57 1) c676be5c -> (null) [(null)] 5294:ohci_hcd :00:0b.0: td alloc for 2 ep85: c6691e40 5327:ohci_hcd :00:0b.0: hash c6691e40 to 57 -> c676be40 5533:ohci_hcd :00:0b.0: td free c676be40 5534:ohci_hcd :00:0b.0: (57 1) c6691e5c -> (null) [(null)] 5556:ohci_hcd :00:0b.0: td alloc for 2 ep85: c676be40 5589:ohci_hcd :00:0b.0: hash c676be40 to 57 -> c6691e40 5640:ohci_hcd :00:0b.0: td free c6691e40 5641:ohci_hcd :00:0b.0: (57 1) c676be5c -> c676be40 [c676be40] 5713:ohci_hcd :00:0b.0: td alloc for 2 ep85: c6691e40 5746:ohci_hcd :00:0b.0: hash c6691e40 to 57 -> c676be40 5899:ohci_hcd :00:0b.0: td free c676be40 5900:ohci_hcd :00:0b.0: (57 1) c6691e5c -> c676be40 [c676be40] At line 5641 we see that the td_hash pointer in c676be40 gets corrupted. It is copied from the pointer in c6691e40, which was set to NULL in line 5534, but now it points to c676be40. The question is whether this corruption was caused by a hardware fault or a software bug. We have added debugging printouts to the only two places where the driver assigns anything to td->td_hash, and they don't show anything wrong. This leads me to suspect the hardware, but of course this is still just a guess. Here is a completely new patch. This one is more directed, to catch the sort of errors we now know to look for. Alan Stern Index: usb-2.6/drivers/usb/host/ohci-mem.c === --- usb-2.6.orig/drivers/usb/host/ohci-mem.c +++ usb-2.6/drivers/usb/host/ohci-mem.c @@ -98,17 +98,56 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem return td; } +static void td_check(struct ohci_hcd *hc, int hash, char *msg) +{ + struct td *td, *first; + + first = hc->td_hash[hash]; + for (td = first; td; td = td->td_hash) { + if (td->td_hash == first || td->td_hash == td) { + ohci_err(hc, "Circular pointer %s: %d %p %p %p\n", + msg, hash, first, td, td->td_hash); + td->td_hash = NULL; + return; + } + } +} + +static void td_check_all(struct ohci_hcd *hc, char *msg) +{ + int hash; + + for (hash = 0; hash < TD_HASH_SIZE; ++hash) + td_check(hc, hash, msg); +} + static void td_free (struct ohci_hcd *hc, struct td *td) { - struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; + int hash = TD_HASH_FUNC(td->td_dma); + struct td **prev = &hc->td_hash[hash]; - while (*prev && *prev != td) + td_check(hc, hash, "#1a"); + while (*prev && *prev != td) { + if ((unsigned long) *prev == 0xa7a7a7a7) { + ohci_err(hc, "poisoned hash at %p (%d) %p %p\n", prev, + hash, td, hc->td_hash[hash]); + return; + } prev = &(*prev)->td_hash; - if (*prev) + } + if (*prev) { *prev = td->td_hash; + if (*prev == td) { + ohci_err(hc, "invalid hash at %p (%d) %p %p\n", prev, + hash, td, hc->td_hash[hash]); + *prev = NULL; + } + } else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) ohci_dbg (hc, "no hash for td %p\n", td); + mb(); + td_check(hc, hash, "#1b"); dma_pool_free (hc->td_cache, td, td->td_dma); } Index: usb-2.6/drivers/usb/host/ohci-q.c === --- usb-2.6.orig/drivers/usb/host/ohci-q.c +++ usb-2.6/drivers/usb/host/ohci-q.c @@ -558,12 +558,14 @@ td_fill (struct ohci_hcd *ohci, u32 info /* hash it for later reverse mapping */ hash = TD_HASH_FUNC (td->td_dma); + td_check(ohci, hash, "#2a"); td->td_hash = ohci->td_hash [hash]; ohci->td_hash [hash] = td; /* HC might read the TD (or cachelines) right away ... */ wmb (); td->ed->hwTailP = td->hwNextTD; + td_check(ohci, hash, "#2b"); } /*---
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Alan Stern wrote: In other words, I'm guessing that you're suffering from hardware memory errors. A possible way to test this is to modify the patch. In td_free() where it adds the line: + ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, *prev); instead add this code: + barrier(); + ohci_dbg(hc, "(%d %d) %p -> %p [%p]\n", hash, n, + prev, *prev, td->td_hash); If we find that the value of *prev differs from the value of td->td_hash then we'll know for certain. (Or maybe the presence of the barrier() will cause the object code to change in a way that prevents the error from occurring.) Alan Stern Hmm, I applied the changes and I did not see a place where *prev differs from td->td_hash. I have run memtest86+ on this box and it has passed 16 times, so I do not suspect a hardware memory error. What do you think? Attached is the latest dmesg output. Sean dmesg3.log.tar.gz Description: Unix tar archive
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Sat, 2 Jan 2010, Sean wrote: > Alan, > > Thanks again. I was able to get the full dmesg output this time. I ran > capture-example three times and each time removing the webcam before > capture-example finished. On the third time I got the poisoned hash > message and I captured the output to a file. Attached is the dmesg output. Okay. Take a look at the following output: $ egrep -n '[2e]e(80|9c)' dmesg2.log 680:pci :00:0c.0: reg 14 io port: [0xee80-0xee83] 727:kobject: 'ieee80211' (c79d5e1c): kobject_add_internal: parent: 'class', set: 'class' 728:kobject: 'ieee80211' (c79d5e1c): kobject_uevent_env 729:kobject: 'ieee80211' (c79d5e1c): fill_kobj_path: path = '/class/ieee80211' 4994:ohci_hcd :00:0b.0: td alloc for 2 ep85: c6662e80 5027:ohci_hcd :00:0b.0: hash c6662e80 to 58 -> (null) 5185:ohci_hcd :00:0b.0: td alloc for 2 ep85: c676ee80 5218:ohci_hcd :00:0b.0: hash c676ee80 to 58 -> c6662e80 5276:ohci_hcd :00:0b.0: td free c6662e80 5277:ohci_hcd :00:0b.0: (58 1) c676ee9c -> (null) 5296:ohci_hcd :00:0b.0: td alloc for 2 ep85: c6662e80 5329:ohci_hcd :00:0b.0: hash c6662e80 to 58 -> c676ee80 5538:ohci_hcd :00:0b.0: td free c676ee80 5539:ohci_hcd :00:0b.0: (58 1) c6662e9c -> (null) 5558:ohci_hcd :00:0b.0: td alloc for 2 ep85: c676ee80 5591:ohci_hcd :00:0b.0: hash c676ee80 to 58 -> c6662e80 5644:ohci_hcd :00:0b.0: td free c6662e80 5645:ohci_hcd :00:0b.0: (58 1) c676ee9c -> (null) 5720:ohci_hcd :00:0b.0: td alloc for 2 ep85: c6662e80 5753:ohci_hcd :00:0b.0: hash c6662e80 to 58 -> c676ee80 5900:ohci_hcd :00:0b.0: td free c676ee80 5901:ohci_hcd :00:0b.0: (58 1) c6662e9c -> (null) 5978:ohci_hcd :00:0b.0: td alloc for 2 ep85: c676ee80 6011:ohci_hcd :00:0b.0: hash c676ee80 to 58 -> c6662e80 6072:ohci_hcd :00:0b.0: td free c6662e80 6073:ohci_hcd :00:0b.0: (58 1) c676ee9c -> (null) 6088:ohci_hcd :00:0b.0: td alloc for 2 ep85: c6662e80 6121:ohci_hcd :00:0b.0: hash c6662e80 to 58 -> c676ee80 6324:ohci_hcd :00:0b.0: td free c676ee80 6325:ohci_hcd :00:0b.0: (58 1) c6662e9c -> (null) 6343:ohci_hcd :00:0b.0: td alloc for 2 ep85: c676ee80 6376:ohci_hcd :00:0b.0: hash c676ee80 to 58 -> c6662e80 6416:ohci_hcd :00:0b.0: td free c6662e80 6417:ohci_hcd :00:0b.0: (58 1) c676ee9c -> c676ee80 6492:ohci_hcd :00:0b.0: td alloc for 2 ep85: c6662e80 6525:ohci_hcd :00:0b.0: hash c6662e80 to 58 -> c676ee80 6686:ohci_hcd :00:0b.0: td free c676ee80 6687:ohci_hcd :00:0b.0: (58 1) c6662e9c -> c676ee80 Ignore the first few lines as being irrelevant. Starting with line 5185 you can see that this shows two TDs being allocated, hashed, freed, and unlinked over and over again, at addresses c6662e80 and c676ee80. When each one is hashed into the list, its td_hash member is made to point to the other. When each is removed from the hash list, the other's td_hash member is set to NULL. It's all very regular and repetitious until line 6417. At that line, the td_hash member of c676ee80 (which is at offset 1c from the start of the structure, hence at c676ee9c) is made to point to its own structure! Thus later at line 6687, when c676ee80 is freed, the td_hash member of c6662e80 is set to point at the freed structure. This is what leads to poisoned pointer values later on. So what went wrong at line 6417? There's no way to know exactly. My guess is that the write at line 6325, where c6662e9c was supposed to be set to NULL, never got recorded properly in the computer's memory. This would mean that c6662e9c still contained the c676ee80 value assigned at line 6121, and hence the incorrect value was copied at line 6417. In other words, I'm guessing that you're suffering from hardware memory errors. A possible way to test this is to modify the patch. In td_free() where it adds the line: + ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, *prev); instead add this code: + barrier(); + ohci_dbg(hc, "(%d %d) %p -> %p [%p]\n", hash, n, + prev, *prev, td->td_hash); If we find that the value of *prev differs from the value of td->td_hash then we'll know for certain. (Or maybe the presence of the barrier() will cause the object code to change in a way that prevents the error from occurring.) Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Alan Stern wrote: Got it. It's not good enough; the initial error occurred before the start of your extract. Here's yet another version of the patch; this one will stop printing the debug messages when an error is first detected so maybe it won't overrun your log buffer. Alan Stern Alan, Thanks again. I was able to get the full dmesg output this time. I ran capture-example three times and each time removing the webcam before capture-example finished. On the third time I got the poisoned hash message and I captured the output to a file. Attached is the dmesg output. Sean dmesg2.log.tar.gz Description: Unix tar archive
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Tue, 29 Dec 2009, Sean wrote: > Alan, > > Thanks for the debug patch. I'll send you the dmesg.log output in > another email. It is 2MB. Got it. It's not good enough; the initial error occurred before the start of your extract. Here's yet another version of the patch; this one will stop printing the debug messages when an error is first detected so maybe it won't overrun your log buffer. Alan Stern Index: usb-2.6/drivers/usb/host/ohci-hcd.c === --- usb-2.6.orig/drivers/usb/host/ohci-hcd.c +++ usb-2.6/drivers/usb/host/ohci-hcd.c @@ -201,7 +201,7 @@ static int ohci_urb_enqueue ( /* allocate the TDs (deferring hash chain updates) */ for (i = 0; i < size; i++) { - urb_priv->td [i] = td_alloc (ohci, mem_flags); + urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, urb->ep); if (!urb_priv->td [i]) { urb_priv->length = i; urb_free_priv (ohci, urb_priv); @@ -580,6 +580,7 @@ static int ohci_run (struct ohci_hcd *oh struct usb_hcd *hcd = ohci_to_hcd(ohci); disable (ohci); + ohci->testing = 1; /* boot firmware should have set this up (5.1.1.3.1) */ if (first) { Index: usb-2.6/drivers/usb/host/ohci-mem.c === --- usb-2.6.orig/drivers/usb/host/ohci-mem.c +++ usb-2.6/drivers/usb/host/ohci-mem.c @@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr /* TDs ... */ static struct td * -td_alloc (struct ohci_hcd *hc, gfp_t mem_flags) +td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev, + struct usb_host_endpoint *ep) { dma_addr_t dma; struct td *td; @@ -94,6 +95,9 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem td->hwNextTD = cpu_to_hc32 (hc, dma); td->td_dma = dma; /* hashed in td_fill */ + if (hc->testing) + ohci_dbg(hc, "td alloc for %s ep%x: %p\n", + udev->devpath, ep->desc.bEndpointAddress, td); } return td; } @@ -101,14 +105,32 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem static void td_free (struct ohci_hcd *hc, struct td *td) { - struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; - - while (*prev && *prev != td) + int hash = TD_HASH_FUNC(td->td_dma); + struct td **prev = &hc->td_hash[hash]; + int n = 0; + + if (hc->testing) + ohci_dbg(hc, "td free %p\n", td); + while (*prev && *prev != td) { + if ((unsigned long) *prev == 0xa7a7a7a7) { + ohci_dbg(hc, "poisoned hash at %p (%d %d) %p\n", prev, + hash, n, hc->td_hash[hash]); + return; + } prev = &(*prev)->td_hash; - if (*prev) + ++n; + } + if (*prev) { *prev = td->td_hash; - else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) - ohci_dbg (hc, "no hash for td %p\n", td); + if (hc->testing) { + ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, *prev); + if (td->td_hash == td) + hc->testing = 0; + } + } else { + ohci_dbg(hc, "no hash for td %p: %d %p\n", td, + hash, hc->td_hash[hash]); + } dma_pool_free (hc->td_cache, td, td->td_dma); } Index: usb-2.6/drivers/usb/host/ohci-q.c === --- usb-2.6.orig/drivers/usb/host/ohci-q.c +++ usb-2.6/drivers/usb/host/ohci-q.c @@ -406,7 +406,7 @@ static struct ed *ed_get ( } /* dummy td; end of td list for ed */ - td = td_alloc (ohci, GFP_ATOMIC); + td = td_alloc (ohci, GFP_ATOMIC, udev, ep); if (!td) { /* out of memory */ ed_free (ohci, ed); @@ -560,6 +560,11 @@ td_fill (struct ohci_hcd *ohci, u32 info hash = TD_HASH_FUNC (td->td_dma); td->td_hash = ohci->td_hash [hash]; ohci->td_hash [hash] = td; + if (ohci->testing) { + ohci_dbg(ohci, "hash %p to %d -> %p\n", td, hash, td->td_hash); + if (td->td_hash == td) + ohci->testing = 0; + } /* HC might read the TD (or cachelines) right away ... */ wmb (); Index: usb-2.6/drivers/usb/host/ohci.h === --- usb-2.6.orig/drivers/usb/host/ohci.h +++ usb-2.6/drivers/usb/host/ohci.h @@ -346,6 +346,7 @@ typedef struct urb_priv { struct ohci_hcd { spinlock_t lock; + int
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Alan Stern wrote: You did not get the full output. That's why I tell people to use dmesg instead of klogd or the kernel log files. If necessary you can always increase the size of the dmesg log buffer by changing CONFIG_LOG_BUF_SHIFT. I tend to set it to 18; you might want to go even higher. That's because you didn't get the full output. Here's an enhanced version of the patch. It will provide more information and perhaps a smoking gun. Alan Stern Alan, Thanks for the debug patch. I'll send you the dmesg.log output in another email. It is 2MB. Sean -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Tue, 29 Dec 2009, Sean wrote: > > I changed the debug statements to ohci_dbg and I was able to capture > > the full output with klogd. It is attached. You did not get the full output. That's why I tell people to use dmesg instead of klogd or the kernel log files. If necessary you can always increase the size of the dmesg log buffer by changing CONFIG_LOG_BUF_SHIFT. I tend to set it to 18; you might want to go even higher. > In looking at the log file it seems that there is a mismatch of td_alloc > calls and td_free calls. That's because you didn't get the full output. Here's an enhanced version of the patch. It will provide more information and perhaps a smoking gun. Alan Stern Index: usb-2.6/drivers/usb/host/ohci-hcd.c === --- usb-2.6.orig/drivers/usb/host/ohci-hcd.c +++ usb-2.6/drivers/usb/host/ohci-hcd.c @@ -201,7 +201,7 @@ static int ohci_urb_enqueue ( /* allocate the TDs (deferring hash chain updates) */ for (i = 0; i < size; i++) { - urb_priv->td [i] = td_alloc (ohci, mem_flags); + urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, urb->ep); if (!urb_priv->td [i]) { urb_priv->length = i; urb_free_priv (ohci, urb_priv); Index: usb-2.6/drivers/usb/host/ohci-mem.c === --- usb-2.6.orig/drivers/usb/host/ohci-mem.c +++ usb-2.6/drivers/usb/host/ohci-mem.c @@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr /* TDs ... */ static struct td * -td_alloc (struct ohci_hcd *hc, gfp_t mem_flags) +td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev, + struct usb_host_endpoint *ep) { dma_addr_t dma; struct td *td; @@ -94,6 +95,8 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem td->hwNextTD = cpu_to_hc32 (hc, dma); td->td_dma = dma; /* hashed in td_fill */ + ohci_dbg(hc, "td alloc for %s ep%x: %p\n", + udev->devpath, ep->desc.bEndpointAddress, td); } return td; } @@ -101,14 +104,27 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem static void td_free (struct ohci_hcd *hc, struct td *td) { - struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; - - while (*prev && *prev != td) + int hash = TD_HASH_FUNC(td->td_dma); + struct td **prev = &hc->td_hash[hash]; + int n = 0; + + ohci_dbg(hc, "td free %p\n", td); + while (*prev && *prev != td) { + if ((unsigned long) *prev == 0xa7a7a7a7) { + ohci_dbg(hc, "poisoned hash at %p (%d %d) %p\n", prev, + hash, n, hc->td_hash[hash]); + return; + } prev = &(*prev)->td_hash; - if (*prev) + ++n; + } + if (*prev) { *prev = td->td_hash; - else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) - ohci_dbg (hc, "no hash for td %p\n", td); + ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, *prev); + } else { + ohci_dbg(hc, "no hash for td %p: %d %p\n", td, + hash, hc->td_hash[hash]); + } dma_pool_free (hc->td_cache, td, td->td_dma); } Index: usb-2.6/drivers/usb/host/ohci-q.c === --- usb-2.6.orig/drivers/usb/host/ohci-q.c +++ usb-2.6/drivers/usb/host/ohci-q.c @@ -406,7 +406,7 @@ static struct ed *ed_get ( } /* dummy td; end of td list for ed */ - td = td_alloc (ohci, GFP_ATOMIC); + td = td_alloc (ohci, GFP_ATOMIC, udev, ep); if (!td) { /* out of memory */ ed_free (ohci, ed); @@ -560,6 +560,7 @@ td_fill (struct ohci_hcd *ohci, u32 info hash = TD_HASH_FUNC (td->td_dma); td->td_hash = ohci->td_hash [hash]; ohci->td_hash [hash] = td; + ohci_dbg(ohci, "hash %p to %d -> %p\n", td, hash, td->td_hash); /* HC might read the TD (or cachelines) right away ... */ wmb (); -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Alan Stern wrote: The patch doesn't fix anything. The point was to gather enough information to figure out what's going wrong. Without the debug messages, there's no information. Perhaps things will slow down less if you change the new ohci_info() calls in the patch to ohci_dbg(). Or perhaps you can increase the timeout values in capture-example.c. You should also apply this patch (be sure to enable CONFIG_USB_DEBUG): http://marc.info/?l=linux-usb&m=126056642931083&w=2 It probably won't make any difference, but including it anyway is worthwhile. Alan Stern The early return in td_free that is in the patch will trap the error. I changed the debug statements to ohci_dbg and I was able to capture the full output with klogd. It is attached. Sean Sean wrote: In looking at the log file it seems that there is a mismatch of td_alloc calls and td_free calls. Sean -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Wed, 16 Dec 2009, Sean wrote: > Thanks, that patch definitely traps the bug. Unfortunately there are so > many debug messages that the capture-example.c times out trying to > connect to the webcam. The debug messages slow down the process enough > to make that happen. But if I modify your patch and take out the extra > debug messages, it works well. The modified patch is below. The patch doesn't fix anything. The point was to gather enough information to figure out what's going wrong. Without the debug messages, there's no information. Perhaps things will slow down less if you change the new ohci_info() calls in the patch to ohci_dbg(). Or perhaps you can increase the timeout values in capture-example.c. You should also apply this patch (be sure to enable CONFIG_USB_DEBUG): http://marc.info/?l=linux-usb&m=126056642931083&w=2 It probably won't make any difference, but including it anyway is worthwhile. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Thanks, that patch definitely traps the bug. Unfortunately there are so many debug messages that the capture-example.c times out trying to connect to the webcam. The debug messages slow down the process enough to make that happen. But if I modify your patch and take out the extra debug messages, it works well. The modified patch is below. Reproducing the bug in four separate instances I got: ohci_hcd :00:0a.0: poisoned hash at c67a285c ohci_hcd :00:0a.0: poisoned hash at c67b875c ohci_hcd :00:0a.0: poisoned hash at c67a179c ohci_hcd :00:0a.0: poisoned hash at c679c79c Sean Lazar --- ohci-mem.c.orig2009-12-16 22:57:49.0 + +++ ohci-mem.c2009-12-16 22:49:37.0 + @@ -103,8 +103,13 @@ { struct td**prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; -while (*prev && *prev != td) +while (*prev && *prev != td) { +if ((unsigned long) *prev == 0xa7a7a7a7) { +ohci_info(hc, "poisoned hash at %p\n", prev); +return; +} prev = &(*prev)->td_hash; +} if (*prev) *prev = td->td_hash; else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) Alan Stern wrote: On Wed, 2 Dec 2009, Sean wrote: Is there anything I can do to help? This is a show stopping bug for me. Here's a patch you can try. It will add a _lot_ of debugging information to the system log. Maybe it will help pin down the source of the problem. Alan Stern Index: 2.6.31/drivers/usb/host/ohci-hcd.c === --- 2.6.31.orig/drivers/usb/host/ohci-hcd.c +++ 2.6.31/drivers/usb/host/ohci-hcd.c @@ -197,7 +197,7 @@ static int ohci_urb_enqueue ( /* allocate the TDs (deferring hash chain updates) */ for (i = 0; i < size; i++) { - urb_priv->td [i] = td_alloc (ohci, mem_flags); + urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, urb->ep); if (!urb_priv->td [i]) { urb_priv->length = i; urb_free_priv (ohci, urb_priv); Index: 2.6.31/drivers/usb/host/ohci-mem.c === --- 2.6.31.orig/drivers/usb/host/ohci-mem.c +++ 2.6.31/drivers/usb/host/ohci-mem.c @@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr /* TDs ... */ static struct td * -td_alloc (struct ohci_hcd *hc, gfp_t mem_flags) +td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev, + struct usb_host_endpoint *ep) { dma_addr_t dma; struct td *td; @@ -94,6 +95,8 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem td->hwNextTD = cpu_to_hc32 (hc, dma); td->td_dma = dma; /* hashed in td_fill */ + ohci_info(hc, "td alloc for %s ep%x: %p\n", + udev->devpath, ep->desc.bEndpointAddress, td); } return td; } @@ -103,8 +106,14 @@ td_free (struct ohci_hcd *hc, struct td { struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; - while (*prev && *prev != td) + ohci_info(hc, "td free %p\n", td); + while (*prev && *prev != td) { + if ((unsigned long) *prev == 0xa7a7a7a7) { + ohci_info(hc, "poisoned hash at %p\n", prev); + return; + } prev = &(*prev)->td_hash; + } if (*prev) *prev = td->td_hash; else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) Index: 2.6.31/drivers/usb/host/ohci-q.c === --- 2.6.31.orig/drivers/usb/host/ohci-q.c +++ 2.6.31/drivers/usb/host/ohci-q.c @@ -403,7 +403,7 @@ static struct ed *ed_get ( } /* dummy td; end of td list for ed */ - td = td_alloc (ohci, GFP_ATOMIC); + td = td_alloc (ohci, GFP_ATOMIC, udev, ep); if (!td) { /* out of memory */ ed_free (ohci, ed); -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Wed, 2 Dec 2009, Sean wrote: > Is there anything I can do to help? This is a show stopping bug for me. Here's a patch you can try. It will add a _lot_ of debugging information to the system log. Maybe it will help pin down the source of the problem. Alan Stern Index: 2.6.31/drivers/usb/host/ohci-hcd.c === --- 2.6.31.orig/drivers/usb/host/ohci-hcd.c +++ 2.6.31/drivers/usb/host/ohci-hcd.c @@ -197,7 +197,7 @@ static int ohci_urb_enqueue ( /* allocate the TDs (deferring hash chain updates) */ for (i = 0; i < size; i++) { - urb_priv->td [i] = td_alloc (ohci, mem_flags); + urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, urb->ep); if (!urb_priv->td [i]) { urb_priv->length = i; urb_free_priv (ohci, urb_priv); Index: 2.6.31/drivers/usb/host/ohci-mem.c === --- 2.6.31.orig/drivers/usb/host/ohci-mem.c +++ 2.6.31/drivers/usb/host/ohci-mem.c @@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr /* TDs ... */ static struct td * -td_alloc (struct ohci_hcd *hc, gfp_t mem_flags) +td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev, + struct usb_host_endpoint *ep) { dma_addr_t dma; struct td *td; @@ -94,6 +95,8 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem td->hwNextTD = cpu_to_hc32 (hc, dma); td->td_dma = dma; /* hashed in td_fill */ + ohci_info(hc, "td alloc for %s ep%x: %p\n", + udev->devpath, ep->desc.bEndpointAddress, td); } return td; } @@ -103,8 +106,14 @@ td_free (struct ohci_hcd *hc, struct td { struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; - while (*prev && *prev != td) + ohci_info(hc, "td free %p\n", td); + while (*prev && *prev != td) { + if ((unsigned long) *prev == 0xa7a7a7a7) { + ohci_info(hc, "poisoned hash at %p\n", prev); + return; + } prev = &(*prev)->td_hash; + } if (*prev) *prev = td->td_hash; else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) Index: 2.6.31/drivers/usb/host/ohci-q.c === --- 2.6.31.orig/drivers/usb/host/ohci-q.c +++ 2.6.31/drivers/usb/host/ohci-q.c @@ -403,7 +403,7 @@ static struct ed *ed_get ( } /* dummy td; end of td list for ed */ - td = td_alloc (ohci, GFP_ATOMIC); + td = td_alloc (ohci, GFP_ATOMIC, udev, ep); if (!td) { /* out of memory */ ed_free (ohci, ed); -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Is there anything I can do to help? This is a show stopping bug for me. Thanks, Sean Lazar Alan Stern wrote: On Wed, 11 Nov 2009, Andrew Morton wrote: http://bugzilla.kernel.org/show_bug.cgi?id=14564 Summary: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c We oopsed in td_free() (see below). But as part of that oops processing the kernel entered do_page_fault() and emitted a might_sleep() warning because we took a pagefault with local interrupts disabled. This is undesirable behaviour from the low-level x86 fault code and I don't think it normally happens. Did we break something in x86 land, or is this oops sufficiently weird and whacky to bypass existing checks for this false positive? No, what happened was a structure containing a linked-list entry got freed while it was still on the list. Then when the driver walked through the list, it attempted to dereference a list pointer that had been poisoned. More or less by coincidence, the poison value represented a paged-out address rather than an invalid address, so a page fault occurred. That's what caused the oops. BUG: unable to handle kernel paging request at a7a7a7c3 IP: [] td_free+0x23/0x75 [] ? tty_ldisc_deref+0x8/0xa [] ? tty_write+0x1b1/0x1c2 [] ? n_tty_write+0x0/0x2e6 [] ? tty_write+0x0/0x1c2 [] ? vfs_write+0xe3/0xfa [] ? restore_all_notrace+0x0/0x18 [] ? sys_ioctl+0x2c/0x45 [] ? syscall_call+0x7/0xb Code: e5 e8 bf 7b e9 ff 5d c3 55 89 e5 57 89 c7 56 89 d6 53 8b 42 28 89 c2 c1 ea 06 31 d0 83 e0 3f 8d 94 87 cc 00 00 00 eb 03 8d 50 1c <8b> 02 85 c0 74 0b 39 EIP: [] td_free+0x23/0x75 SS:ESP 0068:c6785cb8 CR2: a7a7a7c3 And here's the real oops. drivers/usb/host/ohci-mem.c:td_free() dereferenced a7a7a7c3. Which looks like /** drivers/base/dmapool.c **/ #define POOL_POISON_FREED 0xa7/* !inuse */ #define POOL_POISON_ALLOCATED 0xa9/* !initted */ If I'm reading this correctly, the bad dereference occurred in the second source line: prev = &(*prev)->td_hash; if (*prev) The original value in *prev was 0xa7a7a7a7 and the offset of td_hash is 0x1c, causing the offending address to be 0xa7a7a7c3. I have no idea why a struct td would have been freed while it was still in use. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
On Wed, 11 Nov 2009, Andrew Morton wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=14564 > > > >Summary: capture-example sleeping function called from invalid > > context at arch/x86/mm/fault.c > We oopsed in td_free() (see below). But as part of that oops > processing the kernel entered do_page_fault() and emitted a > might_sleep() warning because we took a pagefault with local interrupts > disabled. > > This is undesirable behaviour from the low-level x86 fault code and I > don't think it normally happens. > > Did we break something in x86 land, or is this oops sufficiently weird > and whacky to bypass existing checks for this false positive? No, what happened was a structure containing a linked-list entry got freed while it was still on the list. Then when the driver walked through the list, it attempted to dereference a list pointer that had been poisoned. More or less by coincidence, the poison value represented a paged-out address rather than an invalid address, so a page fault occurred. That's what caused the oops. > > BUG: unable to handle kernel paging request at a7a7a7c3 > > > > IP: [] td_free+0x23/0x75 > > > > [] ? tty_ldisc_deref+0x8/0xa > > > > [] ? tty_write+0x1b1/0x1c2 > > > > [] ? n_tty_write+0x0/0x2e6 > > > > [] ? tty_write+0x0/0x1c2 > > > > [] ? vfs_write+0xe3/0xfa > > > > [] ? restore_all_notrace+0x0/0x18 > > > > [] ? sys_ioctl+0x2c/0x45 > > > > [] ? syscall_call+0x7/0xb > > > > Code: e5 e8 bf 7b e9 ff 5d c3 55 89 e5 57 89 c7 56 89 d6 53 8b 42 28 89 c2 > > c1 > > ea 06 31 d0 83 e0 3f 8d 94 87 cc 00 00 00 eb 03 8d 50 1c <8b> 02 85 c0 74 > > 0b 39 > > EIP: [] td_free+0x23/0x75 SS:ESP 0068:c6785cb8 > > > > CR2: a7a7a7c3 > > > > And here's the real oops. drivers/usb/host/ohci-mem.c:td_free() > dereferenced a7a7a7c3. Which looks like > > /** drivers/base/dmapool.c **/ > #define POOL_POISON_FREED 0xa7/* !inuse */ > #define POOL_POISON_ALLOCATED 0xa9/* !initted */ If I'm reading this correctly, the bad dereference occurred in the second source line: prev = &(*prev)->td_hash; if (*prev) The original value in *prev was 0xa7a7a7a7 and the offset of td_hash is 0x1c, causing the offending address to be 0xa7a7a7c3. I have no idea why a struct td would have been freed while it was still in use. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 14564] New: capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). (lotsa cc's added) On Mon, 9 Nov 2009 08:59:05 GMT bugzilla-dae...@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14564 > >Summary: capture-example sleeping function called from invalid > context at arch/x86/mm/fault.c >Product: Memory Management >Version: 2.5 > Kernel Version: 2.6.31.5 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: Slab Allocator > AssignedTo: a...@linux-foundation.org > ReportedBy: kn...@toaster.net > Regression: No > Thhis is odd. > On a DM&P ebox2300sx, 300Mhz Vortex86 cpu, I have a vanilla 2.6.31.5 kernel > with a pac207 webcam. I run capture-example from the v4l-dvb sample > applications and it crashes 1 out of 5 times. Let me know if I need to collect > more data or try anything. > > [r...@x-linux]:~ # capture-example > > ..BUG: > sleeping function called from invalid context at arch/x86/mm/fault.c:1069 > > in_atomic(): 0, irqs_disabled(): 1, pid: 1178, name: capture-example > > 4 locks held by capture-example/1178: > > #0: (&gspca_dev->queue_lock){+.+.+.}, at: [] > vidioc_streamoff+0x3b/0xb4 [gspca_main] > > #1: (&gspca_dev->usb_lock){+.+.+.}, at: [] > vidioc_streamoff+0x4e/0xb4 [gspca_main] > > #2: (&ohci->lock){-.-...}, at: [] > ohci_endpoint_disable+0x31/0x192 > #3: (&mm->mmap_sem){++}, at: [] do_page_fault+0xc1/0x1fe > > irq event stamp: 11656 > > hardirqs last enabled at (11655): [] _spin_unlock_irq+0x22/0x26 > > hardirqs last disabled at (11656): [] _spin_lock_irqsave+0x10/0x5a > > softirqs last enabled at (11610): [] __do_softirq+0x145/0x14d > > softirqs last disabled at (11605): [] do_softirq+0x2a/0x42 > > Pid: 1178, comm: capture-example Not tainted 2.6.31.5 #2 > > Call Trace: > > [] __might_sleep+0xcb/0xd0 > > [] do_page_fault+0x106/0x1fe > > [] ? do_page_fault+0x0/0x1fe > > [] error_code+0x63/0x70 > > [] ? do_page_fault+0x0/0x1fe > > [] ? td_free+0x23/0x75 > > [] ohci_endpoint_disable+0x113/0x192 > > [] usb_hcd_disable_endpoint+0x2e/0x32 > > [] usb_disable_endpoint+0x6d/0x72 > > [] usb_disable_interface+0x30/0x3f > > [] usb_set_interface+0x11b/0x1a0 > > [] gspca_set_alt0+0x23/0x46 [gspca_main] > > [] gspca_stream_off+0x35/0x5f [gspca_main] > > [] vidioc_streamoff+0x59/0xb4 [gspca_main] > > [] __video_do_ioctl+0x17af/0x3920 [videodev] > > [] ? __lock_acquire+0x6ef/0x755 > > [] ? lock_release_holdtime+0x81/0x86 > > [] ? lock_release_non_nested+0xab/0x1cf > > [] ? might_fault+0x3d/0x79 > > [] ? might_fault+0x3d/0x79 > > [] ? copy_from_user+0x31/0x54 > > [] video_ioctl2+0x303/0x3ea [videodev] > > [] ? lock_release_holdtime+0x81/0x86 > > [] ? _spin_unlock_irqrestore+0x36/0x3c > > [] ? trace_hardirqs_on_caller+0x104/0x12b > > [] ? trace_hardirqs_on+0xb/0xd > > [] ? video_ioctl2+0x0/0x3ea [videodev] > > [] v4l2_unlocked_ioctl+0x2e/0x32 [videodev] > > [] ? v4l2_unlocked_ioctl+0x0/0x32 [videodev] > > [] vfs_ioctl+0x19/0x50 > > [] do_vfs_ioctl+0x458/0x4a3 > > [] ? tty_ldisc_deref+0x8/0xa > > [] ? tty_write+0x1b1/0x1c2 > > [] ? n_tty_writ