Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()
On Monday 11 May 2015 13:14:52 Alan Stern wrote: On Sun, 10 May 2015, Heiko Przybyl wrote: Hello again. The issue still bites me with 4.0.2. Though I was able to narrow down the issue, it's still hard to deterministically trigger it. Anyway. The setup: - OHCI controller with hooked up keyboard - xhddled (program that toggles scrolllock on disk activity) - disk activity I'm very certain that it happens, when usbhid issues an urb for scrolllock led toggling, the ohci driver adding a duplicate to its eds_in_use list. That's why I hit it way more often with disk activity. I also tried to alter the program to just periodically toggle the scrolllock, but that didn't seem to help for reproducibility. As long as you have a way to trigger the problem, you're in a good position to track it down. On Friday 23 January 2015 15:21:11 Alan Stern wrote: As suggested, checking the eds_in_use list was a good idea. Now, I'm even more convinced duplicates are being added to the eds_in_use list. Your tests gave me a clue as to where the problem might be. It looks like finish_unlinks() changes ed-state to ED_IDLE too soon. The state should remain set to ED_UNLINK while finish_urb() is called. See if the patch below eliminates your problem. Yes it indeed does eliminate the problem. Took some time to check if the problem is really gone, though. Without the patch I get lockups up to and including release 4.1; and not a single lockup with the patch applied. So for the patch I'd say Tested-By: Heiko Przybyl lil_...@web.de Thanks for looking into and fixing this issue, Heiko Alan Stern Index: usb-4.0/drivers/usb/host/ohci-q.c === --- usb-4.0.orig/drivers/usb/host/ohci-q.c +++ usb-4.0/drivers/usb/host/ohci-q.c @@ -980,10 +980,6 @@ rescan_all: int completed, modified; __hc32 *prev; - /* Is this ED already invisible to the hardware? */ - if (ed-state == ED_IDLE) - goto ed_idle; - /* only take off EDs that the HC isn't using, accounting for * frame counter wraps and EDs with partially retired TDs */ @@ -1011,12 +1007,10 @@ skip_ed: } /* ED's now officially unlinked, hc doesn't see */ - ed-state = ED_IDLE; ed-hwHeadP = ~cpu_to_hc32(ohci, ED_H); ed-hwNextED = 0; wmb(); ed-hwINFO = ~cpu_to_hc32(ohci, ED_SKIP | ED_DEQUEUE); -ed_idle: /* reentrancy: if we drop the schedule lock, someone might * have modified this list. normally it's just prepending @@ -1087,6 +1081,7 @@ rescan_this: if (list_empty(ed-td_list)) { *last = ed-ed_next; ed-ed_next = NULL; + ed-state = ED_IDLE; list_del(ed-in_use_list); } else if (ohci-rh_state == OHCI_RH_RUNNING) { *last = ed-ed_next; -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()
On Tuesday 20 January 2015 10:49:29 Alan Stern wrote: On Mon, 19 Jan 2015, Heiko Przybyl wrote: On Monday 19 January 2015 11:17:59 Alan Stern wrote: That's easy enough to test. All you have to do is change the spin_lock/unlock statements to their irq_save/restore variants. Well, thought about that as well, but I'm not sure when to take it as fixed and when to take it as issue-just-didn't-happen-yet, because of the not-so- deterministic occurrence of the error. But I can try it out anyway, just wanted to have some feedback before trying. By the way, failing to disable interrupts when acquiring a spinlock generally does not lead to data corruption -- it leads to deadlocks. So I doubt this is the cause of your problem. If you really want to, you could add a WARN_ON(!irqs_disabled()); line to ohci_irq(). You're right. After thinking/reading a bit more about the topic, not disabled IRQs would cause deadlocks, because there's already one thread in the critical section. No idea. It might be a good idea for you to try something a little more invasive. How about writing a routine to check the entire ohci-eds_in_use list for validity (each forward pointer is matched by the corresponding backward pointer), and calling this routine at each place where the list gets modified, before the modification happens? You could also make sure that an entry being added to the list isn't on the list already, and whenever an entry is deleted from the list either it really is on the list or else its list pointers point to themselves. I'm not 100% sure, but then it's probably a race between urb enqueuing (duplicates?) and the watchdog orphan cleanup. The crash log already shows the double list add in ohci_urb_enqueue ohci-hcd.c:238: list_add(ed-in_use_list, ohci-eds_in_use); This is probably due to the ed returned by ed_get() being reused before the watchdog ran, thus the same in_use_list re-added to ohci.eds_in_use. Entries seem to get removed in finish_unlinks() ohci-q.c:1090: list_del(ed-in_use_list); with list_del() poisoning the next/prev pointers of the removed entry. Now with the watchdog starting cleanup it iterates over the ohci.eds_in_use list that still has the second very same entry of in_use_list we double-added (but now with 0xdead... pointers) and we fault on ohci-hcd.c:761: if (ed-pending_td) { I hope that makes any sense? I'll hook up the list checking tomorrow. Though I haven't hit the (double-add) problem again, since the bug report. Seems pretty specific the whole thing. Alan Stern Heiko -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()
Hi all, been redirected here from bug #91511 [1]. I'm getting sporadic crashes in io_watchdog_func() in drivers/usb/host/ohci- hcd.c:761: list_for_each_entry(ed, ohci-eds_in_use, in_use_list) { if (ed-pending_td) { with the in_use list getting corrupted: from ohci_urb_enqueue(): [43656.918055] list_add double add: new=8800cbaa8040, prev=8800cb8aa5b8, next=8800cbaa8040. from ohci_work.part(): [43656.920980] list_del corruption. next-prev should be 8800cbaa8040, but was 8800cb8aa5b8 One or both set the pointer to 0xdead00100100 and 0xdead001000c0, where io_watchdog_func() chokes on [2]. It seems to be related to keyboard input (at least it happens when using the keyboard), without relation to system load. Can happen within a day after boot or after several days of hibernated uptime. Unfortunately, I haven't found a way to reliably reproduce the issue, yet. The box is a Gigabyte GA-78LMT-USB3 with AMD FX(tm)-6300 Six-Core Processor and [AMD/ATI] SB7x0 USB OHCI1 Controller. For more info including crash trace, just have a look at the bug report [1] My (pretty wild) guess is, that the corruption happens through a race in the interrupt handler ohci_irq(), which calls ohci_work(), which calls finish_urb(), which states: * PRECONDITION: ohci lock held, irqs blocked But ohci_irq() seems to only spin_[un]lock(), not spin_[un]lock_irq[save| restore](). All other functions that call ohci_work() do at least spin_[un]lock_irq. So irqs could still be enabled and possibly the event triggered twice, thus the double list add? If you need more information just let me know (per the bug [1] or CC, since I'm not subscribed to this list). Kind regards, Heiko [1] https://bugzilla.kernel.org/show_bug.cgi?id=91511 [2] https://bugzilla.kernel.org/attachment.cgi?id=163721 -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()
On Monday 19 January 2015 11:17:59 Alan Stern wrote: On Mon, 19 Jan 2015, Heiko Przybyl wrote: It seems to be related to keyboard input (at least it happens when using the keyboard), without relation to system load. Can happen within a day after boot or after several days of hibernated uptime. Unfortunately, I haven't found a way to reliably reproduce the issue, yet. [..] My (pretty wild) guess is, that the corruption happens through a race in the interrupt handler ohci_irq(), which calls ohci_work(), which calls finish_urb(), which states: * PRECONDITION: ohci lock held, irqs blocked But ohci_irq() seems to only spin_[un]lock(), not spin_[un]lock_irq[save| restore](). All other functions that call ohci_work() do at least spin_[un]lock_irq. So irqs could still be enabled and possibly the event triggered twice, thus the double list add? That's easy enough to test. All you have to do is change the spin_lock/unlock statements to their irq_save/restore variants. Well, thought about that as well, but I'm not sure when to take it as fixed and when to take it as issue-just-didn't-happen-yet, because of the not-so- deterministic occurrence of the error. But I can try it out anyway, just wanted to have some feedback before trying. ohci_irq() is an interrupt handler. In the absence of threaded IRQs, he kernel should always call interrupt handlers with interrupts disabled. Do you specify threadirqs on your boot command line? Never used threadirqs. # cat /proc/cmdline BOOT_IMAGE=/boot/gentoo root=/dev/sda2 ro rootfstype=ext4 resume=/dev/sda3 init=/usr/lib/systemd/systemd quiet libahci.ignore_sss=1 i8042.nopnp crashkernel=64M If that's not the explanation then we'll have to dig deeper. I can still work on a saved vmcore dump of a crash. Btw. using crash(1) and its command `bt -E`shows two kernel mode exceptions. Though, I can't figure out where the first one originates from CPU 3 IRQ STACK: KERNEL-MODE EXCEPTION FRAME AT: 88022ecc3638 [exception RIP: _raw_spin_unlock_irqrestore+9] RIP: 814774b9 RSP: 88022ecc36e8 RFLAGS: 0202 RAX: 88022ecc36a8 RBX: 88022ecc36b0 RCX: 81290279 RDX: 2dff RSI: RDI: 88022ecc3788 RBP: 88022ecc36e8 R8: 0080 R9: 0023 R10: 813e6407 R11: ea000863ad80 R12: 88022ecc3658 R13: 81478b2a R14: 88022ecc36e8 R15: 0001 ORIG_RAX: 81471cfd CS: 0010 SS: 0018 0x814774b9 +9: decl %gs:0xa860 CPU 5 IRQ STACK: KERNEL-MODE EXCEPTION FRAME AT: 88022ed43d98 [exception RIP: io_watchdog_func+112] RIP: 81394b80 RSP: 88022ed43e48 RFLAGS: 00010006 RAX: 8800cb8aa598 RBX: 0296 RCX: 8800cbaa8030 RDX: dead00100100 RSI: cbaa91e0 RDI: 8800cbaa8030 RBP: 88022ed43e88 R8: 8800cbaa7fe8 R9: 0205 R10: 8800cbaa8030 R11: 8800cb8aa5a0 R12: dead001000c0 R13: 8800cb8aa248 R14: 8800cb8aa5b8 R15: ORIG_RAX: CS: 0010 SS: Alan Stern Kind regards, Heiko -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html