Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-06-28 Thread Heiko Przybyl
On Monday 11 May 2015 13:14:52 Alan Stern wrote:
 On Sun, 10 May 2015, Heiko Przybyl wrote:
  Hello again.
  
  The issue still bites me with 4.0.2.
  
  Though I was able to narrow down the issue, it's still hard to
  deterministically trigger it.
  
  Anyway. The setup:
  - OHCI controller with hooked up keyboard
  - xhddled (program that toggles scrolllock on disk activity)
  - disk activity
  
  I'm very certain that it happens, when usbhid issues an urb for scrolllock
  led toggling, the ohci driver adding a duplicate to its eds_in_use list.
  That's why I hit it way more often with disk activity. I also tried to
  alter the program to just periodically toggle the scrolllock, but that
  didn't seem to help for reproducibility.
 
 As long as you have a way to trigger the problem, you're in a good
 position to track it down.
 
  On Friday 23 January 2015 15:21:11 Alan Stern wrote:
  
  As suggested, checking the eds_in_use list was a good idea. Now, I'm even
  more convinced duplicates are being added to the eds_in_use list.
 
 Your tests gave me a clue as to where the problem might be.  It looks
 like finish_unlinks() changes ed-state to ED_IDLE too soon.  The state
 should remain set to ED_UNLINK while finish_urb() is called.
 
 See if the patch below eliminates your problem.

Yes it indeed does eliminate the problem. Took some time to check if the 
problem is really gone, though. Without the patch I get lockups up to and 
including release 4.1; and not a single lockup with the patch applied.

So for the patch I'd say

Tested-By: Heiko Przybyl lil_...@web.de


Thanks for looking into and fixing this issue,

   Heiko

 
 Alan Stern
 
 
 
 Index: usb-4.0/drivers/usb/host/ohci-q.c
 ===
 --- usb-4.0.orig/drivers/usb/host/ohci-q.c
 +++ usb-4.0/drivers/usb/host/ohci-q.c
 @@ -980,10 +980,6 @@ rescan_all:
   int completed, modified;
   __hc32  *prev;
 
 - /* Is this ED already invisible to the hardware? */
 - if (ed-state == ED_IDLE)
 - goto ed_idle;
 -
   /* only take off EDs that the HC isn't using, accounting for
* frame counter wraps and EDs with partially retired TDs
*/
 @@ -1011,12 +1007,10 @@ skip_ed:
   }
 
   /* ED's now officially unlinked, hc doesn't see */
 - ed-state = ED_IDLE;
   ed-hwHeadP = ~cpu_to_hc32(ohci, ED_H);
   ed-hwNextED = 0;
   wmb();
   ed-hwINFO = ~cpu_to_hc32(ohci, ED_SKIP | ED_DEQUEUE);
 -ed_idle:
 
   /* reentrancy:  if we drop the schedule lock, someone might
* have modified this list.  normally it's just prepending
 @@ -1087,6 +1081,7 @@ rescan_this:
   if (list_empty(ed-td_list)) {
   *last = ed-ed_next;
   ed-ed_next = NULL;
 + ed-state = ED_IDLE;
   list_del(ed-in_use_list);
   } else if (ohci-rh_state == OHCI_RH_RUNNING) {
   *last = ed-ed_next;

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-01-20 Thread Heiko Przybyl
On Tuesday 20 January 2015 10:49:29 Alan Stern wrote:
 On Mon, 19 Jan 2015, Heiko Przybyl wrote:
  On Monday 19 January 2015 11:17:59 Alan Stern wrote:
   
   That's easy enough to test.  All you have to do is change the
   spin_lock/unlock statements to their irq_save/restore variants.
  
  Well, thought about that as well, but I'm not sure when to take it as
  fixed and when to take it as issue-just-didn't-happen-yet, because of the
  not-so- deterministic occurrence of the error. But I can try it out
  anyway, just wanted to have some feedback before trying.
 
 By the way, failing to disable interrupts when acquiring a spinlock
 generally does not lead to data corruption -- it leads to deadlocks.
 So I doubt this is the cause of your problem.  If you really want to,
 you could add a
 
   WARN_ON(!irqs_disabled());
 
 line to ohci_irq().

You're right. After thinking/reading a bit more about the topic, not disabled 
IRQs would cause deadlocks, because there's already one thread in the critical 
section.

 
 No idea.
 
 It might be a good idea for you to try something a little more
 invasive.  How about writing a routine to check the entire
 ohci-eds_in_use list for validity (each forward pointer is matched by
 the corresponding backward pointer), and calling this routine at each
 place where the list gets modified, before the modification happens?
 
 You could also make sure that an entry being added to the list isn't on
 the list already, and whenever an entry is deleted from the list
 either it really is on the list or else its list pointers point to
 themselves.
 

I'm not 100% sure, but then it's probably a race between urb 
enqueuing (duplicates?) and the watchdog orphan cleanup.

The crash log already shows the double list add in ohci_urb_enqueue

ohci-hcd.c:238: list_add(ed-in_use_list, ohci-eds_in_use);

This is probably due to the ed returned by ed_get() being reused before the
watchdog ran, thus the same in_use_list re-added to ohci.eds_in_use.

Entries seem to get removed in finish_unlinks()

ohci-q.c:1090: list_del(ed-in_use_list);

with list_del() poisoning the next/prev pointers of the removed entry.

Now with the watchdog starting cleanup it iterates over the ohci.eds_in_use 
list
that still has the second very same entry of in_use_list we double-added (but 
now with 0xdead... pointers) and we fault on 

ohci-hcd.c:761: if (ed-pending_td) {


I hope that makes any sense? I'll hook up the list checking tomorrow. Though I 
haven't hit the (double-add) problem again, since the bug report. Seems pretty 
specific the whole thing.


 Alan Stern

Heiko
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-01-19 Thread Heiko Przybyl
Hi all,

been redirected here from bug #91511 [1].

I'm getting sporadic crashes in io_watchdog_func() in drivers/usb/host/ohci-
hcd.c:761:

list_for_each_entry(ed, ohci-eds_in_use, in_use_list) {
if (ed-pending_td) {

with the in_use list getting corrupted:

from ohci_urb_enqueue():
[43656.918055] list_add double add: new=8800cbaa8040, 
prev=8800cb8aa5b8, 
next=8800cbaa8040.
from ohci_work.part():
[43656.920980] list_del corruption. next-prev should be 8800cbaa8040, but 
was 8800cb8aa5b8

One or both set the pointer to 0xdead00100100 and 0xdead001000c0, 
where io_watchdog_func() chokes on [2].

It seems to be related to keyboard input (at least it happens when using the 
keyboard), without relation to system load. Can happen within a day after boot 
or after several days of hibernated uptime. Unfortunately, I haven't found a 
way to reliably reproduce the issue, yet.

The box is a Gigabyte GA-78LMT-USB3 with AMD FX(tm)-6300 Six-Core 
Processor and [AMD/ATI] SB7x0 USB OHCI1 Controller.

For more info including crash trace, just have a look at the bug report [1]


My (pretty wild) guess is, that the corruption happens through a race in the 
interrupt handler ohci_irq(), which calls ohci_work(), which calls 
finish_urb(), which states:
 * PRECONDITION:  ohci lock held, irqs blocked

But ohci_irq() seems to only spin_[un]lock(), not spin_[un]lock_irq[save|
restore](). All other functions that call ohci_work() do at least 
spin_[un]lock_irq. So irqs could still be enabled and possibly the event 
triggered twice, thus the double list add?


If you need more information just let me know (per the bug [1] or CC, since 
I'm not subscribed to this list).


Kind regards,

Heiko


[1] https://bugzilla.kernel.org/show_bug.cgi?id=91511
[2] https://bugzilla.kernel.org/attachment.cgi?id=163721

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-01-19 Thread Heiko Przybyl
On Monday 19 January 2015 11:17:59 Alan Stern wrote:
 On Mon, 19 Jan 2015, Heiko Przybyl wrote:
  It seems to be related to keyboard input (at least it happens when using
  the keyboard), without relation to system load. Can happen within a day
  after boot or after several days of hibernated uptime. Unfortunately, I
  haven't found a way to reliably reproduce the issue, yet.
  
  [..]
  
  My (pretty wild) guess is, that the corruption happens through a race in
  the interrupt handler ohci_irq(), which calls ohci_work(), which calls
  finish_urb(), which states:
   * PRECONDITION:  ohci lock held, irqs blocked
  
  But ohci_irq() seems to only spin_[un]lock(), not spin_[un]lock_irq[save|
  restore](). All other functions that call ohci_work() do at least
  spin_[un]lock_irq. So irqs could still be enabled and possibly the event
  triggered twice, thus the double list add?
 
 That's easy enough to test.  All you have to do is change the
 spin_lock/unlock statements to their irq_save/restore variants.

Well, thought about that as well, but I'm not sure when to take it as fixed and 
when to take it as issue-just-didn't-happen-yet, because of the not-so-
deterministic occurrence of the error. But I can try it out anyway, just 
wanted to have some feedback before trying.

 
 ohci_irq() is an interrupt handler.  In the absence of threaded IRQs,
 he kernel should always call interrupt handlers with interrupts
 disabled.  Do you specify threadirqs on your boot command line?
 

Never used threadirqs.

# cat /proc/cmdline 
BOOT_IMAGE=/boot/gentoo root=/dev/sda2 ro rootfstype=ext4 resume=/dev/sda3 
init=/usr/lib/systemd/systemd quiet libahci.ignore_sss=1 i8042.nopnp 
crashkernel=64M

 If that's not the explanation then we'll have to dig deeper.

I can still work on a saved vmcore dump of a crash. Btw. using crash(1) and 
its command `bt -E`shows two kernel mode exceptions. Though, I can't figure out 
where the first one originates from

CPU 3 IRQ STACK:
  KERNEL-MODE EXCEPTION FRAME AT: 88022ecc3638
[exception RIP: _raw_spin_unlock_irqrestore+9]
RIP: 814774b9  RSP: 88022ecc36e8  RFLAGS: 0202
RAX: 88022ecc36a8  RBX: 88022ecc36b0  RCX: 81290279
RDX: 2dff  RSI:   RDI: 88022ecc3788
RBP: 88022ecc36e8   R8: 0080   R9: 0023
R10: 813e6407  R11: ea000863ad80  R12: 88022ecc3658
R13: 81478b2a  R14: 88022ecc36e8  R15: 0001
ORIG_RAX: 81471cfd  CS: 0010  SS: 0018

0x814774b9 +9: decl   %gs:0xa860

CPU 5 IRQ STACK:
  KERNEL-MODE EXCEPTION FRAME AT: 88022ed43d98
[exception RIP: io_watchdog_func+112]
RIP: 81394b80  RSP: 88022ed43e48  RFLAGS: 00010006
RAX: 8800cb8aa598  RBX: 0296  RCX: 8800cbaa8030
RDX: dead00100100  RSI: cbaa91e0  RDI: 8800cbaa8030
RBP: 88022ed43e88   R8: 8800cbaa7fe8   R9: 0205
R10: 8800cbaa8030  R11: 8800cb8aa5a0  R12: dead001000c0
R13: 8800cb8aa248  R14: 8800cb8aa5b8  R15: 
ORIG_RAX:   CS: 0010  SS: 


 Alan Stern

Kind regards,

   Heiko
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html