Re: Help troubleshooting ehci_idone hang.

2013-09-12 Thread Martin Pieuchot
On 11/09/13(Wed) 11:03, RD Thrush wrote:
 On 09/10/13 07:56, Martin Pieuchot wrote:
  On 10/09/13(Tue) 07:15, RD Thrush wrote:
  On 09/10/13 04:42, Martin Pieuchot wrote:
  [...]
 
  Thanks for this detailed bug report.
 
  You're saying that you have 2 amd64 systems with the same problem but
  I see only the dmesg for one machine, does the other has the same ehci
  controller?
 
  Apparently one is ATI and the other Intel.  
  http://arp.thrush.com/openbsd/ehci_idone/v1/ has two console captures, 
  v1.1 and v1.2, for the other machine after an ehci_idone hang (I 
  hadn't made the panic patch yet).  I was able to generate a ddb interrupt 
  to stop the spew and gather some additional ddb info.  The forementioned 
  directory also has acpidump, pcidump, biosdecode, and dmidecode previously 
  collected from the same kernel.
 
  If you want/need further info about the 'v1' machine, let me know and I'll 
  boot OpenBSD and get the info.
  
  It would be nice if you could reproduce the manipulation you did with
  the other machine and set ehcidebug to 5 before switching your kvm.
 
 With ehcidebug=5 on the v1 machine, switching the kvm resulted in a continual
 ddb loop,  I wasn't able to generate a ddb interrupt via the serial console;
 however, the pc keyboard was able to drop into ddb where I collected some
 additional info.
 
 'boot sync' resulted in the panic I'd patched (earlier in thread) to stop the
 initial hang.  I had to do a hard reset to regain control.
 
 http://arp.thrush.com/openbsd/ehci_idone/v1/v1-2 has the capture of the 
 serial
 console for that session.

Could you try the diff below on the v1 machine and tell me if it helps?

 WRT to the other machine, x4, I installed the patch and have not yet had a
 problem.  However, with ehcidebug=5, the following 2 line message is issued
 about once per second:
 ehci_intrlist_timeout
 ehci_check_intr: ex=0x80238c00

That's expected, thanks for confirming the problem cannot be reproduced
with this diff with an ATI controller.

Index: usbdi.c
===
RCS file: /home/ncvs/src/sys/dev/usb/usbdi.c,v
retrieving revision 1.58
diff -u -p -r1.58 usbdi.c
--- usbdi.c 6 Sep 2013 08:29:58 -   1.58
+++ usbdi.c 12 Sep 2013 08:47:09 -
@@ -810,6 +810,8 @@ usb_transfer_complete(struct usbd_xfer *
if (!repeat) {
/* XXX should we stop the queue on all errors? */
if ((xfer-status == USBD_CANCELLED ||
+xfer-status == USBD_IOERROR ||
+xfer-status == USBD_STALLED ||
 xfer-status == USBD_TIMEOUT) 
pipe-iface != NULL)/* not control pipe */
pipe-running = 0;



Re: Help troubleshooting ehci_idone hang.

2013-09-12 Thread RD Thrush
On 09/12/13 05:15, Martin Pieuchot wrote:
 On 11/09/13(Wed) 11:03, RD Thrush wrote:
 On 09/10/13 07:56, Martin Pieuchot wrote:
 On 10/09/13(Tue) 07:15, RD Thrush wrote:
 On 09/10/13 04:42, Martin Pieuchot wrote:
 [...]

 Thanks for this detailed bug report.

 You're saying that you have 2 amd64 systems with the same problem but
 I see only the dmesg for one machine, does the other has the same ehci
 controller?

 Apparently one is ATI and the other Intel.  
 http://arp.thrush.com/openbsd/ehci_idone/v1/ has two console captures, 
 v1.1 and v1.2, for the other machine after an ehci_idone hang (I 
 hadn't made the panic patch yet).  I was able to generate a ddb interrupt 
 to stop the spew and gather some additional ddb info.  The forementioned 
 directory also has acpidump, pcidump, biosdecode, and dmidecode previously 
 collected from the same kernel.

 If you want/need further info about the 'v1' machine, let me know and I'll 
 boot OpenBSD and get the info.

 It would be nice if you could reproduce the manipulation you did with
 the other machine and set ehcidebug to 5 before switching your kvm.

 With ehcidebug=5 on the v1 machine, switching the kvm resulted in a continual
 ddb loop,  I wasn't able to generate a ddb interrupt via the serial console;
 however, the pc keyboard was able to drop into ddb where I collected some
 additional info.

 'boot sync' resulted in the panic I'd patched (earlier in thread) to stop the
 initial hang.  I had to do a hard reset to regain control.

 http://arp.thrush.com/openbsd/ehci_idone/v1/v1-2 has the capture of the 
 serial
 console for that session.
 
 Could you try the diff below on the v1 machine and tell me if it helps?

Thanks, I don't think it helped...

After booting the new kernel, upon first use of the kvm switch, the serial
console began filling with diagnostics which (to my untrained eye) looked
similar to the pre-patch session. v1-2.  Since I was unable to remotely ssh, I
interrupted ddb and set ehcidebug=1 but was unable to remotely ssh.  ehcidebug=0
allowed remote ssh and normal operations resumed.

http://arp.thrush.com/openbsd/ehci_idone/v1/v1-3 has the associated serial
console.

Let me know if I can provide further info.

 WRT to the other machine, x4, I installed the patch and have not yet had a
 problem.  However, with ehcidebug=5, the following 2 line message is issued
 about once per second:
 ehci_intrlist_timeout
 ehci_check_intr: ex=0x80238c00
 
 That's expected, thanks for confirming the problem cannot be reproduced
 with this diff with an ATI controller.

Since the problem has been intermittent since Nov 2012, I'm not sure enough
time/kvm usage has occurred for certainty about that fix.

I will run a custom MP kernel with your pci/ehci_pci.c patch on the x4 machine
whenever I install a new snapshot and let you know if I notice the ehci_idone 
issue.


 Index: usbdi.c
 ===
 RCS file: /home/ncvs/src/sys/dev/usb/usbdi.c,v
 retrieving revision 1.58
 diff -u -p -r1.58 usbdi.c
 --- usbdi.c   6 Sep 2013 08:29:58 -   1.58
 +++ usbdi.c   12 Sep 2013 08:47:09 -
 @@ -810,6 +810,8 @@ usb_transfer_complete(struct usbd_xfer *
   if (!repeat) {
   /* XXX should we stop the queue on all errors? */
   if ((xfer-status == USBD_CANCELLED ||
 +  xfer-status == USBD_IOERROR ||
 +  xfer-status == USBD_STALLED ||
xfer-status == USBD_TIMEOUT) 
   pipe-iface != NULL)/* not control pipe */
   pipe-running = 0;



Re: Help troubleshooting ehci_idone hang.

2013-09-12 Thread Martin Pieuchot
On 12/09/13(Thu) 08:16, RD Thrush wrote:
 On 09/12/13 05:15, Martin Pieuchot wrote:
  [...]
  
  Could you try the diff below on the v1 machine and tell me if it helps?
 
 Thanks, I don't think it helped...

By looking at your new log, I believe it did ;)

 After booting the new kernel, upon first use of the kvm switch, the serial
 console began filling with diagnostics which (to my untrained eye) looked
 similar to the pre-patch session. v1-2.  Since I was unable to remotely ssh, I
 interrupted ddb and set ehcidebug=1 but was unable to remotely ssh.  
 ehcidebug=0
 allowed remote ssh and normal operations resumed.
 
 http://arp.thrush.com/openbsd/ehci_idone/v1/v1-3 has the associated serial
 console.
 
 Let me know if I can provide further info.

Could you reproduce once again this manipulation with the diff below
applied on top of the previous one?

  WRT to the other machine, x4, I installed the patch and have not yet had a
  problem.  However, with ehcidebug=5, the following 2 line message is issued
  about once per second:
  ehci_intrlist_timeout
  ehci_check_intr: ex=0x80238c00
  
  That's expected, thanks for confirming the problem cannot be reproduced
  with this diff with an ATI controller.
 
 Since the problem has been intermittent since Nov 2012, I'm not sure enough
 time/kvm usage has occurred for certainty about that fix.

Since this diff is a backport of what has been done in FreeBSD and I
already had a similar report from another developer with the same
controller, I'm quite confident.

 I will run a custom MP kernel with your pci/ehci_pci.c patch on the x4 machine
 whenever I install a new snapshot and let you know if I notice the ehci_idone 
 issue.

Please let me know.

Index: uhidev.c
===
RCS file: /home/ncvs/src/sys/dev/usb/uhidev.c,v
retrieving revision 1.44
diff -u -p -r1.44 uhidev.c
--- uhidev.c7 May 2013 08:44:38 -   1.44
+++ uhidev.c12 Sep 2013 14:25:52 -
@@ -393,7 +393,7 @@ uhidev_intr(struct usbd_xfer *xfer, void
}
 #endif
 
-   if (status == USBD_CANCELLED)
+   if (status == USBD_CANCELLED || status == USBD_IOERROR)
return;
 
if (status != USBD_NORMAL_COMPLETION) {



Re: Help troubleshooting ehci_idone hang.

2013-09-12 Thread RD Thrush
On 09/12/13 10:34, Martin Pieuchot wrote:
 On 12/09/13(Thu) 08:16, RD Thrush wrote:
 On 09/12/13 05:15, Martin Pieuchot wrote:
 [...]

 Could you try the diff below on the v1 machine and tell me if it helps?

 Thanks, I don't think it helped...
 
 By looking at your new log, I believe it did ;)
 
 After booting the new kernel, upon first use of the kvm switch, the serial
 console began filling with diagnostics which (to my untrained eye) looked
 similar to the pre-patch session. v1-2.  Since I was unable to remotely ssh, 
 I
 interrupted ddb and set ehcidebug=1 but was unable to remotely ssh.  
 ehcidebug=0
 allowed remote ssh and normal operations resumed.

 http://arp.thrush.com/openbsd/ehci_idone/v1/v1-3 has the associated serial
 console.

 Let me know if I can provide further info.
 
 Could you reproduce once again this manipulation with the diff below
 applied on top of the previous one?

http://arp.thrush.com/openbsd/ehci_idone/v1/v1-4 has the results after adding
your diff.

I'm assuming the continual diagnostic output is expected with ehcidebug=5.
Interrupting ddb, setting ehcidebug=1 and continuing ddb still hits my panic
hack; however, I'll assume that's due to the crudeness of my hack.  I did
subsequently reboot with the same custom kernel w/ ehcidebug=0 and didn't hit
the panic hack.

What's the next step?


 WRT to the other machine, x4, I installed the patch and have not yet had a
 problem.  However, with ehcidebug=5, the following 2 line message is issued
 about once per second:
 ehci_intrlist_timeout
 ehci_check_intr: ex=0x80238c00

 That's expected, thanks for confirming the problem cannot be reproduced
 with this diff with an ATI controller.

 Since the problem has been intermittent since Nov 2012, I'm not sure enough
 time/kvm usage has occurred for certainty about that fix.
 
 Since this diff is a backport of what has been done in FreeBSD and I
 already had a similar report from another developer with the same
 controller, I'm quite confident.
 
 I will run a custom MP kernel with your pci/ehci_pci.c patch on the x4 
 machine
 whenever I install a new snapshot and let you know if I notice the 
 ehci_idone issue.
 
 Please let me know.
 
 Index: uhidev.c
 ===
 RCS file: /home/ncvs/src/sys/dev/usb/uhidev.c,v
 retrieving revision 1.44
 diff -u -p -r1.44 uhidev.c
 --- uhidev.c  7 May 2013 08:44:38 -   1.44
 +++ uhidev.c  12 Sep 2013 14:25:52 -
 @@ -393,7 +393,7 @@ uhidev_intr(struct usbd_xfer *xfer, void
   }
  #endif
  
 - if (status == USBD_CANCELLED)
 + if (status == USBD_CANCELLED || status == USBD_IOERROR)
   return;
  
   if (status != USBD_NORMAL_COMPLETION) {



Re: Help troubleshooting ehci_idone hang.

2013-09-11 Thread RD Thrush
On 09/10/13 07:56, Martin Pieuchot wrote:
 On 10/09/13(Tue) 07:15, RD Thrush wrote:
 On 09/10/13 04:42, Martin Pieuchot wrote:
 [...]

 Thanks for this detailed bug report.

 You're saying that you have 2 amd64 systems with the same problem but
 I see only the dmesg for one machine, does the other has the same ehci
 controller?

 Apparently one is ATI and the other Intel.  
 http://arp.thrush.com/openbsd/ehci_idone/v1/ has two console captures, 
 v1.1 and v1.2, for the other machine after an ehci_idone hang (I hadn't 
 made the panic patch yet).  I was able to generate a ddb interrupt to stop 
 the spew and gather some additional ddb info.  The forementioned directory 
 also has acpidump, pcidump, biosdecode, and dmidecode previously collected 
 from the same kernel.

 If you want/need further info about the 'v1' machine, let me know and I'll 
 boot OpenBSD and get the info.
 
 It would be nice if you could reproduce the manipulation you did with
 the other machine and set ehcidebug to 5 before switching your kvm.

With ehcidebug=5 on the v1 machine, switching the kvm resulted in a continual
ddb loop,  I wasn't able to generate a ddb interrupt via the serial console;
however, the pc keyboard was able to drop into ddb where I collected some
additional info.

'boot sync' resulted in the panic I'd patched (earlier in thread) to stop the
initial hang.  I had to do a hard reset to regain control.

http://arp.thrush.com/openbsd/ehci_idone/v1/v1-2 has the capture of the serial
console for that session.


WRT to the other machine, x4, I installed the patch and have not yet had a
problem.  However, with ehcidebug=5, the following 2 line message is issued
about once per second:
ehci_intrlist_timeout
ehci_check_intr: ex=0x80238c00

That periodic message did not occur with the v1 machine.

Sorry for the delay in reporting...



Re: Help troubleshooting ehci_idone hang.

2013-09-10 Thread Martin Pieuchot
Hi Bob,

On 07/09/13(Sat) 08:14, RD Thrush wrote:
 Since appx. November, 2012, I've had 2 amd64 systems hang while
 spewing ehci_idone: ex=0x80.. is done! messages to the
 serial console.  The hangs are intermittent.  The system is
 unresponsive to the keyboard and doesn't respond to network ping.  A
 hardware reset is necessary to regain control.
 
 In order to help troubleshoot, I patched /usr/src/sys/dev/usb/ehci.c
 to panic when the forementioned message had occurred 9 times and then
 built a custom kernel with EHCI_DEBUG defined.  In the past day, the
 new panic has occurred on the same machine with both an mp and sp
 kernel and I have collected basic ddb information as well as crash
 dumps.
 
 Will the ddb results from my patch [below] help troubleshoot the hang?
 If so, the largish console logs, usbdevs, pcidump and acpidump are
 located at http://arp.thrush.com/openbsd/ehci_idone/x4/.
 
 NB: ehcidebug=0 in the sp session, while ehcidebug=3 or 2 in the mp session.
 Setting ehcidebug=3 seemed to hang but I was able to interrupt ddb, set
 ehcidebug=2 and continue the ddb session.
 
 I appreciate any help diagnosing this problem.

Thanks for this detailed bug report.

You're saying that you have 2 amd64 systems with the same problem but
I see only the dmesg for one machine, does the other has the same ehci
controller?

The problem you are seeing is related to the way ehci transfers are
aborted.  The abortion process is subtly broken.

For the archives what happens in your case is that the timeout for
one of the transfers fires and enqueue an abort task (ehci_timeout
in your log).  This abort task get scheduled tries to deactivate
the qTDs, asks for an Interrupt on Async Advance Doorbell and goes
to sleep (ehci_sync_hc in your log).
Then the interrupt happens (ehci_intr1: door bell), wakeups the
task and goes into the softinterrupt path to process the finished
transfers.  Here the driver discovers that the transfer that timed
out is finished (whoa!) and tries to handles it.  But since this 
transfer has been marked as TIMEOUT (ehci_idone: aborted in your
log), it does nothing and bails.  

Apparently the abort task never get rescheduled and your transfer
is never removed from the list, certainly because the hardware 
keeps interrupting your systems, so you're livelock ;)

But all of that happens because a timeout fires for one of your
transfers, apparently some ATI controllers needs one more quirk,
as your problem looks like a dropped interrupt.  Does the diff
below helps? 

Index: pci/ehci_pci.c
===
RCS file: /home/ncvs/src/sys/dev/pci/ehci_pci.c,v
retrieving revision 1.26
diff -u -p -r1.26 ehci_pci.c
--- pci/ehci_pci.c  15 Apr 2013 09:23:01 -  1.26
+++ pci/ehci_pci.c  10 Sep 2013 08:19:28 -
@@ -204,10 +204,16 @@ ehci_pci_attach(struct device *parent, s
else
snprintf(sc-sc.sc_vendor, sizeof(sc-sc.sc_vendor),
vendor 0x%04x, PCI_VENDOR(pa-pa_id));
-   
+
/* Enable workaround for dropped interrupts as required */
-   if (sc-sc.sc_id_vendor == PCI_VENDOR_VIATECH)
+   switch (sc-sc.sc_id_vendor) {
+   case PCI_VENDOR_ATI:
+   case PCI_VENDOR_VIATECH:
sc-sc.sc_flags |= EHCIF_DROPPED_INTR_WORKAROUND;
+   break;
+   default:
+   break;
+   }
 
ehci_pci_takecontroller(sc, 0);
r = ehci_init(sc-sc);
Index: usb/ehci.c
===
RCS file: /home/ncvs/src/sys/dev/usb/ehci.c,v
retrieving revision 1.134
diff -u -p -r1.134 ehci.c
--- usb/ehci.c  12 Jun 2013 11:42:01 -  1.134
+++ usb/ehci.c  10 Sep 2013 08:19:28 -
@@ -2943,11 +2943,11 @@ ehci_timeout_task(void *addr)
 }
 
 /*
- * Some EHCI chips from VIA seem to trigger interrupts before writing back the
- * qTD status, or miss signalling occasionally under heavy load.  If the host
- * machine is too fast, we we can miss transaction completion - when we scan
- * the active list the transaction still seems to be active.  This generally
- * exhibits itself as a umass stall that never recovers.
+ * Some EHCI chips from VIA / ATI seem to trigger interrupts before writing
+ * back the qTD status, or miss signalling occasionally under heavy load.
+ * If the host machine is too fast, we can miss transaction completion - when
+ * we scan the active list the transaction still seems to be active. This
+ * generally exhibits itself as a umass stall that never recovers.
  *
  * We work around this behaviour by setting up this callback after any softintr
  * that completes with transactions still pending, giving us another chance to



Re: Help troubleshooting ehci_idone hang.

2013-09-10 Thread RD Thrush
On 09/10/13 04:42, Martin Pieuchot wrote:
 Hi Bob,
 
 On 07/09/13(Sat) 08:14, RD Thrush wrote:
 Since appx. November, 2012, I've had 2 amd64 systems hang while
 spewing ehci_idone: ex=0x80.. is done! messages to the
 serial console.  The hangs are intermittent.  The system is
 unresponsive to the keyboard and doesn't respond to network ping.  A
 hardware reset is necessary to regain control.

 In order to help troubleshoot, I patched /usr/src/sys/dev/usb/ehci.c
 to panic when the forementioned message had occurred 9 times and then
 built a custom kernel with EHCI_DEBUG defined.  In the past day, the
 new panic has occurred on the same machine with both an mp and sp
 kernel and I have collected basic ddb information as well as crash
 dumps.

 Will the ddb results from my patch [below] help troubleshoot the hang?
 If so, the largish console logs, usbdevs, pcidump and acpidump are
 located at http://arp.thrush.com/openbsd/ehci_idone/x4/.

 NB: ehcidebug=0 in the sp session, while ehcidebug=3 or 2 in the mp session.
 Setting ehcidebug=3 seemed to hang but I was able to interrupt ddb, set
 ehcidebug=2 and continue the ddb session.

 I appreciate any help diagnosing this problem.
 
 Thanks for this detailed bug report.
 
 You're saying that you have 2 amd64 systems with the same problem but
 I see only the dmesg for one machine, does the other has the same ehci
 controller?

Apparently one is ATI and the other Intel.  
http://arp.thrush.com/openbsd/ehci_idone/v1/ has two console captures, v1.1 
and v1.2, for the other machine after an ehci_idone hang (I hadn't made the 
panic patch yet).  I was able to generate a ddb interrupt to stop the spew and 
gather some additional ddb info.  The forementioned directory also has 
acpidump, pcidump, biosdecode, and dmidecode previously collected from the same 
kernel.

If you want/need further info about the 'v1' machine, let me know and I'll boot 
OpenBSD and get the info.


x4:
x2:x4/ehci_idone 1481egrep 'ehci[0-9]' console.mp | grep -w at
ehci0 at pci0 dev 18 function 2 ATI SB700 USB2 rev 0x00ehci0: offs=32
usb0 at ehci0: USB revision 2.0
ehci1 at pci0 dev 19 function 2 ATI SB700 USB2 rev 0x00ehci1: offs=32
usb1 at ehci1: USB revision 2.0

v1:
x2:v1/ehci_idone 1479egrep 'ehci[0-9]' v1.1 | grep -w at
ehci0 at pci0 dev 26 function 0 Intel 7 Series USB rev 0x04: apic 2 int 23
usb0 at ehci0: USB revision 2.0
ehci1 at pci0 dev 29 function 0 Intel 7 Series USB rev 0x04: apic 2 int 23
usb1 at ehci1: USB revision 2.0


 The problem you are seeing is related to the way ehci transfers are
 aborted.  The abortion process is subtly broken.
 
 For the archives what happens in your case is that the timeout for
 one of the transfers fires and enqueue an abort task (ehci_timeout
 in your log).  This abort task get scheduled tries to deactivate
 the qTDs, asks for an Interrupt on Async Advance Doorbell and goes
 to sleep (ehci_sync_hc in your log).
 Then the interrupt happens (ehci_intr1: door bell), wakeups the
 task and goes into the softinterrupt path to process the finished
 transfers.  Here the driver discovers that the transfer that timed
 out is finished (whoa!) and tries to handles it.  But since this 
 transfer has been marked as TIMEOUT (ehci_idone: aborted in your
 log), it does nothing and bails.  
 
 Apparently the abort task never get rescheduled and your transfer
 is never removed from the list, certainly because the hardware 
 keeps interrupting your systems, so you're livelock ;)
 
 But all of that happens because a timeout fires for one of your
 transfers, apparently some ATI controllers needs one more quirk,
 as your problem looks like a dropped interrupt.  Does the diff
 below helps?

Thank you very much for the detailed analysis and patch.  I'll build a -current 
kernel and try it.

Would there be a complementary patch for the (above) Intel ehci controller?


 Index: pci/ehci_pci.c
 ===
 RCS file: /home/ncvs/src/sys/dev/pci/ehci_pci.c,v
 retrieving revision 1.26
 diff -u -p -r1.26 ehci_pci.c
 --- pci/ehci_pci.c15 Apr 2013 09:23:01 -  1.26
 +++ pci/ehci_pci.c10 Sep 2013 08:19:28 -
 @@ -204,10 +204,16 @@ ehci_pci_attach(struct device *parent, s
   else
   snprintf(sc-sc.sc_vendor, sizeof(sc-sc.sc_vendor),
   vendor 0x%04x, PCI_VENDOR(pa-pa_id));
 - 
 +
   /* Enable workaround for dropped interrupts as required */
 - if (sc-sc.sc_id_vendor == PCI_VENDOR_VIATECH)
 + switch (sc-sc.sc_id_vendor) {
 + case PCI_VENDOR_ATI:
 + case PCI_VENDOR_VIATECH:
   sc-sc.sc_flags |= EHCIF_DROPPED_INTR_WORKAROUND;
 + break;
 + default:
 + break;
 + }
  
   ehci_pci_takecontroller(sc, 0);
   r = ehci_init(sc-sc);
 Index: usb/ehci.c
 ===
 RCS file: /home/ncvs/src/sys/dev/usb/ehci.c,v
 retrieving revision 1.134
 diff -u -p 

Re: Help troubleshooting ehci_idone hang.

2013-09-10 Thread Martin Pieuchot
On 10/09/13(Tue) 07:15, RD Thrush wrote:
 On 09/10/13 04:42, Martin Pieuchot wrote:
  [...]
 
  Thanks for this detailed bug report.
  
  You're saying that you have 2 amd64 systems with the same problem but
  I see only the dmesg for one machine, does the other has the same ehci
  controller?
 
 Apparently one is ATI and the other Intel.  
 http://arp.thrush.com/openbsd/ehci_idone/v1/ has two console captures, 
 v1.1 and v1.2, for the other machine after an ehci_idone hang (I hadn't 
 made the panic patch yet).  I was able to generate a ddb interrupt to stop 
 the spew and gather some additional ddb info.  The forementioned directory 
 also has acpidump, pcidump, biosdecode, and dmidecode previously collected 
 from the same kernel.
 
 If you want/need further info about the 'v1' machine, let me know and I'll 
 boot OpenBSD and get the info.

It would be nice if you could reproduce the manipulation you did with
the other machine and set ehcidebug to 5 before switching your kvm.

  The problem you are seeing is related to the way ehci transfers are
  aborted.  The abortion process is subtly broken.
  
  For the archives what happens in your case is that the timeout for
  one of the transfers fires and enqueue an abort task (ehci_timeout
  in your log).  This abort task get scheduled tries to deactivate
  the qTDs, asks for an Interrupt on Async Advance Doorbell and goes
  to sleep (ehci_sync_hc in your log).
  Then the interrupt happens (ehci_intr1: door bell), wakeups the
  task and goes into the softinterrupt path to process the finished
  transfers.  Here the driver discovers that the transfer that timed
  out is finished (whoa!) and tries to handles it.  But since this 
  transfer has been marked as TIMEOUT (ehci_idone: aborted in your
  log), it does nothing and bails.  
  
  Apparently the abort task never get rescheduled and your transfer
  is never removed from the list, certainly because the hardware 
  keeps interrupting your systems, so you're livelock ;)
  
  But all of that happens because a timeout fires for one of your
  transfers, apparently some ATI controllers needs one more quirk,
  as your problem looks like a dropped interrupt.  Does the diff
  below helps?
 
 Thank you very much for the detailed analysis and patch.  I'll build a 
 -current kernel and try it.
 
 Would there be a complementary patch for the (above) Intel ehci controller?

I'm not even sure this will avoid your problem, a proper fix would be to
stop trying to deactivate the transfer descriptors, as it obviously
doesn't work, and just remove them from the list.  Does anybody want to
take the time to do that? :)

Otherwise you can just buy a non crappy kvm ;)

M.



Re: Help troubleshooting ehci_idone hang.

2013-09-10 Thread RD Thrush
On 09/10/13 07:56, Martin Pieuchot wrote:
 On 10/09/13(Tue) 07:15, RD Thrush wrote:
 On 09/10/13 04:42, Martin Pieuchot wrote:
 [...]

 Thanks for this detailed bug report.

 You're saying that you have 2 amd64 systems with the same problem but
 I see only the dmesg for one machine, does the other has the same ehci
 controller?

 Apparently one is ATI and the other Intel.  
 http://arp.thrush.com/openbsd/ehci_idone/v1/ has two console captures, 
 v1.1 and v1.2, for the other machine after an ehci_idone hang (I hadn't 
 made the panic patch yet).  I was able to generate a ddb interrupt to stop 
 the spew and gather some additional ddb info.  The forementioned directory 
 also has acpidump, pcidump, biosdecode, and dmidecode previously collected 
 from the same kernel.

 If you want/need further info about the 'v1' machine, let me know and I'll 
 boot OpenBSD and get the info.
 
 It would be nice if you could reproduce the manipulation you did with
 the other machine and set ehcidebug to 5 before switching your kvm.
 
 The problem you are seeing is related to the way ehci transfers are
 aborted.  The abortion process is subtly broken.

 For the archives what happens in your case is that the timeout for
 one of the transfers fires and enqueue an abort task (ehci_timeout
 in your log).  This abort task get scheduled tries to deactivate
 the qTDs, asks for an Interrupt on Async Advance Doorbell and goes
 to sleep (ehci_sync_hc in your log).
 Then the interrupt happens (ehci_intr1: door bell), wakeups the
 task and goes into the softinterrupt path to process the finished
 transfers.  Here the driver discovers that the transfer that timed
 out is finished (whoa!) and tries to handles it.  But since this 
 transfer has been marked as TIMEOUT (ehci_idone: aborted in your
 log), it does nothing and bails.  

 Apparently the abort task never get rescheduled and your transfer
 is never removed from the list, certainly because the hardware 
 keeps interrupting your systems, so you're livelock ;)

 But all of that happens because a timeout fires for one of your
 transfers, apparently some ATI controllers needs one more quirk,
 as your problem looks like a dropped interrupt.  Does the diff
 below helps?

 Thank you very much for the detailed analysis and patch.  I'll build a 
 -current kernel and try it.

 Would there be a complementary patch for the (above) Intel ehci controller?
 
 I'm not even sure this will avoid your problem, a proper fix would be to
 stop trying to deactivate the transfer descriptors, as it obviously
 doesn't work, and just remove them from the list.  Does anybody want to
 take the time to do that? :)
 
 Otherwise you can just buy a non crappy kvm ;)

Does anyone have suggestions for a non-crappy kvm?

For the record, this crappy kvm is a Trendnet TK-407.



Help troubleshooting ehci_idone hang.

2013-09-07 Thread RD Thrush
Since appx. November, 2012, I've had 2 amd64 systems hang while
spewing ehci_idone: ex=0x80.. is done! messages to the
serial console.  The hangs are intermittent.  The system is
unresponsive to the keyboard and doesn't respond to network ping.  A
hardware reset is necessary to regain control.

In order to help troubleshoot, I patched /usr/src/sys/dev/usb/ehci.c
to panic when the forementioned message had occurred 9 times and then
built a custom kernel with EHCI_DEBUG defined.  In the past day, the
new panic has occurred on the same machine with both an mp and sp
kernel and I have collected basic ddb information as well as crash
dumps.

Will the ddb results from my patch [below] help troubleshoot the hang?
If so, the largish console logs, usbdevs, pcidump and acpidump are
located at http://arp.thrush.com/openbsd/ehci_idone/x4/.

NB: ehcidebug=0 in the sp session, while ehcidebug=3 or 2 in the mp session.
Setting ehcidebug=3 seemed to hang but I was able to interrupt ddb, set
ehcidebug=2 and continue the ddb session.

I appreciate any help diagnosing this problem.

Thanks, Bob


Index: dev/usb/ehci.c
===
RCS file: /pub2/cvsroot/OpenBSD/src/sys/dev/usb/ehci.c,v
retrieving revision 1.134
diff -u -p -w -b -u -r1.134 ehci.c
--- dev/usb/ehci.c  12 Jun 2013 11:42:01 -  1.134
+++ dev/usb/ehci.c  12 Jun 2013 12:47:18 -
@@ -81,6 +81,8 @@ struct cfdriver ehci_cd = {
 #define DPRINTF(x) do { if (ehcidebug) printf x; } while(0)
 #define DPRINTFN(n,x)  do { if (ehcidebug(n)) printf x; } while (0)
 int ehcidebug = 0;
+int ehcicount = 0;
+int ehcicount_max = 10; /* panic - use ddb to gather more info before 
restarting */
 #define bitmask_snprintf(q,f,b,l) snprintf((b), (l), %b, (q), (f))
 #else
 #define DPRINTF(x)
@@ -808,12 +810,15 @@ ehci_idone(struct ehci_xfer *ex)
{
int s = splhigh();
if (ex-isdone) {
+   if ( ++ehcicount = ehcicount_max ) {
+   panic(ehci_idone: ex is done!\n);
+   }
splx(s);
 #ifdef EHCI_DEBUG
-   printf(ehci_idone: ex is done!\n   );
+   printf(ehci_idone: ex is done!ehcicount=%d\n   , 
ehcicount);
ehci_dump_exfer(ex);
 #else
-   printf(ehci_idone: ex=%p is done!\n, ex);
+   printf(ehci_idone: ex=%p is done!ehcicount=%d\n, ex, 
ehcicount);
 #endif
return;
}