Re: [linux-usb-devel] EHCI Regression in 2.6.23-rc2
Daniel Exner wrote: Jiri Kosina wrote: On Fri, 10 Aug 2007, Daniel Exner wrote: Please CC me, as I'm currently not subscribed to this list, thx. Please also don't forget to CC relevant people/lists when reporting bugs, thanks. Guess its ok, now? Thanks anyway :) After some serious hangs with 2.6.23-rc2 I did some bisects and this was the result: 196705c9bbc03540429b0f7cf9ee35c2f928a534 is first bad commit commit 196705c9bbc03540429b0f7cf9ee35c2f928a534 Author: [EMAIL PROTECTED] [EMAIL PROTECTED] Date: Thu May 3 08:58:49 2007 -0700 I guess that the patch attached to bug 8535 in kernel.org bugzilla -- http://bugzilla.kernel.org/attachment.cgi?id=12228action=view -- solves your issues, right? Nope, this does _not_ fix my issue. Anything else I could try, or some files you need? I tried finding some clue in my logs, but without any results so far. Greetings Daniel Exner It appears that the VIA controllers just ignore the inactivate bit completely. Normally, when I set the inactivate bit in the QH and then watch the QH overlay, I eventually see the controller clear the active bit in the overlay token, and, of course, it doesn't do the transaction. With the VIA controller I have, after I set the inactivate bit, I eventually see the controller set bit 1 in the overlay token (SplitXstate), indicating that it's running the transaction, and, a couple microframes later, it clears that bit again. The transaction is not inactivated. The problem occurs if a transaction completes when the inactivate bit is set... qh_completions will ignore the transaction until the inactivate bit is cleared, and then, when the transaction should be re-activated, my patch will set the active bit back to 1 in the overlay qtd token, even though the transaction was already completed by the controller... To work around this, I'd have to re-write my patch so that it didn't depend on the inactivate bit at all... I suppose it could possibly be done just by directly manipulating the active bit in the overlay token, since already the code doesn't mess with the overlay if there's any chance that the transaction is alrady cached or in progress, but that would be tricky. Perhaps for now the best thing would just be to bypass the EHCI CPU frequency notifier code (i.e., my patch) for VIA EHCI controllers, since they are broken. Would a hard-coded blacklist (just an if (manufacturer==VIA)... type thing) be OK? I've also acquired a card with an NEC EHCI controller on it, which I'm going to look at while I'm into it... Thanks Stuart - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] EHCI Regression in 2.6.23-rc2
Jiri Kosina wrote: On Fri, 10 Aug 2007, Daniel Exner wrote: After some serious hangs with 2.6.23-rc2 I did some bisects and this was the result: 196705c9bbc03540429b0f7cf9ee35c2f928a534 is first bad commit commit 196705c9bbc03540429b0f7cf9ee35c2f928a534 Author: [EMAIL PROTECTED] [EMAIL PROTECTED] Date: Thu May 3 08:58:49 2007 -0700 I guess that the patch attached to bug 8535 in kernel.org bugzilla -- http://bugzilla.kernel.org/attachment.cgi?id=12228action=view -- solves your issues, right? Stuart, did you submit this fix for upstream already please? Yes... http://marc.info/?l=linux-usb-develm=118598561010046w=2 However, I have not tested this with a VIA EHCI controller (though it's been tested with Intel, Broadcom, and nVidia). This patch uses the inactivate bit in the QH, which wasn't previously used by the linux kernel, and I found that the different vendors of EHCI controllers (Intel, Broadcom, nVidia) all handle this a little differently. There's probably something about the way VIA controllers respond to seeing this bit set that is breaking things. I'll try to get my hands on a VIA EHCI controller so I can look at this... if you happen to know of an add-in card that has one of these, please let me know! It would be a lot easier for me to debug this myself here than to try to get someone else to run test kernels for me... - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] [PATCH] bug with EHCI cpufreq patch on nVidia controllers
Greg KH wrote: On Tue, Jul 31, 2007 at 04:00:15PM -0700, Pete Zaitcev wrote: On Tue, 31 Jul 2007 17:00:04 -0500, [EMAIL PROTECTED] wrote: + list_for_each_safe (entry, tmp, qh-qtd_list) { + qtd = list_entry (entry, struct ehci_qtd, qtd_list); + if (cpu_to_le32 (qtd-qtd_dma) == qh-hw_current) + return qtd; + } Why use list_for_each_safe when mere list_for_each would do? In fact, some might ask for list_for_each_entry. Yes, I think Rusty has an automated email that checks for every new list_for_each that doesn't use list_for_each_entry, you don't want to hit that :) So could you change it to use this? I don't think that the safe version is needed here either, do you? thanks, greg k-h Yes, sorry, I'm not sure what I was thinking. Here's the changed patch along with the description (again). Thanks! This patch fixes a bug with the cpu frequency change notifier and nVidia EHCI controllers. The nVidia controllers write the transfer overlay back to the qtd when they see the inactivate bit set in the qh, which clears the active bit in qtd-hw_token. When the qh was reactivated, the active bit in the overlay's hw_token was turned back on, but not the one in the qtd. This caused qh_completions to think that qtd was finished even though it wasn't. This patch (against 2.6.23-rc1) fixes this bug. Signed-off-by: Stuart Hayes [EMAIL PROTECTED] ehci_cpufreq_nvidia_30july_c.patch Description: ehci_cpufreq_nvidia_30july_c.patch - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] USB-related instability on 2.6.23-rc1
Greg KH wrote: On Tue, Jul 31, 2007 at 03:06:07PM -0600, Grant Likely wrote: I've observed instability on my Macbook c2d with the Linus' latest tree. Symptom is that USB devices start behaving badly and the kernel seems to be registering incorrect HID events; (ie. moving the mouse causes odd keyboard events). If I continue to attempt to use the mouse, it results in a system freeze (and I don't see any kind of backtrace in /var/log/kern.log). To reproduce the problem, I just need to be using the keyboard/mouse. If I leave the system idle, the problem does not seem to occur. If I actively use the keyboard/mouse, the problem can occur at any time; usually within 5-10 minutes. In my particular case, I've got both a USB key/mouse pair on the laptop and an external key/mouse pair plugged in via a hub. Typically, it's the external mouse that exhibits badness first. USB HID devices are definitely affected. I have not tested USB block devices. Isochronous transfers to a USB headset appear to be unaffected, but that's under minimal testing. I've bisected down to commit 196705c9bbc03540429b0f7cf9ee35c2f928a534, USB: EHCI cpufreq fix Reverting the patch at the head of the tree makes the kernel stable again. I've attached my kernel config file and a dmesg output Can you enable CONFIG_USB_DEBUG and send the linux-usb-devel list and us, the output when you have these problems? thanks, greg k-h I'm actually in the process right now of attaching a patch for this problem (well, I hope it's a patch for the problem you're seeing) right now on the bugzilla (bug 8535 at bugzilla.kernel.org). If you'd like to test it, that would be great. Thanks, Stuart - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] [PATCH] bug with EHCI cpufreq patch on nVidia controllers
This patch fixes a bug with the cpu frequency change notifier and nVidia EHCI controllers. The nVidia controllers write the transfer overlay back to the qtd when they see the inactivate bit set in the qh, which clears the active bit in qtd-hw_token. When the qh was reactivated, the active bit in the overlay's hw_token was turned back on, but not the one in the qtd. This caused qh_completions to think that qtd was finished even though it wasn't. This patch (against 2.6.23-rc1) fixes this bug. Signed-off-by: Stuart Hayes [EMAIL PROTECTED] ehci_cpufreq_nvidia_30july_b.patch Description: ehci_cpufreq_nvidia_30july_b.patch - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] Bug in EHCI split-interrupt handling
I don't see how my patch could cause a transfer to return an actual_length of -4. If it is my patch causing this problem, I suspect it would be because the nVidia EHCI controller handles the inactivate bit in an unexpected (and probably out of spec) way--I was able to test with Broadcom Intel, but I didn't have an nVidia EHCI controller to test on. I guess it's possible that, when the inactivate bit is set, the nVidia EHCI controller sets the status to make it look like that transaction is finished, and maybe the inactivate bit was set early enough that actual_length never got initialized to anything and the -4 was just leftover in that memory space...? I suggest this without looking at the code--I don't know if that's actually possible. I'll try to locate a box with an nVidia EHCI controller today and see if I can reproduce this. Stuart -Original Message- From: Alan Stern [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 24, 2007 5:07 PM To: Pete Zaitcev Cc: Hayes, Stuart; USB development list Subject: Re: Bug in EHCI split-interrupt handling On Tue, 24 Jul 2007, Pete Zaitcev wrote: On Tue, 24 Jul 2007 15:44:23 -0400 (EDT), Alan Stern [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=8535 contains logs suggestive of problems with EHCI split-interrupt handling. See in particular comment #33; the usbmon log in comment #32 contains a line in which a low-speed interrupt URB returns with status equal to 0 and actual_length set to -4. Nicolas sent me a trace out of band which had a similar entry: 8100057f6100 1467276128 C Ii:2:004:1 0:8 -4 Obviously this should never happen. Can anybody offer tips on where to go searching through the driver code for a solution? I'm going to double-check that usbmon is not deceiving you. It sure looks totally bogus. Maybe I mixed up fields in a union somewhere. Maybe. Stuart Hayes has just added comment #42 in which he suggests this problem may be caused by the cpufreq adjustment patch he wrote. Stuart, could your patch cause a 4-byte low-speed interrupt transfer to return an actual_length of -4? Alan Stern - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] Bug in EHCI split-interrupt handling
OK, I'm seeing an issue that I'm pretty sure is the same thing... the keyboard is all kind of goofy (loses keys, repeats keys), then quits working, then the system locks up, only when my patch is enabled and I'm getting (faked) cpu frequency transitions. It definitely appears to be some incompatibilty between the nVidia EHCI controller and my patch. I'm debugging now. (And struggling to remember how this all works! I don't work with the USB code every day.) Thanks Stuart -Original Message- From: Hayes, Stuart Sent: Wednesday, July 25, 2007 1:50 PM To: 'Pete Zaitcev' Cc: [EMAIL PROTECTED]; linux-usb-devel@lists.sourceforge.net Subject: RE: Bug in EHCI split-interrupt handling I'm working on it, Pete. I've got a system with an nVidia EHCI controller (unfortunately it's an Intel box, not AMD, since the failing systems are AMD), and I'm working to reproduce the issue. I acknowledge that this issue is probably caused by this patch. I suspect that what's going on is that the nVidia EHCI controller isn't responding to the inactivate bit correctly. I think I will find the answer more quickly (and I'll be more certain that I've got the right answer) if I can work on a system that exhibits the problem. Thanks Stuart -Original Message- From: Pete Zaitcev [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 25, 2007 1:45 PM To: Hayes, Stuart Cc: [EMAIL PROTECTED]; linux-usb-devel@lists.sourceforge.net; [EMAIL PROTECTED] Subject: Re: Bug in EHCI split-interrupt handling On Wed, 25 Jul 2007 08:13:45 -0500, [EMAIL PROTECTED] wrote: [...] and maybe the inactivate bit was set early enough that actual_length never got initialized to anything and the -4 was just leftover in that memory space...? I suggest this without looking at the code--I don't know if that's actually possible. No, Stuart, this won't do. I need you to look at the code, because: a) I have explored the obvious avenues already, and b) We know that unsetting CONFIG_CPU_FREQ clears the issue. The initialization of actual_length is done unconditionally in usb_submit_urb, it was the first thing I checked! I have two vague hypotheses (-sii?) currently: #1 Somehow your patch conflicts with the insertion code which moves dummy qTD around. #2 The length in QH gets desynched from length in QTD, and we have a pice of code which takes the qh-hw_token and uses it for length calculation against qtd-length. -- Pete - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] Bug in EHCI split-interrupt handling
I'm working on it, Pete. I've got a system with an nVidia EHCI controller (unfortunately it's an Intel box, not AMD, since the failing systems are AMD), and I'm working to reproduce the issue. I acknowledge that this issue is probably caused by this patch. I suspect that what's going on is that the nVidia EHCI controller isn't responding to the inactivate bit correctly. I think I will find the answer more quickly (and I'll be more certain that I've got the right answer) if I can work on a system that exhibits the problem. Thanks Stuart -Original Message- From: Pete Zaitcev [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 25, 2007 1:45 PM To: Hayes, Stuart Cc: [EMAIL PROTECTED]; linux-usb-devel@lists.sourceforge.net; [EMAIL PROTECTED] Subject: Re: Bug in EHCI split-interrupt handling On Wed, 25 Jul 2007 08:13:45 -0500, [EMAIL PROTECTED] wrote: [...] and maybe the inactivate bit was set early enough that actual_length never got initialized to anything and the -4 was just leftover in that memory space...? I suggest this without looking at the code--I don't know if that's actually possible. No, Stuart, this won't do. I need you to look at the code, because: a) I have explored the obvious avenues already, and b) We know that unsetting CONFIG_CPU_FREQ clears the issue. The initialization of actual_length is done unconditionally in usb_submit_urb, it was the first thing I checked! I have two vague hypotheses (-sii?) currently: #1 Somehow your patch conflicts with the insertion code which moves dummy qTD around. #2 The length in QH gets desynched from length in QTD, and we have a pice of code which takes the qh-hw_token and uses it for length calculation against qtd-length. -- Pete - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] [PATCH] bug removing ehci-hcd
I wasn't actually able to reproduce the bug myself, but I guess it is pretty obvious that I shouldn't have called cpufreq_unregister_notifier with a spinlock held. I haven't been doing this long enough to know exactly which kernel this patch should be against, so let me know if this ins't good. Thanks! This patch (for the 2.6.21.3 kernel plus previously sent cpufreq notifier patch) fixes a bug caused by calling cpufreq_unregister_notifier (which can sleep) while holding a spinlock. Signed-off-by: Stuart Hayes [EMAIL PROTECTED] ehci_unregister_cpufreq_notifier.patch Description: ehci_unregister_cpufreq_notifier.patch - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] [PATCH] ehci errors during cpu frequency changes
Hayes, Stuart wrote: EHCI controllers that don't cache enough microframes can get MMF errors when CPU frequency changes occur between the start and completion of split interrupt transactions, due to delays in reading main memory (caused by CPU cache snoop delays). This patch adds a cpufreq notifier to the EHCI driver that will inactivate split interrupt transactions during frequency transitions. It was tested on Intel ICH7 and Serverworks/Broadcom HT1000 EHCI controllers. Signed-off-by: Stuart Hayes [EMAIL PROTECTED] Does the lack of response mean that this patch won't be accepted? Is there anything I could do to make it more acceptable? There's definitely an issue that this addresses... if this patch isn't good for some reason I'll be happy to try another approach. The only other thing I can think of--other than setting the I bit--is to unlink all of the split interrupt QHs during CPU frequency transitions. Or... I could add a kernel parameter so that this patch wouldn't be enabled by default, so that only users who are seeing the problem could turn it on... that way this could get some wider testing without risk of everybody's USB subsystems failing mysteriously... Thanks Stuart - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] [PATCH] ehci errors during cpu frequency changes
EHCI controllers that don't cache enough microframes can get MMF errors when CPU frequency changes occur between the start and completion of split interrupt transactions, due to delays in reading main memory (caused by CPU cache snoop delays). This patch adds a cpufreq notifier to the EHCI driver that will inactivate split interrupt transactions during frequency transitions. It was tested on Intel ICH7 and Serverworks/Broadcom HT1000 EHCI controllers. Signed-off-by: Stuart Hayes [EMAIL PROTECTED] take9_i_bit_b.patch Description: take9_i_bit_b.patch - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] [PATCH] EHCI interrupt transactions failing during CPU frequency changes
Sorry for the delayed response on this... I've been working on it and other stuff at the same time. I found a couple bugs with my last patch, and have tested quite a bit on both Intel ServerWorks systems. Here's the new patch (very similar to the previous one I sent), and responses to your questions below. David Brownell wrote: On Wednesday 07 March 2007 1:10 pm, [EMAIL PROTECTED] wrote: OK, I've made an attempt to solve this issue without using such a large hammer. Instead of shutting off the entire periodic schedule during cpufreq changes, this patch will just inactivate (using the I bit in qh-hw_info1, currently unused in the EHCI driver) any queue heads qh-that are to full/low speed interrupt endpoints, during the cpufreq changes. Makes sense ... although I don't trust your safe_to_inactivate() logic, that's inherently racey. You could check to see if the transaction has started, and decide it didn't (and thus set the I bit) just when the controller starts that transaction. The race seems unavoidable safe_to_inactivate() shouldn't be racy, because I wait one full microframe after the transaction should have been started before checking QTD_STS_STS... the controller should write this bit during the microframe that the transaction was started. I did actually see problems (rarely) when I checked QTD_STS_STS in the microframe immediately after the transaction should have started, since I guess QTD_STS_STS sometimes gets updated right at the end of the microframe. I had way too much ... fun isn't quite the word ... chasing issues like that with fault handling in the async ring. Such issues might be less common with the periodic schedule; maybe. When the driver loses that race, this comes up as an MMF error right? Yes, that's what happened when I was checking QTD_STS_STS the uframe immedately after the transaction should have started. I have found that the ICH7 works without this patch, because it caches 7 uframes of the periodic schedule, so it never has to read main memory in order to complete a split transaction in time. The Broadcom/Serverworks HT1000 only caches 1 uframe, so it fails without this patch. I've always wondered how that caching is supposed to work. I suspect that you could make the ICH7 fail too, if the periodic schedule were busy enough ... it's not like they'll allocate enough silicon to cache every possible QH in SRAM inside that controller! That could mean a LOT of memory for caching. Right? Not to poke too much at all your good work here, but how would ICH7 fail in that case, given the logic in this patch? Good question... maybe the caching indicates how many microframes in advance the controller COULD cache the transactions--but it won't necessarily do so if there are too many of them. So the thing to do is to assume that the controller might have the QH cached like it claims (i.e., don't mess with the QH if we're within ehci-i_thresh of starting the transaction), but don't assume it will necessarily cache everything (i.e., register the cpufreq notifier even if i_thresh appears to be big enough that you wouldn't need it), just in case. I have tested this code on both the HT1000 and ICH7 (I tested the patch on the ICH7 with the cpufreq notifier registered, too, even though the patch won't even register it, just to make sure it would work correctly). How heavily were your periodic schedules loaded? The /sys/class/usb_host debug file labeled periodic should provide a textual display. It would be good to test both hardware configs with a lot of USB mice hooked up. Not that the current EHCI periodic scheduler handles very many, but ISTR that three full/low speed devices per TT was realistic. And high speed hubs with multiple TTs are easy enough to find, given that Cypress chip. At Dell, I expect you have one or two extra USB mice and keyboards lying around that could be used to try abusing your patch. :) I guess this isn't all that relevant if I enable the notifier even for the controllers that say that they'll cache 7 uframes, but I ran for 4 days with this load with no problems (I added line feeds, and deleted repeated QHs high-speed QHs, in the text below). (I ran this on both an AMD/ServerWorks HT1000 system, and an Intel ICH7 system.) 7: qh8-0e01/810001665480 (l5 ep1in [1/2] q1 p8) qh8-1c02/810001665780 (l7 ep1in [1/2] q1 p8) qh8-0e01/810001665d80 (l12 ep1in [1/2] q1 p8) qh8-0e01/810001665e40 (l12 ep2in [1/2] q1 p8) qh8-7008/810037fee6c0 (l5 ep2in [1/2] q1 p8) qh8-0e01/810037fee780 (l13 ep1in [1/2] q1 p4) 15: qh16-3804/810001665900 (f8 ep1in [1/2] q1 p8) qh16-3804/810001665b40 (l10 ep1in [1/2] q1 p8) qh16-7008/810037fee0c0 (f14 ep1in [1/2] q1 p8) 31: qh32-7008/810037fee180 (f14 ep2in [1/2] q1 p4) 127: qh128-0e01/810001665600 (f6 ep1in [1/2] q1 p1)
Re: [linux-usb-devel] [PATCH] EHCI interrupt transactions failing during CPU frequency changes
David Brownell wrote: On Wednesday 07 March 2007 1:10 pm, [EMAIL PROTECTED] wrote: OK, I've made an attempt to solve this issue without using such a large hammer. Instead of shutting off the entire periodic schedule during cpufreq changes, this patch will just inactivate (using the I bit in qh-hw_info1, currently unused in the EHCI driver) any queue heads qh-that are to full/low speed interrupt endpoints, during the cpufreq changes. Makes sense ... although I don't trust your safe_to_inactivate() logic, that's inherently racey. You could check to see if the transaction has started, and decide it didn't (and thus set the I bit) just when the controller starts that transaction. The race seems unavoidable I had way too much ... fun isn't quite the word ... chasing issues like that with fault handling in the async ring. Such issues might be less common with the periodic schedule; maybe. When the driver loses that race, this comes up as an MMF error right? I think I am avoiding a race with the controller here, by not checking QTD_STS_STS until the second uframe after the controller should have started the split transaction. If it HAD started the split transaction, the controller should have written QTD_STS_STS back at the end of the uframe in which it started it. Originally I was checking QTD_STS_STS the uframe immediately after it should have started the transaction, but there was a tiny race there, though it usually took 12 hours running on my system to show up. I'm also not setting the I bit in any uframes that might already be cached, or the uframe before that (i_thresh is set to the number of cached uframes + 2). So as long as the code can set the I bit within one uframe after deciding that it's safe to do so, I don't think there should be a race. And, since interrupts are masked, and the CPU isn't changing frequency, I would expect any system with an EHCI controller to be able to do some minor math and set a bit within 125us. I have found that the ICH7 works without this patch, because it caches 7 uframes of the periodic schedule, so it never has to read main memory in order to complete a split transaction in time. The Broadcom/Serverworks HT1000 only caches 1 uframe, so it fails without this patch. I've always wondered how that caching is supposed to work. I suspect that you could make the ICH7 fail too, if the periodic schedule were busy enough ... it's not like they'll allocate enough silicon to cache every possible QH in SRAM inside that controller! That could mean a LOT of memory for caching. Right? Not to poke too much at all your good work here, but how would ICH7 fail in that case, given the logic in this patch? Well, right now it would fail just like 2.6.20, because the patch won't even register the cpufreq notifier if the EHCI controller says it caches 7 or more uframes! :) That is, it would get an MMF error and return -EPROTO. But, if we did register the cpufreq notifier even for the ICH7, I would expect it not to fail regardless of whether it was caching or not, since any split interrupt transactions would be inactivated before any CPU frequency changes. When the controller is caching 7 uframes, there's a smaller window during which safe_to_inactivate() will return 1, but there is still a window, unless you have a device with a polling period of 1 frame (in which case the code will just leave the transaction active, and we'll probably still get an MMF error). I have tested this code on both the HT1000 and ICH7 (I tested the patch on the ICH7 with the cpufreq notifier registered, too, even though the patch won't even register it, just to make sure it would work correctly). How heavily were your periodic schedules loaded? The /sys/class/usb_host debug file labeled periodic should provide a textual display. It would be good to test both hardware configs with a lot of USB mice hooked up. Not that the current EHCI periodic scheduler handles very many, but ISTR that three full/low speed devices per TT was realistic. And high speed hubs with multiple TTs are easy enough to find, given that Cypress chip. At Dell, I expect you have one or two extra USB mice and keyboards lying around that could be used to try abusing your patch. :) Yes, I'll load these systems up with lots of USB keyboards and mice, and re-run. I was able to see failures pretty often with no patch, or with a patch that wasn't quite working, though, with the configs I have--I've been testing with 3-4 keyboards/mice behind hub(s). This patch adds more code, but it avoids a couple of the issues that were brought up with the previous patch: 1--This patch doesn't mess with isochronous transfers (or high speed interrupt transfers) at all. That seems to be a good resolution in both cases. ISO transfers are best effort in any case, and you hadn't identified issues except with full/low speed interrupt
Re: [linux-usb-devel] [PATCH] EHCI interrupt transactions failing during CPU frequency changes
David Brownell wrote: On Tuesday 20 February 2007 12:27 pm, [EMAIL PROTECTED] wrote: This patch turns off periodic list processing in the EHCI controller for the duration of processor frequency changes. Did you test this with any kind of isochronous device active, like a set of USB speakers playing? ISO transfers work differently than the interrupt transfers you tested (with a HID device). Also ... what kind of hardware did you test this with? I wonder how much of the (potential) 1.125 msec it was spinning while waiting for the periodic schedule to actually turn off. That seems like it should be modifying the latency metrics used by cpufreq... is there even a feedback mechanism whereby the system can say that _right now_ it would take this much more time? Last time I remember looking at cpufreq, it had static metrics basically relating only to the CPUs; so costs related to components like EHCI were hidden. (Much less costs associated with re-clocking other peripherals, which is a big issue with embedded SOC chips!) The patch looks mostly OK, but the issue with isochronous transfers is a difference in how transfer completion is handled in the hardware. Rather than scanning a queue head node in the schedule tree, which will be re-scanned after the periodic list is re-activated. Instead, isochronous transfers use an entirely different hardware mechanism. Their ITDs go before that schedule tree, and won't get re-scanned after the relevant frame passes. I'd not be sure that the current iso scanning logic would behave correctly with this kind of on/off mechanism. In fact I'd kind of expect it to break. That's in addition to the issue that it might not be a Good Thing to let cpufreq create audio (or video etc) dropouts ... Maybe the best solution to this issue would be to reject the cpufreq change if ISO transfers are active on EHCI. - Dave OK, I've made an attempt to solve this issue without using such a large hammer. Instead of shutting off the entire periodic schedule during cpufreq changes, this patch will just inactivate (using the I bit in qh-hw_info1, currently unused in the EHCI driver) any queue heads that are to full/low speed interrupt endpoints, during the cpufreq changes. I have found that the ICH7 works without this patch, because it caches 7 uframes of the periodic schedule, so it never has to read main memory in order to complete a split transaction in time. The Broadcom/Serverworks HT1000 only caches 1 uframe, so it fails without this patch. I have tested this code on both the HT1000 and ICH7 (I tested the patch on the ICH7 with the cpufreq notifier registered, too, even though the patch won't even register it, just to make sure it would work correctly). This patch adds more code, but it avoids a couple of the issues that were brought up with the previous patch: 1--This patch doesn't mess with isochronous transfers (or high speed interrupt transfers) at all. 2--The notifier will take very little time unless the cpufreq change happens while a split transaction is in progress (or is already cached in the controller). I can repost it with a more concise summarization of the problem/fix and a signed-off-by if it looks ok. Thanks! Stuart take7_i_bit.patch Description: take7_i_bit.patch - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] [PATCH] EHCI interrupt transactions failing during CPU frequency changes
David Brownell wrote: On Tuesday 20 February 2007 12:27 pm, [EMAIL PROTECTED] wrote: This patch turns off periodic list processing in the EHCI controller for the duration of processor frequency changes. Did you test this with any kind of isochronous device active, like a set of USB speakers playing? ISO transfers work differently than the interrupt transfers you tested (with a HID device). Also ... what kind of hardware did you test this with? I wonder how much of the (potential) 1.125 msec it was spinning while waiting for the periodic schedule to actually turn off. That seems like it should be modifying the latency metrics used by cpufreq... is there even a feedback mechanism whereby the system can say that _right now_ it would take this much more time? Last time I remember looking at cpufreq, it had static metrics basically relating only to the CPUs; so costs related to components like EHCI were hidden. (Much less costs associated with re-clocking other peripherals, which is a big issue with embedded SOC chips!) No--I'll try that (with without the patch). I also only tried this on a Serverworks EHCI controller with AMD processors... I'll test an Intel system, too, and see if I see the MMF errors on it without this patch. The patch looks mostly OK, but the issue with isochronous transfers is a difference in how transfer completion is handled in the hardware. Rather than scanning a queue head node in the schedule tree, which will be re-scanned after the periodic list is re-activated. Instead, isochronous transfers use an entirely different hardware mechanism. Their ITDs go before that schedule tree, and won't get re-scanned after the relevant frame passes. I'd not be sure that the current iso scanning logic would behave correctly with this kind of on/off mechanism. In fact I'd kind of expect it to break. That's in addition to the issue that it might not be a Good Thing to let cpufreq create audio (or video etc) dropouts ... Maybe the best solution to this issue would be to reject the cpufreq change if ISO transfers are active on EHCI. - Dave You might get audio/video dropouts during cpu frequency changes without or with this patch. But... if cpufreq changes are rejected when ISO transfers are active, wouldn't that have the potential to lock the processors into a low-frquency state, if an isochronous device starts up while the CPUs are otherwise idle? How would the user know that his performance is crap because of his isochronous device(s)? - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] [PATCH] EHCI interrupt transactions failing during CPU frequency changes
Many interrupt transactions to low/full speed devices are getting QTD_STS_MMF from the EHCI controller because the split transaction isn't completed in time. This happens when the EHCI controller tries to read main memory during a CPU frequency change, and it can't get the data back in time, presumably due to the processor not being able to complete a cache snoop in a timely manner when changing frequency. This results in ehci_hcd returning -EPROTO to drivers like hid-core, and results in lost keystrokes, as well as messages like usb 3-3.1: reset low speed USB device using ehci_hcd and address 7. This could presumably also affect non-HID devices. This patch turns off periodic list processing in the EHCI controller for the duration of processor frequency changes. Signed-off-by: Stuart Hayes [EMAIL PROTECTED] ehci_cpufreq_2.patch Description: ehci_cpufreq_2.patch - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] errors with split transactions powernow
Greg KH wrote: On Wed, Feb 14, 2007 at 04:49:11PM -0600, [EMAIL PROTECTED] wrote: Hello... While running minor stress on a system, I'm seeing messages like this quite frequently: What kernel version are you using? 2.6.20 It seems to me that ehci-hcd itself should perhaps retry a transaction immediately--at least a couple of times--if it gets QTD_STS_MMF. If that is wrong for some reason, it might be nice if there was some way for hid-core (et al?) to know what's going on so it could opt to resubmit the URB right away--or at least not reset the device so hastily--if this is the cause of the URB failure... -EPROTO covers several other errors, too. Hm, care to make up a patch that would test this out? Now that I've thought about it, I don't think that it makes sense for ehci-hcd to automatically reactivate the failed TD... because, when we get QTD_STS_MMF, it means that the transaction was already completed to the device on the low/full speed bus. If the broken split transaction was reading/writing data from a hard drive, and we just re-issue the read/write TD, I think we'll end up with data corruption... we'd need the usb-storage or scsi driver to realize that something went wrong and redo the whole CDB. But, even worse than that, other types of devices might lose data irretrievably. For example, if the broken split transaction had just read a keypress from a keyboard, and the TT threw away the keypress because we didn't complete the transaction in time, we couldn't get it back at all. (I've actually verified that my typing gets corrupted if I type for a few minutes during a kernel compile, though I'm sure most of the broken split transactions got a NAK from the keyboard.) So I think ideally linux needs some way of coordinating ehci-hcd and powernow, so that the EHCI controller can be stopped temporarily while the processors are changing speed... I'm going to try to hack something to do that, just to verify that it will fix the problem, but I'm not sure what the right way to implement that would be. (Or maybe it isn't worth implementing that, if everything non-HID could recover from this by re-doing stuff... I'm not familiar with all the USB device types.) The controller is getting quite a few QTD_STS_MMF errors, though--several per minute while compiling a kernel on my system with 4 HID devices behind USB 2.0 hubs. (I put in a printk so I could see every time qtd_copy_status() sees QTD_STS_MMF.) Stuart - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] errors with split transactions powernow
Hello... While running minor stress on a system, I'm seeing messages like this quite frequently: usb 3-1.1: reset full speed USB device using ehci_hcd and address 5 These messages are occurring for all 4 of my low/full speed HID devices that are connected to a USB2.0 hub (so all are using split transactions on an interrupt pipe). This is occurring on an AMD system, and the problem only occurs when the processors are changing speed (powernow/cpufreq must be enabled). I added a few printks, and the failures are all because the EHCI controller is setting QTD_STS_MMF (missed micro-frame), which means that the EHCI controller started a split transaction but wasn't able to complete it in time. I think this is happening because the EHCI controller is getting delayed trying reading main memory while the processors are changing speed (this is where the cpufreq/powernow stuff comes in). If this happens between the start complete parts of a split transaction, the EHCI controller can't complete the split transaction in time, and it sets QTD_STS_MMF, causing ehci-hcd to return -EPROTO to the hid-core driver. The function hid_io_error() in hid-core will in theory retry this transaction every so often for up to a second, but, in reality, the transaction isn't getting retried at all before the second is up, because the system is loaded fairly heavily and the code that retries the transaction is called by a timer and not run immediately in the interrupt handler. Could anyone recommend a good way to fix this? It seems to me that ehci-hcd itself should perhaps retry a transaction immediately--at least a couple of times--if it gets QTD_STS_MMF. If that is wrong for some reason, it might be nice if there was some way for hid-core (et al?) to know what's going on so it could opt to resubmit the URB right away--or at least not reset the device so hastily--if this is the cause of the URB failure... -EPROTO covers several other errors, too. Thanks in advance for any advice! Stuart - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] ehci HCRESET
I notice that ehci-hcd.c only waits up to 25ms after setting CMD_RESET in the EHCI command register before timing out and giving up. Is that 25ms based on observation, or is it actually specified somewhere? I couldn't find anything in the EHCI 1.0 spec that says how long a controller can take to finish a reset, and I've got a controller that seems to be taking a little bit longer than that. Thanks! Stuart --- SF email is sponsored by - The IT Product Guide Read honest candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95alloc_id396op=click ___ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
RE: [linux-usb-devel] URB,USB requests and CBW/CSW
I'm not the world's top authority on the subject, but my understanding is that requests on endpoint 0 will either be standard device requests (such as reading basic information about the device), or device specific requests (things such as SET_IDLE for keyboard/mouse devices, which can be used to them to not to send any reports unless something has changed, or GET_MAX_LUN for usb bulk only transport storage devices). CBWs and CSWs are part of the USB mass storage bulk only transport device spec, and are transferred on the bulk pipe (not endpoint 0), and are used to actually transfer data to and from the mass storage device, rather than to configure the device. Hope that helps (and is correct), Stuart -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Hou Xiang ZHU Sent: Wednesday, November 10, 2004 11:00 PM To: [EMAIL PROTECTED] Subject: [linux-usb-devel] URB,USB requests and CBW/CSW hello, A simple question related to linux usb concept here: I am reading the USB Spec2.0 , Mass Storage Spec. as well as linux usb code. I am little confused by some concept concerning URB,USB requests and CBW/CSW. In my understanding, each time host want to send request or data to usb device, it has to use usb_submit_urb(), The USB request has standard device request and Class-specific request, for standard request, it is used to get general USB device info. but when it is going to use class-specific request? what is the relationship between CBW/CSW and Class-specific request? so a URB contains either USB standard request or class-specific request or CBW? how about actual data(for example when writing a file to USB stick)? Thanks, Houxiang Zhu --- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for best database on Linux. http://ads.osdn.com/?ad_idU88alloc_id065op=ick ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel --- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for best database on Linux. http://ads.osdn.com/?ad_idU88alloc_id065op=click ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
Re: [linux-usb-devel] ep0 timeout followed by device removal, ep0
is in uninterruptible sleep state. I can reproduce the failure by cat'ing /proc/bus/usb/devices while performing a mount on the usbdrive. Below is a This problem sounds a lot like one that I encountered a few months ago with a Teac USB CD-ROM drive--the drive couldn't handle being accessed through ep0 (like it is when you cat /proc/bus/usb/devices) at the same time that it is getting storage requests through the other endpoints--especially if one of the storage commands required clearing a stalled pipe, which also has to be done on ep0. I submitted a patch (in two parts) that uses an extra semaphore to keep these accesses from overlapping, and Pete incorporated at least some of these patches into the Red Hat RHEL3 kernel. However, I don't think this was accepted into the mainstream kernel. That might be worth trying. I don't think it was very difficult to retool this patch to work with the 2.6 kernel (I'm sorry, I wasn't following this issue from the start, and I don't know which kernel you are using). Here are the patches (again, these are for the 2.4 kernel). --Stuart diff -BurN linux-2.4.21-11.EL/drivers/usb/devio.c linux-2.4.21-11-usbfix2-temp.EL/drivers/usb/devio.c --- linux-2.4.21-11.EL/drivers/usb/devio.c 2004-03-08 22:15:09.0 -0600 +++ linux-2.4.21-11-usbfix2-temp.EL/drivers/usb/devio.c 2004-05-03 09:57:22.0 -0500 @@ -1171,6 +1171,12 @@ up_read(ps-devsem); return -ENODEV; } + /* +* grab device's exclusive_access mutex to prevent driver from +* using this device while it is being accessed by us +*/ + down ((ps-dev-exclusive_access)); + switch (cmd) { case USBDEVFS_CONTROL: ret = proc_control(ps, (void *)arg); @@ -1250,6 +1256,7 @@ ret = proc_ioctl(ps, (void *) arg); break; } + up ((ps-dev-exclusive_access)); up_read(ps-devsem); if (ret = 0) inode-i_atime = CURRENT_TIME; diff -BurN linux-2.4.21-11.EL/drivers/usb/storage/transport.c linux-2.4.21-11-usbfix2-temp.EL/drivers/usb/storage/transport.c --- linux-2.4.21-11.EL/drivers/usb/storage/transport.c 2004-03-08 22:15:15.0 -0600 +++ linux-2.4.21-11-usbfix2-temp.EL/drivers/usb/storage/transport.c 2004-05-03 09:57:23.0 -0500 @@ -627,8 +627,17 @@ int need_auto_sense; int result; + /* +* Grab device's exclusive_access mutex to prevent /usbfs from +* sending out a command in the middle of ours (if libusb sends a +* get_descriptor or something on pipe 0 after our CBW and before +* our CSW, and then we get a stall, we have trouble) +*/ + down ((us-pusb_dev-exclusive_access)); + /* send the command to the transport layer */ result = us-transport(srb, us); + up ((us-pusb_dev-exclusive_access)); /* if the command gets aborted by the higher layers, we need to * short-circuit all other processing @@ -748,7 +757,9 @@ srb-use_sg = 0; /* issue the auto-sense command */ + down ((us-pusb_dev-exclusive_access)); temp_result = us-transport(us-srb, us); + up ((us-pusb_dev-exclusive_access)); /* let's clean up right away */ srb-request_buffer = old_request_buffer; diff -BurN linux-2.4.21-11.EL/drivers/usb/usb.c linux-2.4.21-11-usbfix2-temp.EL/drivers/usb/usb.c --- linux-2.4.21-11.EL/drivers/usb/usb.c2004-03-08 22:15:09.0 -0600 +++ linux-2.4.21-11-usbfix2-temp.EL/drivers/usb/usb.c 2004-05-03 09:57:23.0 -0500 @@ -997,6 +997,7 @@ INIT_LIST_HEAD(dev-filelist); init_MUTEX(dev-serialize); + init_MUTEX(dev-exclusive_access); dev-bus-op-allocate(dev); diff -BurN linux-2.4.21-11.EL/include/linux/usb.h linux-2.4.21-11-usbfix2-temp.EL/include/linux/usb.h --- linux-2.4.21-11.EL/include/linux/usb.h 2004-03-08 22:15:40.0 -0600 +++ linux-2.4.21-11-usbfix2-temp.EL/include/linux/usb.h 2004-05-03 09:57:40.0 -0500 @@ -828,6 +828,8 @@ atomic_t refcnt;/* Reference count */ struct semaphore serialize; + struct semaphore exclusive_access; /* prevent driver proc accesses + from overlapping cmds on bus */ unsigned int toggle[2]; /* one bit for each endpoint ([0] = IN, [1] = OUT) */ unsigned int halted[2]; /* endpoint halts; one bit per endpoint # direction; */ diff -urN linux-2.4.21-15.18.EL/drivers/usb/devices.c linux-2.4.21-15.18.ELusb/drivers/usb/devices.c --- linux-2.4.21-15.18.EL/drivers/usb/devices.c 2002-11-28 17:53:14.0 -0600 +++ linux-2.4.21-15.18.ELusb/drivers/usb/devices.c 2004-07-16 10:08:24.0 -0500 @@ -387,22 +387,32 @@ if (start end) return start; - + + /* +
RE: [linux-usb-devel] PATCH: (as357) Set QH bit in UHCI framelist entries
Pete Zaitcev wrote: On Wed, 11 Aug 2004 11:42:48 -0400 (EDT) Alan Stern [EMAIL PROTECTED] wrote: This patch fixes the error in the UHCI driver found by Stuart Hayes. It adds the UHCI_PTR_QH bit into the initial entries stored in the hardware framelist. It's not entirely clear how the driver ever managed to work with these bits not set; apparently by coincidence the QH entries resembled TD entries sufficiently closely to fool the hardware. /* Only place we don't use the frame list routines */ -uhci-fl-frame[i] = cpu_to_le32(uhci-skelqh[irq]-dma_handle); +uhci-fl-frame[i] = UHCI_PTR_QH | + cpu_to_le32(uhci-skelqh[irq]-dma_handle); What about uhci_insert_td_frame_list() and uhci_remove_td() ? -- Pete My two cents worth (which is really not worth much more than that): This patch only affects the initial entries in the framelist, which are all QHs. uhci_insert_td_frame_list() and uhci_remove_td() don't need to add the UHCI_PTR_QH bit since they are putting a link to a TD (not a QH) into the frame list. These routines don't delete this bit in the original QH entry. --- SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media 100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 Save 50% off Retail on Ink Toner - Free Shipping and Free Gift. http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] uhci host controller process error
I have several systems that are failing with the messages below when the uhci-hcd module is loaded. This only seems to happen when the QHs and/or TDs (I haven't narrowed this down yet) are at a physical address above roughly 2.5GB. If I limit the memory with a mem=2500m command line, or if I modify the driver to ensure that all of the QHs and TDs are mapped below that address, uhci-hcd loads fine. I am seeing this problem with a Red Hat kernel based on a 2.6.5 kernel, but I assume this would happen on a non-Red Hat kernel as well. Since the host controller process error is supposed to indicate that the host controller thinks there's something wrong with the TD, I'm guessing that maybe the HC can't access memory at certain regions, but I've checked the chipset registers, and I don't see anything. Also, I'm seeing this on two completely different systems--one with an ICH3, and one with an ICH5. The error is seen on the first IRQ that comes back after the host controller is started, and there are no USB devices plugged in. Any idea what might be causing this? Is anyone else seeing this? -- May 25 11:54:20 discovery kernel: uhci_hcd :00:1d.0: new USB bus registered, assigned bus number 1 May 25 11:54:20 discovery kernel: uhci_hcd :00:1d.0: host controller process error, something bad happened! May 25 11:54:20 discovery: uhci_hcd :00:1d.0: host controller halted, very bad! -- Thanks Stuart --- SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media 100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 Save 50% off Retail on Ink Toner - Free Shipping and Free Gift. http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] RE: uhci host controller process error
Any idea what might be causing this? Is anyone else seeing this? -- May 25 11:54:20 discovery kernel: uhci_hcd :00:1d.0: new USB bus registered, assigned bus number 1 May 25 11:54:20 discovery kernel: uhci_hcd :00:1d.0: host controller process error, something bad happened! May 25 11:54:20 discovery: uhci_hcd :00:1d.0: host controller halted, very bad! -- Never mind, I figured it out. It looks like the uhci-hcd driver doesn't add a | UHCI_PTR_QH to the pointers that it puts in the frame list. This causes the ICH to think that the frame list is pointing to a bunch of TDs instead of QHs for purposes of checking for TD errors. I can only assume that the ICH is actually treating the frame list entries as QH pointers in spite of that bit not being set when it is actually executing the schedule, or else I don't think it would work generally. I guess the high addresses were just making the QH look like an invalid TD instead of a valid TD... not sure exactly what the ICH is checking for! --- SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media 100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 Save 50% off Retail on Ink Toner - Free Shipping and Free Gift. http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
RE: [linux-usb-devel] [PATCH] proper bios handoff in ehci-hcd
Will Beers wrote: though maybe 500 msec is too short a period to wait. See if 5000 msec helps. I went all the way up to 2 msec and it still didn't help. I'm sure it's a bad idea, but removing that whole if-block below it makes it work (which is effectively what switching the and/or did). I don't know enough about it to judge whether it's correct, but what exactly is it checking for there? -Will Without the patch, Linux would just ignore the BIOS handoff--Linux was writing 0 to the bit that it was supposed to wait for the BIOS to clear, so it never waited for the BIOS to let go of the controller. I bet you have a bad BIOS that won't hand off, but I would try the other thing David suggested--change the write to a byte write. It seems unlikely, but, since Linux is writing a 1 to the BIOS owns the controller bit right now, you might be hitting something like this, if the system is breaking up the write into multiple smaller writes: the OS wants the controller bit is getting written to 1 (first part of the Linux write, which the system broke into pieces) the system BIOS (SMI handler) sees that bit set to 1, and clears the BIOS owns bit the BIOS owns bit is getting written back to a 1 (the second part of the Linux write) Linux waits in vain for BIOS to clear the BIOS owns bit\ Again, seems unlikely, but worth a try if you're recompiling and testing. --- This SF.Net email sponsored by Black Hat Briefings Training. Attend Black Hat Briefings Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
RE: [linux-usb-devel] Re: [usb-storage] Re: Clear_feature doesn't happen when get_descriptor on pipe
David Brownell wrote: Wouldn't a plain old atomic BUSY bitflag work better? If it's set, usbcore would reject urbs with -EAGAIN ... from all contexts, including the many that can't acquire semaphores! I thought (and I might be confused with other OSes here) that there was a way to attempt to acquire a semaphore from an 'unfriendly' context -- either you got the semaphore or you got an indication that you can't get it. As in down_trylock()? Thing is, one can't necessarily know whether the current context is friendly or not. So one API design strategy thus starts out assuming none of them are, and then sees how well that applies system-wide. Hence a bit flag. But wouldn't a bit flag require continuous retries on the part of whomever gets the -EAGAIN when they try to submit an URB? And, if so, wouldn't that leave open the possibility of them missing the window when the BUSY bit isn't set (unless they retry in a very tight loop, in which case they could waste a lot of processor time)? The problem is in figuring out which devices need this treatment. Since the only way to discover this is by crashing the device, it's not very friendly. But at that point you'll know, and can add it to the list. It's also not friendly to re-define USB so that it won't work well with devices that don't have this rare bug ... :) - Dave I don't understand why this would make USB not work well with devices that don't have this problem. I would think grabbing a semaphore would be extremely quick unless someone else already has it, which should be a very rare occurrance. --- This SF.Net email is sponsored by Sleepycat Software Learn developer strategies Cisco, Motorola, Ericsson Lucent use to deliver higher performing products faster, at low TCO. http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
RE: [linux-usb-devel] Re: [usb-storage] Re: Clear_feature doesn't happen when get_descriptor on pipe
David Brownell wrote: Wouldn't a plain old atomic BUSY bitflag work better? If it's set, usbcore would reject urbs with -EAGAIN ... from all contexts, including the many that can't acquire semaphores! Going back a bit--what context were you thinking of that wouldn't be able to grab the semaphore--are you talking about a userspace program that talks to a USB device URB-by-URB, instead of using a driver like usb-storage? --- This SF.Net email is sponsored by Sleepycat Software Learn developer strategies Cisco, Motorola, Ericsson Lucent use to deliver higher performing products faster, at low TCO. http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] RE: [usb-storage] Re: Clear_feature doesn't happen when get_descriptor on pipe
Matthew Dharm wrote: On Fri, Apr 30, 2004 at 06:32:14PM -0700, Pete Zaitcev wrote: On Fri, 30 Apr 2004 16:34:48 -0500 [EMAIL PROTECTED] wrote: Here's the patch (for 2.4) if you're curious. Do you want me to include all relevant transport types, or you want to do it yourself? Or the lock could be moved to where the transport is invoked, instead of putting it in every transport. Matt I guess that would be easier, assuming the only path to the transport (usb_stor_Bulk_transport(), etc.) is through usb_stor_invoke_transport(), which appears to be the case. I'll try it. Thanks Stuart --- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id149alloc_id66op=click ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
RE: [linux-usb-devel] Clear_feature doesn't happen when get_descriptor on pipe
My apologies--my bugzilla account doesn't have the right permissions. It's not that I'm hiding... I'm the one who made submitted the patch, and I'm the one who posted the message about Robert having found an easy way to reproduce the issue. --Stuart Hayes [EMAIL PROTECTED] -Original Message- From: Pete Zaitcev [mailto:[EMAIL PROTECTED] Sent: Thursday, April 29, 2004 11:59 PM To: Johannes Erdfelt Cc: [EMAIL PROTECTED]; Hayes, Stuart; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [linux-usb-devel] Clear_feature doesn't happen when get_descriptor on pipe On Thu, 29 Apr 2004 21:24:27 -0700 Johannes Erdfelt [EMAIL PROTECTED] wrote: Although, I'm curious what other thread is sending a GET_DESCRIPTOR query that usb-storage is conflicting with. Was this a rare occurance or is this repeatable? Dell people (who for some unimaginable reason prefer to hide behind a corporate alias, so I cannot tell if it's Gary, Stuart, or other) describe it this way: Oh, I forgot to mention--Robert figured out how to reproduce this easily: do a while true; do updfstab; done on one console, which results in continuous accesses of the USB CD-ROM through the SCSI layer. Then, on a second console, do a lsusb -v, which will result in a lot of GET_DESCRIPTOR commands being sent to pipe 0 through the /proc filesystem. You should immediately see the failure. -- Pete --- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id149alloc_id66op=click ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] RE: Clear_feature doesn't happen when get_descriptor on pipe
Pete Zaitcev wrote: On Wed, 28 Apr 2004 15:09:39 -0400 (EDT) Alan Stern [EMAIL PROTECTED] wrote: Perhaps I could keep the proc_ioctls from running until the device isn't being used by usb-storage, That sounds feasible. It is feasible, but it runs into difficulties with locking order. I tried to fiddle with the patch Stuart dropped into RH Bugzilla. I do not have a confidence in it, and unfortunately, the 2.4 development is closed. I do not want to push it to Marcelo and collect failure reports at this stage of development. or maybe just make the transport layer of usb-storage retry the clear_feature command for a while if the pipe already has an URB going, instead of simply failing and moving on to the next step without clearing the stall...? You could do that. Check for -ENXIO error codes and retry in a short loop with usb_stor_control_msg(). Stuart, why don't we do something like the attached? Please back out your patch with semaphores and give it a try, tell me how it went. Busy waits are bad, I know. HOWEVER, this particular thing is clearly local in its scope. THEREFORE, if it solves your problem, I can ship it with Red Hat and even push to Marcelo. Critique and testing are appreciated. -- Pete diff -urN -X dontdiff linux-2.4.26/drivers/usb/storage/transport.c linux-2.4.26-nip/drivers/usb/storage/transport.c --- linux-2.4.26/drivers/usb/storage/transport.c 2004-02-26 14:09:59.0 -0800 +++ linux-2.4.26-nip/drivers/usb/storage/transport.c 2004-04-30 11:07:18.0 -0700 @@ -471,6 +471,7 @@ int result; int endp = usb_pipeendpoint(pipe) | (usb_pipein(pipe) 7); +retry_devio: result = usb_stor_control_msg(us, usb_sndctrlpipe(us-pusb_dev, 0), USB_REQ_CLEAR_FEATURE, USB_RECIP_ENDPOINT, 0, @@ -478,8 +479,21 @@ US_DEBUGP(usb_stor_clear_halt: result=%d\n, result); /* this is a failure case */ - if (result 0) + if (result 0) { + if (result == -ENXIO) { + /* + * usb-uhci cannot queue. If lsusb ran on this device + * (by a startup script or something), we lose. + * But we cannot afford for halts not to clear + * for a reason so silly, so we persist rudely. + */ + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(2); + set_current_state(TASK_RUNNING); + goto retry_devio; + } return result; + } /* reset the toggles and endpoint flags */ usb_endpoint_running(us-pusb_dev, usb_pipeendpoint(pipe), I haven't had any trouble with the patch I submitted. I rebooted the system repeatedly overnight with 2 empty CD-ROMs (lots of stalls--this is the config that would lock up almost every boot before), and I also ran updfstab lsusb simultaneously for a while. I tried your patch (I actually had tried my own version of this a day or two ago), and it does not work. It does make it work a little bit longer, but it eventually fails. I'm not sure exactly why it fails. What I see on the USB bus when the failure occurs is this: READ_CAPACITY CBW on endpoint 2 GET_DESCRIPTOR issued on endpoint 0 READ_CAPACITY GET_DESCRIPTOR attempt data reads on endpoints 0 1 (NAKed) endpoint 1 stalls GET_DESCRIPTOR continues to attempt data reads on endpoint 0 until it times out (100-ish ms later) CLEAR_FEATURE is issued on endpoint 0 0 length data is read on endpoint 0 to ACK the CLEAR_FEATURE GET_DESCRIPTOR is re-issued on endpoint 0 (I believe lsusb retries this when it fails, not the driver) the CSW from the GET_DESCRIPTOR is read from endpoint 1 (it failed of course) the data for the GET_DESCRIPTOR is read from endpoint 0 0 length data is written to endpoint 0 to ACK the GET_DESCRIPTOR data a new GET_DESCRIPTOR is issued on endpoint 0 a REQUEST_SENSE CBW goes out on endpoint 2 the data for the GET_DESCRIPTOR is read from endpoint 0 0 length data is written to endpoint 0 to ACK the GET_DESCRIPTOR data the sense data (the data for the REQUEST_SENSE) is read from endpoint 1 a new GET_DESCRIPTOR is issued on endpoint 0 the data for the GET_DESCRIPTOR is read from endpoint 0 0 length data is written to endpoint 0 to ACK the GET_DESCRIPTOR data the CSW for the REQUEST_SENSE command is read from endpoint 1 -- *it has status=04, which is reserved* a BULK_ONLY_MASS_STORAGE_RESET is issued on endpoint 0 (because the status was invalid) 0 length data is read from endpoint 0 to ACK the reset a new GET_DESCRIPTOR is issued on endpoint 0, which is NAKed until it times out 100ms later (this happens many times in a row) (Many NAKs are omitted for brevity. I can send you the USB trace if you want it.) So, it looks like several things aren't working right: First, I would *expect* the GET_DESCRIPTOR command to
[linux-usb-devel] RE: Clear_feature doesn't happen when get_descriptor on pipe
Alan Stern wrote: On Fri, 30 Apr 2004 [EMAIL PROTECTED] wrote: So, it looks like several things aren't working right: First, I would *expect* the GET_DESCRIPTOR command to finish executing on endpoint 0 even after the bulk in pipe stalls, but this doesn't appear to be the case. It looks like a stall on endpoint 1 also stalls endpoint 0. I can't find anything in the USB spec that explicity says how this should work. That should never happen; endpoints are supposed to be independent. Also, endpoint 0 isn't supposed to stall unless the device is given a request that it can't carry out. Second, the Teac drive is returning in invalid status of 4, which I'm assuming is related to all the overlapping commands, since it doesn't happen with the semaphore in place. And, finally, the drive doesn't respond to GET_DESCRIPTOR commands at all after the bulk reset... but this may be caused by the same thing that caused it to return a status of 4. Granted, it is possible that these problems *may* be caused by the Teac CD-ROM not being as robust as it should be. However, if the Teac has this problem, I'm betting other devices do, too. It looks like the semaphore patch may fix another problem we're seeing here with a USB key--Gary's still checking it out. I think you're right that the drive itself is not robust. I've heard others mention devices that get mixed up when they receive control requests while executing a SCSI command. We have this problem even under 2.6. Ultimately the solution will have to be to make the device single-access only. It's not clear how to do that, and usbfs doesn't enforce such limitations at the moment. Alan Stern With Pete's patch, I think that is all that's going wrong. My proposal to fix the issue (I didn't post the patch on here) was to add a semaphore to struct usb_device, and make the usbdev_ioctl() (in devio.c) grab that semaphore before doing the ioctl, and also make transport.c's usb_stor_Bulk_transfer() grab the semaphore before sending out the CBW (and release it only after getting back the CSW). Same thing for the other mass storage transports (though I hadn't actually included that in the patch I sent Pete). As far as I can tell, usbdev_ioctl is the only way that a control message would go out to thet device at the same time that usb-storage is sending out a SCSI/ATAPI command. Stuart --- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id149alloc_id66op=click ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] RE: Clear_feature doesn't happen when get_descriptor on pipe
Here's the patch (for 2.4) if you're curious. 2cdromfix2.patch Description: 2cdromfix2.patch
[linux-usb-devel] Clear_feature doesn't happen when get_descriptor on pipe
Hello... I'm seeing a problem where I have an empty CD-ROM drive (or two) connected to the system, and, every now and then, I get hangs during boot (or issues with commands not completing once the system is up). I've tracked down the issue to a sequence of events similar to the following: --- Issue a READ_TOC CBW to bulk out pipe Issue a GET_DESCRIPTOR to pipe 0 (a different thread does this, presumably) Try to read data for READ_TOC from bulk in pipe -- pipe stalls because there's no CD in the drive and therefore no TOC data Try to issue a CLEAR_FEATURE to clear the stall to pipe 0 -- this fails because there's already an URB in progress to pipe 0 USB driver then tries to read the CSW (never mind that the stall wasn't cleared)--this also fails because pipe is stalled Try to issue a CLEAR_FEATURE again, with similar results USB bus then gets continuous READ_TOC commands on the USB bus, which all get NAKed (pipe still stalled) --- How should this get handled? I would assume that the CLEAR_FEATURE should get retried after the GET_DESCRIPTOR happens, but this certainly isn't what's happening. I'm working with RedHat kernel 2.4.21-11, but I've tried 2.4.26, which also fails at similar places doing similar things, though I saw kernel panics instead of hangs, so I'm not sure exactly what's happening in 2.4.26. Thanks! Stuart --- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id149alloc_id66op=click ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
RE: [linux-usb-devel] Clear_feature doesn't happen when get_descriptor on pipe
Alan Stern wrote: On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote: Hello... I'm seeing a problem where I have an empty CD-ROM drive (or two) connected to the system, and, every now and then, I get hangs during boot (or issues with commands not completing once the system is up). I've tracked down the issue to a sequence of events similar to the following: --- Issue a READ_TOC CBW to bulk out pipe Issue a GET_DESCRIPTOR to pipe 0 (a different thread does this, presumably) Try to read data for READ_TOC from bulk in pipe -- pipe stalls because there's no CD in the drive and therefore no TOC data Try to issue a CLEAR_FEATURE to clear the stall to pipe 0 -- this fails because there's already an URB in progress to pipe 0 USB driver then tries to read the CSW (never mind that the stall wasn't cleared)--this also fails because pipe is stalled Try to issue a CLEAR_FEATURE again, with similar results USB bus then gets continuous READ_TOC commands on the USB bus, which all get NAKed (pipe still stalled) --- How should this get handled? Under Linux 2.4, one way to handle this is by preventing the other thread from issuing the GET_DESCRIPTOR to pipe 0 while the CDROM drive is busy. You could find out what that other thread is and suspend or kill it. Under Linux 2.6 there shouldn't be any problem. I would assume that the CLEAR_FEATURE should get retried after the GET_DESCRIPTOR happens, but this certainly isn't what's happening. The Host Controller drivers in 2.4 -- the UHCI drivers at least -- can't handle multiple simultaneous requests to the same control endpoint. That's not going to change; if you need that capability then you have to use 2.6. I'm working with RedHat kernel 2.4.21-11, but I've tried 2.4.26, which also fails at similar places doing similar things, though I saw kernel panics instead of hangs, so I'm not sure exactly what's happening in 2.4.26. Thanks! Stuart Alan Stern I think one of the threads is updfstab, lsusb, or kudzu, which access the device's pipe 0 through the /proc directory, while the other thread is just cdrom.c trying to do some stuff with the drive. I'm not sure how one could keep these things from running simultaneously. But, once this problem happens, the USB driver can't even be unloaded--one must reboot the system to get the USB subsystem working again! I'm not looking to re-architect the drivers--I just want to do something to keep the system from failing when this happens. Perhaps I could keep the proc_ioctls from running until the device isn't being used by usb-storage, or maybe just make the transport layer of usb-storage retry the clear_feature command for a while if the pipe already has an URB going, instead of simply failing and moving on to the next step without clearing the stall...? Stuart --- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id149alloc_id66op=click ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
RE: [linux-usb-devel] PATCH for ehci-sched.c -- interrupt QHs are n't started right in E HCI driver (2.4 2.6)
That's what I was missing: the queue advance is aborted if the QTD's active bit isn't set. Thanks again! Stuart -Original Message- From: David Brownell [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 27, 2004 5:56 PM To: Hayes, Stuart Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [linux-usb-devel] PATCH for ehci-sched.c -- interrupt QHs are n't started right in E HCI driver (2.4 2.6) [EMAIL PROTECTED] wrote: One question on that, though. With the fix that was put into 2.4.22, the qh is now initialized as live by qh_make (i.e., the halt bit isn't set). Then, in intr_submit, the qh is scheduled with no QTDs, and the QTDs are added to Or as the comment says, scheduling errors are forced (in the rare case the QH isn't yet scheduled, AND there's no space); periodic qTDs are always added to a QH that's live (which can't fail). the live qh. When the QTDs are inserted by qh_append_tds, it sets the QTD token to have the halt bit set, moves the QTD into position, and, finally, puts the correct token into the QTD. That is, the old dummy is overwritten by the first QTD in the list, and the original first QTD becomes the new dummy. Why? Because the dummy is already in position (but not ACTIVE), and trying to substitute another QTD there gets racey. It seems like there would be a race condition in qh_append_tds between the software and host controller--if the host controller looks at this qh after the QTD is added to the qh, but before the real token is copied into the QTD (which is pretty much the last thing that qh_append_tds does), the host controller will copy the incomplete QTD into the overlay area, see that the active bit isn't set (because it copied the token that just has the halt bit set), and go on to the next qh. The next time around, the host controller will look at the overlay, see the halt bit is set, and move on to the next qh, and the complete QTD will never be seen. See section 4.10.2 of the EHCI spec, which says the controller won't overlay any QTD unless its ACTIVE bit is set. On the other hand ... I believe some EHCI 0.95 controllers don't always do that. I'm pretty sure I've seen both NEC and Philips 0.95 controllers overlay the (inactive) dummy; see comments at the end of qh_completions(). That type of failure was pretty uncommon, at least in the system I was using at that time and using the tests I had then, and I think it's got a complete-enough workaround. Granted, this isn't very likely to happen, but it seems possible. It seems like there would be two ways to avoid this possibility: One way would be to leave the halt bit set in the overlay until all of the QTDs are added to the qh (this is what I had proposed). The second would be to change And that's in fact similar to what the original driver version was doing; as bugs got fixed, and deep queues got to behaving, the code evolved to the current simpler approach. qh_append_tds to put a 0 (instead of a halt) in the QTD's token until it is done adding the QTD--in this way, if the HC *did* copy the incomplete QTD into the overlay area, it would see the that the active bit isn't set and move on to the next QH, but the next time around it would see that neither halt nor active were set in the overlay, and it would go fetch the QTD again. Does this seem reasonable, or am I missing something? I remember thinking leaving HALT in qh_append_tds() should pretty much guarantee that qh_completions() would _always_ patch up after up that EHCI 0.95 bug at the next IRQ (or watchdog). Also, think of the other race: the incomplete QTD would have hw_next pointing to some rather irrelevant location in memory ... but that's what the HC would fetch and try to use! So it'd die after a DMA fault, in at least some cases. - Dave Thanks Stuart --- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
[linux-usb-devel] PATCH for ehci-sched.c -- interrupt QHs aren't started right in E HCI driver (2.4 2.6)
David et al-- I've been working on a problem where devices plugged into a USB 2.0 hub (which is plugged into an EHCI controller) aren't seen until there is some other activity in the USB subsystem. To reproduce this, I booted up a system with a USB2.0 hub connected to my EHCI controller, and, after the system was up, I tried plugging a device into the hub. Linux did not see the device. If I then plugged a device directly into the EHCI controller, Linux would see both of the devices that I plugged in. I've found, in the EHCI driver, that queue heads set up for an interrupt pipe (such as the one set up to check for connection changes on the USB2.0 hub) are set up correctly, but the halt (QTD_STS_HALT) bit in the hw_token in the overlay area is never cleared after the QH and QTDs are set up and scheduled. With this halt bit set, the HC will ignore the QH, so it doesn't get executed. If the hw_token in the overlay is cleared to 0, the HC will advance the queue (copy the first QTD into the overlay area) and do the transaction. The bit eventually gets cleared if there is other activity on the bus--scan_periodic will see the halt bit set, and, among other things, clear the bit. However, if there is no other activity in the EHCI controller, scan_periodic never gets called. (For async transactions, this bit is cleared in qh_link_async after the QH is set up.) Attached is a patch. It just modifies intr_submit (in ehci-sched.c) to clear qh-hw_token to 0 after all of the TDs are added to the QH. I believe this patch is applicable to both the 2.4 and 2.6 kernels. Thanks Stuart Hayes [EMAIL PROTECTED] --- ehci-sched.c.orig 2004-01-26 16:10:14.0 -0500 +++ ehci-sched.c2004-01-26 14:02:45.0 -0500 @@ -483,6 +483,10 @@ /* ... update usbfs periodic stats */ hcd_to_bus (ehci-hcd)-bandwidth_int_reqs++; + /* clear token in overlay so HC will advance queue and */ + /* execute QH (QTD_STS_HALT is set when QH is created) */ + qh-hw_token = 0; + done: spin_unlock_irqrestore (ehci-lock, flags); if (status) --- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
RE: [linux-usb-devel] PATCH for ehci-sched.c -- interrupt QHs are n't started right in E HCI driver (2.4 2.6)
Right, thanks. Sorry I didn't see that--it was fixed it in a different way than I was looking for. I didn't actually test the latest kernels, but just looked at the source. Stuart -Original Message- From: David Brownell [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 27, 2004 2:27 PM To: Hayes, Stuart Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [linux-usb-devel] PATCH for ehci-sched.c -- interrupt QHs aren't started right in E HCI driver (2.4 2.6) [EMAIL PROTECTED] wrote: I've found, in the EHCI driver, that queue heads set up for an interrupt pipe (such as the one set up to check for connection changes on the USB2.0 hub) are set up correctly, but the halt (QTD_STS_HALT) bit in the hw_token in the overlay area is never cleared after the QH and QTDs are set up and On 2.4, that doesn't happen since about 20-June-2003; and on 2.6, just a few days earlier (BK describes it as a micro-patch). scheduled. With this halt bit set, the HC will ignore the QH, so it doesn't get executed. If the hw_token in the overlay is cleared to 0, the HC will advance the queue (copy the first QTD into the overlay area) and do the transaction. Sounds to me like your base kernel is missing EHCI patches I'd call essential. More current kernels should be fine. - Dave --- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
RE: [linux-usb-devel] PATCH for ehci-sched.c -- interrupt QHs are n't started right in E HCI driver (2.4 2.6)
One question on that, though. With the fix that was put into 2.4.22, the qh is now initialized as live by qh_make (i.e., the halt bit isn't set). Then, in intr_submit, the qh is scheduled with no QTDs, and the QTDs are added to the live qh. When the QTDs are inserted by qh_append_tds, it sets the QTD token to have the halt bit set, moves the QTD into position, and, finally, puts the correct token into the QTD. It seems like there would be a race condition in qh_append_tds between the software and host controller--if the host controller looks at this qh after the QTD is added to the qh, but before the real token is copied into the QTD (which is pretty much the last thing that qh_append_tds does), the host controller will copy the incomplete QTD into the overlay area, see that the active bit isn't set (because it copied the token that just has the halt bit set), and go on to the next qh. The next time around, the host controller will look at the overlay, see the halt bit is set, and move on to the next qh, and the complete QTD will never be seen. Granted, this isn't very likely to happen, but it seems possible. It seems like there would be two ways to avoid this possibility: One way would be to leave the halt bit set in the overlay until all of the QTDs are added to the qh (this is what I had proposed). The second would be to change qh_append_tds to put a 0 (instead of a halt) in the QTD's token until it is done adding the QTD--in this way, if the HC *did* copy the incomplete QTD into the overlay area, it would see the that the active bit isn't set and move on to the next QH, but the next time around it would see that neither halt nor active were set in the overlay, and it would go fetch the QTD again. Does this seem reasonable, or am I missing something? Thanks Stuart -Original Message- From: Hayes, Stuart Sent: Tuesday, January 27, 2004 3:05 PM To: 'David Brownell' Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: [linux-usb-devel] PATCH for ehci-sched.c -- interrupt QHs aren't started right in E HCI driver (2.4 2.6) Right, thanks. Sorry I didn't see that--it was fixed it in a different way than I was looking for. I didn't actually test the latest kernels, but just looked at the source. Stuart -Original Message- From: David Brownell [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 27, 2004 2:27 PM To: Hayes, Stuart Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [linux-usb-devel] PATCH for ehci-sched.c -- interrupt QHs aren't started right in E HCI driver (2.4 2.6) [EMAIL PROTECTED] wrote: I've found, in the EHCI driver, that queue heads set up for an interrupt pipe (such as the one set up to check for connection changes on the USB2.0 hub) are set up correctly, but the halt (QTD_STS_HALT) bit in the hw_token in the overlay area is never cleared after the QH and QTDs are set up and On 2.4, that doesn't happen since about 20-June-2003; and on 2.6, just a few days earlier (BK describes it as a micro-patch). scheduled. With this halt bit set, the HC will ignore the QH, so it doesn't get executed. If the hw_token in the overlay is cleared to 0, the HC will advance the queue (copy the first QTD into the overlay area) and do the transaction. Sounds to me like your base kernel is missing EHCI patches I'd call essential. More current kernels should be fine. - Dave --- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn ___ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel