Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
Hello, Alan. On Tue, Jan 15, 2013 at 11:01:15PM -0500, Alan Stern wrote: The current domain implementation is somewhere inbetween. It's not completely simplistic system and at the same time not developed enough to do properly stacked flushing. I like your idea of chronological synchronization: Insist that anybody who wants to flush async jobs must get a cookie, and then only allow them to wait for async jobs started after the cookie was issued. I don't know if this is possible with the current implementation. It would require changing every call to async_synchronize_*(), and in a nontrivial way. But it might provide a proper solution to all these problems. The problem here is that flush everything which comes before me is used to order async jobs. e.g. after async jobs probe the hardware they order themselves by flushing before registering them, so unless we build accurate flushing dependencies, those dependencies will reach beyond the time window we're interested in and bring in deadlocks. And, as Linus pointed it out, tracking dependency through request_module() is tricky no matter what we do. I think it can be done by matching the ones calling request_module() and the ones actually loading modules but it's gonna be nasty. There aren't too many which use async anyway so changing stuff shouldn't be too difficult but I think the simpicity or dumbness is one of major attractions of async, so it'd be nice to keep things that way and the PF_USED_ASYNC hack seems to be able to hold things together for now. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Wed, 16 Jan 2013, Tejun Heo wrote: Hello, Alan. On Tue, Jan 15, 2013 at 11:01:15PM -0500, Alan Stern wrote: The current domain implementation is somewhere inbetween. It's not completely simplistic system and at the same time not developed enough to do properly stacked flushing. I like your idea of chronological synchronization: Insist that anybody who wants to flush async jobs must get a cookie, and then only allow them to wait for async jobs started after the cookie was issued. I don't know if this is possible with the current implementation. It would require changing every call to async_synchronize_*(), and in a nontrivial way. But it might provide a proper solution to all these problems. The problem here is that flush everything which comes before me is used to order async jobs. e.g. after async jobs probe the hardware they order themselves by flushing before registering them, so unless I don't fully understand this example. What is the point -- to make sure that asynchronously probed devices are registered in the order of their discovery? If so, here's how to do it safely: Start up the async jobs in reverse order of discovery. Have each job acquire a cookie when it starts. Then each job needs to wait only for tasks that started after its cookie was issued. we build accurate flushing dependencies, those dependencies will reach beyond the time window we're interested in and bring in deadlocks. The flushing-dependency principle can be very simple: No async task should ever have to wait for another async task that started before it. The cookie approach satisfies this requirement (unless an earlier task passes its cookie to a later task or subverts the mechanism in another way). And, as Linus pointed it out, tracking dependency through request_module() is tricky no matter what we do. I think it can be done by matching the ones calling request_module() and the ones actually loading modules but it's gonna be nasty. This shouldn't matter. Dependencies don't need to be tracked explicitly, because we know that any async work done by request_module() must start _after_ request_module() is called. Thus, if async task A calls request_module(), which starts up async task B, then we know that A can safely wait for B and B cannot safely wait for A. There aren't too many which use async anyway so changing stuff shouldn't be too difficult but I think the simpicity or dumbness is one of major attractions of async, so it'd be nice to keep things that way and the PF_USED_ASYNC hack seems to be able to hold things together for now. Nesting won't matter for the chronological approach. I really think you should consider it more fully. It's not a hack, and it doesn't need to be complicated. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
Hello, Alan. On Wed, Jan 16, 2013 at 12:01:53PM -0500, Alan Stern wrote: The problem here is that flush everything which comes before me is used to order async jobs. e.g. after async jobs probe the hardware they order themselves by flushing before registering them, so unless I don't fully understand this example. What is the point -- to make sure that asynchronously probed devices are registered in the order of their discovery? People still want devices to be numbered to their physical ports and so on, so we keep the registeration order the same as natural (whatever that means) hardware order. If so, here's how to do it safely: Start up the async jobs in reverse order of discovery. Have each job acquire a cookie when it starts. Then each job needs to wait only for tasks that started after its cookie was issued. It's a bit clumsy but yeah I guess it could work. There aren't too many which use async anyway so changing stuff shouldn't be too difficult but I think the simpicity or dumbness is one of major attractions of async, so it'd be nice to keep things that way and the PF_USED_ASYNC hack seems to be able to hold things together for now. Nesting won't matter for the chronological approach. I really think you should consider it more fully. It's not a hack, and it doesn't need to be complicated. There is benefit to the current dumb implementation in that drivers can use it without thinking too much, but yeah it could be that the flushing range limit isn't too much of restriction on top. I don't know. At this point, I'd prefer to remove request_module() from elevator init path for the problem at hand. If we need something more involved, changing cookie usage rules definitely seems like an option. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Wed, 16 Jan 2013, Tejun Heo wrote: Hello, Alan. On Wed, Jan 16, 2013 at 12:01:53PM -0500, Alan Stern wrote: The problem here is that flush everything which comes before me is used to order async jobs. e.g. after async jobs probe the hardware they order themselves by flushing before registering them, so unless I don't fully understand this example. What is the point -- to make sure that asynchronously probed devices are registered in the order of their discovery? People still want devices to be numbered to their physical ports and so on, so we keep the registeration order the same as natural (whatever that means) hardware order. If so, here's how to do it safely: Start up the async jobs in reverse order of discovery. Have each job acquire a cookie when it starts. Then each job needs to wait only for tasks that started after its cookie was issued. It's a bit clumsy but yeah I guess it could work. There aren't too many which use async anyway so changing stuff shouldn't be too difficult but I think the simpicity or dumbness is one of major attractions of async, so it'd be nice to keep things that way and the PF_USED_ASYNC hack seems to be able to hold things together for now. Nesting won't matter for the chronological approach. I really think you should consider it more fully. It's not a hack, and it doesn't need to be complicated. There is benefit to the current dumb implementation in that drivers can use it without thinking too much, but yeah it could be that the flushing range limit isn't too much of restriction on top. I don't know. At this point, I'd prefer to remove request_module() from elevator init path for the problem at hand. If we need something more involved, changing cookie usage rules definitely seems like an option. A simpler approach might be to leave the existing synchronization mechanisms as they are, and use the chronological approach only for the case of loading a module (or wherever else someone wants to use it). Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
[ Added Tejun to the discussion, since he's the async go-to-guy ] On Mon, Jan 14, 2013 at 10:23 PM, Ming Lei ming@canonical.com wrote: But I have another idea to address the problem, and let module code call async_synchronize_full() only if the module requires that explicitly, so how about the below draft patch? No way. This kind of let's just let drivers tell us when they used async helpers is basically *asking* for buggy code. In fact, just to prove how bad it is, YOU SCREWED IT UP YOURSELF. Because it's not just sd.c that uses async_schedule(), and would need the async synchronize. It's floppy.c, it's generic scsi scanning (so scsi tapes etc), and it's libata-core.c. This kind of let's randomly encourage people to write subtly buggy code that has magical timing dependencies, so that the developer won't likely even see it because he has fast disks etc code is totally unacceptable. And this code was *designed* to be that kind of buggy. No, if we set a flag like this, then it needs to be set *automatically*, so that a module cannot screw this up by mistake. It could be as simple as having a per-thread flag that gets set by the __async_schedule() function, and gets cleared by fork. Then the module code could do something like /* before calling the module -init function */ current-used_async = 0; ... if (current-used_async) async_synchronize_full(); or whatever. Tejun, comments? You can see the whole thread on lkml, but the basic problem is that the module loading doing the unconditional async_synchronize_full() has caused problems, because we have - load module A - module A does per-controller async discovery of its devices (eg scsi or ata probing) - in the async thread, it initializes somethign that needs another module B (in this case the default IO scheduler module) - modprobe for B loads the IO scheduler module successfully at the end of the module load, it does async_synchronize_full() to make sure load_module won't return before the module is ready *DEADLOCK*, because the async_synchronize_full() thing actually waits for not the module B async code (it didn't have any), but for the module *A* async code, which is waiting for module B to finish. Now, I'll happily argue that we shouldn't have this kind of load modules from random context behavior in the kernel, and I think the block layer is to blame for doing the IO scheduler load at an insane time. So don't do that then would be the best solution. Sadly, we don't even have a good way to notice that we're doing it, so hacky workaround that at least doesn't require driver authors to care is likely the second-best workaround. But the hacky workaround absolutely needs to be *automatic*. Because the driver writers need to get this subtle untestable thing right is *not* acceptable. That's the patch that Ming Lei did, and I refuse to have that kind of fragile crap in the kernel. Linus -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Tue, Jan 15, 2013 at 9:36 AM, Linus Torvalds torva...@linux-foundation.org wrote: This kind of let's randomly encourage people to write subtly buggy code that has magical timing dependencies, so that the developer won't likely even see it because he has fast disks etc code is totally unacceptable. And this code was *designed* to be that kind of buggy. Btw, we could *possibly* do this the other way around. Wait for all async work by default, but then have a really hacky way to turn that off for modules that explicitly don't want it, because they know they can be loaded in async context, and they don't do any async work themselves. Then we could make the IO schedulers set that flag (I know I'm loaded from async space, and I know I'm not myself doing any async init) Quite frankly, I'd still much rather prefer the automated approach - or even better, just avoiding the load modules in async context entirely. But at least the I can put a huge comment about why I don't want to be waited on would be much more acceptable than the I need to explicitly tell the world that it needs to wait on me. So Ming Lei's patch was easily subtly buggy by mistake (showing that by the fact that it was indeed buggy), while the opposite model where you have to explicitly ask people not to wait for you could still be very buggy, but at least now it needs to explicitly do extra work in order to be buggy. So if an interface is fragile, it should aim to be fragile in the right way - making the fragility explicit, so that people can grep for it, and people can add comments to the particular code that marks it fragile. The default behavior should be the robust one. And if would be lovely to add a warning to the people loaded a module from async context case, so that we'd *see* this. Tejun, is there a good way for code to see I'm running in async context? Then we could do something like WARN_ON_ONCE(wait system_state == SYSTEM_RUNNING in_async_thread()); in kernel/kmod.c (__request_module()). That should at least warn about this whole issue happening. Linus -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Tue, 15 Jan 2013, Linus Torvalds wrote: Tejun, comments? You can see the whole thread on lkml, but the basic problem is that the module loading doing the unconditional async_synchronize_full() has caused problems, because we have - load module A - module A does per-controller async discovery of its devices (eg scsi or ata probing) - in the async thread, it initializes somethign that needs another module B (in this case the default IO scheduler module) - modprobe for B loads the IO scheduler module successfully at the end of the module load, it does async_synchronize_full() to make sure load_module won't return before the module is ready *DEADLOCK*, because the async_synchronize_full() thing actually waits for not the module B async code (it didn't have any), but for the module *A* async code, which is waiting for module B to finish. Now, I'll happily argue that we shouldn't have this kind of load modules from random context behavior in the kernel, and I think the block layer is to blame for doing the IO scheduler load at an insane time. So don't do that then would be the best solution. It may not be so easy. When the SCSI async thread probes the new disk, it has to do I/O. So it needs to use a scheduler. But maybe it could use a built-in trivial scheduler until the proper one is loaded. Then the loading could be asynchronous. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
Hello, Linus. On Tue, Jan 15, 2013 at 09:36:57AM -0800, Linus Torvalds wrote: Tejun, comments? You can see the whole thread on lkml, but the basic problem is that the module loading doing the unconditional async_synchronize_full() has caused problems, because we have - load module A - module A does per-controller async discovery of its devices (eg scsi or ata probing) - in the async thread, it initializes somethign that needs another module B (in this case the default IO scheduler module) - modprobe for B loads the IO scheduler module successfully at the end of the module load, it does async_synchronize_full() to make sure load_module won't return before the module is ready *DEADLOCK*, because the async_synchronize_full() thing actually waits for not the module B async code (it didn't have any), but for the module *A* async code, which is waiting for module B to finish. I think the root problem here, apart from request_module() from block - which is a bit nasty but making that part completely async would too be quite nasty albeit in a different way - is that async_synchronize_full() is way too indescriminate. It's something only suitable for things like the end of system init. I'm wondering whether what we need is a rudimentray nesting like the following. finished_loading() { blah blah; cookie = async_current_cookie(); do init calls; async_synchronize_upto(cookie); blah blah; } The nesting here would be an approximation as the dependency recorded here is chronological. I *suspect* this should be safe unless the module is doing something weird. Need to think more about it. One way or the other, I think what we need is some form of scoping for flushing async ops. BTW, the current synchronization is broken - cookie isn't transferred to running-domain in queueing order but __lowest_in_progress() assumes that. I think I broke that while converting it to workqueue. Anyways, working on it. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
Hello, Alan. On Tue, Jan 15, 2013 at 01:20:58PM -0500, Alan Stern wrote: It may not be so easy. When the SCSI async thread probes the new disk, it has to do I/O. So it needs to use a scheduler. But maybe it could use a built-in trivial scheduler until the proper one is loaded. Then the loading could be asynchronous. It can be done. Noop is always built-in and block IO can do IOs with noop. The problem here is that request_module() is done synchronously during evelator_init(). We can punt that to a work item so that the elevator is switched on load completion. There are some nastiness involved tho - if module probing returns before elevator switch happens, the userland can observe elevator being switched after some indetermined short period of time, which can, for example, break scripts adjusting elevator knobs and etc... I *think* it'll be best to allow scoped synchronization of async ops. Looking into it. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Tue, Jan 15, 2013 at 10:32 AM, Tejun Heo t...@kernel.org wrote: I think the root problem here, apart from request_module() from block - which is a bit nasty but making that part completely async would too be quite nasty albeit in a different way - is that async_synchronize_full() is way too indescriminate. It's something only suitable for things like the end of system init. I'm wondering whether what we need is a rudimentray nesting like the following. I think that is a good solution if it works, but look out: we need to synchronize across *all* domains, not just the default one. The sd.c code, for example, uses its own scsi_sd_probe_domain for example, and we *do* want to synchronize with it. Can you do that with your suggested interface (ie it would have to be a *global* sequence number). Linus -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
Hello, Linus Will continue on another reply but this one is relevant so... On Tue, Jan 15, 2013 at 10:18:45AM -0800, Linus Torvalds wrote: Tejun, is there a good way for code to see I'm running in async context? Then we could do something like Almost. With a bit of modification we can ask whether current is a kworker, reach struct worker_struct via kthread_data() if so and then test worker-current_func against the async workfn. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
cc'ing Arjan. Arjan, the original thread can be read from http://thread.gmane.org/gmane.linux.kernel/1420814 Hello, again. On Tue, Jan 15, 2013 at 12:18:01PM -0800, Linus Torvalds wrote: I think that is a good solution if it works, but look out: we need to synchronize across *all* domains, not just the default one. The sd.c code, for example, uses its own scsi_sd_probe_domain for example, and we *do* want to synchronize with it. Can you do that with your suggested interface (ie it would have to be a *global* sequence number). So, I've been thinking about it for a while now and it looks like async is cutting too many corners to implement any sane stackable flushing scheme on top. There simply isn't much information to determine who should wait for what. I've thought of two workarounds. Both suck. A. Try to detect deadlock conditions from synchronize(). If deadlock condition involving other async jobs are detected, whine about it and then skip. Ignore deadlock condition on self (should solve this particular case). Detecting deadlock condition isn't difficult if there are only global synchronizations; unfortunately, fragmented dependencies via domain-local synchronization makes this non-trivial. We can still do ignore-self thing mostly trivially tho. This will at least work around the problem at hand. B. The ranged synchronization I first suggested. The problem with this is that it's a common practice for a given async job to try to flush anything which comes before it. This can introduce spurious synchronization dependencies which can then lead to deadlocks. These conditions can be detected and ignored, at least only considering global synchronizations. The problem here is that those deadlock conditions will occur under normal usage and thus should be ignored silently, which basically makes synchronization silently ignore and finish successfully even if there are legitimate deadlocks which should be investigated. For now, I'm gonna implement simple I'm not gonna wait for myself self-deadlock avoidance. If this needs any more sophistication, I think we better reimplement it so that we can explicitly match up and track who's gonna wait for what instead of throwing everything into a single cookie space and then try to work back from there. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
For now, I'm gonna implement simple I'm not gonna wait for myself self-deadlock avoidance. If this needs any more sophistication, I think we better reimplement it so that we can explicitly match up and track who's gonna wait for what instead of throwing everything into a single cookie space and then try to work back from there. async fundamentally had the concept of a monotonic increasing number, and that you could always wait for everyone before me. then people (like me) wanted exceptions to what everyone means ;-( I'm ok with going back to a single space and simplify the world. the case with (usb) module loading is fun... people expect the device to be there (since frankly, it's hard to do otherwise).. ... but it's also really hard due to the nature of USB.. USB is async in nature, even independent of the kernel async stuff. Example: Load ehci.ko ... the actual use devices don't show up for some time. the module wait case is tricky, and I wonder if there's deadlocks lurking even without async. (btw there is a similar situation at the end of the normal kernel boot versus things like asynchronous driver initializing... but we skip that in the case of an initrd is used to bypass a very similar deadlock. this is even without async in use.. typical hard case is the PS/2 mouse probing) at some point in the past we had the concept of request a module but don't wait for it, and I wonder if that is what should have been used here. Doing a range wait, with the start of the range being taken at the start of module loading is a bit of a hack, but it'll work for the userspace expected semantics of all async stuff of the *loaded module* be done, independent of all other modules/async stuff. It's not as deadlocky as one might think, but it's not going to be efficient to implement. not self-deadlocking likely solves most practical cases though -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
Hello, Arjan. On Tue, Jan 15, 2013 at 04:25:54PM -0800, Arjan van de Ven wrote: async fundamentally had the concept of a monotonic increasing number, and that you could always wait for everyone before me. then people (like me) wanted exceptions to what everyone means ;-( I'm ok with going back to a single space and simplify the world. If we want (or need) finer grained operation, we'll probably have to head the other direction, so that we can definitively tell that an async operation belongs to domains system, module load A and B, so that each waiter knows what to wait for. The current domain implementation is somewhere inbetween. It's not completely simplistic system and at the same time not developed enough to do properly stacked flushing. the module wait case is tricky, and I wonder if there's deadlocks lurking even without async. I don't think so. It's really an async job waiting for itself. Working around just this case is mostly trivial (working on patches now) but it really is putting kludges on top of shaky foundation. Maybe this is the extent of complexity that we need to go given the rather limited use cases of async. Let's hope so. I think we'll have to reimplement synchronization scheme if we have to go further. at some point in the past we had the concept of request a module but don't wait for it, and I wonder if that is what should have been used here. We actually want to wait for it as it creates a userland visible behavior difference otherwise. It's just that async's way of waiting is too ham-fisted to be used properly in more complex scenarios. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Tue, Jan 15, 2013 at 3:50 PM, Tejun Heo t...@kernel.org wrote: For now, I'm gonna implement simple I'm not gonna wait for myself self-deadlock avoidance. You can't really do that. Or rather, it won't *help*. The thing is, the module loading in particular is not necessarily happening in the same context as what *started* the module loading. A module loader will request the module from user space, and then later user space - through possibly a totally unrelated process - will finish it. So there is no myself. There's not even necessarily any relationship that the kernel even knows about, because the module loading request can have gone from usermode_helper over something like dbus to systemd. See? There's a reason I asked for a warning for this. Or the let's flag the current thread if it ever started anything asynchronous. Because it's complicated. Linus -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Tue, Jan 15, 2013 at 4:36 PM, Linus Torvalds torva...@linux-foundation.org wrote: There's a reason I asked for a warning for this. Or the let's flag the current thread if it ever started anything asynchronous. Because it's complicated. Btw, the sequence counter (that is *not* taking anything else into account) is good enough in practice, exactly because the common case for module loading is actually that nothing in the module init sequence is done asynchronously. Yes, device discovery (particularly for block devices) is often asynchronous. But the modules it then asks to load usually wouldn't be. So if we just have the flag did this thread ever even start async work over the module init sequence, we can just avoid the async serialization entirely for that case, and it breaks the deadlock chain nicely in practice. Only of a block device does async work and then wants to load another module that does more async work in its init routine would it then break. But at that point, I'll happily just put my foot down and tell people they are crazy, and Let's not do that kind of crap. Linus -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Tue, Jan 15, 2013 at 04:36:34PM -0800, Linus Torvalds wrote: The thing is, the module loading in particular is not necessarily happening in the same context as what *started* the module loading. A module loader will request the module from user space, and then later user space - through possibly a totally unrelated process - will finish it. So there is no myself. There's not even necessarily any relationship that the kernel even knows about, because the module loading request can have gone from usermode_helper over something like dbus to systemd. See? Right. Gees, there's even no way to link them. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Wed, Jan 16, 2013 at 1:36 AM, Linus Torvalds torva...@linux-foundation.org wrote: Because it's not just sd.c that uses async_schedule(), and would need the async synchronize. It's floppy.c, it's generic scsi scanning (so scsi tapes etc), and it's libata-core.c. As discussed previously, only the module which will populate device node for user space inside async func may require the synchronization, so that the below modprobe A mount /dev/XXX /mnt script can't be broken, and that should be the original bug report: https://bugzilla.kernel.org/attachment.cgi?id=20937 For other modules, looks the synchonization isn't needed, at least there are lots of other async(work, kthread, ...) things which is scheduled in driver probe() and no any synchronize is added after the module init() completes inside loading module. Do we need to add that sync for all async things inside loading module? So looks only sd.c and floppy.c are to be synchronized suppose some sync interfaces are introduced, doesn't it? Thanks -- Ming Lei -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Tue, 15 Jan 2013, Tejun Heo wrote: Hello, Arjan. On Tue, Jan 15, 2013 at 04:25:54PM -0800, Arjan van de Ven wrote: async fundamentally had the concept of a monotonic increasing number, and that you could always wait for everyone before me. then people (like me) wanted exceptions to what everyone means ;-( I'm ok with going back to a single space and simplify the world. If we want (or need) finer grained operation, we'll probably have to head the other direction, so that we can definitively tell that an async operation belongs to domains system, module load A and B, so that each waiter knows what to wait for. The current domain implementation is somewhere inbetween. It's not completely simplistic system and at the same time not developed enough to do properly stacked flushing. I like your idea of chronological synchronization: Insist that anybody who wants to flush async jobs must get a cookie, and then only allow them to wait for async jobs started after the cookie was issued. I don't know if this is possible with the current implementation. It would require changing every call to async_synchronize_*(), and in a nontrivial way. But it might provide a proper solution to all these problems. Can you think of any reasons why it wouldn't work in principle? It would prevent code from doing wait until all currently-running async jobs have finished -- but arguably, nobody should be allowed to do that anyway. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Monday 14 January 2013 11:47:57 Ming Lei wrote: [ 181.175323] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 181.183624] modprobeD c04f1920 0 2462 2461 0x [ 181.183685] [c04f1920] (__schedule+0x5fc/0x6d4) from [c005eba4] (async_synchronize_cookie_domain+0xdc/0x 168) [ 181.183715] [c005eba4] (async_synchronize_cookie_domain+0xdc/0x168) from [c005ed04] (async_synchronize_f ull+0x3c/0x60) [ 181.183776] [c005ed04] (async_synchronize_full+0x3c/0x60) from [c0085610] (load_module+0x1aac/0x1cdc) [ 181.183807] [c0085610] (load_module+0x1aac/0x1cdc) from [c0085944] (sys_init_module+0x104/0x110) [ 181.183837] [c0085944] (sys_init_module+0x104/0x110) from [c000dfe0] (ret_fast_syscall+0x0/0x48) [ 271.175506] INFO: task modprobe:2462 blocked for more than 90 seconds. [ 271.182373] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 271.190826] modprobeD c04f1920 0 2462 2461 0x [ 271.190887] [c04f1920] (__schedule+0x5fc/0x6d4) from [c005eba4] (async_synchronize_cookie_domain+0xdc/0x 168) [ 271.190917] [c005eba4] (async_synchronize_cookie_domain+0xdc/0x168) from [c005ed04] (async_synchronize_f ull+0x3c/0x60) [ 271.190948] [c005ed04] (async_synchronize_full+0x3c/0x60) from [c0085610] (load_module+0x1aac/0x1cdc) [ 271.190948] [c0085610] (load_module+0x1aac/0x1cdc) from [c0085944] (sys_init_module+0x104/0x110) [ 271.190979] [c0085944] (sys_init_module+0x104/0x110) from [c000dfe0] (ret_fast_syscall+0x0/0x48) OK, your trace is totally different. If your hangs are related, as is likely, my explanation goes out of the window. Regards Oliver -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Mon, Jan 14, 2013 at 4:22 PM, Oliver Neukum oli...@neukum.org wrote: OK, your trace is totally different. If your hangs are related, as is likely, my explanation goes out of the window. If I run 'shutdown' after unplugging usb storage device, another hang trace same with Alex's can be triggered too, so it should be one same problem. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Mon, Jan 14, 2013 at 3:39 AM, Alan Stern st...@rowland.harvard.edu wrote: On Sun, 13 Jan 2013, Oliver Neukum wrote: This is not a USB problem. You need to involve the SCSI people. khubd just stops working because disconnects are processed in its context and the removal deadlocks. The why whould building the deadline elevator as a module make any difference? Or does it make a difference? Building elevator as module does make a difference: the system is broken. Alex, if the elevator is made static instead, do you still see the same behavior when the USB drive is removed? How can I make the elevator static? Or did you mean built-in? Or did you mean to ask if khubd hangs if the deadline is built in? In that case - no. The behavior is normal. Nothing hangs. Also, are there any mounted filesystems on the drive when you unplug it? No, no auto-mount. The whole of userspace init is attached, and I'm reasonably sure nothing of it mounts anything automatically. Nothing of udev, too. linuxrc-t Description: Binary data
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sun, Jan 13, 2013 at 11:15 PM, Ming Lei ming@canonical.com wrote: The deadlock problem is caused by calling request_module() inside async function of do_scan_async(), and it was introduced by Linus's below commit: commit d6de2c80e9d758d2e36c21699117db6178c0f517 Author: Linus Torvalds torva...@linux-foundation.org Date: Fri Apr 10 12:17:41 2009 -0700 async: Fix module loading async-work regression IMO, maybe the commit isn't a proper fix, considered the below fact: - it isn't good to allow async function to be marked as __init Immaterial. For modules, __init is a non-issue. For non-modules, the synchronization elsewhere is sufficient. - any user mode shouldn't expect that the device is ready just after completing of 'insmod' Bullshit. That expectation is just a fact. People insmod a device driver, and mount the device immediately in scripts. We do not say user mode shouldn't. Seriously. EVER. User mode *does*, and we deal with it. Learn it now, and stop ever saying that again. This is really starting to annoy me. Kernel developers who say user mode should be fixes to not do that should go somewhere else. The whole and *only* point of a kernel is to hide these kinds of issues from user mode, and make things just work in user mode. User mode should not ever worry about oh, doing X can trigger a module load, so now the device might not be available immediately, so I should delay and re-try until it is. That's just f*cking crazy talk. We have a very simple rule in the kernel: we don't break user space. EVER. Learn that rule. I don't ever want to hear any user mode shouldn't expect again. User mode *does* expect. End of discussion. - from view of driver, introducing async_synchronize_full() after do_one_initcall() inside do_init_module() is like a sync probe for drivers built as module, and cause this kind of deadlock easily. So could we revert the commit and fix the previous problems just case by case? or other better fix? There's no way in hell we take a fix things one by one approach. It's not going to work. And your suggestion seems to not do async discovery of devices in general, which is a *much* worse fix than anything else. It's just crazy. But there are other approaches we might take. We might move the call to async_synchronize_full(); to other places. For example, maybe we're better off doing it at block/char device open instead? Linus -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Mon, 14 Jan 2013, Linus Torvalds wrote: - from view of driver, introducing async_synchronize_full() after do_one_initcall() inside do_init_module() is like a sync probe for drivers built as module, and cause this kind of deadlock easily. So could we revert the commit and fix the previous problems just case by case? or other better fix? There's no way in hell we take a fix things one by one approach. It's not going to work. And your suggestion seems to not do async discovery of devices in general, which is a *much* worse fix than anything else. It's just crazy. But there are other approaches we might take. We might move the call to async_synchronize_full(); to other places. For example, maybe we're better off doing it at block/char device open instead? How about skipping that call if the current thread is one of the async helpers? Is it possible to detect when that happens? Or maybe such a check should go inside async_synchronize_full() itself. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Mon, Jan 14, 2013 at 10:04 AM, Alan Stern st...@rowland.harvard.edu wrote: How about skipping that call if the current thread is one of the async helpers? Is it possible to detect when that happens? Or maybe such a check should go inside async_synchronize_full() itself. Do we have some idea of exactly what is waiting for what? Which async context is causing the module load to happen in the first place? I think *that* is what we should avoid - it sounds like the block layer is loading the IO scheduler at the wrong point. I realize that people like (for testing purposes) to change the IO scheduler at random, but if that means that any IO can basically result in a request_module(), then that sounds like a problem. It seems to be elevator_get(), and I presume the chain is something like load block driver async, the block driver does blk_init_allocated_queue, that does request_module() to find the elevator, the request_module() succeeds, but ends up waiting for async work, which is the block driver load, which is waiting for the request_module to finish. And my gut feel is that blk_init_allocated_queue() probably shouldn't do that request_module() at all. We migth want to do it when we *open* the device, but not while loading the module for the device. So my _feeling_ is that this is just a bug in the block layer, and that it shouldn't set up block device drivers for this kind of crazy need to load the elevator module while in the middle of scanning devices. I think *that* is what we should aim to change. Hmm? That said, I think it might indeed be a good idea to make this problem much easier to see, and that detect when it happens would be a good thing (and then we should WARN_ON_ONCE() on people trying to do request_module() calls from async context). Linus -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Tue, Jan 15, 2013 at 1:30 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Sun, Jan 13, 2013 at 11:15 PM, Ming Lei ming@canonical.com wrote: The deadlock problem is caused by calling request_module() inside async function of do_scan_async(), and it was introduced by Linus's below commit: commit d6de2c80e9d758d2e36c21699117db6178c0f517 Author: Linus Torvalds torva...@linux-foundation.org Date: Fri Apr 10 12:17:41 2009 -0700 async: Fix module loading async-work regression IMO, maybe the commit isn't a proper fix, considered the below fact: - it isn't good to allow async function to be marked as __init Immaterial. For modules, __init is a non-issue. For non-modules, the synchronization elsewhere is sufficient. Looks 5d38258ec026921a7b266f4047ebeaa75db358e5(ACPI battery: fix async boot oops) addresses the issue of __init for modules. - any user mode shouldn't expect that the device is ready just after completing of 'insmod' Bullshit. That expectation is just a fact. People insmod a device driver, and mount the device immediately in scripts. I mean we can let the device node populated in probe() first, but let open() wait for completion of the async probe(). Maybe my expression is not accurate, here the 'device isn't ready' just means that the async probe() isn't completed, and doesn't mean the device node doesn't come. We do not say user mode shouldn't. Seriously. EVER. User mode *does*, and we deal with it. Learn it now, and stop ever saying that again. This is really starting to annoy me. Kernel developers who say user mode should be fixes to not do that should go somewhere else. The whole and *only* point of a kernel is to hide these kinds of issues from user mode, and make things just work in user mode. User mode should not ever worry about oh, doing X can trigger a module load, so now the device might not be available immediately, so I should delay and re-try until it is. That's just f*cking crazy talk. We have a very simple rule in the kernel: we don't break user space. EVER. No, I don't mean we should break user space, see above. Learn that rule. I don't ever want to hear any user mode shouldn't expect again. User mode *does* expect. End of discussion. - from view of driver, introducing async_synchronize_full() after do_one_initcall() inside do_init_module() is like a sync probe for drivers built as module, and cause this kind of deadlock easily. So could we revert the commit and fix the previous problems just case by case? or other better fix? There's no way in hell we take a fix things one by one approach. It's not going to work. And your suggestion seems to not do async discovery of devices in general, which is a *much* worse fix than anything else. It's just crazy. I will try to figure out one patch to address the scsi block async probe issue first, and see if it can fix the problem by moving add_disk() into sd_probe() and calling async_synchronize_full_domain(scsi_sd_probe_domain) in the entry of sd_open(). But there are other approaches we might take. We might move the call to async_synchronize_full(); to other places. For example, maybe we're better off doing it at block/char device open instead? Looks it is similar with the above idea, but we have to remove the async_synchronize_full() in do_init_module() together. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Tue, Jan 15, 2013 at 9:53 AM, Ming Lei ming@canonical.com wrote: I will try to figure out one patch to address the scsi block async probe issue first, and see if it can fix the problem by moving add_disk() into sd_probe() and calling async_synchronize_full_domain(scsi_sd_probe_domain) in the entry of sd_open(). Looks it isn't doable because the block partition device can only be created inside the async things. But I have another idea to address the problem, and let module code call async_synchronize_full() only if the module requires that explicitly, so how about the below draft patch? diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 7992635..c5106a0 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3143,6 +3143,8 @@ static int __init init_sd(void) if (err) goto err_out_driver; + mod_init_async_wait(THIS_MODULE); + return 0; err_out_driver: diff --git a/include/linux/module.h b/include/linux/module.h index 7760c6d..09bd4c5 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -300,6 +300,12 @@ struct module unsigned int taints;/* same bits as kernel:tainted */ + /* +* set if the module wants to call async_synchronize_full +* after its init() is complted. +*/ + unsigned int init_async_wait:1; + #ifdef CONFIG_GENERIC_BUG /* Support for BUG */ unsigned num_bugs; @@ -656,4 +662,16 @@ static inline void module_bug_finalize(const Elf_Ehdr *hdr, static inline void module_bug_cleanup(struct module *mod) {} #endif /* CONFIG_GENERIC_BUG */ +/* + * If one module wants to complete its all async code after + * its init() executed, the module can call this function in + * the entry of its init(), but the module's async function + * can't call request_module, otherwise deadlock will be caused. + */ +static inline void mod_init_async_wait(struct module *mod) +{ + if (mod) + mod-init_async_wait = 1; +} + #endif /* _LINUX_MODULE_H */ diff --git a/kernel/module.c b/kernel/module.c index 250092c..dc5d011 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3058,8 +3058,9 @@ static int do_init_module(struct module *mod) blocking_notifier_call_chain(module_notify_list, MODULE_STATE_LIVE, mod); - /* We need to finish all async code before the module init sequence is done */ - async_synchronize_full(); + /* Only complete all async code if the module requires that */ + if (mod-init_async_wait) + async_synchronize_full(); mutex_lock(module_mutex); /* Drop initial reference. */ Thanks, -- Ming Lei -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sat, Jan 12, 2013 at 11:52 PM, Alan Stern st...@rowland.harvard.edu wrote: On Sat, 12 Jan 2013, Alex Riesen wrote: Now, who would be interested to handle this kind of misconfiguration ... So the whole thing was a false alarm? Yes, almost. What about khubd hanging when machine is shutdown? Maybe you should report to the block-layer maintainers that it's possible to mess up the system by building an elevator as a module. That sounds like the sort of thing they'd be interested to hear. Hi Jens, may I point you at this problem report: http://thread.gmane.org/gmane.linux.kernel/1420814 It is surely a misconfiguration on my part (the used io scheduler configured as a module), but the behavior is somewhat problematic anyway: at least in this case USB storage is essentially locked up. Regards, Alex -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sun, 13 Jan 2013, Alex Riesen wrote: On Sat, Jan 12, 2013 at 11:52 PM, Alan Stern st...@rowland.harvard.edu wrote: On Sat, 12 Jan 2013, Alex Riesen wrote: Now, who would be interested to handle this kind of misconfiguration ... So the whole thing was a false alarm? Yes, almost. What about khubd hanging when machine is shutdown? What about it? I have trouble understanding all the descriptions you have provided so far, because you talk about several different things and change your mind a lot. Can you provide a single, simple scenario that illustrates this problem? Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sun, Jan 13, 2013 at 5:56 PM, Alan Stern st...@rowland.harvard.edu wrote: On Sun, 13 Jan 2013, Alex Riesen wrote: Yes, almost. What about khubd hanging when machine is shutdown? What about it? I have trouble understanding all the descriptions you have provided so far, because you talk about several different things and change your mind a lot. Can you provide a single, simple scenario that illustrates this problem? 1. Compile a kernel with deadline elevator as module 2. Boot into it, make sure the elevator is selected (I used elevator=deadline in the kernel command line) 3. Insert a FAT formatted mass storage device in an USB2 port Observe io scheduler deadline registered 4. Pull the stick out, wait a moment, and either shutdown or just and press alt-sysrq-W: [ 158.170585] usb 1-1.2: USB disconnect, device number 3 [ 158.170590] usb 1-1.2: unregistering device [ 158.170595] usb 1-1.2: unregistering interface 1-1.2:1.0 [ 166.959398] SysRq : Show Blocked State [ 166.959410] taskPC stack pid father [ 166.959432] khubd D 880213a68000 0 513 2 0x [ 166.959440] 880213affa18 0046 8802006b 0 [ 166.959448] 880213a68000 880213afffd8 880213afffd8 00013 [ 166.959454] 81a14400 880213a68000 880213aff9a8 0 [ 166.959461] Call Trace: [ 166.959475] [8104d763] ? flush_work+0x6d/0x1fe [ 166.959485] [8133defb] ? scsi_remove_host+0x24/0x10e [ 166.959490] [8104d6fb] ? flush_work+0x5/0x1fe [ 166.959499] [815e1dd6] schedule+0x65/0x67 [ 166.959506] [815e201e] schedule_preempt_disabled+0x18/0x24 [ 166.959513] [815e07e4] mutex_lock_nested+0x181/0x2c1 [ 166.959518] [8133defb] ? scsi_remove_host+0x24/0x10e [ 166.959524] [8133defb] scsi_remove_host+0x24/0x10e [ 166.959531] [813910d5] usb_stor_disconnect+0x77/0xbc [ 166.959539] [81377ca3] usb_unbind_interface+0x6c/0x14d [ 166.959548] [813266fc] __device_release_driver+0x88/0xdb [ 166.959554] [81326774] device_release_driver+0x25/0x32 [ 166.959561] [8132616f] bus_remove_device+0xf5/0x10a [ 166.959567] [8132413f] device_del+0x12e/0x189 [ 166.959574] [81375d3a] usb_disable_device+0xb1/0x20e [ 166.959582] [8136ed8b] usb_disconnect+0xab/0x113 [ 166.959589] [81370218] hub_port_connect_change+0x1b0/0x879 [ 166.959597] [81370e3a] hub_events+0x559/0x69d [ 166.959604] [81370fb6] hub_thread+0x38/0x19b [ 166.959612] [81052587] ? wake_up_bit+0x2a/0x2a [ 166.959618] [81370f7e] ? hub_events+0x69d/0x69d [ 166.959625] [81051f2a] kthread+0xd5/0xdd [ 166.959632] [8105d5f6] ? finish_task_switch+0x3f/0xf7 [ 166.959641] [81051e55] ? __init_kthread_worker+0x5a/0x5a [ 166.959648] [815e965c] ret_from_fork+0x7c/0xb0 [ 166.959655] [81051e55] ? __init_kthread_worker+0x5a/0x5a This trace if from alt-sysrq-W. I can attach an image from the shutdown case, the traces from that case are hard to save: the main storage is usually already stopped. I believe it was the same, though. -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sunday 13 January 2013 18:42:49 Alex Riesen wrote: On Sun, Jan 13, 2013 at 5:56 PM, Alan Stern st...@rowland.harvard.edu wrote: On Sun, 13 Jan 2013, Alex Riesen wrote: Yes, almost. What about khubd hanging when machine is shutdown? What about it? I have trouble understanding all the descriptions you have provided so far, because you talk about several different things and change your mind a lot. Can you provide a single, simple scenario that illustrates this problem? 1. Compile a kernel with deadline elevator as module 2. Boot into it, make sure the elevator is selected (I used elevator=deadline in the kernel command line) 3. Insert a FAT formatted mass storage device in an USB2 port Observe io scheduler deadline registered 4. Pull the stick out, wait a moment, and either shutdown or just and press alt-sysrq-W: That makes it clear. The elevator probably has scheduled work which cannot finish waiting on a lock and scsi_remove_host() wants to flush work. This is not a USB problem. You need to involve the SCSI people. khubd just stops working because disconnects are processed in its context and the removal deadlocks. Regards Oliver -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sun, 13 Jan 2013, Oliver Neukum wrote: On Sunday 13 January 2013 18:42:49 Alex Riesen wrote: On Sun, Jan 13, 2013 at 5:56 PM, Alan Stern st...@rowland.harvard.edu wrote: On Sun, 13 Jan 2013, Alex Riesen wrote: Yes, almost. What about khubd hanging when machine is shutdown? What about it? I have trouble understanding all the descriptions you have provided so far, because you talk about several different things and change your mind a lot. Can you provide a single, simple scenario that illustrates this problem? 1. Compile a kernel with deadline elevator as module 2. Boot into it, make sure the elevator is selected (I used elevator=deadline in the kernel command line) 3. Insert a FAT formatted mass storage device in an USB2 port Observe io scheduler deadline registered 4. Pull the stick out, wait a moment, and either shutdown or just and press alt-sysrq-W: Indeed. I just tried booting into a kernel that has the deadline elevator built-in, not a module. Even then, when I specified elevator=deadline on the boot command line, the system hung up partway through booting. Hard to tell exactly where, because it occurred shortly after the switching from VGA to the framebuffer driver, so the screen was completely blank. When I get a chance, I'll try it on another machine where I can use a serial console. That makes it clear. The elevator probably has scheduled work which cannot finish waiting on a lock and scsi_remove_host() wants to flush work. What is the work and why can't it finish? Or rather, how can we figure these things out? According to what Alex wrote, the blocked task doesn't show up in the Alt-SysRq-W listing. And don't forget that the listing shows scsi_remove_host() blocks waiting to acquire the host's scan_mutex. Not waiting for work to be flushed. This casts doubt on your explanation. This is not a USB problem. You need to involve the SCSI people. khubd just stops working because disconnects are processed in its context and the removal deadlocks. The why whould building the deadline elevator as a module make any difference? Or does it make a difference? Alex, if the elevator is made static instead, do you still see the same behavior when the USB drive is removed? Also, are there any mounted filesystems on the drive when you unplug it? Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Mon, Jan 14, 2013 at 1:42 AM, Alex Riesen raa.l...@gmail.com wrote: 1. Compile a kernel with deadline elevator as module 2. Boot into it, make sure the elevator is selected (I used elevator=deadline in the kernel command line) 3. Insert a FAT formatted mass storage device in an USB2 port Observe io scheduler deadline registered 4. Pull the stick out, wait a moment, and either shutdown or just and press alt-sysrq-W: I can reproduce the problem too on one ehci-only system(Pandaboard) with deadline elevator module case, and no such problem in the built-in case, and still on 3.8-rc3. Follows the dmesg log: [ 85.665679] usb 1-1.2.2: new high-speed USB device number 5 using ehci-omap [ 85.784423] usb 1-1.2.2: default language 0x0409 [ 85.790008] usb 1-1.2.2: udev 5, busnum 1, minor = 4 [ 85.790039] usb 1-1.2.2: New USB device found, idVendor=0951, idProduct=1624 [ 85.790039] usb 1-1.2.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [ 85.790069] usb 1-1.2.2: Product: DataTraveler G2 [ 85.790069] usb 1-1.2.2: Manufacturer: Kingston [ 85.790100] usb 1-1.2.2: SerialNumber: 0019E06C5346EA41D071 [ 85.790100] device: '1-1.2.2': device_add [ 85.790344] bus: 'usb': add device 1-1.2.2 [ 85.790405] PM: Adding info for usb:1-1.2.2 [ 85.790740] bus: 'usb': driver_probe_device: matched device 1-1.2.2 with driver usb [ 85.790771] bus: 'usb': really_probe: probing driver usb with device 1-1.2.2 [ 85.790802] usb 1-1.2.2: usb_probe_device [ 85.790832] usb 1-1.2.2: configuration #1 chosen from 1 choice [ 85.791076] usb 1-1.2.2: adding 1-1.2.2:1.0 (config #1, interface 0) [ 85.791076] device: '1-1.2.2:1.0': device_add [ 85.791137] bus: 'usb': add device 1-1.2.2:1.0 [ 85.791168] PM: Adding info for usb:1-1.2.2:1.0 [ 85.791442] device: 'ep_81': device_add [ 85.791564] PM: Adding info for No Bus:ep_81 [ 85.791564] device: 'ep_02': device_add [ 85.791687] PM: Adding info for No Bus:ep_02 [ 85.791687] driver: '1-1.2.2': driver_bound: bound to device 'usb' [ 85.791717] bus: 'usb': really_probe: bound device 1-1.2.2 to driver usb [ 85.791748] PM: Moving platform:musb-hdrc.0.auto to end of list [ 85.791748] device: 'ep_00': device_add [ 85.791778] platform musb-hdrc.0.auto: Retrying from deferred list [ 85.791839] PM: Adding info for No Bus:ep_00 [ 85.791839] bus: 'platform': driver_probe_device: matched device musb-hdrc.0.auto with driver musb-hdrc [ 85.791839] bus: 'platform': really_probe: probing driver musb-hdrc with device musb-hdrc.0.auto [ 85.791870] hub 1-1.2:1.0: state 7 ports 4 chg evt 0004 [ 85.791900] unable to find transceiver of type USB2 PHY [ 85.797454] HS USB OTG: no transceiver configured [ 85.802703] musb-hdrc musb-hdrc.0.auto: musb_init_controller failed with status -517 [ 85.811157] platform musb-hdrc.0.auto: Driver musb-hdrc requests probe deferral [ 85.811187] platform musb-hdrc.0.auto: Added to deferred list [ 85.811218] PM: Moving platform:twl6030_usb to end of list [ 85.811218] platform twl6030_usb: Retrying from deferred list [ 85.811279] bus: 'platform': driver_probe_device: matched device twl6030_usb with driver twl6030_usb [ 85.811279] bus: 'platform': really_probe: probing driver twl6030_usb with device twl6030_usb [ 85.811309] twl6030_usb twl6030_usb: phy not ready, deferring probe [ 85.811462] platform twl6030_usb: Driver twl6030_usb requests probe deferral [ 85.811462] platform twl6030_usb: Added to deferred list [ 85.883331] Initializing USB Mass Storage driver... [ 85.883361] bus: 'usb': add driver usb-storage [ 85.883453] bus: 'usb': driver_probe_device: matched device 1-1.2.2:1.0 with driver usb-storage [ 85.883483] bus: 'usb': really_probe: probing driver usb-storage with device 1-1.2.2:1.0 [ 85.883514] usb-storage 1-1.2.2:1.0: usb_probe_interface [ 85.883544] usb-storage 1-1.2.2:1.0: usb_probe_interface - got id [ 85.884094] scsi0 : usb-storage 1-1.2.2:1.0 [ 85.884155] device: 'host0': device_add [ 85.884185] bus: 'scsi': add device host0 [ 85.884246] PM: Adding info for scsi:host0 [ 85.884552] device: 'host0': device_add [ 85.884674] PM: Adding info for No Bus:host0 [ 85.884948] driver: '1-1.2.2:1.0': driver_bound: bound to device 'usb-storage' [ 85.884979] bus: 'usb': really_probe: bound device 1-1.2.2:1.0 to driver usb-storage [ 85.884979] PM: Moving platform:musb-hdrc.0.auto to end of list [ 85.885009] platform musb-hdrc.0.auto: Retrying from deferred list [ 85.885070] bus: 'platform': driver_probe_device: matched device musb-hdrc.0.auto with driver musb-hdrc [ 85.885070] bus: 'platform': really_probe: probing driver musb-hdrc with device musb-hdrc.0.auto [ 85.885131] unable to find transceiver of type USB2 PHY [ 85.886230] usbcore: registered new interface driver usb-storage [ 85.886230] USB Mass Storage support registered. [ 85.890655] HS USB OTG: no transceiver configured [ 85.895660] musb-hdrc musb-hdrc.0.auto:
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On 2013年1月12日 15:48:59, Alex Riesen wrote: On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen raa.l...@gmail.com wrote: Hi, the USB stick (an Cruzer Titanium 2GB) was not recognized at any of the USB ports of this system (an System76 lemu4 laptop, XHCI device) after it was removed. If I attempt to insert it again in any of the ports (one of the two USB3, or the USB2) the led on the stick lights up shortly and if off again. There is no media detection messages in the dmesg output, only that from the first time: usb 1-1.2: new high-speed USB device number 3 using ehci-pci usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408 usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 usb 1-1.2: Product: U3 Titanium usb 1-1.2: Manufacturer: SanDisk Corporation usb 1-1.2: SerialNumber: 187A3A60F1E9 scsi6 : usb-storage 1-1.2:1.0 io scheduler deadline registered (default) usb 1-1.2: USB disconnect, device number 3 The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost reproduce the problem later in a simplified setup (init=/bin/bash) on USB3 ports by inserting and removing the stick quickly. Almost - because the USB3 ports recovered after some time, while the USB2 port never experienced the problem. One more detail: I usually use the noop elevator. That time it was the deadline. And I just reproduced it easily with deadline. Can you provide the output of dmesg with CONFIG_USB_DEBUG? This will be helpful. -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sat, 12 Jan 2013, Alex Riesen wrote: On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen raa.l...@gmail.com wrote: Hi, the USB stick (an Cruzer Titanium 2GB) was not recognized at any of the USB ports of this system (an System76 lemu4 laptop, XHCI device) after it was removed. If I attempt to insert it again in any of the ports (one of the two USB3, or the USB2) the led on the stick lights up shortly and if off again. There is no media detection messages in the dmesg output, only that from the first time: To make testing simpler, use only the USB-2 ports. The xHCI driver is not as mature as the EHCI driver. usb 1-1.2: new high-speed USB device number 3 using ehci-pci usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408 usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 usb 1-1.2: Product: U3 Titanium usb 1-1.2: Manufacturer: SanDisk Corporation usb 1-1.2: SerialNumber: 187A3A60F1E9 scsi6 : usb-storage 1-1.2:1.0 io scheduler deadline registered (default) usb 1-1.2: USB disconnect, device number 3 The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost reproduce the problem later in a simplified setup (init=/bin/bash) on USB3 ports by inserting and removing the stick quickly. Almost - because the USB3 ports recovered after some time, while the USB2 port never experienced the problem. For testing, use a kernel with CONFIG_USB_DEBUG and CONFIG_PRINTK_TIME enabled. Do the following: After a normal boot, run dmesg -C to clear the log buffer. Then plug in the stick. After a couple of seconds, type Alt-SysRq-W. Then unplug the stick. After a couple of seconds, type Alt-SysRq-W again. Then collect the output from dmesg and post it. One more detail: I usually use the noop elevator. That time it was the deadline. And I just reproduced it easily with deadline. I doubt the elevator has anything to do with this. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern st...@rowland.harvard.edu wrote: On Sat, 12 Jan 2013, Alex Riesen wrote: One more detail: I usually use the noop elevator. That time it was the deadline. And I just reproduced it easily with deadline. I doubt the elevator has anything to do with this. But it looks like it does: just using the deadline elevator is a sure way to reproduce the bug. The system always recovers (sometimes after a while) with noop. -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern st...@rowland.harvard.edu wrote: On Sat, 12 Jan 2013, Alex Riesen wrote: On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen raa.l...@gmail.com wrote: the USB stick (an Cruzer Titanium 2GB) was not recognized at any of the USB ports of this system (an System76 lemu4 laptop, XHCI device) after it was removed. [...] To make testing simpler, use only the USB-2 ports. The xHCI driver is not as mature as the EHCI driver. I used the USB2 port, but enabled the debugging for xHCI too, just because it is not as mature as you say, but in the same machine. And there are some traces from it, even though I didn't touch the USB3 ports. Might be unrelated, but just in case... The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost For the record, I just retested: the problem persists with 3.7.1. reproduce the problem later in a simplified setup (init=/bin/bash) on USB3 ports by inserting and removing the stick quickly. Almost - because the USB3 ports recovered after some time, while the USB2 port never experienced the problem. For testing, use a kernel with CONFIG_USB_DEBUG and CONFIG_PRINTK_TIME enabled. Do the following: After a normal boot, run dmesg -C to clear the log buffer. Then plug in the stick. After a couple of seconds, type Alt-SysRq-W. Then unplug the stick. After a couple of seconds, type Alt-SysRq-W again. Then collect the output from dmesg and post it. Attached. A remount in the middle is me remounting an SATA device to save dmesg output in case the system crashes hard. dmesg2.bz2 Description: BZip2 compressed data
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sat, Jan 12, 2013 at 8:39 PM, Alex Riesen raa.l...@gmail.com wrote: On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern st...@rowland.harvard.edu wrote: On Sat, 12 Jan 2013, Alex Riesen wrote: One more detail: I usually use the noop elevator. That time it was the deadline. And I just reproduced it easily with deadline. I doubt the elevator has anything to do with this. But it looks like it does: just using the deadline elevator is a sure way to reproduce the bug. The system always recovers (sometimes after a while) with noop. And no, it does not. Not by itself, but the fact that deadline elevator was compiled as module certainly helped! This explains the hanging modprobe in the dmesg output (the part after device connect). I still wonder, why didn't it froze at boot, mounting SATA devices (the root, /var, and /home are on an SSD connected by SATA)? And why hanging khubd at reboot? Anyway, building the elevator in the kernel avoids the problem. Sorry for not spotting this earlier. Now, who would be interested to handle this kind of misconfiguration ... -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Sat, 12 Jan 2013, Alex Riesen wrote: On Sat, Jan 12, 2013 at 8:39 PM, Alex Riesen raa.l...@gmail.com wrote: On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern st...@rowland.harvard.edu wrote: On Sat, 12 Jan 2013, Alex Riesen wrote: One more detail: I usually use the noop elevator. That time it was the deadline. And I just reproduced it easily with deadline. I doubt the elevator has anything to do with this. But it looks like it does: just using the deadline elevator is a sure way to reproduce the bug. The system always recovers (sometimes after a while) with noop. And no, it does not. Not by itself, but the fact that deadline elevator was compiled as module certainly helped! This explains the hanging modprobe in the dmesg output (the part after device connect). I still wonder, why didn't it froze at boot, mounting SATA devices (the root, /var, and /home are on an SSD connected by SATA)? And why hanging khubd at reboot? Anyway, building the elevator in the kernel avoids the problem. Sorry for not spotting this earlier. Now, who would be interested to handle this kind of misconfiguration ... So the whole thing was a false alarm? Maybe you should report to the block-layer maintainers that it's possible to mess up the system by building an elevator as a module. That sounds like the sort of thing they'd be interested to hear. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds
On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen raa.l...@gmail.com wrote: Hi, the USB stick (an Cruzer Titanium 2GB) was not recognized at any of the USB ports of this system (an System76 lemu4 laptop, XHCI device) after it was removed. If I attempt to insert it again in any of the ports (one of the two USB3, or the USB2) the led on the stick lights up shortly and if off again. There is no media detection messages in the dmesg output, only that from the first time: usb 1-1.2: new high-speed USB device number 3 using ehci-pci usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408 usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 usb 1-1.2: Product: U3 Titanium usb 1-1.2: Manufacturer: SanDisk Corporation usb 1-1.2: SerialNumber: 187A3A60F1E9 scsi6 : usb-storage 1-1.2:1.0 io scheduler deadline registered (default) usb 1-1.2: USB disconnect, device number 3 The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost reproduce the problem later in a simplified setup (init=/bin/bash) on USB3 ports by inserting and removing the stick quickly. Almost - because the USB3 ports recovered after some time, while the USB2 port never experienced the problem. One more detail: I usually use the noop elevator. That time it was the deadline. And I just reproduced it easily with deadline. -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html