Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-16 Thread Tejun Heo
Hello, Alan.

On Tue, Jan 15, 2013 at 11:01:15PM -0500, Alan Stern wrote:
  The current domain implementation is somewhere inbetween.  It's not
  completely simplistic system and at the same time not developed enough
  to do properly stacked flushing.
 
 I like your idea of chronological synchronization: Insist that anybody
 who wants to flush async jobs must get a cookie, and then only allow
 them to wait for async jobs started after the cookie was issued.
 
 I don't know if this is possible with the current implementation.  It 
 would require changing every call to async_synchronize_*(), and in a 
 nontrivial way.  But it might provide a proper solution to all these 
 problems.

The problem here is that flush everything which comes before me is
used to order async jobs.  e.g. after async jobs probe the hardware
they order themselves by flushing before registering them, so unless
we build accurate flushing dependencies, those dependencies will reach
beyond the time window we're interested in and bring in deadlocks.

And, as Linus pointed it out, tracking dependency through
request_module() is tricky no matter what we do.  I think it can be
done by matching the ones calling request_module() and the ones
actually loading modules but it's gonna be nasty.

There aren't too many which use async anyway so changing stuff
shouldn't be too difficult but I think the simpicity or dumbness is
one of major attractions of async, so it'd be nice to keep things that
way and the PF_USED_ASYNC hack seems to be able to hold things
together for now.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-16 Thread Alan Stern
On Wed, 16 Jan 2013, Tejun Heo wrote:

 Hello, Alan.
 
 On Tue, Jan 15, 2013 at 11:01:15PM -0500, Alan Stern wrote:
   The current domain implementation is somewhere inbetween.  It's not
   completely simplistic system and at the same time not developed enough
   to do properly stacked flushing.
  
  I like your idea of chronological synchronization: Insist that anybody
  who wants to flush async jobs must get a cookie, and then only allow
  them to wait for async jobs started after the cookie was issued.
  
  I don't know if this is possible with the current implementation.  It 
  would require changing every call to async_synchronize_*(), and in a 
  nontrivial way.  But it might provide a proper solution to all these 
  problems.
 
 The problem here is that flush everything which comes before me is
 used to order async jobs.  e.g. after async jobs probe the hardware
 they order themselves by flushing before registering them, so unless

I don't fully understand this example.  What is the point -- to make 
sure that asynchronously probed devices are registered in the order of 
their discovery?

If so, here's how to do it safely: Start up the async jobs in reverse
order of discovery.  Have each job acquire a cookie when it starts.  
Then each job needs to wait only for tasks that started after its
cookie was issued.

 we build accurate flushing dependencies, those dependencies will reach
 beyond the time window we're interested in and bring in deadlocks.

The flushing-dependency principle can be very simple: No async task
should ever have to wait for another async task that started before it.  
The cookie approach satisfies this requirement (unless an earlier 
task passes its cookie to a later task or subverts the mechanism in 
another way).

 And, as Linus pointed it out, tracking dependency through
 request_module() is tricky no matter what we do.  I think it can be
 done by matching the ones calling request_module() and the ones
 actually loading modules but it's gonna be nasty.

This shouldn't matter.  Dependencies don't need to be tracked
explicitly, because we know that any async work done by
request_module() must start _after_ request_module() is called.  Thus,
if async task A calls request_module(), which starts up async task B,
then we know that A can safely wait for B and B cannot safely wait for
A.

 There aren't too many which use async anyway so changing stuff
 shouldn't be too difficult but I think the simpicity or dumbness is
 one of major attractions of async, so it'd be nice to keep things that
 way and the PF_USED_ASYNC hack seems to be able to hold things
 together for now.

Nesting won't matter for the chronological approach.  I really think 
you should consider it more fully.  It's not a hack, and it doesn't 
need to be complicated.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-16 Thread Tejun Heo
Hello, Alan.

On Wed, Jan 16, 2013 at 12:01:53PM -0500, Alan Stern wrote:
  The problem here is that flush everything which comes before me is
  used to order async jobs.  e.g. after async jobs probe the hardware
  they order themselves by flushing before registering them, so unless
 
 I don't fully understand this example.  What is the point -- to make 
 sure that asynchronously probed devices are registered in the order of 
 their discovery?

People still want devices to be numbered to their physical ports and
so on, so we keep the registeration order the same as natural
(whatever that means) hardware order.

 If so, here's how to do it safely: Start up the async jobs in reverse
 order of discovery.  Have each job acquire a cookie when it starts.  
 Then each job needs to wait only for tasks that started after its
 cookie was issued.

It's a bit clumsy but yeah I guess it could work.

  There aren't too many which use async anyway so changing stuff
  shouldn't be too difficult but I think the simpicity or dumbness is
  one of major attractions of async, so it'd be nice to keep things that
  way and the PF_USED_ASYNC hack seems to be able to hold things
  together for now.
 
 Nesting won't matter for the chronological approach.  I really think 
 you should consider it more fully.  It's not a hack, and it doesn't 
 need to be complicated.

There is benefit to the current dumb implementation in that drivers
can use it without thinking too much, but yeah it could be that the
flushing range limit isn't too much of restriction on top.  I don't
know.  At this point, I'd prefer to remove request_module() from
elevator init path for the problem at hand.  If we need something more
involved, changing cookie usage rules definitely seems like an option.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-16 Thread Alan Stern
On Wed, 16 Jan 2013, Tejun Heo wrote:

 Hello, Alan.
 
 On Wed, Jan 16, 2013 at 12:01:53PM -0500, Alan Stern wrote:
   The problem here is that flush everything which comes before me is
   used to order async jobs.  e.g. after async jobs probe the hardware
   they order themselves by flushing before registering them, so unless
  
  I don't fully understand this example.  What is the point -- to make 
  sure that asynchronously probed devices are registered in the order of 
  their discovery?
 
 People still want devices to be numbered to their physical ports and
 so on, so we keep the registeration order the same as natural
 (whatever that means) hardware order.
 
  If so, here's how to do it safely: Start up the async jobs in reverse
  order of discovery.  Have each job acquire a cookie when it starts.  
  Then each job needs to wait only for tasks that started after its
  cookie was issued.
 
 It's a bit clumsy but yeah I guess it could work.
 
   There aren't too many which use async anyway so changing stuff
   shouldn't be too difficult but I think the simpicity or dumbness is
   one of major attractions of async, so it'd be nice to keep things that
   way and the PF_USED_ASYNC hack seems to be able to hold things
   together for now.
  
  Nesting won't matter for the chronological approach.  I really think 
  you should consider it more fully.  It's not a hack, and it doesn't 
  need to be complicated.
 
 There is benefit to the current dumb implementation in that drivers
 can use it without thinking too much, but yeah it could be that the
 flushing range limit isn't too much of restriction on top.  I don't
 know.  At this point, I'd prefer to remove request_module() from
 elevator init path for the problem at hand.  If we need something more
 involved, changing cookie usage rules definitely seems like an option.

A simpler approach might be to leave the existing synchronization 
mechanisms as they are, and use the chronological approach only for the 
case of loading a module (or wherever else someone wants to use it).

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Linus Torvalds
[ Added Tejun to the discussion, since he's the async go-to-guy ]

On Mon, Jan 14, 2013 at 10:23 PM, Ming Lei ming@canonical.com wrote:

 But I have another idea to address the problem, and let module code call
 async_synchronize_full() only if the module requires that explicitly, so how
 about the below draft patch?

No way.

This kind of let's just let drivers tell us when they used async
helpers is basically *asking* for buggy code. In fact, just to prove
how bad it is, YOU SCREWED IT UP YOURSELF.

Because it's not just sd.c that uses async_schedule(), and would need
the async synchronize. It's floppy.c, it's generic scsi scanning (so
scsi tapes etc), and it's libata-core.c.

This kind of let's randomly encourage people to write subtly buggy
code that has magical timing dependencies, so that the developer won't
likely even see it because he has fast disks etc code is totally
unacceptable. And this code was *designed* to be that kind of buggy.

No, if we set a flag like this, then it needs to be set
*automatically*, so that a module cannot screw this up by mistake.

It could be as simple as having a per-thread flag that gets set by the
__async_schedule() function, and gets cleared by fork. Then the module
code could do something like

   /* before calling the module -init function */
   current-used_async = 0;
   ...
   if (current-used_async)
  async_synchronize_full();

or whatever.

Tejun, comments? You can see the whole thread on lkml, but the basic
problem is that the module loading doing the unconditional
async_synchronize_full() has caused problems, because we have

 - load module A
   - module A does per-controller async discovery of its devices (eg
scsi or ata probing)
   - in the async thread, it initializes somethign that needs another
module B (in this case the default IO scheduler module)
  - modprobe for B loads the IO scheduler module successfully
  at the end of the module load, it does
async_synchronize_full() to make sure load_module won't return before
the module is ready
  *DEADLOCK*, because the async_synchronize_full() thing
actually waits for not the module B async code (it didn't have any),
but for the module *A* async code, which is waiting for module B to
finish.

Now, I'll happily argue that we shouldn't have this kind of load
modules from random context behavior in the kernel, and I think the
block layer is to blame for doing the IO scheduler load at an insane
time. So don't do that then would be the best solution. Sadly, we
don't even have a good way to notice that we're doing it, so hacky
workaround that at least doesn't require driver authors to care is
likely the second-best workaround.

But the hacky workaround absolutely needs to be *automatic*. Because
the driver writers need to get this subtle untestable thing right is
*not* acceptable. That's the patch that Ming Lei did, and I refuse to
have that kind of fragile crap in the kernel.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Linus Torvalds
On Tue, Jan 15, 2013 at 9:36 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 This kind of let's randomly encourage people to write subtly buggy
 code that has magical timing dependencies, so that the developer won't
 likely even see it because he has fast disks etc code is totally
 unacceptable. And this code was *designed* to be that kind of buggy.

Btw, we could *possibly* do this the other way around. Wait for all
async work by default, but then have a really hacky way to turn that
off for modules that explicitly don't want it, because they know they
can be loaded in async context, and they don't do any async work
themselves. Then we could make the IO schedulers set that flag (I
know I'm loaded from async space, and I know I'm not myself doing any
async init)

Quite frankly, I'd still much rather prefer the automated approach -
or even better, just avoiding the load modules in async context
entirely. But at least the I can put a huge comment about why I don't
want to be waited on would be much more acceptable than the I need
to explicitly tell the world that it needs to wait on me.

So Ming Lei's patch was easily subtly buggy by mistake (showing that
by the fact that it was indeed buggy), while the opposite model where
you have to explicitly ask people not to wait for you could still be
very buggy, but at least now it needs to explicitly do extra work in
order to be buggy.

So if an interface is fragile, it should aim to be fragile in the
right way - making the fragility explicit, so that people can grep for
it, and people can add comments to the particular code that marks it
fragile. The default behavior should be the robust one.

And if would be lovely to add a warning to the people loaded a module
from async context case, so that we'd *see* this.

Tejun, is there a good way for code to see I'm running in async
context? Then we could do something like

WARN_ON_ONCE(wait  system_state == SYSTEM_RUNNING  in_async_thread());

in kernel/kmod.c (__request_module()). That should at least warn about
this whole issue happening.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Alan Stern
On Tue, 15 Jan 2013, Linus Torvalds wrote:

 Tejun, comments? You can see the whole thread on lkml, but the basic
 problem is that the module loading doing the unconditional
 async_synchronize_full() has caused problems, because we have
 
  - load module A
- module A does per-controller async discovery of its devices (eg
 scsi or ata probing)
- in the async thread, it initializes somethign that needs another
 module B (in this case the default IO scheduler module)
   - modprobe for B loads the IO scheduler module successfully
   at the end of the module load, it does
 async_synchronize_full() to make sure load_module won't return before
 the module is ready
   *DEADLOCK*, because the async_synchronize_full() thing
 actually waits for not the module B async code (it didn't have any),
 but for the module *A* async code, which is waiting for module B to
 finish.
 
 Now, I'll happily argue that we shouldn't have this kind of load
 modules from random context behavior in the kernel, and I think the
 block layer is to blame for doing the IO scheduler load at an insane
 time. So don't do that then would be the best solution.

It may not be so easy.  When the SCSI async thread probes the new disk, 
it has to do I/O.  So it needs to use a scheduler.

But maybe it could use a built-in trivial scheduler until the proper 
one is loaded.  Then the loading could be asynchronous.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Tejun Heo
Hello, Linus.

On Tue, Jan 15, 2013 at 09:36:57AM -0800, Linus Torvalds wrote:
 Tejun, comments? You can see the whole thread on lkml, but the basic
 problem is that the module loading doing the unconditional
 async_synchronize_full() has caused problems, because we have
 
  - load module A
- module A does per-controller async discovery of its devices (eg
 scsi or ata probing)
- in the async thread, it initializes somethign that needs another
 module B (in this case the default IO scheduler module)
   - modprobe for B loads the IO scheduler module successfully
   at the end of the module load, it does
 async_synchronize_full() to make sure load_module won't return before
 the module is ready
   *DEADLOCK*, because the async_synchronize_full() thing
 actually waits for not the module B async code (it didn't have any),
 but for the module *A* async code, which is waiting for module B to
 finish.

I think the root problem here, apart from request_module() from block
- which is a bit nasty but making that part completely async would too
be quite nasty albeit in a different way - is that
async_synchronize_full() is way too indescriminate.  It's something
only suitable for things like the end of system init.

I'm wondering whether what we need is a rudimentray nesting like the
following.

finished_loading()
{
blah blah;

cookie = async_current_cookie();

do init calls;

async_synchronize_upto(cookie);

blah blah;
}

The nesting here would be an approximation as the dependency recorded
here is chronological.  I *suspect* this should be safe unless the
module is doing something weird.  Need to think more about it.  One
way or the other, I think what we need is some form of scoping for
flushing async ops.

BTW, the current synchronization is broken - cookie isn't transferred
to running-domain in queueing order but __lowest_in_progress()
assumes that.  I think I broke that while converting it to workqueue.

Anyways, working on it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Tejun Heo
Hello, Alan.

On Tue, Jan 15, 2013 at 01:20:58PM -0500, Alan Stern wrote:
 It may not be so easy.  When the SCSI async thread probes the new disk, 
 it has to do I/O.  So it needs to use a scheduler.
 
 But maybe it could use a built-in trivial scheduler until the proper 
 one is loaded.  Then the loading could be asynchronous.

It can be done.  Noop is always built-in and block IO can do IOs with
noop.  The problem here is that request_module() is done synchronously
during evelator_init().  We can punt that to a work item so that the
elevator is switched on load completion.  There are some nastiness
involved tho - if module probing returns before elevator switch
happens, the userland can observe elevator being switched after some
indetermined short period of time, which can, for example, break
scripts adjusting elevator knobs and etc...

I *think* it'll be best to allow scoped synchronization of async ops.
Looking into it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Linus Torvalds
On Tue, Jan 15, 2013 at 10:32 AM, Tejun Heo t...@kernel.org wrote:

 I think the root problem here, apart from request_module() from block
 - which is a bit nasty but making that part completely async would too
 be quite nasty albeit in a different way - is that
 async_synchronize_full() is way too indescriminate.  It's something
 only suitable for things like the end of system init.

 I'm wondering whether what we need is a rudimentray nesting like the
 following.

I think that is a good solution if it works, but look out: we need to
synchronize across *all* domains, not just the default one.  The sd.c
code, for example, uses its own scsi_sd_probe_domain for example,
and we *do* want to synchronize with it.

Can you do that with your suggested interface (ie it would have to be
a *global* sequence number).

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Tejun Heo
Hello, Linus

Will continue on another reply but this one is relevant so...

On Tue, Jan 15, 2013 at 10:18:45AM -0800, Linus Torvalds wrote:
 Tejun, is there a good way for code to see I'm running in async
 context? Then we could do something like

Almost.  With a bit of modification we can ask whether current is a
kworker, reach struct worker_struct via kthread_data() if so and then
test worker-current_func against the async workfn.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Tejun Heo
cc'ing Arjan.  Arjan, the original thread can be read from

  http://thread.gmane.org/gmane.linux.kernel/1420814

Hello, again.

On Tue, Jan 15, 2013 at 12:18:01PM -0800, Linus Torvalds wrote:
 I think that is a good solution if it works, but look out: we need to
 synchronize across *all* domains, not just the default one.  The sd.c
 code, for example, uses its own scsi_sd_probe_domain for example,
 and we *do* want to synchronize with it.
 
 Can you do that with your suggested interface (ie it would have to be
 a *global* sequence number).

So, I've been thinking about it for a while now and it looks like
async is cutting too many corners to implement any sane stackable
flushing scheme on top.  There simply isn't much information to
determine who should wait for what.

I've thought of two workarounds.  Both suck.

A. Try to detect deadlock conditions from synchronize().  If deadlock
   condition involving other async jobs are detected, whine about it
   and then skip.  Ignore deadlock condition on self (should solve
   this particular case).

   Detecting deadlock condition isn't difficult if there are only
   global synchronizations; unfortunately, fragmented dependencies via
   domain-local synchronization makes this non-trivial.

   We can still do ignore-self thing mostly trivially tho.  This will
   at least work around the problem at hand.

B. The ranged synchronization I first suggested.  The problem with
   this is that it's a common practice for a given async job to try to
   flush anything which comes before it.  This can introduce spurious
   synchronization dependencies which can then lead to deadlocks.

   These conditions can be detected and ignored, at least only
   considering global synchronizations.  The problem here is that
   those deadlock conditions will occur under normal usage and thus
   should be ignored silently, which basically makes synchronization
   silently ignore and finish successfully even if there are
   legitimate deadlocks which should be investigated.

For now, I'm gonna implement simple I'm not gonna wait for myself
self-deadlock avoidance.  If this needs any more sophistication, I
think we better reimplement it so that we can explicitly match up and
track who's gonna wait for what instead of throwing everything into a
single cookie space and then try to work back from there.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Arjan van de Ven



For now, I'm gonna implement simple I'm not gonna wait for myself
self-deadlock avoidance.  If this needs any more sophistication, I
think we better reimplement it so that we can explicitly match up and
track who's gonna wait for what instead of throwing everything into a
single cookie space and then try to work back from there.


async fundamentally had the concept of a monotonic increasing number,
and that you could always wait for everyone before me.
then people (like me) wanted exceptions to what everyone means ;-(
I'm ok with going back to a single space and simplify the world.

the case with (usb) module loading is fun...
people expect the device to be there (since frankly, it's hard to do 
otherwise)..
... but it's also really hard due to the nature of USB.. USB is async in nature,
even independent of the kernel async stuff.
Example: Load ehci.ko ... the actual use devices don't show up for some time.


the module wait case is tricky, and I wonder if there's deadlocks lurking even 
without async.
(btw there is a similar situation at the end of the normal kernel boot versus 
things like asynchronous
driver initializing... but we skip that in the case of an initrd is used to 
bypass a very similar deadlock.
this is even without async in use.. typical hard case is the PS/2 mouse 
probing)

at some point in the past we had the concept of request a module but don't wait for 
it,
and I wonder if that is what should have been used here.

Doing a range wait, with the start of the range being taken at the start of 
module loading
is a bit of a hack, but it'll work for the userspace expected semantics of all 
async stuff of
the *loaded module* be done, independent of all other modules/async stuff.
It's not as deadlocky as one might think, but it's not going to be efficient to 
implement.

not self-deadlocking likely solves most practical cases though




--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Tejun Heo
Hello, Arjan.

On Tue, Jan 15, 2013 at 04:25:54PM -0800, Arjan van de Ven wrote:
 async fundamentally had the concept of a monotonic increasing number,
 and that you could always wait for everyone before me.
 then people (like me) wanted exceptions to what everyone means ;-(
 I'm ok with going back to a single space and simplify the world.

If we want (or need) finer grained operation, we'll probably have to
head the other direction, so that we can definitively tell that an
async operation belongs to domains system, module load A and B, so
that each waiter knows what to wait for.

The current domain implementation is somewhere inbetween.  It's not
completely simplistic system and at the same time not developed enough
to do properly stacked flushing.

 the module wait case is tricky, and I wonder if there's deadlocks
 lurking even without async.

I don't think so.  It's really an async job waiting for itself.
Working around just this case is mostly trivial (working on patches
now) but it really is putting kludges on top of shaky foundation.
Maybe this is the extent of complexity that we need to go given the
rather limited use cases of async.  Let's hope so.  I think we'll have
to reimplement synchronization scheme if we have to go further.

 at some point in the past we had the concept of request a module
 but don't wait for it, and I wonder if that is what should have
 been used here.

We actually want to wait for it as it creates a userland visible
behavior difference otherwise.  It's just that async's way of waiting
is too ham-fisted to be used properly in more complex scenarios.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Linus Torvalds
On Tue, Jan 15, 2013 at 3:50 PM, Tejun Heo t...@kernel.org wrote:

 For now, I'm gonna implement simple I'm not gonna wait for myself
 self-deadlock avoidance.

You can't really do that. Or rather, it won't *help*.

The thing is, the module loading in particular is not necessarily
happening in the same context as what *started* the module loading. A
module loader will request the module from user space, and then later
user space - through possibly a totally unrelated process - will
finish it. So there is no myself. There's not even necessarily any
relationship that the kernel even knows about, because the module
loading request can have gone from usermode_helper over something like
dbus to systemd.

See?

There's a reason I asked for a warning for this. Or the let's flag
the current thread if it ever started anything asynchronous. Because
it's complicated.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Linus Torvalds
On Tue, Jan 15, 2013 at 4:36 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 There's a reason I asked for a warning for this. Or the let's flag
 the current thread if it ever started anything asynchronous. Because
 it's complicated.

Btw, the sequence counter (that is *not* taking anything else into
account) is good enough in practice, exactly because the common case
for module loading is actually that nothing in the module init
sequence is done asynchronously.

Yes, device discovery (particularly for block devices) is often
asynchronous. But the modules it then asks to load usually wouldn't
be. So if we just have the flag did this thread ever even start async
work over the module init sequence, we can just avoid the async
serialization entirely for that case, and it breaks the deadlock chain
nicely in practice.

Only of a block device does async work and then wants to load another
module that does more async work in its init routine would it then
break. But at that point, I'll happily just put my foot down and tell
people they are crazy, and Let's not do that kind of crap.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Tejun Heo
On Tue, Jan 15, 2013 at 04:36:34PM -0800, Linus Torvalds wrote:
 The thing is, the module loading in particular is not necessarily
 happening in the same context as what *started* the module loading. A
 module loader will request the module from user space, and then later
 user space - through possibly a totally unrelated process - will
 finish it. So there is no myself. There's not even necessarily any
 relationship that the kernel even knows about, because the module
 loading request can have gone from usermode_helper over something like
 dbus to systemd.
 
 See?

Right.  Gees, there's even no way to link them.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Ming Lei
On Wed, Jan 16, 2013 at 1:36 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 Because it's not just sd.c that uses async_schedule(), and would need
 the async synchronize. It's floppy.c, it's generic scsi scanning (so
 scsi tapes etc), and it's libata-core.c.

As discussed previously, only the module which will populate device
node for user space inside async func may require the synchronization,
so that the below

modprobe A
mount /dev/XXX /mnt

script can't be broken, and that should be the original bug report:

   https://bugzilla.kernel.org/attachment.cgi?id=20937

For other modules, looks the synchonization isn't needed, at least there
are lots of other async(work, kthread, ...) things which is scheduled in
driver probe() and no any synchronize is added after the module init()
completes inside loading module. Do we need to add that sync
for all async things inside loading module?

So looks only sd.c and floppy.c are to be synchronized suppose
some sync interfaces are introduced, doesn't it?


Thanks
--
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-15 Thread Alan Stern
On Tue, 15 Jan 2013, Tejun Heo wrote:

 Hello, Arjan.
 
 On Tue, Jan 15, 2013 at 04:25:54PM -0800, Arjan van de Ven wrote:
  async fundamentally had the concept of a monotonic increasing number,
  and that you could always wait for everyone before me.
  then people (like me) wanted exceptions to what everyone means ;-(
  I'm ok with going back to a single space and simplify the world.
 
 If we want (or need) finer grained operation, we'll probably have to
 head the other direction, so that we can definitively tell that an
 async operation belongs to domains system, module load A and B, so
 that each waiter knows what to wait for.
 
 The current domain implementation is somewhere inbetween.  It's not
 completely simplistic system and at the same time not developed enough
 to do properly stacked flushing.

I like your idea of chronological synchronization: Insist that anybody
who wants to flush async jobs must get a cookie, and then only allow
them to wait for async jobs started after the cookie was issued.

I don't know if this is possible with the current implementation.  It 
would require changing every call to async_synchronize_*(), and in a 
nontrivial way.  But it might provide a proper solution to all these 
problems.

Can you think of any reasons why it wouldn't work in principle?  It 
would prevent code from doing wait until all currently-running async 
jobs have finished -- but arguably, nobody should be allowed to do 
that anyway.

Alan Stern


--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-14 Thread Oliver Neukum
On Monday 14 January 2013 11:47:57 Ming Lei wrote:
 [  181.175323] echo 0  /proc/sys/kernel/hung_task_timeout_secs
 disables this message.
 [  181.183624] modprobeD c04f1920 0  2462   2461 0x
 [  181.183685] [c04f1920] (__schedule+0x5fc/0x6d4) from [c005eba4]
 (async_synchronize_cookie_domain+0xdc/0x
 168)
 [  181.183715] [c005eba4]
 (async_synchronize_cookie_domain+0xdc/0x168) from [c005ed04]
 (async_synchronize_f
 ull+0x3c/0x60)
 [  181.183776] [c005ed04] (async_synchronize_full+0x3c/0x60) from
 [c0085610] (load_module+0x1aac/0x1cdc)
 [  181.183807] [c0085610] (load_module+0x1aac/0x1cdc) from
 [c0085944] (sys_init_module+0x104/0x110)
 [  181.183837] [c0085944] (sys_init_module+0x104/0x110) from
 [c000dfe0] (ret_fast_syscall+0x0/0x48)
 [  271.175506] INFO: task modprobe:2462 blocked for more than 90 seconds.
 [  271.182373] echo 0  /proc/sys/kernel/hung_task_timeout_secs
 disables this message.
 [  271.190826] modprobeD c04f1920 0  2462   2461 0x
 [  271.190887] [c04f1920] (__schedule+0x5fc/0x6d4) from [c005eba4]
 (async_synchronize_cookie_domain+0xdc/0x
 168)
 [  271.190917] [c005eba4]
 (async_synchronize_cookie_domain+0xdc/0x168) from [c005ed04]
 (async_synchronize_f
 ull+0x3c/0x60)
 [  271.190948] [c005ed04] (async_synchronize_full+0x3c/0x60) from
 [c0085610] (load_module+0x1aac/0x1cdc)
 [  271.190948] [c0085610] (load_module+0x1aac/0x1cdc) from
 [c0085944] (sys_init_module+0x104/0x110)
 [  271.190979] [c0085944] (sys_init_module+0x104/0x110) from
 [c000dfe0] (ret_fast_syscall+0x0/0x48)

OK, your trace is totally different. If your hangs are related, as is likely,
my explanation goes out of the window.

Regards
Oliver

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-14 Thread Ming Lei
On Mon, Jan 14, 2013 at 4:22 PM, Oliver Neukum oli...@neukum.org wrote:

 OK, your trace is totally different. If your hangs are related, as is likely,
 my explanation goes out of the window.

If I run 'shutdown' after unplugging usb storage device, another hang trace
same with Alex's can be triggered too, so it should be one same problem.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-14 Thread Alex Riesen
On Mon, Jan 14, 2013 at 3:39 AM, Alan Stern st...@rowland.harvard.edu wrote:
 On Sun, 13 Jan 2013, Oliver Neukum wrote:
 This is not a USB problem. You need to involve the SCSI people.
 khubd just stops working because disconnects are processed
 in its context and the removal deadlocks.

 The why whould building the deadline elevator as a module make any
 difference?  Or does it make a difference?

Building elevator as module does make a difference: the system is broken.

 Alex, if the elevator is made static instead, do you still see the same
 behavior when the USB drive is removed?

How can I make the elevator static? Or did you mean built-in?
Or did you mean to ask if khubd hangs if the deadline is built in?
In that case - no. The behavior is normal. Nothing hangs.

 Also, are there any mounted filesystems on the drive when you unplug
 it?

No, no auto-mount. The whole of userspace init is attached, and I'm reasonably
sure nothing of it mounts anything automatically. Nothing of udev, too.


linuxrc-t
Description: Binary data


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-14 Thread Linus Torvalds
On Sun, Jan 13, 2013 at 11:15 PM, Ming Lei ming@canonical.com wrote:

 The deadlock problem is caused by calling request_module() inside
 async function of do_scan_async(), and it was introduced by Linus's
 below commit:

 commit d6de2c80e9d758d2e36c21699117db6178c0f517
 Author: Linus Torvalds torva...@linux-foundation.org
 Date:   Fri Apr 10 12:17:41 2009 -0700

 async: Fix module loading async-work regression

 IMO, maybe the commit isn't a proper fix, considered the
 below fact:

 - it isn't good to allow async function to be marked as __init

Immaterial. For modules, __init is a non-issue. For non-modules, the
synchronization elsewhere is sufficient.

 - any user mode shouldn't expect that the device is ready just
 after completing of 'insmod'

Bullshit. That expectation is just a fact. People insmod a device
driver, and mount the device immediately in scripts.

We do not say user mode shouldn't. Seriously. EVER. User mode
*does*, and we deal with it. Learn it now, and stop ever saying that
again.

This is really starting to annoy me. Kernel developers who say user
mode should be fixes to not do that should go somewhere else. The
whole and *only* point of a kernel is to hide these kinds of issues
from user mode, and make things just work in user mode. User mode
should not ever worry about oh, doing X can trigger a module load, so
now the device might not be available immediately, so I should delay
and re-try until it is.

That's just f*cking crazy talk.

We have a very simple rule in the kernel: we don't break user space. EVER.

Learn that rule. I don't ever want to hear any user mode shouldn't
expect again. User mode *does* expect. End of discussion.

 - from view of driver, introducing async_synchronize_full() after
 do_one_initcall() inside do_init_module() is like a sync probe
 for drivers built as module, and cause this kind of deadlock easily.

 So could we revert the commit and fix the previous problems just
 case by case? or other better fix?

There's no way in hell we take a fix things one by one approach.
It's not going to work. And your suggestion seems to not do async
discovery of devices in general, which is a *much* worse fix than
anything else. It's just crazy.

But there are other approaches we might take. We might move the call to

async_synchronize_full();

to other places. For example, maybe we're better off doing it at
block/char device open instead?

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-14 Thread Alan Stern
On Mon, 14 Jan 2013, Linus Torvalds wrote:

  - from view of driver, introducing async_synchronize_full() after
  do_one_initcall() inside do_init_module() is like a sync probe
  for drivers built as module, and cause this kind of deadlock easily.
 
  So could we revert the commit and fix the previous problems just
  case by case? or other better fix?
 
 There's no way in hell we take a fix things one by one approach.
 It's not going to work. And your suggestion seems to not do async
 discovery of devices in general, which is a *much* worse fix than
 anything else. It's just crazy.
 
 But there are other approaches we might take. We might move the call to
 
 async_synchronize_full();
 
 to other places. For example, maybe we're better off doing it at
 block/char device open instead?

How about skipping that call if the current thread is one of the async 
helpers?  Is it possible to detect when that happens?

Or maybe such a check should go inside async_synchronize_full() itself.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-14 Thread Linus Torvalds
On Mon, Jan 14, 2013 at 10:04 AM, Alan Stern st...@rowland.harvard.edu wrote:

 How about skipping that call if the current thread is one of the async
 helpers?  Is it possible to detect when that happens?

 Or maybe such a check should go inside async_synchronize_full() itself.

Do we have some idea of exactly what is waiting for what? Which async
context is causing the module load to happen in the first place?

I think *that* is what we should avoid - it sounds like the block
layer is loading the IO scheduler at the wrong point. I realize that
people like (for testing purposes) to change the IO scheduler at
random, but if that means that any IO can basically result in a
request_module(), then that sounds like a problem.

It seems to be elevator_get(), and I presume the chain is something
like load block driver async, the block driver does
blk_init_allocated_queue, that does request_module() to find the
elevator, the request_module() succeeds, but ends up waiting for async
work, which is the block driver load, which is waiting for the
request_module to finish.

And my gut feel is that blk_init_allocated_queue() probably shouldn't
do that request_module() at all. We migth want to do it when we *open*
the device, but not while loading the module for the device.

So my _feeling_ is that this is just a bug in the block layer, and
that it shouldn't set up block device drivers for this kind of crazy
need to load the elevator module while in the middle of scanning
devices. I think *that* is what we should aim to change.

Hmm?

That said, I think it might indeed be a good idea to make this problem
much easier to see, and that detect when it happens would be a good
thing (and then we should WARN_ON_ONCE() on people trying to do
request_module() calls from async context).

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-14 Thread Ming Lei
On Tue, Jan 15, 2013 at 1:30 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Sun, Jan 13, 2013 at 11:15 PM, Ming Lei ming@canonical.com wrote:

 The deadlock problem is caused by calling request_module() inside
 async function of do_scan_async(), and it was introduced by Linus's
 below commit:

 commit d6de2c80e9d758d2e36c21699117db6178c0f517
 Author: Linus Torvalds torva...@linux-foundation.org
 Date:   Fri Apr 10 12:17:41 2009 -0700

 async: Fix module loading async-work regression

 IMO, maybe the commit isn't a proper fix, considered the
 below fact:

 - it isn't good to allow async function to be marked as __init

 Immaterial. For modules, __init is a non-issue. For non-modules, the
 synchronization elsewhere is sufficient.

Looks 5d38258ec026921a7b266f4047ebeaa75db358e5(ACPI battery:
fix async boot oops) addresses the issue of __init for modules.


 - any user mode shouldn't expect that the device is ready just
 after completing of 'insmod'

 Bullshit. That expectation is just a fact. People insmod a device
 driver, and mount the device immediately in scripts.

I mean we can let the device node populated in probe() first,
but let open() wait for completion of the async probe(). Maybe my
expression is not accurate, here the 'device isn't ready' just means
that the async probe() isn't completed, and doesn't mean the device
node doesn't come.


 We do not say user mode shouldn't. Seriously. EVER. User mode
 *does*, and we deal with it. Learn it now, and stop ever saying that
 again.

 This is really starting to annoy me. Kernel developers who say user
 mode should be fixes to not do that should go somewhere else. The
 whole and *only* point of a kernel is to hide these kinds of issues
 from user mode, and make things just work in user mode. User mode
 should not ever worry about oh, doing X can trigger a module load, so
 now the device might not be available immediately, so I should delay
 and re-try until it is.

 That's just f*cking crazy talk.

 We have a very simple rule in the kernel: we don't break user space. EVER.

No, I don't mean we should break user space, see above.


 Learn that rule. I don't ever want to hear any user mode shouldn't
 expect again. User mode *does* expect. End of discussion.

 - from view of driver, introducing async_synchronize_full() after
 do_one_initcall() inside do_init_module() is like a sync probe
 for drivers built as module, and cause this kind of deadlock easily.

 So could we revert the commit and fix the previous problems just
 case by case? or other better fix?

 There's no way in hell we take a fix things one by one approach.
 It's not going to work. And your suggestion seems to not do async
 discovery of devices in general, which is a *much* worse fix than
 anything else. It's just crazy.

I will try to figure out one patch to address the scsi block async probe
issue first, and see if it can fix the problem by moving add_disk()
into sd_probe()
and calling async_synchronize_full_domain(scsi_sd_probe_domain)
in the entry of sd_open().


 But there are other approaches we might take. We might move the call to

 async_synchronize_full();

 to other places. For example, maybe we're better off doing it at
 block/char device open instead?

Looks it is similar with the above idea, but we have to remove the
async_synchronize_full() in do_init_module() together.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-14 Thread Ming Lei
On Tue, Jan 15, 2013 at 9:53 AM, Ming Lei ming@canonical.com wrote:

 I will try to figure out one patch to address the scsi block async probe
 issue first, and see if it can fix the problem by moving add_disk()
 into sd_probe()
 and calling async_synchronize_full_domain(scsi_sd_probe_domain)
 in the entry of sd_open().

Looks it isn't doable because the block partition device can only be created
inside the async things.

But I have another idea to address the problem, and let module code call
async_synchronize_full() only if the module requires that explicitly, so how
about the below draft patch?

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7992635..c5106a0 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3143,6 +3143,8 @@ static int __init init_sd(void)
if (err)
goto err_out_driver;

+   mod_init_async_wait(THIS_MODULE);
+
return 0;

 err_out_driver:
diff --git a/include/linux/module.h b/include/linux/module.h
index 7760c6d..09bd4c5 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -300,6 +300,12 @@ struct module

unsigned int taints;/* same bits as kernel:tainted */

+   /*
+* set if the module wants to call async_synchronize_full
+* after its init() is complted.
+*/
+   unsigned int init_async_wait:1;
+
 #ifdef CONFIG_GENERIC_BUG
/* Support for BUG */
unsigned num_bugs;
@@ -656,4 +662,16 @@ static inline void module_bug_finalize(const Elf_Ehdr *hdr,
 static inline void module_bug_cleanup(struct module *mod) {}
 #endif /* CONFIG_GENERIC_BUG */

+/*
+ * If one module wants to complete its all async code after
+ * its init() executed, the module can call this function in
+ * the entry of its init(), but the module's async function
+ * can't call request_module, otherwise deadlock will be caused.
+ */
+static inline void mod_init_async_wait(struct module *mod)
+{
+   if (mod)
+   mod-init_async_wait = 1;
+}
+
 #endif /* _LINUX_MODULE_H */
diff --git a/kernel/module.c b/kernel/module.c
index 250092c..dc5d011 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3058,8 +3058,9 @@ static int do_init_module(struct module *mod)
blocking_notifier_call_chain(module_notify_list,
 MODULE_STATE_LIVE, mod);

-   /* We need to finish all async code before the module init sequence is 
done */
-   async_synchronize_full();
+   /* Only complete all async code if the module requires that */
+   if (mod-init_async_wait)
+   async_synchronize_full();

mutex_lock(module_mutex);
/* Drop initial reference. */


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-13 Thread Alex Riesen
On Sat, Jan 12, 2013 at 11:52 PM, Alan Stern st...@rowland.harvard.edu wrote:
 On Sat, 12 Jan 2013, Alex Riesen wrote:
 Now, who would be interested to handle this kind of misconfiguration ...

 So the whole thing was a false alarm?

Yes, almost. What about khubd hanging when machine is shutdown?

 Maybe you should report to the block-layer maintainers that it's
 possible to mess up the system by building an elevator as a module.
 That sounds like the sort of thing they'd be interested to hear.

Hi Jens,

may I point you at this problem report:

http://thread.gmane.org/gmane.linux.kernel/1420814

It is surely a misconfiguration on my part (the used io scheduler
configured as a module), but the behavior is somewhat problematic
anyway: at least in this case USB storage is essentially locked up.

Regards,
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-13 Thread Alan Stern
On Sun, 13 Jan 2013, Alex Riesen wrote:

 On Sat, Jan 12, 2013 at 11:52 PM, Alan Stern st...@rowland.harvard.edu 
 wrote:
  On Sat, 12 Jan 2013, Alex Riesen wrote:
  Now, who would be interested to handle this kind of misconfiguration ...
 
  So the whole thing was a false alarm?
 
 Yes, almost. What about khubd hanging when machine is shutdown?

What about it?  I have trouble understanding all the descriptions you
have provided so far, because you talk about several different things
and change your mind a lot.  Can you provide a single, simple scenario
that illustrates this problem?

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-13 Thread Alex Riesen
On Sun, Jan 13, 2013 at 5:56 PM, Alan Stern st...@rowland.harvard.edu wrote:
 On Sun, 13 Jan 2013, Alex Riesen wrote:

 Yes, almost. What about khubd hanging when machine is shutdown?

 What about it?  I have trouble understanding all the descriptions you
 have provided so far, because you talk about several different things
 and change your mind a lot.  Can you provide a single, simple scenario
 that illustrates this problem?

1. Compile a kernel with deadline elevator as module
2. Boot into it, make sure the elevator is selected
  (I used elevator=deadline in the kernel command line)
3. Insert a FAT formatted mass storage device in an USB2 port
   Observe io scheduler deadline registered
4. Pull the stick out, wait a moment, and either shutdown or just
   and press alt-sysrq-W:

[  158.170585] usb 1-1.2: USB disconnect, device number 3
[  158.170590] usb 1-1.2: unregistering device
[  158.170595] usb 1-1.2: unregistering interface 1-1.2:1.0
[  166.959398] SysRq : Show Blocked State
[  166.959410]   taskPC stack   pid father
[  166.959432] khubd   D 880213a68000 0   513  2 0x
[  166.959440]  880213affa18 0046 8802006b 0
[  166.959448]  880213a68000 880213afffd8 880213afffd8 00013
[  166.959454]  81a14400 880213a68000 880213aff9a8 0
[  166.959461] Call Trace:
[  166.959475]  [8104d763] ? flush_work+0x6d/0x1fe
[  166.959485]  [8133defb] ? scsi_remove_host+0x24/0x10e
[  166.959490]  [8104d6fb] ? flush_work+0x5/0x1fe
[  166.959499]  [815e1dd6] schedule+0x65/0x67
[  166.959506]  [815e201e] schedule_preempt_disabled+0x18/0x24
[  166.959513]  [815e07e4] mutex_lock_nested+0x181/0x2c1
[  166.959518]  [8133defb] ? scsi_remove_host+0x24/0x10e
[  166.959524]  [8133defb] scsi_remove_host+0x24/0x10e
[  166.959531]  [813910d5] usb_stor_disconnect+0x77/0xbc
[  166.959539]  [81377ca3] usb_unbind_interface+0x6c/0x14d
[  166.959548]  [813266fc] __device_release_driver+0x88/0xdb
[  166.959554]  [81326774] device_release_driver+0x25/0x32
[  166.959561]  [8132616f] bus_remove_device+0xf5/0x10a
[  166.959567]  [8132413f] device_del+0x12e/0x189
[  166.959574]  [81375d3a] usb_disable_device+0xb1/0x20e
[  166.959582]  [8136ed8b] usb_disconnect+0xab/0x113
[  166.959589]  [81370218] hub_port_connect_change+0x1b0/0x879
[  166.959597]  [81370e3a] hub_events+0x559/0x69d
[  166.959604]  [81370fb6] hub_thread+0x38/0x19b
[  166.959612]  [81052587] ? wake_up_bit+0x2a/0x2a
[  166.959618]  [81370f7e] ? hub_events+0x69d/0x69d
[  166.959625]  [81051f2a] kthread+0xd5/0xdd
[  166.959632]  [8105d5f6] ? finish_task_switch+0x3f/0xf7
[  166.959641]  [81051e55] ? __init_kthread_worker+0x5a/0x5a
[  166.959648]  [815e965c] ret_from_fork+0x7c/0xb0
[  166.959655]  [81051e55] ? __init_kthread_worker+0x5a/0x5a

This trace if from alt-sysrq-W. I can attach an image from the shutdown case,
the traces from that case are hard to save: the main storage is usually already
stopped. I believe it was the same, though.
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-13 Thread Oliver Neukum
On Sunday 13 January 2013 18:42:49 Alex Riesen wrote:
 On Sun, Jan 13, 2013 at 5:56 PM, Alan Stern st...@rowland.harvard.edu wrote:
  On Sun, 13 Jan 2013, Alex Riesen wrote:
 
  Yes, almost. What about khubd hanging when machine is shutdown?
 
  What about it?  I have trouble understanding all the descriptions you
  have provided so far, because you talk about several different things
  and change your mind a lot.  Can you provide a single, simple scenario
  that illustrates this problem?
 
 1. Compile a kernel with deadline elevator as module
 2. Boot into it, make sure the elevator is selected
   (I used elevator=deadline in the kernel command line)
 3. Insert a FAT formatted mass storage device in an USB2 port
Observe io scheduler deadline registered
 4. Pull the stick out, wait a moment, and either shutdown or just
and press alt-sysrq-W:

That makes it clear. The elevator probably has scheduled work
which cannot finish waiting on a lock and scsi_remove_host()
wants to flush work.

This is not a USB problem. You need to involve the SCSI people.
khubd just stops working because disconnects are processed
in its context and the removal deadlocks.

Regards
Oliver


--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-13 Thread Alan Stern
On Sun, 13 Jan 2013, Oliver Neukum wrote:

 On Sunday 13 January 2013 18:42:49 Alex Riesen wrote:
  On Sun, Jan 13, 2013 at 5:56 PM, Alan Stern st...@rowland.harvard.edu 
  wrote:
   On Sun, 13 Jan 2013, Alex Riesen wrote:
  
   Yes, almost. What about khubd hanging when machine is shutdown?
  
   What about it?  I have trouble understanding all the descriptions you
   have provided so far, because you talk about several different things
   and change your mind a lot.  Can you provide a single, simple scenario
   that illustrates this problem?
  
  1. Compile a kernel with deadline elevator as module
  2. Boot into it, make sure the elevator is selected
(I used elevator=deadline in the kernel command line)
  3. Insert a FAT formatted mass storage device in an USB2 port
 Observe io scheduler deadline registered
  4. Pull the stick out, wait a moment, and either shutdown or just
 and press alt-sysrq-W:

Indeed.  I just tried booting into a kernel that has the deadline
elevator built-in, not a module.  Even then, when I specified
elevator=deadline on the boot command line, the system hung up
partway through booting.  Hard to tell exactly where, because it
occurred shortly after the switching from VGA to the framebuffer
driver, so the screen was completely blank.

When I get a chance, I'll try it on another machine where I can use a 
serial console.

 That makes it clear. The elevator probably has scheduled work
 which cannot finish waiting on a lock and scsi_remove_host()
 wants to flush work.

What is the work and why can't it finish?  Or rather, how can we 
figure these things out?  According to what Alex wrote, the blocked 
task doesn't show up in the Alt-SysRq-W listing.

And don't forget that the listing shows scsi_remove_host() blocks
waiting to acquire the host's scan_mutex.  Not waiting for work to be
flushed.  This casts doubt on your explanation.

 This is not a USB problem. You need to involve the SCSI people.
 khubd just stops working because disconnects are processed
 in its context and the removal deadlocks.

The why whould building the deadline elevator as a module make any
difference?  Or does it make a difference?

Alex, if the elevator is made static instead, do you still see the same 
behavior when the USB drive is removed?

Also, are there any mounted filesystems on the drive when you unplug
it?

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-13 Thread Ming Lei
On Mon, Jan 14, 2013 at 1:42 AM, Alex Riesen raa.l...@gmail.com wrote:

 1. Compile a kernel with deadline elevator as module
 2. Boot into it, make sure the elevator is selected
   (I used elevator=deadline in the kernel command line)
 3. Insert a FAT formatted mass storage device in an USB2 port
Observe io scheduler deadline registered
 4. Pull the stick out, wait a moment, and either shutdown or just
and press alt-sysrq-W:

I can reproduce the problem too on one ehci-only system(Pandaboard)
with deadline elevator module case, and no such problem in the
built-in case, and still on 3.8-rc3.

Follows the dmesg log:

[   85.665679] usb 1-1.2.2: new high-speed USB device number 5 using ehci-omap
[   85.784423] usb 1-1.2.2: default language 0x0409
[   85.790008] usb 1-1.2.2: udev 5, busnum 1, minor = 4
[   85.790039] usb 1-1.2.2: New USB device found, idVendor=0951, idProduct=1624
[   85.790039] usb 1-1.2.2: New USB device strings: Mfr=1, Product=2,
SerialNumber=3
[   85.790069] usb 1-1.2.2: Product: DataTraveler G2
[   85.790069] usb 1-1.2.2: Manufacturer: Kingston
[   85.790100] usb 1-1.2.2: SerialNumber: 0019E06C5346EA41D071
[   85.790100] device: '1-1.2.2': device_add
[   85.790344] bus: 'usb': add device 1-1.2.2
[   85.790405] PM: Adding info for usb:1-1.2.2
[   85.790740] bus: 'usb': driver_probe_device: matched device 1-1.2.2
with driver usb
[   85.790771] bus: 'usb': really_probe: probing driver usb with device 1-1.2.2
[   85.790802] usb 1-1.2.2: usb_probe_device
[   85.790832] usb 1-1.2.2: configuration #1 chosen from 1 choice
[   85.791076] usb 1-1.2.2: adding 1-1.2.2:1.0 (config #1, interface 0)
[   85.791076] device: '1-1.2.2:1.0': device_add
[   85.791137] bus: 'usb': add device 1-1.2.2:1.0
[   85.791168] PM: Adding info for usb:1-1.2.2:1.0
[   85.791442] device: 'ep_81': device_add
[   85.791564] PM: Adding info for No Bus:ep_81
[   85.791564] device: 'ep_02': device_add
[   85.791687] PM: Adding info for No Bus:ep_02
[   85.791687] driver: '1-1.2.2': driver_bound: bound to device 'usb'
[   85.791717] bus: 'usb': really_probe: bound device 1-1.2.2 to driver usb
[   85.791748] PM: Moving platform:musb-hdrc.0.auto to end of list
[   85.791748] device: 'ep_00': device_add
[   85.791778] platform musb-hdrc.0.auto: Retrying from deferred list
[   85.791839] PM: Adding info for No Bus:ep_00
[   85.791839] bus: 'platform': driver_probe_device: matched device
musb-hdrc.0.auto with driver musb-hdrc
[   85.791839] bus: 'platform': really_probe: probing driver musb-hdrc
with device musb-hdrc.0.auto
[   85.791870] hub 1-1.2:1.0: state 7 ports 4 chg  evt 0004
[   85.791900] unable to find transceiver of type USB2 PHY
[   85.797454] HS USB OTG: no transceiver configured
[   85.802703] musb-hdrc musb-hdrc.0.auto: musb_init_controller failed
with status -517
[   85.811157] platform musb-hdrc.0.auto: Driver musb-hdrc requests
probe deferral
[   85.811187] platform musb-hdrc.0.auto: Added to deferred list
[   85.811218] PM: Moving platform:twl6030_usb to end of list
[   85.811218] platform twl6030_usb: Retrying from deferred list
[   85.811279] bus: 'platform': driver_probe_device: matched device
twl6030_usb with driver twl6030_usb
[   85.811279] bus: 'platform': really_probe: probing driver
twl6030_usb with device twl6030_usb
[   85.811309] twl6030_usb twl6030_usb: phy not ready, deferring probe
[   85.811462] platform twl6030_usb: Driver twl6030_usb requests probe deferral
[   85.811462] platform twl6030_usb: Added to deferred list
[   85.883331] Initializing USB Mass Storage driver...
[   85.883361] bus: 'usb': add driver usb-storage
[   85.883453] bus: 'usb': driver_probe_device: matched device
1-1.2.2:1.0 with driver usb-storage
[   85.883483] bus: 'usb': really_probe: probing driver usb-storage
with device 1-1.2.2:1.0
[   85.883514] usb-storage 1-1.2.2:1.0: usb_probe_interface
[   85.883544] usb-storage 1-1.2.2:1.0: usb_probe_interface - got id
[   85.884094] scsi0 : usb-storage 1-1.2.2:1.0
[   85.884155] device: 'host0': device_add
[   85.884185] bus: 'scsi': add device host0
[   85.884246] PM: Adding info for scsi:host0
[   85.884552] device: 'host0': device_add
[   85.884674] PM: Adding info for No Bus:host0
[   85.884948] driver: '1-1.2.2:1.0': driver_bound: bound to device
'usb-storage'
[   85.884979] bus: 'usb': really_probe: bound device 1-1.2.2:1.0 to
driver usb-storage
[   85.884979] PM: Moving platform:musb-hdrc.0.auto to end of list
[   85.885009] platform musb-hdrc.0.auto: Retrying from deferred list
[   85.885070] bus: 'platform': driver_probe_device: matched device
musb-hdrc.0.auto with driver musb-hdrc
[   85.885070] bus: 'platform': really_probe: probing driver musb-hdrc
with device musb-hdrc.0.auto
[   85.885131] unable to find transceiver of type USB2 PHY
[   85.886230] usbcore: registered new interface driver usb-storage
[   85.886230] USB Mass Storage support registered.
[   85.890655] HS USB OTG: no transceiver configured
[   85.895660] musb-hdrc musb-hdrc.0.auto: 

Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-12 Thread Lan Tianyu

On 2013年1月12日 15:48:59, Alex Riesen wrote:

On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen raa.l...@gmail.com wrote:

Hi,

the USB stick (an Cruzer Titanium 2GB) was not recognized at any of
the USB ports of this system (an System76 lemu4 laptop, XHCI device)
after it was removed. If I attempt to insert it again in any of the
ports (one of the two USB3, or the USB2) the led on the stick lights
up shortly and if off again. There is no media detection messages in
the dmesg output, only that from the first time:

 usb 1-1.2: new high-speed USB device number 3 using ehci-pci
 usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408
 usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
 usb 1-1.2: Product: U3 Titanium
 usb 1-1.2: Manufacturer: SanDisk Corporation
 usb 1-1.2: SerialNumber: 187A3A60F1E9
 scsi6 : usb-storage 1-1.2:1.0
 io scheduler deadline registered (default)
 usb 1-1.2: USB disconnect, device number 3

The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost
reproduce the problem later in a simplified setup (init=/bin/bash) on
USB3 ports by inserting and removing the stick quickly. Almost - because
the USB3 ports recovered after some time, while the USB2 port never
experienced the problem.


One more detail: I usually use the noop elevator. That time it was
the deadline. And I just reproduced it easily with deadline.

Can you provide the output of dmesg with CONFIG_USB_DEBUG? This will
be helpful.

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-12 Thread Alan Stern
On Sat, 12 Jan 2013, Alex Riesen wrote:

 On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen raa.l...@gmail.com wrote:
  Hi,
 
  the USB stick (an Cruzer Titanium 2GB) was not recognized at any of
  the USB ports of this system (an System76 lemu4 laptop, XHCI device)
  after it was removed. If I attempt to insert it again in any of the
  ports (one of the two USB3, or the USB2) the led on the stick lights
  up shortly and if off again. There is no media detection messages in
  the dmesg output, only that from the first time:

To make testing simpler, use only the USB-2 ports.  The xHCI driver is 
not as mature as the EHCI driver.

   usb 1-1.2: new high-speed USB device number 3 using ehci-pci
   usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408
   usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
   usb 1-1.2: Product: U3 Titanium
   usb 1-1.2: Manufacturer: SanDisk Corporation
   usb 1-1.2: SerialNumber: 187A3A60F1E9
   scsi6 : usb-storage 1-1.2:1.0
   io scheduler deadline registered (default)
   usb 1-1.2: USB disconnect, device number 3
 
  The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost
  reproduce the problem later in a simplified setup (init=/bin/bash) on
  USB3 ports by inserting and removing the stick quickly. Almost - because
  the USB3 ports recovered after some time, while the USB2 port never
  experienced the problem.

For testing, use a kernel with CONFIG_USB_DEBUG and CONFIG_PRINTK_TIME 
enabled.  Do the following:

After a normal boot, run dmesg -C to clear the log buffer.

Then plug in the stick.  After a couple of seconds, type Alt-SysRq-W.

Then unplug the stick.  After a couple of seconds, type Alt-SysRq-W 
again.

Then collect the output from dmesg and post it.

 One more detail: I usually use the noop elevator. That time it was
 the deadline. And I just reproduced it easily with deadline.

I doubt the elevator has anything to do with this.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-12 Thread Alex Riesen
On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern st...@rowland.harvard.edu wrote:
 On Sat, 12 Jan 2013, Alex Riesen wrote:
 One more detail: I usually use the noop elevator. That time it was
 the deadline. And I just reproduced it easily with deadline.

 I doubt the elevator has anything to do with this.

But it looks like it does: just using the deadline elevator is a sure way
to reproduce the bug. The system always recovers (sometimes after a while)
with noop.
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-12 Thread Alex Riesen
On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern st...@rowland.harvard.edu wrote:
 On Sat, 12 Jan 2013, Alex Riesen wrote:

 On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen raa.l...@gmail.com wrote:
 
  the USB stick (an Cruzer Titanium 2GB) was not recognized at any of
  the USB ports of this system (an System76 lemu4 laptop, XHCI device)
  after it was removed. [...]

 To make testing simpler, use only the USB-2 ports.  The xHCI driver is
 not as mature as the EHCI driver.

I used the USB2 port, but enabled the debugging for xHCI too, just because
it is not as mature as you say, but in the same machine. And there are some
traces from it, even though I didn't touch the USB3 ports.
Might be unrelated, but just in case...

  The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost

For the record, I just retested: the problem persists with 3.7.1.

  reproduce the problem later in a simplified setup (init=/bin/bash) on
  USB3 ports by inserting and removing the stick quickly. Almost - because
  the USB3 ports recovered after some time, while the USB2 port never
  experienced the problem.

 For testing, use a kernel with CONFIG_USB_DEBUG and CONFIG_PRINTK_TIME
 enabled.  Do the following:

 After a normal boot, run dmesg -C to clear the log buffer.

 Then plug in the stick.  After a couple of seconds, type Alt-SysRq-W.

 Then unplug the stick.  After a couple of seconds, type Alt-SysRq-W
 again.

 Then collect the output from dmesg and post it.

Attached. A remount in the middle is me remounting an SATA device to
save dmesg output in case the system crashes hard.


dmesg2.bz2
Description: BZip2 compressed data


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-12 Thread Alex Riesen
On Sat, Jan 12, 2013 at 8:39 PM, Alex Riesen raa.l...@gmail.com wrote:
 On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern st...@rowland.harvard.edu wrote:
 On Sat, 12 Jan 2013, Alex Riesen wrote:
 One more detail: I usually use the noop elevator. That time it was
 the deadline. And I just reproduced it easily with deadline.

 I doubt the elevator has anything to do with this.

 But it looks like it does: just using the deadline elevator is a sure way
 to reproduce the bug. The system always recovers (sometimes after a while)
 with noop.

And no, it does not. Not by itself, but the fact that deadline elevator was
compiled as module certainly helped!

This explains the hanging modprobe in the dmesg output (the part after device
connect). I still wonder, why didn't it froze at boot, mounting SATA devices
(the root, /var, and /home are on an SSD connected by SATA)? And why hanging
khubd at reboot?

Anyway, building the elevator in the kernel avoids the problem. Sorry for
not spotting this earlier.

Now, who would be interested to handle this kind of misconfiguration ...
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-12 Thread Alan Stern
On Sat, 12 Jan 2013, Alex Riesen wrote:

 On Sat, Jan 12, 2013 at 8:39 PM, Alex Riesen raa.l...@gmail.com wrote:
  On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern st...@rowland.harvard.edu 
  wrote:
  On Sat, 12 Jan 2013, Alex Riesen wrote:
  One more detail: I usually use the noop elevator. That time it was
  the deadline. And I just reproduced it easily with deadline.
 
  I doubt the elevator has anything to do with this.
 
  But it looks like it does: just using the deadline elevator is a sure way
  to reproduce the bug. The system always recovers (sometimes after a while)
  with noop.
 
 And no, it does not. Not by itself, but the fact that deadline elevator was
 compiled as module certainly helped!
 
 This explains the hanging modprobe in the dmesg output (the part after device
 connect). I still wonder, why didn't it froze at boot, mounting SATA devices
 (the root, /var, and /home are on an SSD connected by SATA)? And why hanging
 khubd at reboot?
 
 Anyway, building the elevator in the kernel avoids the problem. Sorry for
 not spotting this earlier.
 
 Now, who would be interested to handle this kind of misconfiguration ...

So the whole thing was a false alarm?

Maybe you should report to the block-layer maintainers that it's 
possible to mess up the system by building an elevator as a module.  
That sounds like the sort of thing they'd be interested to hear.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB device cannot be reconnected and khubd blocked for more than 120 seconds

2013-01-11 Thread Alex Riesen
On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen raa.l...@gmail.com wrote:
 Hi,

 the USB stick (an Cruzer Titanium 2GB) was not recognized at any of
 the USB ports of this system (an System76 lemu4 laptop, XHCI device)
 after it was removed. If I attempt to insert it again in any of the
 ports (one of the two USB3, or the USB2) the led on the stick lights
 up shortly and if off again. There is no media detection messages in
 the dmesg output, only that from the first time:

  usb 1-1.2: new high-speed USB device number 3 using ehci-pci
  usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408
  usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
  usb 1-1.2: Product: U3 Titanium
  usb 1-1.2: Manufacturer: SanDisk Corporation
  usb 1-1.2: SerialNumber: 187A3A60F1E9
  scsi6 : usb-storage 1-1.2:1.0
  io scheduler deadline registered (default)
  usb 1-1.2: USB disconnect, device number 3

 The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost
 reproduce the problem later in a simplified setup (init=/bin/bash) on
 USB3 ports by inserting and removing the stick quickly. Almost - because
 the USB3 ports recovered after some time, while the USB2 port never
 experienced the problem.

One more detail: I usually use the noop elevator. That time it was
the deadline. And I just reproduced it easily with deadline.
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html