Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-17 Thread Tejun Heo
Hello,

On Thu, Apr 17, 2014 at 11:05:53AM +0800, Li Zhong wrote:
> It seems to me cpu_add_remove_lock is always taken after
> device_hotplug_lock.
> 
> So if cpu_add_remove_lock has been acquired by device removing process,
> then it means the other online/offline process couldn't successfully try
> lock device_hotplug_lock, and will release s_active with a restart
> syscall error;
> 
> if cpu_add_remove_lock has been acquired by online/offline process, then
> it should already hold device_hotlug_lock, and keeps the device removing
> process waiting at device_hotplug_lock. So online/offline process could
> release the lock, and finally release s_active soon. 

I see.  That's kinda nasty tho and lockdep of course doesn't know
about it and generates spurious warnings.

> But after some further thinking, I seem to understand your point.
> s_active has lock order problem with the other series of hotplug related
> locks, so it's better to take s_active out of the dependency chain,
> rather than the first of the other series of locks? like you suggested
> below.

Yeah, I think that'd be the right thing to do and we can revert the
previous convolution.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-17 Thread Tejun Heo
Hello,

On Thu, Apr 17, 2014 at 11:05:53AM +0800, Li Zhong wrote:
 It seems to me cpu_add_remove_lock is always taken after
 device_hotplug_lock.
 
 So if cpu_add_remove_lock has been acquired by device removing process,
 then it means the other online/offline process couldn't successfully try
 lock device_hotplug_lock, and will release s_active with a restart
 syscall error;
 
 if cpu_add_remove_lock has been acquired by online/offline process, then
 it should already hold device_hotlug_lock, and keeps the device removing
 process waiting at device_hotplug_lock. So online/offline process could
 release the lock, and finally release s_active soon. 

I see.  That's kinda nasty tho and lockdep of course doesn't know
about it and generates spurious warnings.

 But after some further thinking, I seem to understand your point.
 s_active has lock order problem with the other series of hotplug related
 locks, so it's better to take s_active out of the dependency chain,
 rather than the first of the other series of locks? like you suggested
 below.

Yeah, I think that'd be the right thing to do and we can revert the
previous convolution.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-16 Thread Li Zhong
On Wed, 2014-04-16 at 11:17 -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote:
> > > If so, that is
> > > an actually possible deadlock, no?
> > 
> > Yes, but it seems to me that it is solved in commit 5e33bc41, which uses
> > lock_device_hotplug_sysfs() to return a restart syscall error if not
> > able to try lock the device_hotplug_lock. That also requires the device
> > removing code path to take the device_hotplug_lock. 
> 
> But that patch only takes out device_hotplug_lock out of the
> dependency graph and does nothing for cpu_add_remove_lock.  It seems
> to be that there still is a deadlock condition involving s_active and
> cpu_add_remove_lock.  Am I missing something here?

It seems to me cpu_add_remove_lock is always taken after
device_hotplug_lock.

So if cpu_add_remove_lock has been acquired by device removing process,
then it means the other online/offline process couldn't successfully try
lock device_hotplug_lock, and will release s_active with a restart
syscall error;

if cpu_add_remove_lock has been acquired by online/offline process, then
it should already hold device_hotlug_lock, and keeps the device removing
process waiting at device_hotplug_lock. So online/offline process could
release the lock, and finally release s_active soon. 

But after some further thinking, I seem to understand your point.
s_active has lock order problem with the other series of hotplug related
locks, so it's better to take s_active out of the dependency chain,
rather than the first of the other series of locks? like you suggested
below.

> 
> Now that kernfs has a proper mechanism to deal with it, wouldn't it
> make more sense to replace 5e33bc41 with prper s_active protection
> breaking?

I'll try this way and send you the code for review.

Thanks,
Zhong

> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-16 Thread Tejun Heo
Hello,

On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote:
> > If so, that is
> > an actually possible deadlock, no?
> 
> Yes, but it seems to me that it is solved in commit 5e33bc41, which uses
> lock_device_hotplug_sysfs() to return a restart syscall error if not
> able to try lock the device_hotplug_lock. That also requires the device
> removing code path to take the device_hotplug_lock. 

But that patch only takes out device_hotplug_lock out of the
dependency graph and does nothing for cpu_add_remove_lock.  It seems
to be that there still is a deadlock condition involving s_active and
cpu_add_remove_lock.  Am I missing something here?

Now that kernfs has a proper mechanism to deal with it, wouldn't it
make more sense to replace 5e33bc41 with prper s_active protection
breaking?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-16 Thread Tejun Heo
Hello,

On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote:
  If so, that is
  an actually possible deadlock, no?
 
 Yes, but it seems to me that it is solved in commit 5e33bc41, which uses
 lock_device_hotplug_sysfs() to return a restart syscall error if not
 able to try lock the device_hotplug_lock. That also requires the device
 removing code path to take the device_hotplug_lock. 

But that patch only takes out device_hotplug_lock out of the
dependency graph and does nothing for cpu_add_remove_lock.  It seems
to be that there still is a deadlock condition involving s_active and
cpu_add_remove_lock.  Am I missing something here?

Now that kernfs has a proper mechanism to deal with it, wouldn't it
make more sense to replace 5e33bc41 with prper s_active protection
breaking?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-16 Thread Li Zhong
On Wed, 2014-04-16 at 11:17 -0400, Tejun Heo wrote:
 Hello,
 
 On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote:
   If so, that is
   an actually possible deadlock, no?
  
  Yes, but it seems to me that it is solved in commit 5e33bc41, which uses
  lock_device_hotplug_sysfs() to return a restart syscall error if not
  able to try lock the device_hotplug_lock. That also requires the device
  removing code path to take the device_hotplug_lock. 
 
 But that patch only takes out device_hotplug_lock out of the
 dependency graph and does nothing for cpu_add_remove_lock.  It seems
 to be that there still is a deadlock condition involving s_active and
 cpu_add_remove_lock.  Am I missing something here?

It seems to me cpu_add_remove_lock is always taken after
device_hotplug_lock.

So if cpu_add_remove_lock has been acquired by device removing process,
then it means the other online/offline process couldn't successfully try
lock device_hotplug_lock, and will release s_active with a restart
syscall error;

if cpu_add_remove_lock has been acquired by online/offline process, then
it should already hold device_hotlug_lock, and keeps the device removing
process waiting at device_hotplug_lock. So online/offline process could
release the lock, and finally release s_active soon. 

But after some further thinking, I seem to understand your point.
s_active has lock order problem with the other series of hotplug related
locks, so it's better to take s_active out of the dependency chain,
rather than the first of the other series of locks? like you suggested
below.

 
 Now that kernfs has a proper mechanism to deal with it, wouldn't it
 make more sense to replace 5e33bc41 with prper s_active protection
 breaking?

I'll try this way and send you the code for review.

Thanks,
Zhong

 
 Thanks.
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-15 Thread Li Zhong
On Tue, 2014-04-15 at 10:50 -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote:
> > / *
> >   * This process might deadlock with another process trying to 
> >   * remove this device:
> >   * This process holding the s_active of "online" attribute, and tries 
> >   * to online/offline the device with some locks protecting hotplug.
> >   * Device removing process holding some locks protecting hotplug, and 
> >   * tries to remove the "online" attribute, waiting for the s_active to
> >   * be released. 
> >   *
> >   * The deadlock described above should be solved with
> >   * lock_device_hotplug_sysfs(). We temporarily drop the active 
> >   * protection here to avoid some lockdep warnings. 
> >   *
> >   * If device_hotplug_lock is forgotten to be used when removing
> >   * device(possibly some very simple device even don't need this lock?),
> >   * @dev could go away any time after dropping the active protection. 
> >   * So increase its ref count before dropping active protection. 
> >   * Though invoking device_{on|off}line() on a removed device seems
> >   * unreasonable, it should be less disastrous than playing with freed
> >   * @dev. Also, we might be able to have some mechanism abort 
> >   * device_{on|off}line() if @dev already removed.
> >   */
> 
> Hmmm... I'm not sure I fully understand the problem.  Does the code
> ever try to remove "online" while holding cpu_add_remove_lock and,
> when written 0, online knob grabs cpu_add_remove_lock?  

Yes.

In acpi_processor_remove(), cpu_maps_update_begin() is called to hold
cpu_add_remove_lock, and then arch_unregister_cpu calls
unregister_cpu(), which will try to remove dir cpu1 including "online".

while written 0 to online, cpu_down() will also try to grab
cpu_add_remove_lock with cpu_maps_update_begin().

> If so, that is
> an actually possible deadlock, no?

Yes, but it seems to me that it is solved in commit 5e33bc41, which uses
lock_device_hotplug_sysfs() to return a restart syscall error if not
able to try lock the device_hotplug_lock. That also requires the device
removing code path to take the device_hotplug_lock. 

Thanks, Zhong

> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-15 Thread Tejun Heo
Hello,

On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote:
> / *
>   * This process might deadlock with another process trying to 
>   * remove this device:
>   * This process holding the s_active of "online" attribute, and tries 
>   * to online/offline the device with some locks protecting hotplug.
>   * Device removing process holding some locks protecting hotplug, and 
>   * tries to remove the "online" attribute, waiting for the s_active to
>   * be released. 
>   *
>   * The deadlock described above should be solved with
>   * lock_device_hotplug_sysfs(). We temporarily drop the active 
>   * protection here to avoid some lockdep warnings. 
>   *
>   * If device_hotplug_lock is forgotten to be used when removing
>   * device(possibly some very simple device even don't need this lock?),
>   * @dev could go away any time after dropping the active protection. 
>   * So increase its ref count before dropping active protection. 
>   * Though invoking device_{on|off}line() on a removed device seems
>   * unreasonable, it should be less disastrous than playing with freed
>   * @dev. Also, we might be able to have some mechanism abort 
>   * device_{on|off}line() if @dev already removed.
>   */

Hmmm... I'm not sure I fully understand the problem.  Does the code
ever try to remove "online" while holding cpu_add_remove_lock and,
when written 0, online knob grabs cpu_add_remove_lock?  If so, that is
an actually possible deadlock, no?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-15 Thread Tejun Heo
Hello,

On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote:
 / *
   * This process might deadlock with another process trying to 
   * remove this device:
   * This process holding the s_active of online attribute, and tries 
   * to online/offline the device with some locks protecting hotplug.
   * Device removing process holding some locks protecting hotplug, and 
   * tries to remove the online attribute, waiting for the s_active to
   * be released. 
   *
   * The deadlock described above should be solved with
   * lock_device_hotplug_sysfs(). We temporarily drop the active 
   * protection here to avoid some lockdep warnings. 
   *
   * If device_hotplug_lock is forgotten to be used when removing
   * device(possibly some very simple device even don't need this lock?),
   * @dev could go away any time after dropping the active protection. 
   * So increase its ref count before dropping active protection. 
   * Though invoking device_{on|off}line() on a removed device seems
   * unreasonable, it should be less disastrous than playing with freed
   * @dev. Also, we might be able to have some mechanism abort 
   * device_{on|off}line() if @dev already removed.
   */

Hmmm... I'm not sure I fully understand the problem.  Does the code
ever try to remove online while holding cpu_add_remove_lock and,
when written 0, online knob grabs cpu_add_remove_lock?  If so, that is
an actually possible deadlock, no?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-15 Thread Li Zhong
On Tue, 2014-04-15 at 10:50 -0400, Tejun Heo wrote:
 Hello,
 
 On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote:
  / *
* This process might deadlock with another process trying to 
* remove this device:
* This process holding the s_active of online attribute, and tries 
* to online/offline the device with some locks protecting hotplug.
* Device removing process holding some locks protecting hotplug, and 
* tries to remove the online attribute, waiting for the s_active to
* be released. 
*
* The deadlock described above should be solved with
* lock_device_hotplug_sysfs(). We temporarily drop the active 
* protection here to avoid some lockdep warnings. 
*
* If device_hotplug_lock is forgotten to be used when removing
* device(possibly some very simple device even don't need this lock?),
* @dev could go away any time after dropping the active protection. 
* So increase its ref count before dropping active protection. 
* Though invoking device_{on|off}line() on a removed device seems
* unreasonable, it should be less disastrous than playing with freed
* @dev. Also, we might be able to have some mechanism abort 
* device_{on|off}line() if @dev already removed.
*/
 
 Hmmm... I'm not sure I fully understand the problem.  Does the code
 ever try to remove online while holding cpu_add_remove_lock and,
 when written 0, online knob grabs cpu_add_remove_lock?  

Yes.

In acpi_processor_remove(), cpu_maps_update_begin() is called to hold
cpu_add_remove_lock, and then arch_unregister_cpu calls
unregister_cpu(), which will try to remove dir cpu1 including online.

while written 0 to online, cpu_down() will also try to grab
cpu_add_remove_lock with cpu_maps_update_begin().

 If so, that is
 an actually possible deadlock, no?

Yes, but it seems to me that it is solved in commit 5e33bc41, which uses
lock_device_hotplug_sysfs() to return a restart syscall error if not
able to try lock the device_hotplug_lock. That also requires the device
removing code path to take the device_hotplug_lock. 

Thanks, Zhong

 
 Thanks.
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-14 Thread Li Zhong
On Mon, 2014-04-14 at 16:13 -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote:
> > @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct 
> > device_attribute *attr,
> >  {
> > bool val;
> > int ret;
> > +   struct kernfs_node *kn;
> >  
> > ret = strtobool(buf, );
> > if (ret < 0)
> > @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct 
> > device_attribute *attr,
> > if (ret)
> > return ret;
> >  
> > +   kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name);
> > +   if (WARN_ON_ONCE(!kn))
> > +   goto out;
> > +
> > +   get_device(dev);
> > +   kernfs_break_active_protection(kn);
> > ret = val ? device_online(dev) : device_offline(dev);
> > +   kernfs_unbreak_active_protection(kn);
> > +   put_device(dev);
> > +
> > +   kernfs_put(kn);
> > +
> > +out:
> > unlock_device_hotplug();
> > return ret < 0 ? ret : count;
> >  }
> 
> Can you please add comment explainin why this is being down?  As it
> currently stands, it's quite a bit of complexity without any
> indication what's going on why.  Also, if device_hotplug is locked, is
> it really necessary to get @dev?  Can it go away inbetween?  The code
> snippet looks weird because getting @dev indicates that the device
> might go away without doing it but then it proceeds to invoke
> device_{on|off}line() which probably isn't safe to invoke on a removed
> device.

Hi Tejun, 

I tried to write some draft words here... (I'm not good at writing
it...) Could you please help to have a review and comment? thanks.

/ *
  * This process might deadlock with another process trying to 
  * remove this device:
  * This process holding the s_active of "online" attribute, and tries 
  * to online/offline the device with some locks protecting hotplug.
  * Device removing process holding some locks protecting hotplug, and 
  * tries to remove the "online" attribute, waiting for the s_active to
  * be released. 
  *
  * The deadlock described above should be solved with
  * lock_device_hotplug_sysfs(). We temporarily drop the active 
  * protection here to avoid some lockdep warnings. 
  *
  * If device_hotplug_lock is forgotten to be used when removing
  * device(possibly some very simple device even don't need this lock?),
  * @dev could go away any time after dropping the active protection. 
  * So increase its ref count before dropping active protection. 
  * Though invoking device_{on|off}line() on a removed device seems
  * unreasonable, it should be less disastrous than playing with freed
  * @dev. Also, we might be able to have some mechanism abort 
  * device_{on|off}line() if @dev already removed.
  */

Thanks, Zhong
> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-14 Thread Tejun Heo
Hello,

On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote:
> @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct 
> device_attribute *attr,
>  {
>   bool val;
>   int ret;
> + struct kernfs_node *kn;
>  
>   ret = strtobool(buf, );
>   if (ret < 0)
> @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct 
> device_attribute *attr,
>   if (ret)
>   return ret;
>  
> + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name);
> + if (WARN_ON_ONCE(!kn))
> + goto out;
> +
> + get_device(dev);
> + kernfs_break_active_protection(kn);
>   ret = val ? device_online(dev) : device_offline(dev);
> + kernfs_unbreak_active_protection(kn);
> + put_device(dev);
> +
> + kernfs_put(kn);
> +
> +out:
>   unlock_device_hotplug();
>   return ret < 0 ? ret : count;
>  }

Can you please add comment explainin why this is being down?  As it
currently stands, it's quite a bit of complexity without any
indication what's going on why.  Also, if device_hotplug is locked, is
it really necessary to get @dev?  Can it go away inbetween?  The code
snippet looks weird because getting @dev indicates that the device
might go away without doing it but then it proceeds to invoke
device_{on|off}line() which probably isn't safe to invoke on a removed
device.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-14 Thread Li Zhong
I noticed following lockdep warning when trying acpi hot-remove cpus:

[84154.204080] ==
[84154.204080] [ INFO: possible circular locking dependency detected ]
[84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW
[84154.204080] ---
[84154.204080] bash/777 is trying to acquire lock:
[84154.204080]  (cpu_add_remove_lock){+.+.+.}, at: []
cpu_maps_update_begin+0x17/0x20
[84154.213203] 
[84154.213203] but task is already holding lock:
[84154.213203]  (s_active#79){.+}, at: []
kernfs_fop_write+0xe4/0x190
[84154.213203] 
[84154.213203] which lock already depends on the new lock.
[84154.213203] 
[84154.213203] 
[84154.213203] the existing dependency chain (in reverse order) is:
[84154.213203] 
-> #3 (s_active#79){.+}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] __kernfs_remove+0x250/0x310
[84154.213203][] kernfs_remove_by_name_ns
+0x50/0xc0
[84154.213203][] sysfs_remove_file_ns
+0x15/0x20
[84154.213203][] device_remove_file+0x19/0x20
[84154.213203][] device_remove_attrs+0x33/0x80
[84154.213203][] device_del+0x127/0x1c0
[84154.213203][] device_unregister+0x22/0x60
[84154.213203][] unregister_cpu+0x1e/0x40
[84154.213203][] arch_unregister_cpu+0x23/0x30
[84154.213203][] acpi_processor_remove
+0x8d/0xb2
[84154.213203][] acpi_bus_trim+0x5a/0x8d
[84154.213203][] acpi_device_hotplug
+0x1a8/0x3ec
[84154.213203][] acpi_hotplug_work_fn
+0x1f/0x2b
[84154.213203][] process_one_work+0x1eb/0x6b0
[84154.213203][] worker_thread+0x11b/0x370
[84154.213203][] kthread+0xe4/0x100
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #2 (cpu_hotplug.lock#2){+.+.+.}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] mutex_lock_nested+0x50/0x3c0
[84154.213203][] cpu_hotplug_begin+0x4f/0x80
[84154.213203][] _cpu_up+0x3f/0x160
[84154.213203][] cpu_up+0x69/0x80
[84154.213203][] smp_init+0x60/0x8c
[84154.213203][] kernel_init_freeable
+0x126/0x23b
[84154.213203][] kernel_init+0xe/0xf0
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #1 (cpu_hotplug.lock){++}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] cpu_hotplug_begin+0x41/0x80
[84154.213203][] _cpu_up+0x3f/0x160
[84154.213203][] cpu_up+0x69/0x80
[84154.213203][] smp_init+0x60/0x8c
[84154.213203][] kernel_init_freeable
+0x126/0x23b
[84154.213203][] kernel_init+0xe/0xf0
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #0 (cpu_add_remove_lock){+.+.+.}:
[84154.213203][] __lock_acquire+0x1f2a/0x1f60
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] mutex_lock_nested+0x50/0x3c0
[84154.213203][] cpu_maps_update_begin
+0x17/0x20
[84154.213203][] cpu_down+0x1d/0x50
[84154.213203][] cpu_subsys_offline+0x14/0x20
[84154.213203][] device_offline+0xad/0xd0
[84154.213203][] online_store+0x42/0x80
[84154.213203][] dev_attr_store+0x18/0x30
[84154.213203][] sysfs_kf_write+0x49/0x60
[84154.213203][] kernfs_fop_write+0x109/0x190
[84154.213203][] vfs_write+0xbe/0x1c0
[84154.213203][] SyS_write+0x52/0xb0
[84154.213203][] tracesys+0xd0/0xd5
[84154.213203] 
[84154.213203] other info that might help us debug this:
[84154.213203] 
[84154.213203] Chain exists of:
  cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79

[84154.213203]  Possible unsafe locking scenario:
[84154.213203] 
[84154.213203]CPU0CPU1
[84154.213203]
[84154.213203]   lock(s_active#79);
[84154.213203]lock(cpu_hotplug.lock#2);
[84154.213203]lock(s_active#79);
[84154.213203]   lock(cpu_add_remove_lock);
[84154.213203] 
[84154.213203]  *** DEADLOCK ***
.

The deadlock itself seems already fixed in commit 5e33bc41. 

As Tejun suggested, to avoid this lockdep warning,
kernfs_break_active_protection() is used before online/offline callbacks
to take the s_active lock out of the dependency chain during
online/offline operations. 

Signed-off-by: Li Zhong 
---
 drivers/base/core.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 0dd6528..b313ad7 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct 
device_attribute *attr,
 {
bool val;
int ret;
+   struct kernfs_node *kn;
 
ret = strtobool(buf, );
if (ret < 0)
@@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct 
device_attribute *attr,
if (ret)
return 

Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-14 Thread Li Zhong
On Mon, 2014-04-14 at 16:13 -0400, Tejun Heo wrote:
 Hello,
 
 On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote:
  @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct 
  device_attribute *attr,
   {
  bool val;
  int ret;
  +   struct kernfs_node *kn;
   
  ret = strtobool(buf, val);
  if (ret  0)
  @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct 
  device_attribute *attr,
  if (ret)
  return ret;
   
  +   kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name);
  +   if (WARN_ON_ONCE(!kn))
  +   goto out;
  +
  +   get_device(dev);
  +   kernfs_break_active_protection(kn);
  ret = val ? device_online(dev) : device_offline(dev);
  +   kernfs_unbreak_active_protection(kn);
  +   put_device(dev);
  +
  +   kernfs_put(kn);
  +
  +out:
  unlock_device_hotplug();
  return ret  0 ? ret : count;
   }
 
 Can you please add comment explainin why this is being down?  As it
 currently stands, it's quite a bit of complexity without any
 indication what's going on why.  Also, if device_hotplug is locked, is
 it really necessary to get @dev?  Can it go away inbetween?  The code
 snippet looks weird because getting @dev indicates that the device
 might go away without doing it but then it proceeds to invoke
 device_{on|off}line() which probably isn't safe to invoke on a removed
 device.

Hi Tejun, 

I tried to write some draft words here... (I'm not good at writing
it...) Could you please help to have a review and comment? thanks.

/ *
  * This process might deadlock with another process trying to 
  * remove this device:
  * This process holding the s_active of online attribute, and tries 
  * to online/offline the device with some locks protecting hotplug.
  * Device removing process holding some locks protecting hotplug, and 
  * tries to remove the online attribute, waiting for the s_active to
  * be released. 
  *
  * The deadlock described above should be solved with
  * lock_device_hotplug_sysfs(). We temporarily drop the active 
  * protection here to avoid some lockdep warnings. 
  *
  * If device_hotplug_lock is forgotten to be used when removing
  * device(possibly some very simple device even don't need this lock?),
  * @dev could go away any time after dropping the active protection. 
  * So increase its ref count before dropping active protection. 
  * Though invoking device_{on|off}line() on a removed device seems
  * unreasonable, it should be less disastrous than playing with freed
  * @dev. Also, we might be able to have some mechanism abort 
  * device_{on|off}line() if @dev already removed.
  */

Thanks, Zhong
 
 Thanks.
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-14 Thread Li Zhong
I noticed following lockdep warning when trying acpi hot-remove cpus:

[84154.204080] ==
[84154.204080] [ INFO: possible circular locking dependency detected ]
[84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW
[84154.204080] ---
[84154.204080] bash/777 is trying to acquire lock:
[84154.204080]  (cpu_add_remove_lock){+.+.+.}, at: [810664a7]
cpu_maps_update_begin+0x17/0x20
[84154.213203] 
[84154.213203] but task is already holding lock:
[84154.213203]  (s_active#79){.+}, at: [81256e14]
kernfs_fop_write+0xe4/0x190
[84154.213203] 
[84154.213203] which lock already depends on the new lock.
[84154.213203] 
[84154.213203] 
[84154.213203] the existing dependency chain (in reverse order) is:
[84154.213203] 
- #3 (s_active#79){.+}:
[84154.213203][810c408b] lock_acquire+0x9b/0x1d0
[84154.213203][812552e0] __kernfs_remove+0x250/0x310
[84154.213203][81256470] kernfs_remove_by_name_ns
+0x50/0xc0
[84154.213203][81257af5] sysfs_remove_file_ns
+0x15/0x20
[84154.213203][813df339] device_remove_file+0x19/0x20
[84154.213203][813dff33] device_remove_attrs+0x33/0x80
[84154.213203][813e00a7] device_del+0x127/0x1c0
[84154.213203][813e0162] device_unregister+0x22/0x60
[84154.213203][813e66de] unregister_cpu+0x1e/0x40
[84154.213203][81009a33] arch_unregister_cpu+0x23/0x30
[84154.213203][8136f619] acpi_processor_remove
+0x8d/0xb2
[84154.213203][8136cfff] acpi_bus_trim+0x5a/0x8d
[84154.213203][8136ec3b] acpi_device_hotplug
+0x1a8/0x3ec
[84154.213203][81369002] acpi_hotplug_work_fn
+0x1f/0x2b
[84154.213203][8108754b] process_one_work+0x1eb/0x6b0
[84154.213203][81087e9b] worker_thread+0x11b/0x370
[84154.213203][81090324] kthread+0xe4/0x100
[84154.213203][815d2f2c] ret_from_fork+0x7c/0xb0
[84154.213203] 
- #2 (cpu_hotplug.lock#2){+.+.+.}:
[84154.213203][810c408b] lock_acquire+0x9b/0x1d0
[84154.213203][815c7700] mutex_lock_nested+0x50/0x3c0
[84154.213203][810665bf] cpu_hotplug_begin+0x4f/0x80
[84154.213203][810f] _cpu_up+0x3f/0x160
[84154.213203][810667f9] cpu_up+0x69/0x80
[84154.213203][81b18f14] smp_init+0x60/0x8c
[84154.213203][81b00fd8] kernel_init_freeable
+0x126/0x23b
[84154.213203][815b4a3e] kernel_init+0xe/0xf0
[84154.213203][815d2f2c] ret_from_fork+0x7c/0xb0
[84154.213203] 
- #1 (cpu_hotplug.lock){++}:
[84154.213203][810c408b] lock_acquire+0x9b/0x1d0
[84154.213203][810665b1] cpu_hotplug_begin+0x41/0x80
[84154.213203][810f] _cpu_up+0x3f/0x160
[84154.213203][810667f9] cpu_up+0x69/0x80
[84154.213203][81b18f14] smp_init+0x60/0x8c
[84154.213203][81b00fd8] kernel_init_freeable
+0x126/0x23b
[84154.213203][815b4a3e] kernel_init+0xe/0xf0
[84154.213203][815d2f2c] ret_from_fork+0x7c/0xb0
[84154.213203] 
- #0 (cpu_add_remove_lock){+.+.+.}:
[84154.213203][810c397a] __lock_acquire+0x1f2a/0x1f60
[84154.213203][810c408b] lock_acquire+0x9b/0x1d0
[84154.213203][815c7700] mutex_lock_nested+0x50/0x3c0
[84154.213203][810664a7] cpu_maps_update_begin
+0x17/0x20
[84154.213203][815b582d] cpu_down+0x1d/0x50
[84154.213203][813e63b4] cpu_subsys_offline+0x14/0x20
[84154.213203][813e145d] device_offline+0xad/0xd0
[84154.213203][813e1562] online_store+0x42/0x80
[84154.213203][813deab8] dev_attr_store+0x18/0x30
[84154.213203][81257bb9] sysfs_kf_write+0x49/0x60
[84154.213203][81256e39] kernfs_fop_write+0x109/0x190
[84154.213203][811d15be] vfs_write+0xbe/0x1c0
[84154.213203][811d1a52] SyS_write+0x52/0xb0
[84154.213203][815d3162] tracesys+0xd0/0xd5
[84154.213203] 
[84154.213203] other info that might help us debug this:
[84154.213203] 
[84154.213203] Chain exists of:
  cpu_add_remove_lock -- cpu_hotplug.lock#2 -- s_active#79

[84154.213203]  Possible unsafe locking scenario:
[84154.213203] 
[84154.213203]CPU0CPU1
[84154.213203]
[84154.213203]   lock(s_active#79);
[84154.213203]lock(cpu_hotplug.lock#2);
[84154.213203]lock(s_active#79);
[84154.213203]   lock(cpu_add_remove_lock);
[84154.213203] 
[84154.213203]  *** DEADLOCK ***
.

The deadlock itself seems already fixed in commit 5e33bc41. 

As Tejun 

Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-14 Thread Tejun Heo
Hello,

On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote:
 @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct 
 device_attribute *attr,
  {
   bool val;
   int ret;
 + struct kernfs_node *kn;
  
   ret = strtobool(buf, val);
   if (ret  0)
 @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct 
 device_attribute *attr,
   if (ret)
   return ret;
  
 + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name);
 + if (WARN_ON_ONCE(!kn))
 + goto out;
 +
 + get_device(dev);
 + kernfs_break_active_protection(kn);
   ret = val ? device_online(dev) : device_offline(dev);
 + kernfs_unbreak_active_protection(kn);
 + put_device(dev);
 +
 + kernfs_put(kn);
 +
 +out:
   unlock_device_hotplug();
   return ret  0 ? ret : count;
  }

Can you please add comment explainin why this is being down?  As it
currently stands, it's quite a bit of complexity without any
indication what's going on why.  Also, if device_hotplug is locked, is
it really necessary to get @dev?  Can it go away inbetween?  The code
snippet looks weird because getting @dev indicates that the device
might go away without doing it but then it proceeds to invoke
device_{on|off}line() which probably isn't safe to invoke on a removed
device.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/