Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
Hello, On Thu, Apr 17, 2014 at 11:05:53AM +0800, Li Zhong wrote: > It seems to me cpu_add_remove_lock is always taken after > device_hotplug_lock. > > So if cpu_add_remove_lock has been acquired by device removing process, > then it means the other online/offline process couldn't successfully try > lock device_hotplug_lock, and will release s_active with a restart > syscall error; > > if cpu_add_remove_lock has been acquired by online/offline process, then > it should already hold device_hotlug_lock, and keeps the device removing > process waiting at device_hotplug_lock. So online/offline process could > release the lock, and finally release s_active soon. I see. That's kinda nasty tho and lockdep of course doesn't know about it and generates spurious warnings. > But after some further thinking, I seem to understand your point. > s_active has lock order problem with the other series of hotplug related > locks, so it's better to take s_active out of the dependency chain, > rather than the first of the other series of locks? like you suggested > below. Yeah, I think that'd be the right thing to do and we can revert the previous convolution. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
Hello, On Thu, Apr 17, 2014 at 11:05:53AM +0800, Li Zhong wrote: It seems to me cpu_add_remove_lock is always taken after device_hotplug_lock. So if cpu_add_remove_lock has been acquired by device removing process, then it means the other online/offline process couldn't successfully try lock device_hotplug_lock, and will release s_active with a restart syscall error; if cpu_add_remove_lock has been acquired by online/offline process, then it should already hold device_hotlug_lock, and keeps the device removing process waiting at device_hotplug_lock. So online/offline process could release the lock, and finally release s_active soon. I see. That's kinda nasty tho and lockdep of course doesn't know about it and generates spurious warnings. But after some further thinking, I seem to understand your point. s_active has lock order problem with the other series of hotplug related locks, so it's better to take s_active out of the dependency chain, rather than the first of the other series of locks? like you suggested below. Yeah, I think that'd be the right thing to do and we can revert the previous convolution. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-16 at 11:17 -0400, Tejun Heo wrote: > Hello, > > On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote: > > > If so, that is > > > an actually possible deadlock, no? > > > > Yes, but it seems to me that it is solved in commit 5e33bc41, which uses > > lock_device_hotplug_sysfs() to return a restart syscall error if not > > able to try lock the device_hotplug_lock. That also requires the device > > removing code path to take the device_hotplug_lock. > > But that patch only takes out device_hotplug_lock out of the > dependency graph and does nothing for cpu_add_remove_lock. It seems > to be that there still is a deadlock condition involving s_active and > cpu_add_remove_lock. Am I missing something here? It seems to me cpu_add_remove_lock is always taken after device_hotplug_lock. So if cpu_add_remove_lock has been acquired by device removing process, then it means the other online/offline process couldn't successfully try lock device_hotplug_lock, and will release s_active with a restart syscall error; if cpu_add_remove_lock has been acquired by online/offline process, then it should already hold device_hotlug_lock, and keeps the device removing process waiting at device_hotplug_lock. So online/offline process could release the lock, and finally release s_active soon. But after some further thinking, I seem to understand your point. s_active has lock order problem with the other series of hotplug related locks, so it's better to take s_active out of the dependency chain, rather than the first of the other series of locks? like you suggested below. > > Now that kernfs has a proper mechanism to deal with it, wouldn't it > make more sense to replace 5e33bc41 with prper s_active protection > breaking? I'll try this way and send you the code for review. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
Hello, On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote: > > If so, that is > > an actually possible deadlock, no? > > Yes, but it seems to me that it is solved in commit 5e33bc41, which uses > lock_device_hotplug_sysfs() to return a restart syscall error if not > able to try lock the device_hotplug_lock. That also requires the device > removing code path to take the device_hotplug_lock. But that patch only takes out device_hotplug_lock out of the dependency graph and does nothing for cpu_add_remove_lock. It seems to be that there still is a deadlock condition involving s_active and cpu_add_remove_lock. Am I missing something here? Now that kernfs has a proper mechanism to deal with it, wouldn't it make more sense to replace 5e33bc41 with prper s_active protection breaking? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
Hello, On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote: If so, that is an actually possible deadlock, no? Yes, but it seems to me that it is solved in commit 5e33bc41, which uses lock_device_hotplug_sysfs() to return a restart syscall error if not able to try lock the device_hotplug_lock. That also requires the device removing code path to take the device_hotplug_lock. But that patch only takes out device_hotplug_lock out of the dependency graph and does nothing for cpu_add_remove_lock. It seems to be that there still is a deadlock condition involving s_active and cpu_add_remove_lock. Am I missing something here? Now that kernfs has a proper mechanism to deal with it, wouldn't it make more sense to replace 5e33bc41 with prper s_active protection breaking? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-16 at 11:17 -0400, Tejun Heo wrote: Hello, On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote: If so, that is an actually possible deadlock, no? Yes, but it seems to me that it is solved in commit 5e33bc41, which uses lock_device_hotplug_sysfs() to return a restart syscall error if not able to try lock the device_hotplug_lock. That also requires the device removing code path to take the device_hotplug_lock. But that patch only takes out device_hotplug_lock out of the dependency graph and does nothing for cpu_add_remove_lock. It seems to be that there still is a deadlock condition involving s_active and cpu_add_remove_lock. Am I missing something here? It seems to me cpu_add_remove_lock is always taken after device_hotplug_lock. So if cpu_add_remove_lock has been acquired by device removing process, then it means the other online/offline process couldn't successfully try lock device_hotplug_lock, and will release s_active with a restart syscall error; if cpu_add_remove_lock has been acquired by online/offline process, then it should already hold device_hotlug_lock, and keeps the device removing process waiting at device_hotplug_lock. So online/offline process could release the lock, and finally release s_active soon. But after some further thinking, I seem to understand your point. s_active has lock order problem with the other series of hotplug related locks, so it's better to take s_active out of the dependency chain, rather than the first of the other series of locks? like you suggested below. Now that kernfs has a proper mechanism to deal with it, wouldn't it make more sense to replace 5e33bc41 with prper s_active protection breaking? I'll try this way and send you the code for review. Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-15 at 10:50 -0400, Tejun Heo wrote: > Hello, > > On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote: > > / * > > * This process might deadlock with another process trying to > > * remove this device: > > * This process holding the s_active of "online" attribute, and tries > > * to online/offline the device with some locks protecting hotplug. > > * Device removing process holding some locks protecting hotplug, and > > * tries to remove the "online" attribute, waiting for the s_active to > > * be released. > > * > > * The deadlock described above should be solved with > > * lock_device_hotplug_sysfs(). We temporarily drop the active > > * protection here to avoid some lockdep warnings. > > * > > * If device_hotplug_lock is forgotten to be used when removing > > * device(possibly some very simple device even don't need this lock?), > > * @dev could go away any time after dropping the active protection. > > * So increase its ref count before dropping active protection. > > * Though invoking device_{on|off}line() on a removed device seems > > * unreasonable, it should be less disastrous than playing with freed > > * @dev. Also, we might be able to have some mechanism abort > > * device_{on|off}line() if @dev already removed. > > */ > > Hmmm... I'm not sure I fully understand the problem. Does the code > ever try to remove "online" while holding cpu_add_remove_lock and, > when written 0, online knob grabs cpu_add_remove_lock? Yes. In acpi_processor_remove(), cpu_maps_update_begin() is called to hold cpu_add_remove_lock, and then arch_unregister_cpu calls unregister_cpu(), which will try to remove dir cpu1 including "online". while written 0 to online, cpu_down() will also try to grab cpu_add_remove_lock with cpu_maps_update_begin(). > If so, that is > an actually possible deadlock, no? Yes, but it seems to me that it is solved in commit 5e33bc41, which uses lock_device_hotplug_sysfs() to return a restart syscall error if not able to try lock the device_hotplug_lock. That also requires the device removing code path to take the device_hotplug_lock. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
Hello, On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote: > / * > * This process might deadlock with another process trying to > * remove this device: > * This process holding the s_active of "online" attribute, and tries > * to online/offline the device with some locks protecting hotplug. > * Device removing process holding some locks protecting hotplug, and > * tries to remove the "online" attribute, waiting for the s_active to > * be released. > * > * The deadlock described above should be solved with > * lock_device_hotplug_sysfs(). We temporarily drop the active > * protection here to avoid some lockdep warnings. > * > * If device_hotplug_lock is forgotten to be used when removing > * device(possibly some very simple device even don't need this lock?), > * @dev could go away any time after dropping the active protection. > * So increase its ref count before dropping active protection. > * Though invoking device_{on|off}line() on a removed device seems > * unreasonable, it should be less disastrous than playing with freed > * @dev. Also, we might be able to have some mechanism abort > * device_{on|off}line() if @dev already removed. > */ Hmmm... I'm not sure I fully understand the problem. Does the code ever try to remove "online" while holding cpu_add_remove_lock and, when written 0, online knob grabs cpu_add_remove_lock? If so, that is an actually possible deadlock, no? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
Hello, On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote: / * * This process might deadlock with another process trying to * remove this device: * This process holding the s_active of online attribute, and tries * to online/offline the device with some locks protecting hotplug. * Device removing process holding some locks protecting hotplug, and * tries to remove the online attribute, waiting for the s_active to * be released. * * The deadlock described above should be solved with * lock_device_hotplug_sysfs(). We temporarily drop the active * protection here to avoid some lockdep warnings. * * If device_hotplug_lock is forgotten to be used when removing * device(possibly some very simple device even don't need this lock?), * @dev could go away any time after dropping the active protection. * So increase its ref count before dropping active protection. * Though invoking device_{on|off}line() on a removed device seems * unreasonable, it should be less disastrous than playing with freed * @dev. Also, we might be able to have some mechanism abort * device_{on|off}line() if @dev already removed. */ Hmmm... I'm not sure I fully understand the problem. Does the code ever try to remove online while holding cpu_add_remove_lock and, when written 0, online knob grabs cpu_add_remove_lock? If so, that is an actually possible deadlock, no? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-15 at 10:50 -0400, Tejun Heo wrote: Hello, On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote: / * * This process might deadlock with another process trying to * remove this device: * This process holding the s_active of online attribute, and tries * to online/offline the device with some locks protecting hotplug. * Device removing process holding some locks protecting hotplug, and * tries to remove the online attribute, waiting for the s_active to * be released. * * The deadlock described above should be solved with * lock_device_hotplug_sysfs(). We temporarily drop the active * protection here to avoid some lockdep warnings. * * If device_hotplug_lock is forgotten to be used when removing * device(possibly some very simple device even don't need this lock?), * @dev could go away any time after dropping the active protection. * So increase its ref count before dropping active protection. * Though invoking device_{on|off}line() on a removed device seems * unreasonable, it should be less disastrous than playing with freed * @dev. Also, we might be able to have some mechanism abort * device_{on|off}line() if @dev already removed. */ Hmmm... I'm not sure I fully understand the problem. Does the code ever try to remove online while holding cpu_add_remove_lock and, when written 0, online knob grabs cpu_add_remove_lock? Yes. In acpi_processor_remove(), cpu_maps_update_begin() is called to hold cpu_add_remove_lock, and then arch_unregister_cpu calls unregister_cpu(), which will try to remove dir cpu1 including online. while written 0 to online, cpu_down() will also try to grab cpu_add_remove_lock with cpu_maps_update_begin(). If so, that is an actually possible deadlock, no? Yes, but it seems to me that it is solved in commit 5e33bc41, which uses lock_device_hotplug_sysfs() to return a restart syscall error if not able to try lock the device_hotplug_lock. That also requires the device removing code path to take the device_hotplug_lock. Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Mon, 2014-04-14 at 16:13 -0400, Tejun Heo wrote: > Hello, > > On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote: > > @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct > > device_attribute *attr, > > { > > bool val; > > int ret; > > + struct kernfs_node *kn; > > > > ret = strtobool(buf, ); > > if (ret < 0) > > @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct > > device_attribute *attr, > > if (ret) > > return ret; > > > > + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); > > + if (WARN_ON_ONCE(!kn)) > > + goto out; > > + > > + get_device(dev); > > + kernfs_break_active_protection(kn); > > ret = val ? device_online(dev) : device_offline(dev); > > + kernfs_unbreak_active_protection(kn); > > + put_device(dev); > > + > > + kernfs_put(kn); > > + > > +out: > > unlock_device_hotplug(); > > return ret < 0 ? ret : count; > > } > > Can you please add comment explainin why this is being down? As it > currently stands, it's quite a bit of complexity without any > indication what's going on why. Also, if device_hotplug is locked, is > it really necessary to get @dev? Can it go away inbetween? The code > snippet looks weird because getting @dev indicates that the device > might go away without doing it but then it proceeds to invoke > device_{on|off}line() which probably isn't safe to invoke on a removed > device. Hi Tejun, I tried to write some draft words here... (I'm not good at writing it...) Could you please help to have a review and comment? thanks. / * * This process might deadlock with another process trying to * remove this device: * This process holding the s_active of "online" attribute, and tries * to online/offline the device with some locks protecting hotplug. * Device removing process holding some locks protecting hotplug, and * tries to remove the "online" attribute, waiting for the s_active to * be released. * * The deadlock described above should be solved with * lock_device_hotplug_sysfs(). We temporarily drop the active * protection here to avoid some lockdep warnings. * * If device_hotplug_lock is forgotten to be used when removing * device(possibly some very simple device even don't need this lock?), * @dev could go away any time after dropping the active protection. * So increase its ref count before dropping active protection. * Though invoking device_{on|off}line() on a removed device seems * unreasonable, it should be less disastrous than playing with freed * @dev. Also, we might be able to have some mechanism abort * device_{on|off}line() if @dev already removed. */ Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
Hello, On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote: > @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct > device_attribute *attr, > { > bool val; > int ret; > + struct kernfs_node *kn; > > ret = strtobool(buf, ); > if (ret < 0) > @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct > device_attribute *attr, > if (ret) > return ret; > > + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); > + if (WARN_ON_ONCE(!kn)) > + goto out; > + > + get_device(dev); > + kernfs_break_active_protection(kn); > ret = val ? device_online(dev) : device_offline(dev); > + kernfs_unbreak_active_protection(kn); > + put_device(dev); > + > + kernfs_put(kn); > + > +out: > unlock_device_hotplug(); > return ret < 0 ? ret : count; > } Can you please add comment explainin why this is being down? As it currently stands, it's quite a bit of complexity without any indication what's going on why. Also, if device_hotplug is locked, is it really necessary to get @dev? Can it go away inbetween? The code snippet looks weird because getting @dev indicates that the device might go away without doing it but then it proceeds to invoke device_{on|off}line() which probably isn't safe to invoke on a removed device. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
I noticed following lockdep warning when trying acpi hot-remove cpus: [84154.204080] == [84154.204080] [ INFO: possible circular locking dependency detected ] [84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW [84154.204080] --- [84154.204080] bash/777 is trying to acquire lock: [84154.204080] (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x20 [84154.213203] [84154.213203] but task is already holding lock: [84154.213203] (s_active#79){.+}, at: [] kernfs_fop_write+0xe4/0x190 [84154.213203] [84154.213203] which lock already depends on the new lock. [84154.213203] [84154.213203] [84154.213203] the existing dependency chain (in reverse order) is: [84154.213203] -> #3 (s_active#79){.+}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] __kernfs_remove+0x250/0x310 [84154.213203][] kernfs_remove_by_name_ns +0x50/0xc0 [84154.213203][] sysfs_remove_file_ns +0x15/0x20 [84154.213203][] device_remove_file+0x19/0x20 [84154.213203][] device_remove_attrs+0x33/0x80 [84154.213203][] device_del+0x127/0x1c0 [84154.213203][] device_unregister+0x22/0x60 [84154.213203][] unregister_cpu+0x1e/0x40 [84154.213203][] arch_unregister_cpu+0x23/0x30 [84154.213203][] acpi_processor_remove +0x8d/0xb2 [84154.213203][] acpi_bus_trim+0x5a/0x8d [84154.213203][] acpi_device_hotplug +0x1a8/0x3ec [84154.213203][] acpi_hotplug_work_fn +0x1f/0x2b [84154.213203][] process_one_work+0x1eb/0x6b0 [84154.213203][] worker_thread+0x11b/0x370 [84154.213203][] kthread+0xe4/0x100 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #2 (cpu_hotplug.lock#2){+.+.+.}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_hotplug_begin+0x4f/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable +0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #1 (cpu_hotplug.lock){++}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] cpu_hotplug_begin+0x41/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable +0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #0 (cpu_add_remove_lock){+.+.+.}: [84154.213203][] __lock_acquire+0x1f2a/0x1f60 [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_maps_update_begin +0x17/0x20 [84154.213203][] cpu_down+0x1d/0x50 [84154.213203][] cpu_subsys_offline+0x14/0x20 [84154.213203][] device_offline+0xad/0xd0 [84154.213203][] online_store+0x42/0x80 [84154.213203][] dev_attr_store+0x18/0x30 [84154.213203][] sysfs_kf_write+0x49/0x60 [84154.213203][] kernfs_fop_write+0x109/0x190 [84154.213203][] vfs_write+0xbe/0x1c0 [84154.213203][] SyS_write+0x52/0xb0 [84154.213203][] tracesys+0xd0/0xd5 [84154.213203] [84154.213203] other info that might help us debug this: [84154.213203] [84154.213203] Chain exists of: cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79 [84154.213203] Possible unsafe locking scenario: [84154.213203] [84154.213203]CPU0CPU1 [84154.213203] [84154.213203] lock(s_active#79); [84154.213203]lock(cpu_hotplug.lock#2); [84154.213203]lock(s_active#79); [84154.213203] lock(cpu_add_remove_lock); [84154.213203] [84154.213203] *** DEADLOCK *** . The deadlock itself seems already fixed in commit 5e33bc41. As Tejun suggested, to avoid this lockdep warning, kernfs_break_active_protection() is used before online/offline callbacks to take the s_active lock out of the dependency chain during online/offline operations. Signed-off-by: Li Zhong --- drivers/base/core.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..b313ad7 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, ); if (ret < 0) @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, if (ret) return
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Mon, 2014-04-14 at 16:13 -0400, Tejun Heo wrote: Hello, On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote: @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, val); if (ret 0) @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, if (ret) return ret; + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + if (WARN_ON_ONCE(!kn)) + goto out; + + get_device(dev); + kernfs_break_active_protection(kn); ret = val ? device_online(dev) : device_offline(dev); + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + +out: unlock_device_hotplug(); return ret 0 ? ret : count; } Can you please add comment explainin why this is being down? As it currently stands, it's quite a bit of complexity without any indication what's going on why. Also, if device_hotplug is locked, is it really necessary to get @dev? Can it go away inbetween? The code snippet looks weird because getting @dev indicates that the device might go away without doing it but then it proceeds to invoke device_{on|off}line() which probably isn't safe to invoke on a removed device. Hi Tejun, I tried to write some draft words here... (I'm not good at writing it...) Could you please help to have a review and comment? thanks. / * * This process might deadlock with another process trying to * remove this device: * This process holding the s_active of online attribute, and tries * to online/offline the device with some locks protecting hotplug. * Device removing process holding some locks protecting hotplug, and * tries to remove the online attribute, waiting for the s_active to * be released. * * The deadlock described above should be solved with * lock_device_hotplug_sysfs(). We temporarily drop the active * protection here to avoid some lockdep warnings. * * If device_hotplug_lock is forgotten to be used when removing * device(possibly some very simple device even don't need this lock?), * @dev could go away any time after dropping the active protection. * So increase its ref count before dropping active protection. * Though invoking device_{on|off}line() on a removed device seems * unreasonable, it should be less disastrous than playing with freed * @dev. Also, we might be able to have some mechanism abort * device_{on|off}line() if @dev already removed. */ Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
I noticed following lockdep warning when trying acpi hot-remove cpus: [84154.204080] == [84154.204080] [ INFO: possible circular locking dependency detected ] [84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW [84154.204080] --- [84154.204080] bash/777 is trying to acquire lock: [84154.204080] (cpu_add_remove_lock){+.+.+.}, at: [810664a7] cpu_maps_update_begin+0x17/0x20 [84154.213203] [84154.213203] but task is already holding lock: [84154.213203] (s_active#79){.+}, at: [81256e14] kernfs_fop_write+0xe4/0x190 [84154.213203] [84154.213203] which lock already depends on the new lock. [84154.213203] [84154.213203] [84154.213203] the existing dependency chain (in reverse order) is: [84154.213203] - #3 (s_active#79){.+}: [84154.213203][810c408b] lock_acquire+0x9b/0x1d0 [84154.213203][812552e0] __kernfs_remove+0x250/0x310 [84154.213203][81256470] kernfs_remove_by_name_ns +0x50/0xc0 [84154.213203][81257af5] sysfs_remove_file_ns +0x15/0x20 [84154.213203][813df339] device_remove_file+0x19/0x20 [84154.213203][813dff33] device_remove_attrs+0x33/0x80 [84154.213203][813e00a7] device_del+0x127/0x1c0 [84154.213203][813e0162] device_unregister+0x22/0x60 [84154.213203][813e66de] unregister_cpu+0x1e/0x40 [84154.213203][81009a33] arch_unregister_cpu+0x23/0x30 [84154.213203][8136f619] acpi_processor_remove +0x8d/0xb2 [84154.213203][8136cfff] acpi_bus_trim+0x5a/0x8d [84154.213203][8136ec3b] acpi_device_hotplug +0x1a8/0x3ec [84154.213203][81369002] acpi_hotplug_work_fn +0x1f/0x2b [84154.213203][8108754b] process_one_work+0x1eb/0x6b0 [84154.213203][81087e9b] worker_thread+0x11b/0x370 [84154.213203][81090324] kthread+0xe4/0x100 [84154.213203][815d2f2c] ret_from_fork+0x7c/0xb0 [84154.213203] - #2 (cpu_hotplug.lock#2){+.+.+.}: [84154.213203][810c408b] lock_acquire+0x9b/0x1d0 [84154.213203][815c7700] mutex_lock_nested+0x50/0x3c0 [84154.213203][810665bf] cpu_hotplug_begin+0x4f/0x80 [84154.213203][810f] _cpu_up+0x3f/0x160 [84154.213203][810667f9] cpu_up+0x69/0x80 [84154.213203][81b18f14] smp_init+0x60/0x8c [84154.213203][81b00fd8] kernel_init_freeable +0x126/0x23b [84154.213203][815b4a3e] kernel_init+0xe/0xf0 [84154.213203][815d2f2c] ret_from_fork+0x7c/0xb0 [84154.213203] - #1 (cpu_hotplug.lock){++}: [84154.213203][810c408b] lock_acquire+0x9b/0x1d0 [84154.213203][810665b1] cpu_hotplug_begin+0x41/0x80 [84154.213203][810f] _cpu_up+0x3f/0x160 [84154.213203][810667f9] cpu_up+0x69/0x80 [84154.213203][81b18f14] smp_init+0x60/0x8c [84154.213203][81b00fd8] kernel_init_freeable +0x126/0x23b [84154.213203][815b4a3e] kernel_init+0xe/0xf0 [84154.213203][815d2f2c] ret_from_fork+0x7c/0xb0 [84154.213203] - #0 (cpu_add_remove_lock){+.+.+.}: [84154.213203][810c397a] __lock_acquire+0x1f2a/0x1f60 [84154.213203][810c408b] lock_acquire+0x9b/0x1d0 [84154.213203][815c7700] mutex_lock_nested+0x50/0x3c0 [84154.213203][810664a7] cpu_maps_update_begin +0x17/0x20 [84154.213203][815b582d] cpu_down+0x1d/0x50 [84154.213203][813e63b4] cpu_subsys_offline+0x14/0x20 [84154.213203][813e145d] device_offline+0xad/0xd0 [84154.213203][813e1562] online_store+0x42/0x80 [84154.213203][813deab8] dev_attr_store+0x18/0x30 [84154.213203][81257bb9] sysfs_kf_write+0x49/0x60 [84154.213203][81256e39] kernfs_fop_write+0x109/0x190 [84154.213203][811d15be] vfs_write+0xbe/0x1c0 [84154.213203][811d1a52] SyS_write+0x52/0xb0 [84154.213203][815d3162] tracesys+0xd0/0xd5 [84154.213203] [84154.213203] other info that might help us debug this: [84154.213203] [84154.213203] Chain exists of: cpu_add_remove_lock -- cpu_hotplug.lock#2 -- s_active#79 [84154.213203] Possible unsafe locking scenario: [84154.213203] [84154.213203]CPU0CPU1 [84154.213203] [84154.213203] lock(s_active#79); [84154.213203]lock(cpu_hotplug.lock#2); [84154.213203]lock(s_active#79); [84154.213203] lock(cpu_add_remove_lock); [84154.213203] [84154.213203] *** DEADLOCK *** . The deadlock itself seems already fixed in commit 5e33bc41. As Tejun
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
Hello, On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote: @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, val); if (ret 0) @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, if (ret) return ret; + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + if (WARN_ON_ONCE(!kn)) + goto out; + + get_device(dev); + kernfs_break_active_protection(kn); ret = val ? device_online(dev) : device_offline(dev); + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + +out: unlock_device_hotplug(); return ret 0 ? ret : count; } Can you please add comment explainin why this is being down? As it currently stands, it's quite a bit of complexity without any indication what's going on why. Also, if device_hotplug is locked, is it really necessary to get @dev? Can it go away inbetween? The code snippet looks weird because getting @dev indicates that the device might go away without doing it but then it proceeds to invoke device_{on|off}line() which probably isn't safe to invoke on a removed device. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/