[RFC PATCH v6 2/2] Implement lock_device_hotplug_sysfs() by breaking active protection
This patch tries to solve the device hot remove locking issues in a different way from commit 5e33bc41, as kernfs already has a mechanism to break active protection. The active protection is there to keep the file alive by blocking deletion while operations are on-going in the file. This blocking creates a dependency loop when an operation running off a sysfs knob ends up grabbing a lock which may be held while removing the said sysfs knob. So the problem here is the order of s_active, and the series of hotplug related locks. commit 5e33bc41 solves it by taking out the first of the series of hoplug related locks, device_hotplug_lock, with a try lock. And if that try lock fails, it returns a restart syscall error, and drops s_active temporarily to allow the other process to remove the sysfs knob. This doesn't help with lockdep warnings reported against s_active and other hotplug related locks in the series. This patch instead tries to take s_active out of the lock dependency graph using the active protection breaking mechanism. lock_device_hotplug_sysfs() function name is kept here, three more arguments are added, dev, attr, to help find the kernfs_node corresponding to the sysfs knob (which should always be able to be found, WARN if not); kn_out is used to record the found kernfs_node for use in unlock. unlock_device_hotplug_sysfs() is created to help cleanup the operations(get reference, break active, etc) done in lock_device_hotplug_sysfs(), as well as unlock the lock. As we break the active protection here, the device removing process might remove the sysfs entries after that point. So after we grab the device_hotplug_lock, we can check whether that really happens. If so, we should abort and not invoke the online/offline callbacks anymore. lockdep assertion is added in try_offline_node() to make sure device_hotplug_lock is grabbed while removing cpu or memory. As devcie_hotplug_lock is used to protect hotplug operations with multiple subsystems. So there might be devices that won't be removed together with devices of other subsystems, and thus device_hotplug_lock is not needed. In this case, their specific online/offline callbacks needs to check whether the device has been removed. Cc: Tejun Heo Cc: Rafael J. Wysocki Cc: Greg Kroah-Hartman Cc: Toshi Kani Signed-off-by: Li Zhong --- drivers/base/core.c| 93 +- drivers/base/memory.c | 5 +-- include/linux/device.h | 7 +++- mm/memory_hotplug.c| 2 ++ 4 files changed, 95 insertions(+), 12 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 20da3ad..329f3b4 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -61,14 +61,87 @@ void unlock_device_hotplug(void) mutex_unlock(_hotplug_lock); } -int lock_device_hotplug_sysfs(void) +void device_hotplug_assert_held(void) { - if (mutex_trylock(_hotplug_lock)) - return 0; + lockdep_assert_held(_hotplug_lock); +} + +/* + * device_hotplug_lock is used to protect hotplugs which may involve multiple + * subsystems. This _sysfs() version is created to solve the deadlock with + * s_active of the device attribute file, which could be used to online/offline + * the device. + * + * Returns 0 on success. -ENODEV if the @dev is removed after we break the + * active protection. (We should always be able to find the kernfs_node + * related to the attribute, WARN if not, and also return -ENODEV). + * + * When success(return 0), the found kernfs_node is passed out in @kn_out, + * else pass NULL to @kn_out. + */ +int lock_device_hotplug_sysfs(struct device *dev, + struct device_attribute *attr, + struct kernfs_node **kn_out) +{ + struct kernfs_node *kn; + + *kn_out = kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); + + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* +* Break active protection here to avoid deadlocks with device +* removing process, which tries to remove sysfs entries including this +* "online" attribute while holding some hotplug related locks. +* +* @dev needs to be protected here, or it could go away any time after +* dropping active protection. +*/ + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); + + /* +* For devices which need device_hoplug_lock to protect the +* online/offline operation, i.e. it might be removed together with +* devices in some other subsystems, we can check here whether the +* device has been removed, so we don't call device_{on|off}line +* against removed device. lockdep assertion is added in +* try_offline_node() to make sure the lock is held to remove cpu +* or memory. +* +* If some devices won't be removed together with devices in other +
[RFC PATCH v6 1/2 ] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
While auditing the usage of lock_device_hotplug_sysfs() for implementing it in another way in following patch, it seems to me that the code here is to add/remove device, and the files probe/release themselves won't be removed. lock_device_hotplug_sysfs() is used to solve the deadlocks of s_active of the "online" file and the hotplug related locks. So if one process trying to remove the device, the sysfs files for this device, including "online" will be removed after device_hotplug_lock is grabbed. So for another process which might concurrently write to the "online" file, after it grabs s_active of the "online", it couldn't directly lock device_hotplug_lock, instead, in *_sysfs() version, try lock is used, and if it fails, the whole syscall is restarted. But here the s_active of "probe/release" are not involved in the above process, as they are not to be removed. So we could use lock_device_hotplug() here. Cc: Tejun Heo Cc: Rafael J. Wysocki Cc: Greg Kroah-Hartman Cc: Toshi Kani Signed-off-by: Li Zhong --- drivers/base/cpu.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 006b1bc..9483225 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -88,9 +88,7 @@ static ssize_t cpu_probe_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_probe(buf, count); @@ -106,9 +104,7 @@ static ssize_t cpu_release_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_release(buf, count); -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
viOn Tue, 2014-04-22 at 11:34 +0800, Li Zhong wrote: > On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: > > Hello, > > > > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: > > > > Proper /** function comment would be nice. > > Ok, will try to write some in next version. > > > > > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, > > > +struct device_attribute *attr) > > > > I can see why you did this but let's please not require the user of > > this function to see how the thing is working internally. Let's > > return int and keep track of (or look up again) the kernfs_node > > internally. > > Ok, it also makes the prototype of lock and unlock look more consistent > and comfortable. When trying to do an new version of the patch, I find that if the device is really removed, then we couldn't look up using the parent, and attribute name again in unlock. So I guess maybe I could add one more argument, e.g. kn_out,:q to track this kernfs node. Code will be posted soon for your review. Thanks, Zhong > > > > > > { > > ... > > > + /* > > > + * We assume device_hotplug_lock must be acquired before removing > > > > Is this assumption true? If so, can we add lockdep assertions in > > places to verify and enforce this? If not, aren't we just feeling > > good when the reality is broken? > > It seems not true ... I think there are devices that don't have the > online/offline concept, we just need to add it, remove it, like ethernet > cards. > > Maybe we could change the comments above, like: > /* We assume device_hotplug_lock must be acquired before >* removing devices, which have online/offline sysfs knob, >* and some locks are needed to serialize the online/offline >* callbacks and device removing. ... > ? > > And we could add lockdep assertions in cpu and memory related code? e.g. > remove_memory(), unregister_cpu() > > Currently, remove_memory() has comments for the function: > > * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > * and online/offline operations before this call, as required by > * try_offline_node(). > */ > > maybe it could be removed with the lockdep assertion. > > > ... > > > > Function comment please. > > OK. > > Thanks, Zhong > > > > +void unlock_device_hotplug_sysfs(struct device *dev, > > > + struct kernfs_node *kn) > > > +{ > > > + unlock_device_hotplug(); > > > + kernfs_unbreak_active_protection(kn); > > > + put_device(dev); > > > + kernfs_put(kn); > > > } > > > > Thanks. > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
viOn Tue, 2014-04-22 at 11:34 +0800, Li Zhong wrote: On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: Hello, On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: Proper /** function comment would be nice. Ok, will try to write some in next version. +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, +struct device_attribute *attr) I can see why you did this but let's please not require the user of this function to see how the thing is working internally. Let's return int and keep track of (or look up again) the kernfs_node internally. Ok, it also makes the prototype of lock and unlock look more consistent and comfortable. When trying to do an new version of the patch, I find that if the device is really removed, then we couldn't look up using the parent, and attribute name again in unlock. So I guess maybe I could add one more argument, e.g. kn_out,:q to track this kernfs node. Code will be posted soon for your review. Thanks, Zhong { ... + /* + * We assume device_hotplug_lock must be acquired before removing Is this assumption true? If so, can we add lockdep assertions in places to verify and enforce this? If not, aren't we just feeling good when the reality is broken? It seems not true ... I think there are devices that don't have the online/offline concept, we just need to add it, remove it, like ethernet cards. Maybe we could change the comments above, like: /* We assume device_hotplug_lock must be acquired before * removing devices, which have online/offline sysfs knob, * and some locks are needed to serialize the online/offline * callbacks and device removing. ... ? And we could add lockdep assertions in cpu and memory related code? e.g. remove_memory(), unregister_cpu() Currently, remove_memory() has comments for the function: * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations before this call, as required by * try_offline_node(). */ maybe it could be removed with the lockdep assertion. ... Function comment please. OK. Thanks, Zhong +void unlock_device_hotplug_sysfs(struct device *dev, + struct kernfs_node *kn) +{ + unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + kernfs_put(kn); } Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v6 1/2 ] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
While auditing the usage of lock_device_hotplug_sysfs() for implementing it in another way in following patch, it seems to me that the code here is to add/remove device, and the files probe/release themselves won't be removed. lock_device_hotplug_sysfs() is used to solve the deadlocks of s_active of the online file and the hotplug related locks. So if one process trying to remove the device, the sysfs files for this device, including online will be removed after device_hotplug_lock is grabbed. So for another process which might concurrently write to the online file, after it grabs s_active of the online, it couldn't directly lock device_hotplug_lock, instead, in *_sysfs() version, try lock is used, and if it fails, the whole syscall is restarted. But here the s_active of probe/release are not involved in the above process, as they are not to be removed. So we could use lock_device_hotplug() here. Cc: Tejun Heo t...@kernel.org Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Cc: Greg Kroah-Hartman gre...@linuxfoundation.org Cc: Toshi Kani toshi.k...@hp.com Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/cpu.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 006b1bc..9483225 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -88,9 +88,7 @@ static ssize_t cpu_probe_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_probe(buf, count); @@ -106,9 +104,7 @@ static ssize_t cpu_release_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_release(buf, count); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v6 2/2] Implement lock_device_hotplug_sysfs() by breaking active protection
This patch tries to solve the device hot remove locking issues in a different way from commit 5e33bc41, as kernfs already has a mechanism to break active protection. The active protection is there to keep the file alive by blocking deletion while operations are on-going in the file. This blocking creates a dependency loop when an operation running off a sysfs knob ends up grabbing a lock which may be held while removing the said sysfs knob. So the problem here is the order of s_active, and the series of hotplug related locks. commit 5e33bc41 solves it by taking out the first of the series of hoplug related locks, device_hotplug_lock, with a try lock. And if that try lock fails, it returns a restart syscall error, and drops s_active temporarily to allow the other process to remove the sysfs knob. This doesn't help with lockdep warnings reported against s_active and other hotplug related locks in the series. This patch instead tries to take s_active out of the lock dependency graph using the active protection breaking mechanism. lock_device_hotplug_sysfs() function name is kept here, three more arguments are added, dev, attr, to help find the kernfs_node corresponding to the sysfs knob (which should always be able to be found, WARN if not); kn_out is used to record the found kernfs_node for use in unlock. unlock_device_hotplug_sysfs() is created to help cleanup the operations(get reference, break active, etc) done in lock_device_hotplug_sysfs(), as well as unlock the lock. As we break the active protection here, the device removing process might remove the sysfs entries after that point. So after we grab the device_hotplug_lock, we can check whether that really happens. If so, we should abort and not invoke the online/offline callbacks anymore. lockdep assertion is added in try_offline_node() to make sure device_hotplug_lock is grabbed while removing cpu or memory. As devcie_hotplug_lock is used to protect hotplug operations with multiple subsystems. So there might be devices that won't be removed together with devices of other subsystems, and thus device_hotplug_lock is not needed. In this case, their specific online/offline callbacks needs to check whether the device has been removed. Cc: Tejun Heo t...@kernel.org Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com Cc: Greg Kroah-Hartman gre...@linuxfoundation.org Cc: Toshi Kani toshi.k...@hp.com Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/core.c| 93 +- drivers/base/memory.c | 5 +-- include/linux/device.h | 7 +++- mm/memory_hotplug.c| 2 ++ 4 files changed, 95 insertions(+), 12 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 20da3ad..329f3b4 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -61,14 +61,87 @@ void unlock_device_hotplug(void) mutex_unlock(device_hotplug_lock); } -int lock_device_hotplug_sysfs(void) +void device_hotplug_assert_held(void) { - if (mutex_trylock(device_hotplug_lock)) - return 0; + lockdep_assert_held(device_hotplug_lock); +} + +/* + * device_hotplug_lock is used to protect hotplugs which may involve multiple + * subsystems. This _sysfs() version is created to solve the deadlock with + * s_active of the device attribute file, which could be used to online/offline + * the device. + * + * Returns 0 on success. -ENODEV if the @dev is removed after we break the + * active protection. (We should always be able to find the kernfs_node + * related to the attribute, WARN if not, and also return -ENODEV). + * + * When success(return 0), the found kernfs_node is passed out in @kn_out, + * else pass NULL to @kn_out. + */ +int lock_device_hotplug_sysfs(struct device *dev, + struct device_attribute *attr, + struct kernfs_node **kn_out) +{ + struct kernfs_node *kn; + + *kn_out = kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* +* Break active protection here to avoid deadlocks with device +* removing process, which tries to remove sysfs entries including this +* online attribute while holding some hotplug related locks. +* +* @dev needs to be protected here, or it could go away any time after +* dropping active protection. +*/ + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); + + /* +* For devices which need device_hoplug_lock to protect the +* online/offline operation, i.e. it might be removed together with +* devices in some other subsystems, we can check here whether the +* device has been removed, so we don't call device_{on|off}line +* against removed device. lockdep assertion is added in +* try_offline_node() to make sure the lock is held to remove cpu
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Fri, 2014-04-25 at 09:59 -0400, Alan Stern wrote: > On Fri, 25 Apr 2014, Li Zhong wrote: > > > > No, this isn't self removal. The driver-attribute (not device-attribute) > > > store operation simply grabs a lock that is also held while the driver > > > is being deregistered at module unload. Taking a reference to the module > > > in this case will prevent deregistration while store is running. > > > > > > But it seems like this can be solved for usb-serial by simply not > > > holding the lock while deregistering. > > > > I didn't look carefully about this lock. > > > > But I'm not sure whether there are such requirements for driver > > attributes: > > > > some lock needs be grabbed in the driver attributes store callbacks, and > > the same lock also needs to be grabbed during driver unregister. > > In this case, the lock does _not_ need to be grabbed during driver > unregister. The driver grabs the lock, but it doesn't need to. OK. > > > If we have such requirements currently or in the future, I think they > > could all be solved by breaking active protection after get the module > > reference. > > No! That would be very bad. > > Unloading modules is quite different from unbinding drivers. After the > driver is unbound, its attribute callback routines can continue to run. > But after a driver module has been unloaded, its attribute callback > routines _cannot_ run because they aren't present in memory any more. > > If we allowed a module to be unloaded while one of its callbacks was > running (because active protection was broken), imagine what would > happen... I don't think the module could be unloaded after we increased the module reference counter. Thanks, Zhong > > Alan Stern > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Fri, 2014-04-25 at 09:54 -0400, Alan Stern wrote: > On Fri, 25 Apr 2014, Li Zhong wrote: > > > > I don't get why try_module_get() matters here. We can't call into > > > ->store if the object at hand is already destroyed and the underlying > > > module can't go away if the target device is still alive. > > > try_module_get() doesn't actually protect the object. Why does that > > > matter? This is self removal, right? Can you please take a look at > > > kernfs_remove_self()? > > > > This is about one process writing something to driver attributes, and > > one process trying to unload this driver. > > > > I think try_module_get() could detect whether the driver is being > > unloaded, and if not, prevent it from being unloaded, so it could > > protect the object here by not allow the driver to be unloaded. > > That isn't how try_module_get() works. If the module is being > unloaded, try_module_get() simply fails. It does not prevent the > module from being unloaded -- that's why its name begins with "try". Yes, I know that. What I said above is for the case when try_module_get() detects the driver is NOT being unloaded, and increases the reference counter. Thanks, Zhong > > Alan Stern > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Fri, 2014-04-25 at 14:47 +0200, Rafael J. Wysocki wrote: > On 4/25/2014 3:46 AM, Li Zhong wrote: > > On Thu, 2014-04-24 at 12:02 +0200, Rafael J. Wysocki wrote: > >> On 4/24/2014 10:59 AM, Li Zhong wrote: > >>> On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote: > >>>> On 4/23/2014 4:23 PM, Tejun Heo wrote: > >>>>> Hello, Rafael. > >>>> Hi, > >>>> > >>>>> On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote: > >>>>>> Can you please elaborate a bit? > >>>>> Because it can get involved in larger locking dependency issues by > >>>>> joining dependency graphs of two otherwise largely disjoint > >>>>> subsystems. It has potential to create possible deadlocks which don't > >>>>> need to exist. > >>>> Well, I do my best not to add unnecessary locks if that's what you mean. > >>>> > >>>>>> It is there to protect hotplug operations involving multiple devices > >>>>>> (in different subsystems) from racing with each other. Why exactly > >>>>>> is it bad? > >>>>> But why would different subsystems, say cpu and memory, use the same > >>>>> lock? Wouldn't those subsystems already have proper locking inside > >>>>> their own subsystems? > >>>> That locking is not sufficient. > >>>> > >>>>> Why add this additional global lock across multiple subsystems? > >>>> That basically is because of how eject works when it is triggered via > >>>> ACPI. > >>>> > >>>> It is signaled for a device at the top of a subtree. It may be a > >>>> container of some sort and the eject involves everything below that > >>>> device in the ACPI namespace. That may involve multiple subsystem > >>>> (CPUs, memory, PCI host bridge, etc.). > >>>> > >>>> We do that in two steps, offline (which can fail) and eject proper > >>>> (which can't fail and makes all of the involved devices go away). All > >>>> that has to be done in one go with respect to the sysfs-triggered > >>>> offline/online and that's why the lock is there. > >>> Thank you for the education. It's more clear to me now why we need this > >>> lock. > >>> > >>> I still have some small questions about when this lock is needed: > >>> > >>> I could understand sysfs-triggered online is not acceptable when > >>> removing devices in multiple subsystems. But maybe concurrent offline > >>> and remove(with proper per subsystem locks) seems not harmful? > >>> > >>> And if we just want to remove some devices in a specific subsystem, e.g. > >>> like writing /cpu/release, if it just wants to offline and remove some > >>> cpus, then maybe we don't require the device_hotplug_lock to be taken? > >> No and no. > >> > >> If the offline phase fails for any device in the subtree, we roll back > >> the operation > >> and online devices that have already been offlined by it. Also the ACPI > >> hot-addition > >> needs to acquire device_hotplug_lock so that it doesn't race with ejects > >> and so > >> that lock needs to be taken by sysfs-triggered offline too. > > I can understand that hot-addition needs the device_hotplug lock, but > > still not very clear about the offline. > > > > I guess your are describing following scenario: > > > > user A: (trying remove cpu 1 and memory 1-10) > > > > lock_device_hotplug > > offline cpu with cpu locks -- successful > > offline memories with memory locks -- failed, e.g. for memory8 > > online cpu and memory with their locks > > unlock_device_hotplug > > What about if all is successful and CPU1 is gone before > device_hotplug_lock is released? You mean user B will try to offline an already removed cpu1? But I think the cpu subsys locks should be able to handle such situation? > > > user B: (trying offline cpu 1) > > > > offline cpu with cpu locks > > > > But I don't see any problem for user B not taking the device_hotplug > > lock. The result may be different for user B to take or not take the > > lock. But I think it could be seen as concurrent online/offline for cpu1 > > under cpu hotplug locks, which just depends on which is executed last? > > > > Or did I miss something here? > > Yes, yo
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Fri, 2014-04-25 at 08:28 -0400, Tejun Heo wrote: > On Fri, Apr 25, 2014 at 09:56:10AM +0800, Li Zhong wrote: > > > A nests cpu subsys mutex under s_active of the online node. B nests > > > s_active of the online node under the cpu subsys mutex. What am I > > > missing? > > > > From the above two chain, I think the problem is cpu_subsys_mutex and > > s_active(online), which is the deadlock we are trying to solve in patch > > Yeap, that one was what I was thinking about. > > > #2. I seems to me s_active(release) here doesn't have lock issues? > > And your patch doesn't change the situation for online. I'm a little confused here... To clarify for this patch(#1): Currently(before patch#2), the s_active(online) and cpu subsys mutex deadlock is solved by lock_device_hotplug_sysfs(). (lockdep warnings are still there, so we want to change the implementation of lock_device_hotplug_sysfs() in patch #2 to take out of s_active_online, instead of device_hotplug_lock, which is grabbed before cpu subsys mutex). But here, for the probe/release, I think we really don't need to use this _sysfs() version. So it is replaced with lock_device_hotplug() directly to avoid two unnecessary conversions of lock_device_hotplug_sysfs(), that otherwise need to be done in patch #2. Thanks, Zhong > Alright, > thanks a lot for the explanation. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Fri, 2014-04-25 at 12:15 +0200, Johan Hovold wrote: > On Fri, Apr 25, 2014 at 10:16:57AM +0800, Li Zhong wrote: > > On Thu, 2014-04-24 at 16:52 +0200, Johan Hovold wrote: > > > > No, this isn't self removal. The driver-attribute (not device-attribute) > > > store operation simply grabs a lock that is also held while the driver > > > is being deregistered at module unload. Taking a reference to the module > > > in this case will prevent deregistration while store is running. > > > > > > But it seems like this can be solved for usb-serial by simply not > > > holding the lock while deregistering. > > > > I didn't look carefully about this lock. > > > > But I'm not sure whether there are such requirements for driver > > attributes: > > > > some lock needs be grabbed in the driver attributes store callbacks, and > > the same lock also needs to be grabbed during driver unregister. > > > > If we have such requirements currently or in the future, I think they > > could all be solved by breaking active protection after get the module > > reference. > > Yes, if we find any such cases, but I don't think it should be done > generally (as in one of your earlier posts). OK, maybe we only need to reconsider this only if we see some more similar lockdep warnings in the future. > > Active protection is usually exactly what prevents the driver from being > deregistered, and would only need to be broken to avoid any ABBA > deadlocks from being reported by lockdep (the module reference would be > enough to prevent the actual deadlock). Yes, maybe try to get the module reference is not bad before writing to driver attributes, as it doesn't make much sense to really call the callbacks for the driver attribute if the driver is being unload. And after we get the reference, it is safe for us to break the active. But if we don't have such real cases(lockdep warnings), we actually don't need to break it. Thanks, Zhong > > Thanks, > Johan > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Fri, 2014-04-25 at 12:15 +0200, Johan Hovold wrote: On Fri, Apr 25, 2014 at 10:16:57AM +0800, Li Zhong wrote: On Thu, 2014-04-24 at 16:52 +0200, Johan Hovold wrote: No, this isn't self removal. The driver-attribute (not device-attribute) store operation simply grabs a lock that is also held while the driver is being deregistered at module unload. Taking a reference to the module in this case will prevent deregistration while store is running. But it seems like this can be solved for usb-serial by simply not holding the lock while deregistering. I didn't look carefully about this lock. But I'm not sure whether there are such requirements for driver attributes: some lock needs be grabbed in the driver attributes store callbacks, and the same lock also needs to be grabbed during driver unregister. If we have such requirements currently or in the future, I think they could all be solved by breaking active protection after get the module reference. Yes, if we find any such cases, but I don't think it should be done generally (as in one of your earlier posts). OK, maybe we only need to reconsider this only if we see some more similar lockdep warnings in the future. Active protection is usually exactly what prevents the driver from being deregistered, and would only need to be broken to avoid any ABBA deadlocks from being reported by lockdep (the module reference would be enough to prevent the actual deadlock). Yes, maybe try to get the module reference is not bad before writing to driver attributes, as it doesn't make much sense to really call the callbacks for the driver attribute if the driver is being unload. And after we get the reference, it is safe for us to break the active. But if we don't have such real cases(lockdep warnings), we actually don't need to break it. Thanks, Zhong Thanks, Johan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Fri, 2014-04-25 at 08:28 -0400, Tejun Heo wrote: On Fri, Apr 25, 2014 at 09:56:10AM +0800, Li Zhong wrote: A nests cpu subsys mutex under s_active of the online node. B nests s_active of the online node under the cpu subsys mutex. What am I missing? From the above two chain, I think the problem is cpu_subsys_mutex and s_active(online), which is the deadlock we are trying to solve in patch Yeap, that one was what I was thinking about. #2. I seems to me s_active(release) here doesn't have lock issues? And your patch doesn't change the situation for online. I'm a little confused here... To clarify for this patch(#1): Currently(before patch#2), the s_active(online) and cpu subsys mutex deadlock is solved by lock_device_hotplug_sysfs(). (lockdep warnings are still there, so we want to change the implementation of lock_device_hotplug_sysfs() in patch #2 to take out of s_active_online, instead of device_hotplug_lock, which is grabbed before cpu subsys mutex). But here, for the probe/release, I think we really don't need to use this _sysfs() version. So it is replaced with lock_device_hotplug() directly to avoid two unnecessary conversions of lock_device_hotplug_sysfs(), that otherwise need to be done in patch #2. Thanks, Zhong Alright, thanks a lot for the explanation. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Fri, 2014-04-25 at 14:47 +0200, Rafael J. Wysocki wrote: On 4/25/2014 3:46 AM, Li Zhong wrote: On Thu, 2014-04-24 at 12:02 +0200, Rafael J. Wysocki wrote: On 4/24/2014 10:59 AM, Li Zhong wrote: On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote: On 4/23/2014 4:23 PM, Tejun Heo wrote: Hello, Rafael. Hi, On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote: Can you please elaborate a bit? Because it can get involved in larger locking dependency issues by joining dependency graphs of two otherwise largely disjoint subsystems. It has potential to create possible deadlocks which don't need to exist. Well, I do my best not to add unnecessary locks if that's what you mean. It is there to protect hotplug operations involving multiple devices (in different subsystems) from racing with each other. Why exactly is it bad? But why would different subsystems, say cpu and memory, use the same lock? Wouldn't those subsystems already have proper locking inside their own subsystems? That locking is not sufficient. Why add this additional global lock across multiple subsystems? That basically is because of how eject works when it is triggered via ACPI. It is signaled for a device at the top of a subtree. It may be a container of some sort and the eject involves everything below that device in the ACPI namespace. That may involve multiple subsystem (CPUs, memory, PCI host bridge, etc.). We do that in two steps, offline (which can fail) and eject proper (which can't fail and makes all of the involved devices go away). All that has to be done in one go with respect to the sysfs-triggered offline/online and that's why the lock is there. Thank you for the education. It's more clear to me now why we need this lock. I still have some small questions about when this lock is needed: I could understand sysfs-triggered online is not acceptable when removing devices in multiple subsystems. But maybe concurrent offline and remove(with proper per subsystem locks) seems not harmful? And if we just want to remove some devices in a specific subsystem, e.g. like writing /cpu/release, if it just wants to offline and remove some cpus, then maybe we don't require the device_hotplug_lock to be taken? No and no. If the offline phase fails for any device in the subtree, we roll back the operation and online devices that have already been offlined by it. Also the ACPI hot-addition needs to acquire device_hotplug_lock so that it doesn't race with ejects and so that lock needs to be taken by sysfs-triggered offline too. I can understand that hot-addition needs the device_hotplug lock, but still not very clear about the offline. I guess your are describing following scenario: user A: (trying remove cpu 1 and memory 1-10) lock_device_hotplug offline cpu with cpu locks -- successful offline memories with memory locks -- failed, e.g. for memory8 online cpu and memory with their locks unlock_device_hotplug What about if all is successful and CPU1 is gone before device_hotplug_lock is released? You mean user B will try to offline an already removed cpu1? But I think the cpu subsys locks should be able to handle such situation? user B: (trying offline cpu 1) offline cpu with cpu locks But I don't see any problem for user B not taking the device_hotplug lock. The result may be different for user B to take or not take the lock. But I think it could be seen as concurrent online/offline for cpu1 under cpu hotplug locks, which just depends on which is executed last? Or did I miss something here? Yes, you could do offline in parallel with user A without taking device_hotplug_lock, but the result may be surprising to user B then. With device _hotplug_lock user B will always see CPU1 off line (or gone) after his offline in this scenario, while without taking the lock user B may sometimes see CPU1 on line after his offline. I don't think that's a good thing. That seems complicated after some more thinking. I think I missed something when describing the steps for A. I think the initial online/offline state needs to be recorded by offline operations in A, so the rollback could be done based on the initial state. If adding the above, then 1) B offline cpu 1 before A offline cpu 1 A could see the initial state of cpu1 as offline, and the rollback would not put cpu1 online again. In the code, I think the check is done at if (!cpu_online(cpu)) return -EINVAL; So the pn-put_online is kept as false. So the result is cpu1 offline. 2) B offline cpu 1 after A offline cpu1 then the rollback would online cpu1 2.1) B offline cpu1 after A rollback The result is cpu1 offline, good. 2.2) B offline cpu1 before A rollback B would see a -EINVAL error, and the result is cpu1 online. I guess this is the case you mentioned
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Fri, 2014-04-25 at 09:54 -0400, Alan Stern wrote: On Fri, 25 Apr 2014, Li Zhong wrote: I don't get why try_module_get() matters here. We can't call into -store if the object at hand is already destroyed and the underlying module can't go away if the target device is still alive. try_module_get() doesn't actually protect the object. Why does that matter? This is self removal, right? Can you please take a look at kernfs_remove_self()? This is about one process writing something to driver attributes, and one process trying to unload this driver. I think try_module_get() could detect whether the driver is being unloaded, and if not, prevent it from being unloaded, so it could protect the object here by not allow the driver to be unloaded. That isn't how try_module_get() works. If the module is being unloaded, try_module_get() simply fails. It does not prevent the module from being unloaded -- that's why its name begins with try. Yes, I know that. What I said above is for the case when try_module_get() detects the driver is NOT being unloaded, and increases the reference counter. Thanks, Zhong Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Fri, 2014-04-25 at 09:59 -0400, Alan Stern wrote: On Fri, 25 Apr 2014, Li Zhong wrote: No, this isn't self removal. The driver-attribute (not device-attribute) store operation simply grabs a lock that is also held while the driver is being deregistered at module unload. Taking a reference to the module in this case will prevent deregistration while store is running. But it seems like this can be solved for usb-serial by simply not holding the lock while deregistering. I didn't look carefully about this lock. But I'm not sure whether there are such requirements for driver attributes: some lock needs be grabbed in the driver attributes store callbacks, and the same lock also needs to be grabbed during driver unregister. In this case, the lock does _not_ need to be grabbed during driver unregister. The driver grabs the lock, but it doesn't need to. OK. If we have such requirements currently or in the future, I think they could all be solved by breaking active protection after get the module reference. No! That would be very bad. Unloading modules is quite different from unbinding drivers. After the driver is unbound, its attribute callback routines can continue to run. But after a driver module has been unloaded, its attribute callback routines _cannot_ run because they aren't present in memory any more. If we allowed a module to be unloaded while one of its callbacks was running (because active protection was broken), imagine what would happen... I don't think the module could be unloaded after we increased the module reference counter. Thanks, Zhong Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Thu, 2014-04-24 at 16:52 +0200, Johan Hovold wrote: > On Thu, Apr 24, 2014 at 10:35:17AM -0400, Tejun Heo wrote: > > On Thu, Apr 24, 2014 at 04:29:15PM +0800, Li Zhong wrote: > > > On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote: > > > > cc'ing Li Zhong who's working on a simliar issue in the following > > > > thread and quoting whole body. > > > > > > > > http://thread.gmane.org/gmane.linux.kernel/1680706 > > > > > > > > Li, this is another variation of the same problem. Maybe this can be > > > > covered by your work too? > > > > > > It seems to me that it is about write something to driver attribute, and > > > driver unloading. If so, maybe it's not easy to reuse the help functions > > > created for device attribute, and device removing. > > > > > > But I guess the idea to break the active protection could still be > > > applied here: > > > > > > Maybe we could try_module_get() here (like the other option suggested by > > > Johan?), and break active protection if we could get the module, > > > something like below? > > > > I don't get why try_module_get() matters here. We can't call into > > ->store if the object at hand is already destroyed and the underlying > > module can't go away if the target device is still alive. > > try_module_get() doesn't actually protect the object. Why does that > > matter? This is self removal, right? Can you please take a look at > > kernfs_remove_self()? > > No, this isn't self removal. The driver-attribute (not device-attribute) > store operation simply grabs a lock that is also held while the driver > is being deregistered at module unload. Taking a reference to the module > in this case will prevent deregistration while store is running. > > But it seems like this can be solved for usb-serial by simply not > holding the lock while deregistering. I didn't look carefully about this lock. But I'm not sure whether there are such requirements for driver attributes: some lock needs be grabbed in the driver attributes store callbacks, and the same lock also needs to be grabbed during driver unregister. If we have such requirements currently or in the future, I think they could all be solved by breaking active protection after get the module reference. Thanks, Zhong > > Johan > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Thu, 2014-04-24 at 10:35 -0400, Tejun Heo wrote: > On Thu, Apr 24, 2014 at 04:29:15PM +0800, Li Zhong wrote: > > On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote: > > > cc'ing Li Zhong who's working on a simliar issue in the following > > > thread and quoting whole body. > > > > > > http://thread.gmane.org/gmane.linux.kernel/1680706 > > > > > > Li, this is another variation of the same problem. Maybe this can be > > > covered by your work too? > > > > It seems to me that it is about write something to driver attribute, and > > driver unloading. If so, maybe it's not easy to reuse the help functions > > created for device attribute, and device removing. > > > > But I guess the idea to break the active protection could still be > > applied here: > > > > Maybe we could try_module_get() here (like the other option suggested by > > Johan?), and break active protection if we could get the module, > > something like below? > > I don't get why try_module_get() matters here. We can't call into > ->store if the object at hand is already destroyed and the underlying > module can't go away if the target device is still alive. > try_module_get() doesn't actually protect the object. Why does that > matter? This is self removal, right? Can you please take a look at > kernfs_remove_self()? This is about one process writing something to driver attributes, and one process trying to unload this driver. I think try_module_get() could detect whether the driver is being unloaded, and if not, prevent it from being unloaded, so it could protect the object here by not allow the driver to be unloaded. And if the driver is being unloaded, we could abort the write operation directly. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Thu, 2014-04-24 at 10:32 -0400, Tejun Heo wrote: > Hello, > > On Thu, Apr 24, 2014 at 04:37:23PM +0800, Li Zhong wrote: > > On Wed, 2014-04-23 at 10:39 -0400, Tejun Heo wrote: > > After thinking it harder, I still couldn't see ABBA here ... > > > > the active protection taken here is for "probe/release" which will not > > be waited for removing something like "online" under cpu#? Or my > > assumption that s_active for different files are different locks are > > completely wrong? Or I missed something else? > > I'm probably confused about the locking. I was thinking a scenario > like the following. > > A. CPU on/offline > >grabs s_active protection of online node >grabs cpu subsys mutex >perform on/offline >releases cpu subsys mutex >releases s_active protection of online node so the chain here is s_active(online) -> cpu_subsys_mutex > > B. CPU release > >grabs s_active protection of release node >grabs cpu subsys mutex >performs removal of the CPU >removes the online node >releases cpu subsys mutex >releases s_active protection of release node and the chain here is s_active(release) -> cpu_subsys_mutex -> s_active(online) > > A nests cpu subsys mutex under s_active of the online node. B nests > s_active of the online node under the cpu subsys mutex. What am I > missing? >From the above two chain, I think the problem is cpu_subsys_mutex and s_active(online), which is the deadlock we are trying to solve in patch #2. I seems to me s_active(release) here doesn't have lock issues? Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Thu, 2014-04-24 at 12:02 +0200, Rafael J. Wysocki wrote: > On 4/24/2014 10:59 AM, Li Zhong wrote: > > On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote: > >> On 4/23/2014 4:23 PM, Tejun Heo wrote: > >>> Hello, Rafael. > >> Hi, > >> > >>> On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote: > >>>> Can you please elaborate a bit? > >>> Because it can get involved in larger locking dependency issues by > >>> joining dependency graphs of two otherwise largely disjoint > >>> subsystems. It has potential to create possible deadlocks which don't > >>> need to exist. > >> Well, I do my best not to add unnecessary locks if that's what you mean. > >> > >>>> It is there to protect hotplug operations involving multiple devices > >>>> (in different subsystems) from racing with each other. Why exactly > >>>> is it bad? > >>> But why would different subsystems, say cpu and memory, use the same > >>> lock? Wouldn't those subsystems already have proper locking inside > >>> their own subsystems? > >> That locking is not sufficient. > >> > >>> Why add this additional global lock across multiple subsystems? > >> That basically is because of how eject works when it is triggered via ACPI. > >> > >> It is signaled for a device at the top of a subtree. It may be a > >> container of some sort and the eject involves everything below that > >> device in the ACPI namespace. That may involve multiple subsystem > >> (CPUs, memory, PCI host bridge, etc.). > >> > >> We do that in two steps, offline (which can fail) and eject proper > >> (which can't fail and makes all of the involved devices go away). All > >> that has to be done in one go with respect to the sysfs-triggered > >> offline/online and that's why the lock is there. > > Thank you for the education. It's more clear to me now why we need this > > lock. > > > > I still have some small questions about when this lock is needed: > > > > I could understand sysfs-triggered online is not acceptable when > > removing devices in multiple subsystems. But maybe concurrent offline > > and remove(with proper per subsystem locks) seems not harmful? > > > > And if we just want to remove some devices in a specific subsystem, e.g. > > like writing /cpu/release, if it just wants to offline and remove some > > cpus, then maybe we don't require the device_hotplug_lock to be taken? > > No and no. > > If the offline phase fails for any device in the subtree, we roll back > the operation > and online devices that have already been offlined by it. Also the ACPI > hot-addition > needs to acquire device_hotplug_lock so that it doesn't race with ejects > and so > that lock needs to be taken by sysfs-triggered offline too. I can understand that hot-addition needs the device_hotplug lock, but still not very clear about the offline. I guess your are describing following scenario: user A: (trying remove cpu 1 and memory 1-10) lock_device_hotplug offline cpu with cpu locks -- successful offline memories with memory locks -- failed, e.g. for memory8 online cpu and memory with their locks unlock_device_hotplug user B: (trying offline cpu 1) offline cpu with cpu locks But I don't see any problem for user B not taking the device_hotplug lock. The result may be different for user B to take or not take the lock. But I think it could be seen as concurrent online/offline for cpu1 under cpu hotplug locks, which just depends on which is executed last? Or did I miss something here? Thanks, Zhong > Thanks, > Rafael > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote: > On 4/23/2014 4:23 PM, Tejun Heo wrote: > > Hello, Rafael. > > Hi, > > > On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote: > >> Can you please elaborate a bit? > > Because it can get involved in larger locking dependency issues by > > joining dependency graphs of two otherwise largely disjoint > > subsystems. It has potential to create possible deadlocks which don't > > need to exist. > > Well, I do my best not to add unnecessary locks if that's what you mean. > > >> It is there to protect hotplug operations involving multiple devices > >> (in different subsystems) from racing with each other. Why exactly > >> is it bad? > > But why would different subsystems, say cpu and memory, use the same > > lock? Wouldn't those subsystems already have proper locking inside > > their own subsystems? > > That locking is not sufficient. > > > Why add this additional global lock across multiple subsystems? > > That basically is because of how eject works when it is triggered via ACPI. > > It is signaled for a device at the top of a subtree. It may be a > container of some sort and the eject involves everything below that > device in the ACPI namespace. That may involve multiple subsystem > (CPUs, memory, PCI host bridge, etc.). > > We do that in two steps, offline (which can fail) and eject proper > (which can't fail and makes all of the involved devices go away). All > that has to be done in one go with respect to the sysfs-triggered > offline/online and that's why the lock is there. Thank you for the education. It's more clear to me now why we need this lock. I still have some small questions about when this lock is needed: I could understand sysfs-triggered online is not acceptable when removing devices in multiple subsystems. But maybe concurrent offline and remove(with proper per subsystem locks) seems not harmful? And if we just want to remove some devices in a specific subsystem, e.g. like writing /cpu/release, if it just wants to offline and remove some cpus, then maybe we don't require the device_hotplug_lock to be taken? Thanks, Zhong > > Thanks, > Rafael > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Wed, 2014-04-23 at 10:39 -0400, Tejun Heo wrote: > On Wed, Apr 23, 2014 at 10:00:26AM +0800, Li Zhong wrote: > > If you remove cpu0, then the cpu0 directory will be removed, together > > with the "online" file in the directory, while some other process might > > be writing 0 or 1 to it. > > > > But here, for the probe/release, take "release" for example, if user > > writes something that stands for cpu0 to it, the cpu0 will be removed, > > and the cpu0 directory and the files under it will be removed. But > > "release" itself is not removed. > > > > They are attributes of cpu_subsys, not of some specific cpus. > > OIC, so the file itself which triggers removal doesn't get removed. > Hmmm... but can't you still fall into deadlock? If on/offline takes > the same lock under active protection which is also taken while > removing the cpu files, it doesn't really matter whether the release > file itself is removed in the process or not. You can still have ABBA > deadlock. What am I missing here? After thinking it harder, I still couldn't see ABBA here ... the active protection taken here is for "probe/release" which will not be waited for removing something like "online" under cpu#? Or my assumption that s_active for different files are different locks are completely wrong? Or I missed something else? Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote: > cc'ing Li Zhong who's working on a simliar issue in the following > thread and quoting whole body. > > http://thread.gmane.org/gmane.linux.kernel/1680706 > > Li, this is another variation of the same problem. Maybe this can be > covered by your work too? It seems to me that it is about write something to driver attribute, and driver unloading. If so, maybe it's not easy to reuse the help functions created for device attribute, and device removing. But I guess the idea to break the active protection could still be applied here: Maybe we could try_module_get() here (like the other option suggested by Johan?), and break active protection if we could get the module, something like below? diff --git a/drivers/base/bus.c b/drivers/base/bus.c index 83e910a..6ce27e0 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -69,9 +69,24 @@ static ssize_t drv_attr_store(struct kobject *kobj, struct attribute *attr, struct driver_attribute *drv_attr = to_drv_attr(attr); struct driver_private *drv_priv = to_driver(kobj); ssize_t ret = -EIO; + struct kernfs_node *kn; + + if (!try_module_get(drv_priv->driver->owner)) + return ret; + + kn = kernfs_find_and_get(kobj->sd, attr->name); + if (WARN_ON_ONCE(!kn)) + return ret; + + kernfs_break_active_protection(kn); if (drv_attr->store) ret = drv_attr->store(drv_priv->driver, buf, count); + + kernfs_unbreak_active_protection(kn); + kernfs_put(kn); + + module_put(drv_priv->driver->owner); return ret; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote: cc'ing Li Zhong who's working on a simliar issue in the following thread and quoting whole body. http://thread.gmane.org/gmane.linux.kernel/1680706 Li, this is another variation of the same problem. Maybe this can be covered by your work too? It seems to me that it is about write something to driver attribute, and driver unloading. If so, maybe it's not easy to reuse the help functions created for device attribute, and device removing. But I guess the idea to break the active protection could still be applied here: Maybe we could try_module_get() here (like the other option suggested by Johan?), and break active protection if we could get the module, something like below? diff --git a/drivers/base/bus.c b/drivers/base/bus.c index 83e910a..6ce27e0 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -69,9 +69,24 @@ static ssize_t drv_attr_store(struct kobject *kobj, struct attribute *attr, struct driver_attribute *drv_attr = to_drv_attr(attr); struct driver_private *drv_priv = to_driver(kobj); ssize_t ret = -EIO; + struct kernfs_node *kn; + + if (!try_module_get(drv_priv-driver-owner)) + return ret; + + kn = kernfs_find_and_get(kobj-sd, attr-name); + if (WARN_ON_ONCE(!kn)) + return ret; + + kernfs_break_active_protection(kn); if (drv_attr-store) ret = drv_attr-store(drv_priv-driver, buf, count); + + kernfs_unbreak_active_protection(kn); + kernfs_put(kn); + + module_put(drv_priv-driver-owner); return ret; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Wed, 2014-04-23 at 10:39 -0400, Tejun Heo wrote: On Wed, Apr 23, 2014 at 10:00:26AM +0800, Li Zhong wrote: If you remove cpu0, then the cpu0 directory will be removed, together with the online file in the directory, while some other process might be writing 0 or 1 to it. But here, for the probe/release, take release for example, if user writes something that stands for cpu0 to it, the cpu0 will be removed, and the cpu0 directory and the files under it will be removed. But release itself is not removed. They are attributes of cpu_subsys, not of some specific cpus. OIC, so the file itself which triggers removal doesn't get removed. Hmmm... but can't you still fall into deadlock? If on/offline takes the same lock under active protection which is also taken while removing the cpu files, it doesn't really matter whether the release file itself is removed in the process or not. You can still have ABBA deadlock. What am I missing here? After thinking it harder, I still couldn't see ABBA here ... the active protection taken here is for probe/release which will not be waited for removing something like online under cpu#? Or my assumption that s_active for different files are different locks are completely wrong? Or I missed something else? Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote: On 4/23/2014 4:23 PM, Tejun Heo wrote: Hello, Rafael. Hi, On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote: Can you please elaborate a bit? Because it can get involved in larger locking dependency issues by joining dependency graphs of two otherwise largely disjoint subsystems. It has potential to create possible deadlocks which don't need to exist. Well, I do my best not to add unnecessary locks if that's what you mean. It is there to protect hotplug operations involving multiple devices (in different subsystems) from racing with each other. Why exactly is it bad? But why would different subsystems, say cpu and memory, use the same lock? Wouldn't those subsystems already have proper locking inside their own subsystems? That locking is not sufficient. Why add this additional global lock across multiple subsystems? That basically is because of how eject works when it is triggered via ACPI. It is signaled for a device at the top of a subtree. It may be a container of some sort and the eject involves everything below that device in the ACPI namespace. That may involve multiple subsystem (CPUs, memory, PCI host bridge, etc.). We do that in two steps, offline (which can fail) and eject proper (which can't fail and makes all of the involved devices go away). All that has to be done in one go with respect to the sysfs-triggered offline/online and that's why the lock is there. Thank you for the education. It's more clear to me now why we need this lock. I still have some small questions about when this lock is needed: I could understand sysfs-triggered online is not acceptable when removing devices in multiple subsystems. But maybe concurrent offline and remove(with proper per subsystem locks) seems not harmful? And if we just want to remove some devices in a specific subsystem, e.g. like writing /cpu/release, if it just wants to offline and remove some cpus, then maybe we don't require the device_hotplug_lock to be taken? Thanks, Zhong Thanks, Rafael -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Thu, 2014-04-24 at 12:02 +0200, Rafael J. Wysocki wrote: On 4/24/2014 10:59 AM, Li Zhong wrote: On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote: On 4/23/2014 4:23 PM, Tejun Heo wrote: Hello, Rafael. Hi, On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote: Can you please elaborate a bit? Because it can get involved in larger locking dependency issues by joining dependency graphs of two otherwise largely disjoint subsystems. It has potential to create possible deadlocks which don't need to exist. Well, I do my best not to add unnecessary locks if that's what you mean. It is there to protect hotplug operations involving multiple devices (in different subsystems) from racing with each other. Why exactly is it bad? But why would different subsystems, say cpu and memory, use the same lock? Wouldn't those subsystems already have proper locking inside their own subsystems? That locking is not sufficient. Why add this additional global lock across multiple subsystems? That basically is because of how eject works when it is triggered via ACPI. It is signaled for a device at the top of a subtree. It may be a container of some sort and the eject involves everything below that device in the ACPI namespace. That may involve multiple subsystem (CPUs, memory, PCI host bridge, etc.). We do that in two steps, offline (which can fail) and eject proper (which can't fail and makes all of the involved devices go away). All that has to be done in one go with respect to the sysfs-triggered offline/online and that's why the lock is there. Thank you for the education. It's more clear to me now why we need this lock. I still have some small questions about when this lock is needed: I could understand sysfs-triggered online is not acceptable when removing devices in multiple subsystems. But maybe concurrent offline and remove(with proper per subsystem locks) seems not harmful? And if we just want to remove some devices in a specific subsystem, e.g. like writing /cpu/release, if it just wants to offline and remove some cpus, then maybe we don't require the device_hotplug_lock to be taken? No and no. If the offline phase fails for any device in the subtree, we roll back the operation and online devices that have already been offlined by it. Also the ACPI hot-addition needs to acquire device_hotplug_lock so that it doesn't race with ejects and so that lock needs to be taken by sysfs-triggered offline too. I can understand that hot-addition needs the device_hotplug lock, but still not very clear about the offline. I guess your are describing following scenario: user A: (trying remove cpu 1 and memory 1-10) lock_device_hotplug offline cpu with cpu locks -- successful offline memories with memory locks -- failed, e.g. for memory8 online cpu and memory with their locks unlock_device_hotplug user B: (trying offline cpu 1) offline cpu with cpu locks But I don't see any problem for user B not taking the device_hotplug lock. The result may be different for user B to take or not take the lock. But I think it could be seen as concurrent online/offline for cpu1 under cpu hotplug locks, which just depends on which is executed last? Or did I miss something here? Thanks, Zhong Thanks, Rafael -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Thu, 2014-04-24 at 10:32 -0400, Tejun Heo wrote: Hello, On Thu, Apr 24, 2014 at 04:37:23PM +0800, Li Zhong wrote: On Wed, 2014-04-23 at 10:39 -0400, Tejun Heo wrote: After thinking it harder, I still couldn't see ABBA here ... the active protection taken here is for probe/release which will not be waited for removing something like online under cpu#? Or my assumption that s_active for different files are different locks are completely wrong? Or I missed something else? I'm probably confused about the locking. I was thinking a scenario like the following. A. CPU on/offline grabs s_active protection of online node grabs cpu subsys mutex perform on/offline releases cpu subsys mutex releases s_active protection of online node so the chain here is s_active(online) - cpu_subsys_mutex B. CPU release grabs s_active protection of release node grabs cpu subsys mutex performs removal of the CPU removes the online node releases cpu subsys mutex releases s_active protection of release node and the chain here is s_active(release) - cpu_subsys_mutex - s_active(online) A nests cpu subsys mutex under s_active of the online node. B nests s_active of the online node under the cpu subsys mutex. What am I missing? From the above two chain, I think the problem is cpu_subsys_mutex and s_active(online), which is the deadlock we are trying to solve in patch #2. I seems to me s_active(release) here doesn't have lock issues? Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Thu, 2014-04-24 at 10:35 -0400, Tejun Heo wrote: On Thu, Apr 24, 2014 at 04:29:15PM +0800, Li Zhong wrote: On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote: cc'ing Li Zhong who's working on a simliar issue in the following thread and quoting whole body. http://thread.gmane.org/gmane.linux.kernel/1680706 Li, this is another variation of the same problem. Maybe this can be covered by your work too? It seems to me that it is about write something to driver attribute, and driver unloading. If so, maybe it's not easy to reuse the help functions created for device attribute, and device removing. But I guess the idea to break the active protection could still be applied here: Maybe we could try_module_get() here (like the other option suggested by Johan?), and break active protection if we could get the module, something like below? I don't get why try_module_get() matters here. We can't call into -store if the object at hand is already destroyed and the underlying module can't go away if the target device is still alive. try_module_get() doesn't actually protect the object. Why does that matter? This is self removal, right? Can you please take a look at kernfs_remove_self()? This is about one process writing something to driver attributes, and one process trying to unload this driver. I think try_module_get() could detect whether the driver is being unloaded, and if not, prevent it from being unloaded, so it could protect the object here by not allow the driver to be unloaded. And if the driver is being unloaded, we could abort the write operation directly. Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Thu, 2014-04-24 at 16:52 +0200, Johan Hovold wrote: On Thu, Apr 24, 2014 at 10:35:17AM -0400, Tejun Heo wrote: On Thu, Apr 24, 2014 at 04:29:15PM +0800, Li Zhong wrote: On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote: cc'ing Li Zhong who's working on a simliar issue in the following thread and quoting whole body. http://thread.gmane.org/gmane.linux.kernel/1680706 Li, this is another variation of the same problem. Maybe this can be covered by your work too? It seems to me that it is about write something to driver attribute, and driver unloading. If so, maybe it's not easy to reuse the help functions created for device attribute, and device removing. But I guess the idea to break the active protection could still be applied here: Maybe we could try_module_get() here (like the other option suggested by Johan?), and break active protection if we could get the module, something like below? I don't get why try_module_get() matters here. We can't call into -store if the object at hand is already destroyed and the underlying module can't go away if the target device is still alive. try_module_get() doesn't actually protect the object. Why does that matter? This is self removal, right? Can you please take a look at kernfs_remove_self()? No, this isn't self removal. The driver-attribute (not device-attribute) store operation simply grabs a lock that is also held while the driver is being deregistered at module unload. Taking a reference to the module in this case will prevent deregistration while store is running. But it seems like this can be solved for usb-serial by simply not holding the lock while deregistering. I didn't look carefully about this lock. But I'm not sure whether there are such requirements for driver attributes: some lock needs be grabbed in the driver attributes store callbacks, and the same lock also needs to be grabbed during driver unregister. If we have such requirements currently or in the future, I think they could all be solved by breaking active protection after get the module reference. Thanks, Zhong Johan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-23 at 12:58 +0200, Rafael J. Wysocki wrote: > On Wednesday, April 23, 2014 01:03:42 PM Li Zhong wrote: > > On Tue, 2014-04-22 at 16:44 -0400, Tejun Heo wrote: > > > Hello, > > > > > > On Tue, Apr 22, 2014 at 11:34:39AM +0800, Li Zhong wrote: > > > > > Is this assumption true? If so, can we add lockdep assertions in > > > > > places to verify and enforce this? If not, aren't we just feeling > > > > > good when the reality is broken? > > > > > > > > It seems not true ... I think there are devices that don't have the > > > > online/offline concept, we just need to add it, remove it, like ethernet > > > > cards. > > > > > > > > Maybe we could change the comments above, like: > > > > /* We assume device_hotplug_lock must be acquired before > > > > * removing devices, which have online/offline sysfs knob, > > > > * and some locks are needed to serialize the online/offline > > > > * callbacks and device removing. ... > > > > ? > > > > > > > > And we could add lockdep assertions in cpu and memory related code? e.g. > > > > remove_memory(), unregister_cpu() > > > > > > > > Currently, remove_memory() has comments for the function: > > > > > > > > * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > > > > * and online/offline operations before this call, as required by > > > > * try_offline_node(). > > > > */ > > > > > > > > maybe it could be removed with the lockdep assertion. > > > > > > I'm confused about the overall locking scheme. What's the role of > > > device_hotplug_lock? Is that solely to prevent the sysfs deadlock > > > issue? Or does it serve other synchronization purposes depending on > > > the specific subsystem? If the former, the lock no longer needs to > > > exist. The only thing necessary would be synchronization between > > > device_del() deleting the sysfs file and the unbreak helper invoking > > > device-specific callback. If the latter, we probably should change > > > that. Sharing hotplug lock across multiple subsystems through driver > > > core sounds like a pretty bad idea. > > > > I think it's the latter. > > Actually, no, this is not the case if I understand you correctly. Oh, Sorry, I didn't read carefully. Yes, it's not specific subsystem. After seeing your reply, I understand it is for protecting device hot remove involving multiple subsystems. > > > I think device_{on|off}line is better to be > > done in some sort of lock which prevents the device from being removed, > > including some preparation work that needs be done before device_del(). > > Quite frankly, you should be confident that you understand the code you're > trying to modify or please don't touch it. > > I'll have a deeper look at this issue later today or tomorrow and will get > back to you then. Ok, thank you, Zhong > > Thanks! > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-23 at 12:54 +0200, Rafael J. Wysocki wrote: > On Wednesday, April 23, 2014 09:50:32 AM Li Zhong wrote: > > On Tue, 2014-04-22 at 12:11 +0200, Rafael J. Wysocki wrote: > > > On Tuesday, April 22, 2014 11:34:39 AM Li Zhong wrote: > > > > On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: > > > > > Hello, > > > > > > > > > > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: > > > > > > > > > > Proper /** function comment would be nice. > > > > > > > > Ok, will try to write some in next version. > > > > > > > > > > > > > > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, > > > > > > + struct device_attribute > > > > > > *attr) > > > > > > > > > > I can see why you did this but let's please not require the user of > > > > > this function to see how the thing is working internally. Let's > > > > > return int and keep track of (or look up again) the kernfs_node > > > > > internally. > > > > > > > > Ok, it also makes the prototype of lock and unlock look more consistent > > > > and comfortable. > > > > > > > > > > > > > > > { > > > > > ... > > > > > > + /* > > > > > > +* We assume device_hotplug_lock must be acquired before > > > > > > removing > > > > > > > > > > Is this assumption true? If so, can we add lockdep assertions in > > > > > places to verify and enforce this? If not, aren't we just feeling > > > > > good when the reality is broken? > > > > > > > > It seems not true ... I think there are devices that don't have the > > > > online/offline concept, we just need to add it, remove it, like ethernet > > > > cards. > > > > > > Well, I haven't been following this closely (I was travelling, sorry), but > > > there certainly are devices without online/offline. That currently is > > > only > > > present for CPUs, memory blocks and ACPI containers (if I remember > > > correctly). > > > > > > > > > > > Maybe we could change the comments above, like: > > > > /* We assume device_hotplug_lock must be acquired before > > > > * removing devices, which have online/offline sysfs knob, > > > > * and some locks are needed to serialize the online/offline > > > > * callbacks and device removing. ... > > > > ? > > > > > > Lockdep assertions would be better than this in my opinion. > > > > This is talking about the lock required in the other process, the device > > removing process, e.g. that in remove_memory() below. So I guess no > > lockdep assertions needed here. Or I misunderstand your point? > > I mean if you assume certain lock to be held somewhere, it is better to use > lockdep annotations to express that assumption, because that will cause users > to *see* the problem when it happens. OK, I see, I think you were suggesting the same thing as Tejun, just I misunderstood it. Thanks, Zhong > > Thanks! > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-23 at 12:54 +0200, Rafael J. Wysocki wrote: On Wednesday, April 23, 2014 09:50:32 AM Li Zhong wrote: On Tue, 2014-04-22 at 12:11 +0200, Rafael J. Wysocki wrote: On Tuesday, April 22, 2014 11:34:39 AM Li Zhong wrote: On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: Hello, On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: Proper /** function comment would be nice. Ok, will try to write some in next version. +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, + struct device_attribute *attr) I can see why you did this but let's please not require the user of this function to see how the thing is working internally. Let's return int and keep track of (or look up again) the kernfs_node internally. Ok, it also makes the prototype of lock and unlock look more consistent and comfortable. { ... + /* +* We assume device_hotplug_lock must be acquired before removing Is this assumption true? If so, can we add lockdep assertions in places to verify and enforce this? If not, aren't we just feeling good when the reality is broken? It seems not true ... I think there are devices that don't have the online/offline concept, we just need to add it, remove it, like ethernet cards. Well, I haven't been following this closely (I was travelling, sorry), but there certainly are devices without online/offline. That currently is only present for CPUs, memory blocks and ACPI containers (if I remember correctly). Maybe we could change the comments above, like: /* We assume device_hotplug_lock must be acquired before * removing devices, which have online/offline sysfs knob, * and some locks are needed to serialize the online/offline * callbacks and device removing. ... ? Lockdep assertions would be better than this in my opinion. This is talking about the lock required in the other process, the device removing process, e.g. that in remove_memory() below. So I guess no lockdep assertions needed here. Or I misunderstand your point? I mean if you assume certain lock to be held somewhere, it is better to use lockdep annotations to express that assumption, because that will cause users to *see* the problem when it happens. OK, I see, I think you were suggesting the same thing as Tejun, just I misunderstood it. Thanks, Zhong Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-23 at 12:58 +0200, Rafael J. Wysocki wrote: On Wednesday, April 23, 2014 01:03:42 PM Li Zhong wrote: On Tue, 2014-04-22 at 16:44 -0400, Tejun Heo wrote: Hello, On Tue, Apr 22, 2014 at 11:34:39AM +0800, Li Zhong wrote: Is this assumption true? If so, can we add lockdep assertions in places to verify and enforce this? If not, aren't we just feeling good when the reality is broken? It seems not true ... I think there are devices that don't have the online/offline concept, we just need to add it, remove it, like ethernet cards. Maybe we could change the comments above, like: /* We assume device_hotplug_lock must be acquired before * removing devices, which have online/offline sysfs knob, * and some locks are needed to serialize the online/offline * callbacks and device removing. ... ? And we could add lockdep assertions in cpu and memory related code? e.g. remove_memory(), unregister_cpu() Currently, remove_memory() has comments for the function: * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations before this call, as required by * try_offline_node(). */ maybe it could be removed with the lockdep assertion. I'm confused about the overall locking scheme. What's the role of device_hotplug_lock? Is that solely to prevent the sysfs deadlock issue? Or does it serve other synchronization purposes depending on the specific subsystem? If the former, the lock no longer needs to exist. The only thing necessary would be synchronization between device_del() deleting the sysfs file and the unbreak helper invoking device-specific callback. If the latter, we probably should change that. Sharing hotplug lock across multiple subsystems through driver core sounds like a pretty bad idea. I think it's the latter. Actually, no, this is not the case if I understand you correctly. Oh, Sorry, I didn't read carefully. Yes, it's not specific subsystem. After seeing your reply, I understand it is for protecting device hot remove involving multiple subsystems. I think device_{on|off}line is better to be done in some sort of lock which prevents the device from being removed, including some preparation work that needs be done before device_del(). Quite frankly, you should be confident that you understand the code you're trying to modify or please don't touch it. I'll have a deeper look at this issue later today or tomorrow and will get back to you then. Ok, thank you, Zhong Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-22 at 16:44 -0400, Tejun Heo wrote: > Hello, > > On Tue, Apr 22, 2014 at 11:34:39AM +0800, Li Zhong wrote: > > > Is this assumption true? If so, can we add lockdep assertions in > > > places to verify and enforce this? If not, aren't we just feeling > > > good when the reality is broken? > > > > It seems not true ... I think there are devices that don't have the > > online/offline concept, we just need to add it, remove it, like ethernet > > cards. > > > > Maybe we could change the comments above, like: > > /* We assume device_hotplug_lock must be acquired before > > * removing devices, which have online/offline sysfs knob, > > * and some locks are needed to serialize the online/offline > > * callbacks and device removing. ... > > ? > > > > And we could add lockdep assertions in cpu and memory related code? e.g. > > remove_memory(), unregister_cpu() > > > > Currently, remove_memory() has comments for the function: > > > > * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > > * and online/offline operations before this call, as required by > > * try_offline_node(). > > */ > > > > maybe it could be removed with the lockdep assertion. > > I'm confused about the overall locking scheme. What's the role of > device_hotplug_lock? Is that solely to prevent the sysfs deadlock > issue? Or does it serve other synchronization purposes depending on > the specific subsystem? If the former, the lock no longer needs to > exist. The only thing necessary would be synchronization between > device_del() deleting the sysfs file and the unbreak helper invoking > device-specific callback. If the latter, we probably should change > that. Sharing hotplug lock across multiple subsystems through driver > core sounds like a pretty bad idea. I think it's the latter. I think device_{on|off}line is better to be done in some sort of lock which prevents the device from being removed, including some preparation work that needs be done before device_del(). Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Tue, 2014-04-22 at 16:40 -0400, Tejun Heo wrote: > Hello, > > On Tue, Apr 22, 2014 at 10:29:37AM +0800, Li Zhong wrote: > > The probe/release files are attribute files for cpu subsys, looks like > > following in sysfs dirs > > > > $ cd /sys/devices/system/cpu/ > > $ ls -l > > total 0 > > drwxr-xr-x. 7 root root 0 Apr 17 19:00 cpu0 > > drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu1 > > drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu10 > > .. > > drwxr-xr-x. 3 root root 0 Apr 20 08:00 cpufreq > > drwxr-xr-x. 2 root root 0 Apr 20 08:00 cpuidle > > -rw---. 1 root root 65536 Apr 21 00:28 dscr_default > > -r--r--r--. 1 root root 65536 Apr 21 00:28 kernel_max > > -r--r--r--. 1 root root 65536 Apr 21 00:28 offline > > -r--r--r--. 1 root root 65536 Sep 4 2014 online > > -r--r--r--. 1 root root 65536 Apr 21 00:28 possible > > drwxr-xr-x. 2 root root 0 Apr 20 08:00 power > > -r--r--r--. 1 root root 65536 Apr 17 20:46 present > > --w---. 1 root root 65536 Apr 21 00:28 probe<- > > --w---. 1 root root 65536 Apr 21 00:28 release <- > > -rw---. 1 root root 65536 Apr 21 00:28 subcores_per_core > > -rw-r--r--. 1 root root 65536 Apr 21 00:28 uevent > > > > From the code, it seems cpu subsys won't be unregistered, and it doesn't > > make sense to remove all the cpus in the system. > > I don't think I'm following you. Are you saying that no files which > are protected under the hotplug lock that cpu subsys uses are removed > during offlining? Sorry, Maybe I didn't say it clearly. There are files under cpu#, e.g. $ cd /sys/devices/system/cpu/cpu0 $ ls cache dscr node1pmc1 pmc5 smt_snooze_delay uevent cpuidle mmcr0 online pmc2 pmc6 spurr crash_notes mmcr1 physical_id pmc3 power subsystem crash_notes_size mmcra pir pmc4 purr topology If you remove cpu0, then the cpu0 directory will be removed, together with the "online" file in the directory, while some other process might be writing 0 or 1 to it. But here, for the probe/release, take "release" for example, if user writes something that stands for cpu0 to it, the cpu0 will be removed, and the cpu0 directory and the files under it will be removed. But "release" itself is not removed. They are attributes of cpu_subsys, not of some specific cpus. Hopes the above makes things a bit clearer. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-22 at 12:11 +0200, Rafael J. Wysocki wrote: > On Tuesday, April 22, 2014 11:34:39 AM Li Zhong wrote: > > On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: > > > Hello, > > > > > > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: > > > > > > Proper /** function comment would be nice. > > > > Ok, will try to write some in next version. > > > > > > > > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, > > > > + struct device_attribute > > > > *attr) > > > > > > I can see why you did this but let's please not require the user of > > > this function to see how the thing is working internally. Let's > > > return int and keep track of (or look up again) the kernfs_node > > > internally. > > > > Ok, it also makes the prototype of lock and unlock look more consistent > > and comfortable. > > > > > > > > > { > > > ... > > > > + /* > > > > +* We assume device_hotplug_lock must be acquired before > > > > removing > > > > > > Is this assumption true? If so, can we add lockdep assertions in > > > places to verify and enforce this? If not, aren't we just feeling > > > good when the reality is broken? > > > > It seems not true ... I think there are devices that don't have the > > online/offline concept, we just need to add it, remove it, like ethernet > > cards. > > Well, I haven't been following this closely (I was travelling, sorry), but > there certainly are devices without online/offline. That currently is only > present for CPUs, memory blocks and ACPI containers (if I remember correctly). > > > > > Maybe we could change the comments above, like: > > /* We assume device_hotplug_lock must be acquired before > > * removing devices, which have online/offline sysfs knob, > > * and some locks are needed to serialize the online/offline > > * callbacks and device removing. ... > > ? > > Lockdep assertions would be better than this in my opinion. This is talking about the lock required in the other process, the device removing process, e.g. that in remove_memory() below. So I guess no lockdep assertions needed here. Or I misunderstand your point? > > > > And we could add lockdep assertions in cpu and memory related code? e.g. > > remove_memory(), unregister_cpu() > > > > Currently, remove_memory() has comments for the function: > > > > * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > > * and online/offline operations before this call, as required by > > * try_offline_node(). > > */ > > > > maybe it could be removed with the lockdep assertion. > > No, please don't remove it. It is there to explain where the locking > requirement > comes from. OK, I see. I think I'll just add lockdep assertions, and keep the comments there. Thanks, Zhong > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-22 at 12:11 +0200, Rafael J. Wysocki wrote: On Tuesday, April 22, 2014 11:34:39 AM Li Zhong wrote: On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: Hello, On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: Proper /** function comment would be nice. Ok, will try to write some in next version. +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, + struct device_attribute *attr) I can see why you did this but let's please not require the user of this function to see how the thing is working internally. Let's return int and keep track of (or look up again) the kernfs_node internally. Ok, it also makes the prototype of lock and unlock look more consistent and comfortable. { ... + /* +* We assume device_hotplug_lock must be acquired before removing Is this assumption true? If so, can we add lockdep assertions in places to verify and enforce this? If not, aren't we just feeling good when the reality is broken? It seems not true ... I think there are devices that don't have the online/offline concept, we just need to add it, remove it, like ethernet cards. Well, I haven't been following this closely (I was travelling, sorry), but there certainly are devices without online/offline. That currently is only present for CPUs, memory blocks and ACPI containers (if I remember correctly). Maybe we could change the comments above, like: /* We assume device_hotplug_lock must be acquired before * removing devices, which have online/offline sysfs knob, * and some locks are needed to serialize the online/offline * callbacks and device removing. ... ? Lockdep assertions would be better than this in my opinion. This is talking about the lock required in the other process, the device removing process, e.g. that in remove_memory() below. So I guess no lockdep assertions needed here. Or I misunderstand your point? And we could add lockdep assertions in cpu and memory related code? e.g. remove_memory(), unregister_cpu() Currently, remove_memory() has comments for the function: * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations before this call, as required by * try_offline_node(). */ maybe it could be removed with the lockdep assertion. No, please don't remove it. It is there to explain where the locking requirement comes from. OK, I see. I think I'll just add lockdep assertions, and keep the comments there. Thanks, Zhong -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Tue, 2014-04-22 at 16:40 -0400, Tejun Heo wrote: Hello, On Tue, Apr 22, 2014 at 10:29:37AM +0800, Li Zhong wrote: The probe/release files are attribute files for cpu subsys, looks like following in sysfs dirs $ cd /sys/devices/system/cpu/ $ ls -l total 0 drwxr-xr-x. 7 root root 0 Apr 17 19:00 cpu0 drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu1 drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu10 .. drwxr-xr-x. 3 root root 0 Apr 20 08:00 cpufreq drwxr-xr-x. 2 root root 0 Apr 20 08:00 cpuidle -rw---. 1 root root 65536 Apr 21 00:28 dscr_default -r--r--r--. 1 root root 65536 Apr 21 00:28 kernel_max -r--r--r--. 1 root root 65536 Apr 21 00:28 offline -r--r--r--. 1 root root 65536 Sep 4 2014 online -r--r--r--. 1 root root 65536 Apr 21 00:28 possible drwxr-xr-x. 2 root root 0 Apr 20 08:00 power -r--r--r--. 1 root root 65536 Apr 17 20:46 present --w---. 1 root root 65536 Apr 21 00:28 probe- --w---. 1 root root 65536 Apr 21 00:28 release - -rw---. 1 root root 65536 Apr 21 00:28 subcores_per_core -rw-r--r--. 1 root root 65536 Apr 21 00:28 uevent From the code, it seems cpu subsys won't be unregistered, and it doesn't make sense to remove all the cpus in the system. I don't think I'm following you. Are you saying that no files which are protected under the hotplug lock that cpu subsys uses are removed during offlining? Sorry, Maybe I didn't say it clearly. There are files under cpu#, e.g. $ cd /sys/devices/system/cpu/cpu0 $ ls cache dscr node1pmc1 pmc5 smt_snooze_delay uevent cpuidle mmcr0 online pmc2 pmc6 spurr crash_notes mmcr1 physical_id pmc3 power subsystem crash_notes_size mmcra pir pmc4 purr topology If you remove cpu0, then the cpu0 directory will be removed, together with the online file in the directory, while some other process might be writing 0 or 1 to it. But here, for the probe/release, take release for example, if user writes something that stands for cpu0 to it, the cpu0 will be removed, and the cpu0 directory and the files under it will be removed. But release itself is not removed. They are attributes of cpu_subsys, not of some specific cpus. Hopes the above makes things a bit clearer. Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-22 at 16:44 -0400, Tejun Heo wrote: Hello, On Tue, Apr 22, 2014 at 11:34:39AM +0800, Li Zhong wrote: Is this assumption true? If so, can we add lockdep assertions in places to verify and enforce this? If not, aren't we just feeling good when the reality is broken? It seems not true ... I think there are devices that don't have the online/offline concept, we just need to add it, remove it, like ethernet cards. Maybe we could change the comments above, like: /* We assume device_hotplug_lock must be acquired before * removing devices, which have online/offline sysfs knob, * and some locks are needed to serialize the online/offline * callbacks and device removing. ... ? And we could add lockdep assertions in cpu and memory related code? e.g. remove_memory(), unregister_cpu() Currently, remove_memory() has comments for the function: * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations before this call, as required by * try_offline_node(). */ maybe it could be removed with the lockdep assertion. I'm confused about the overall locking scheme. What's the role of device_hotplug_lock? Is that solely to prevent the sysfs deadlock issue? Or does it serve other synchronization purposes depending on the specific subsystem? If the former, the lock no longer needs to exist. The only thing necessary would be synchronization between device_del() deleting the sysfs file and the unbreak helper invoking device-specific callback. If the latter, we probably should change that. Sharing hotplug lock across multiple subsystems through driver core sounds like a pretty bad idea. I think it's the latter. I think device_{on|off}line is better to be done in some sort of lock which prevents the device from being removed, including some preparation work that needs be done before device_del(). Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: > Hello, > > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: > > Proper /** function comment would be nice. Ok, will try to write some in next version. > > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, > > + struct device_attribute *attr) > > I can see why you did this but let's please not require the user of > this function to see how the thing is working internally. Let's > return int and keep track of (or look up again) the kernfs_node > internally. Ok, it also makes the prototype of lock and unlock look more consistent and comfortable. > > > { > ... > > + /* > > +* We assume device_hotplug_lock must be acquired before removing > > Is this assumption true? If so, can we add lockdep assertions in > places to verify and enforce this? If not, aren't we just feeling > good when the reality is broken? It seems not true ... I think there are devices that don't have the online/offline concept, we just need to add it, remove it, like ethernet cards. Maybe we could change the comments above, like: /* We assume device_hotplug_lock must be acquired before * removing devices, which have online/offline sysfs knob, * and some locks are needed to serialize the online/offline * callbacks and device removing. ... ? And we could add lockdep assertions in cpu and memory related code? e.g. remove_memory(), unregister_cpu() Currently, remove_memory() has comments for the function: * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations before this call, as required by * try_offline_node(). */ maybe it could be removed with the lockdep assertion. > ... > > Function comment please. OK. Thanks, Zhong > > +void unlock_device_hotplug_sysfs(struct device *dev, > > +struct kernfs_node *kn) > > +{ > > + unlock_device_hotplug(); > > + kernfs_unbreak_active_protection(kn); > > + put_device(dev); > > + kernfs_put(kn); > > } > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Mon, 2014-04-21 at 18:38 -0400, Tejun Heo wrote: > Hello, > > On Mon, Apr 21, 2014 at 05:20:59PM +0800, Li Zhong wrote: > > While auditing the usage of lock_device_hotplug_sysfs() for implementing > > it in another way in following patch, it seems to me that the code here > > is to add/remove device, and the files probe/release for cpu bus > > themselves won't be removed. > > > > So it seems to me there is no s_active related deadlock here, and we > > could just use lock_device_hotplug(). > > It may still cause issue if offlining ends up removing sysfs files or > gets involved with the same lock used during cpu hot[un]plug > operations (e.g. offlining racing against cpu hotunplug) and offlining > a cpu does remove files. Has this change been tested? The probe/release files are attribute files for cpu subsys, looks like following in sysfs dirs $ cd /sys/devices/system/cpu/ $ ls -l total 0 drwxr-xr-x. 7 root root 0 Apr 17 19:00 cpu0 drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu1 drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu10 .. drwxr-xr-x. 3 root root 0 Apr 20 08:00 cpufreq drwxr-xr-x. 2 root root 0 Apr 20 08:00 cpuidle -rw---. 1 root root 65536 Apr 21 00:28 dscr_default -r--r--r--. 1 root root 65536 Apr 21 00:28 kernel_max -r--r--r--. 1 root root 65536 Apr 21 00:28 offline -r--r--r--. 1 root root 65536 Sep 4 2014 online -r--r--r--. 1 root root 65536 Apr 21 00:28 possible drwxr-xr-x. 2 root root 0 Apr 20 08:00 power -r--r--r--. 1 root root 65536 Apr 17 20:46 present --w---. 1 root root 65536 Apr 21 00:28 probe<- --w---. 1 root root 65536 Apr 21 00:28 release <- -rw---. 1 root root 65536 Apr 21 00:28 subcores_per_core -rw-r--r--. 1 root root 65536 Apr 21 00:28 uevent >From the code, it seems cpu subsys won't be unregistered, and it doesn't make sense to remove all the cpus in the system. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
This patch tries to solve the device hot remove locking issues in a different way from commit 5e33bc41, as kernfs already has a mechanism to break active protection. The active protection is there to keep the file alive by blocking deletion while operations are on-going in the file. This blocking creates a dependency loop when an operation running off a sysfs knob ends up grabbing a lock which may be held while removing the said sysfs knob. So the problem here is the order of s_active, and the series of hotplug related locks. commit 5e33bc41 solves it by taking out the first of the series of hoplug related locks, device_hotplug_lock, with a try lock. And if that try lock fails, it returns a restart syscall error, and drops s_active temporarily to allow the other process to remove the sysfs knob. This doesn't help with lockdep warnings reported against s_active and other hotplug related locks in the series. This patch instead tries to take s_active out of the lock dependency graph using the active protection breaking mechanism. lock_device_hotplug_sysfs() function name is kept here, two more arguments are added, dev, attr, to help find the kernfs_node corresponding to the sysfs knob (which should always be able to be found, WARN if not). The found kernfs_node is recorded as the return value, to be released in unlock_device_hotplug_sysfs(). As we break the active protection here, the device removing process might remove the sysfs entries after that point. So after we grab the device_hotplug_lock, we need check whether that really happens. If so, we should abort and not invoke the online/offline callbacks anymore. In this case, NULL is returned to the caller, so it could return with ENODEV. Signed-off-by: Li Zhong --- drivers/base/core.c| 63 ++ drivers/base/memory.c | 9 include/linux/device.h | 5 +++- 3 files changed, 62 insertions(+), 15 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..4d37a2b 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -61,14 +61,55 @@ void unlock_device_hotplug(void) mutex_unlock(_hotplug_lock); } -int lock_device_hotplug_sysfs(void) +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, + struct device_attribute *attr) { - if (mutex_trylock(_hotplug_lock)) - return 0; + struct kernfs_node *kn; + + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); + + if (WARN_ON_ONCE(!kn)) + return NULL; + + /* +* Break active protection here to avoid deadlocks with device +* removing process, which tries to remove sysfs entries including this +* "online" attribute while holding some hotplug related locks. +* +* @dev needs to be protected here, or it could go away any time after +* dropping active protection. +*/ + + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); + + /* +* We assume device_hotplug_lock must be acquired before removing +* device, we can check here whether the device has been removed, so +* we don't call device_{on|off}line against removed device. +*/ - /* Avoid busy looping (5 ms of sleep should do). */ - msleep(5); - return restart_syscall(); + if (!dev->kobj.sd) { + /* device_del() already called on @dev, we should also abort */ + unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + kernfs_put(kn); + return NULL; + } + + return kn; +} + +void unlock_device_hotplug_sysfs(struct device *dev, +struct kernfs_node *kn) +{ + unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + kernfs_put(kn); } #ifdef CONFIG_BLOCK @@ -439,17 +480,19 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, ); if (ret < 0) return ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + kn = lock_device_hotplug_sysfs(dev, attr); + if (!kn) + return -ENODEV; ret = val ? device_online(dev) : device_offline(dev); - unlock_device_hotplug(); + unlock_device_hotplug_sysfs(dev, kn); + return ret < 0 ? ret : count; } static DEVICE_ATTR_RW(online); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..c2b66d4 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -320,10 +320,11 @@ store_mem_state(struct device *dev, { struct memory_block *mem = to_memory_bl
[RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
While auditing the usage of lock_device_hotplug_sysfs() for implementing it in another way in following patch, it seems to me that the code here is to add/remove device, and the files probe/release for cpu bus themselves won't be removed. So it seems to me there is no s_active related deadlock here, and we could just use lock_device_hotplug(). Signed-off-by: Li Zhong --- drivers/base/cpu.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 006b1bc..9483225 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -88,9 +88,7 @@ static ssize_t cpu_probe_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_probe(buf, count); @@ -106,9 +104,7 @@ static ssize_t cpu_release_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_release(buf, count); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
While auditing the usage of lock_device_hotplug_sysfs() for implementing it in another way in following patch, it seems to me that the code here is to add/remove device, and the files probe/release for cpu bus themselves won't be removed. So it seems to me there is no s_active related deadlock here, and we could just use lock_device_hotplug(). Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/cpu.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 006b1bc..9483225 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -88,9 +88,7 @@ static ssize_t cpu_probe_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_probe(buf, count); @@ -106,9 +104,7 @@ static ssize_t cpu_release_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_release(buf, count); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
This patch tries to solve the device hot remove locking issues in a different way from commit 5e33bc41, as kernfs already has a mechanism to break active protection. The active protection is there to keep the file alive by blocking deletion while operations are on-going in the file. This blocking creates a dependency loop when an operation running off a sysfs knob ends up grabbing a lock which may be held while removing the said sysfs knob. So the problem here is the order of s_active, and the series of hotplug related locks. commit 5e33bc41 solves it by taking out the first of the series of hoplug related locks, device_hotplug_lock, with a try lock. And if that try lock fails, it returns a restart syscall error, and drops s_active temporarily to allow the other process to remove the sysfs knob. This doesn't help with lockdep warnings reported against s_active and other hotplug related locks in the series. This patch instead tries to take s_active out of the lock dependency graph using the active protection breaking mechanism. lock_device_hotplug_sysfs() function name is kept here, two more arguments are added, dev, attr, to help find the kernfs_node corresponding to the sysfs knob (which should always be able to be found, WARN if not). The found kernfs_node is recorded as the return value, to be released in unlock_device_hotplug_sysfs(). As we break the active protection here, the device removing process might remove the sysfs entries after that point. So after we grab the device_hotplug_lock, we need check whether that really happens. If so, we should abort and not invoke the online/offline callbacks anymore. In this case, NULL is returned to the caller, so it could return with ENODEV. Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/core.c| 63 ++ drivers/base/memory.c | 9 include/linux/device.h | 5 +++- 3 files changed, 62 insertions(+), 15 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..4d37a2b 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -61,14 +61,55 @@ void unlock_device_hotplug(void) mutex_unlock(device_hotplug_lock); } -int lock_device_hotplug_sysfs(void) +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, + struct device_attribute *attr) { - if (mutex_trylock(device_hotplug_lock)) - return 0; + struct kernfs_node *kn; + + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + + if (WARN_ON_ONCE(!kn)) + return NULL; + + /* +* Break active protection here to avoid deadlocks with device +* removing process, which tries to remove sysfs entries including this +* online attribute while holding some hotplug related locks. +* +* @dev needs to be protected here, or it could go away any time after +* dropping active protection. +*/ + + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); + + /* +* We assume device_hotplug_lock must be acquired before removing +* device, we can check here whether the device has been removed, so +* we don't call device_{on|off}line against removed device. +*/ - /* Avoid busy looping (5 ms of sleep should do). */ - msleep(5); - return restart_syscall(); + if (!dev-kobj.sd) { + /* device_del() already called on @dev, we should also abort */ + unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + kernfs_put(kn); + return NULL; + } + + return kn; +} + +void unlock_device_hotplug_sysfs(struct device *dev, +struct kernfs_node *kn) +{ + unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + kernfs_put(kn); } #ifdef CONFIG_BLOCK @@ -439,17 +480,19 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, val); if (ret 0) return ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + kn = lock_device_hotplug_sysfs(dev, attr); + if (!kn) + return -ENODEV; ret = val ? device_online(dev) : device_offline(dev); - unlock_device_hotplug(); + unlock_device_hotplug_sysfs(dev, kn); + return ret 0 ? ret : count; } static DEVICE_ATTR_RW(online); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..c2b66d4 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -320,10 +320,11 @@ store_mem_state(struct device *dev, { struct memory_block *mem
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Mon, 2014-04-21 at 18:38 -0400, Tejun Heo wrote: Hello, On Mon, Apr 21, 2014 at 05:20:59PM +0800, Li Zhong wrote: While auditing the usage of lock_device_hotplug_sysfs() for implementing it in another way in following patch, it seems to me that the code here is to add/remove device, and the files probe/release for cpu bus themselves won't be removed. So it seems to me there is no s_active related deadlock here, and we could just use lock_device_hotplug(). It may still cause issue if offlining ends up removing sysfs files or gets involved with the same lock used during cpu hot[un]plug operations (e.g. offlining racing against cpu hotunplug) and offlining a cpu does remove files. Has this change been tested? The probe/release files are attribute files for cpu subsys, looks like following in sysfs dirs $ cd /sys/devices/system/cpu/ $ ls -l total 0 drwxr-xr-x. 7 root root 0 Apr 17 19:00 cpu0 drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu1 drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu10 .. drwxr-xr-x. 3 root root 0 Apr 20 08:00 cpufreq drwxr-xr-x. 2 root root 0 Apr 20 08:00 cpuidle -rw---. 1 root root 65536 Apr 21 00:28 dscr_default -r--r--r--. 1 root root 65536 Apr 21 00:28 kernel_max -r--r--r--. 1 root root 65536 Apr 21 00:28 offline -r--r--r--. 1 root root 65536 Sep 4 2014 online -r--r--r--. 1 root root 65536 Apr 21 00:28 possible drwxr-xr-x. 2 root root 0 Apr 20 08:00 power -r--r--r--. 1 root root 65536 Apr 17 20:46 present --w---. 1 root root 65536 Apr 21 00:28 probe- --w---. 1 root root 65536 Apr 21 00:28 release - -rw---. 1 root root 65536 Apr 21 00:28 subcores_per_core -rw-r--r--. 1 root root 65536 Apr 21 00:28 uevent From the code, it seems cpu subsys won't be unregistered, and it doesn't make sense to remove all the cpus in the system. Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: Hello, On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: Proper /** function comment would be nice. Ok, will try to write some in next version. +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, + struct device_attribute *attr) I can see why you did this but let's please not require the user of this function to see how the thing is working internally. Let's return int and keep track of (or look up again) the kernfs_node internally. Ok, it also makes the prototype of lock and unlock look more consistent and comfortable. { ... + /* +* We assume device_hotplug_lock must be acquired before removing Is this assumption true? If so, can we add lockdep assertions in places to verify and enforce this? If not, aren't we just feeling good when the reality is broken? It seems not true ... I think there are devices that don't have the online/offline concept, we just need to add it, remove it, like ethernet cards. Maybe we could change the comments above, like: /* We assume device_hotplug_lock must be acquired before * removing devices, which have online/offline sysfs knob, * and some locks are needed to serialize the online/offline * callbacks and device removing. ... ? And we could add lockdep assertions in cpu and memory related code? e.g. remove_memory(), unregister_cpu() Currently, remove_memory() has comments for the function: * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations before this call, as required by * try_offline_node(). */ maybe it could be removed with the lockdep assertion. ... Function comment please. OK. Thanks, Zhong +void unlock_device_hotplug_sysfs(struct device *dev, +struct kernfs_node *kn) +{ + unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + kernfs_put(kn); } Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v4] Use kernfs_break_active_protection() for device online store callbacks
On Thu, 2014-04-17 at 11:17 -0400, Tejun Heo wrote: > Hello, > > On Thu, Apr 17, 2014 at 02:50:44PM +0800, Li Zhong wrote: > > This patch tries to solve the device hot remove locking issues in a > > different way from commit 5e33bc41, as kernfs already has a mechanism > > to break active protection. > > > > The problem here is the order of s_active, and series of hotplug related > > lock. > > It prolly deservse more detailed explanation of the deadlock along > with how 5e33bc41 ("$SUBJ") tried to solve it. The active protetion > is there to keep the file alive by blocking deletion while operations > are on-going in the file. This blocking creates a dependency loop > when an operation running off a sysfs knob ends up grabbing a lock > which may be held while removing the said sysfs knob. OK, I'll try to add these and something more in next version. > > > + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); > > + if (WARN_ON_ONCE(!kn)) > > + return -ENODEV; > > + > > + /* > > +* Break active protection here to avoid deadlocks with device > > +* removing process, which tries to remove sysfs entries including this > > +* "online" attribute while holding some hotplug related locks. > > +* > > +* @dev needs to be protected here, or it could go away any time after > > +* dropping active protection. But it is still unreasonable/unsafe to > > +* online/offline a device after it being removed. Fortunately, there > > I think this is something driver layer proper should provide > synchronization for. It shouldn't be difficult to synchronize this > function against device_del(), right? And, please note that @dev is > guaranteed to have not been removed (at least hasn't gone through attr > removal) upto this point. Ok, I think what we need here is the check below, after getting device_hotplug_lock, and abort this function if device already removed. We should allow device_del() to remove the device in the other process, which is why we are breaking the active protection. > > > +* are some checks in online/offline knobs. Like cpu, it checks cpu > > +* present/online mask before doing the real work. > > +*/ > > + > > + get_device(dev); > > + kernfs_break_active_protection(kn); > > + > > + lock_device_hotplug(); > > + > > + /* > > +* If we assume device_hotplug_lock must be acquired before removing > > +* device, we may try to find a way to check whether the device has > > +* been removed here, so we don't call device_{on|off}line against > > +* removed device. > > +*/ > > Yeah, let's please fix this. OK, I guess we can check whether dev->kobj.sd becomes NULL. If so, it means the device has already been deleted by device_del(). > > > ret = val ? device_online(dev) : device_offline(dev); > > unlock_device_hotplug(); > > + > > + kernfs_unbreak_active_protection(kn); > > + put_device(dev); > > + > > + kernfs_put(kn); > > + > > return ret < 0 ? ret : count; > > } > > static DEVICE_ATTR_RW(online); > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > > index bece691..0d2f3a5 100644 > > --- a/drivers/base/memory.c > > +++ b/drivers/base/memory.c > > @@ -320,10 +320,17 @@ store_mem_state(struct device *dev, > > { > > struct memory_block *mem = to_memory_block(dev); > > int ret, online_type; > > + struct kernfs_node *kn; > > > > - ret = lock_device_hotplug_sysfs(); > > - if (ret) > > - return ret; > > + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); > > + if (WARN_ON_ONCE(!kn)) > > + return -ENODEV; > > + > > + /* refer to comments in online_store() for more information */ > > + get_device(dev); > > + kernfs_break_active_protection(kn); > > + > > + lock_device_hotplug(); > > > > if (!strncmp(buf, "online_kernel", min_t(int, count, 13))) > > online_type = ONLINE_KERNEL; > > @@ -362,6 +369,11 @@ store_mem_state(struct device *dev, > > err: > > unlock_device_hotplug(); > > > > + kernfs_unbreak_active_protection(kn); > > + put_device(dev); > > + > > + kernfs_put(kn); > > There are other users of lock_device_hotplug_sysfs(). We probably > want to audit them and convert them too, preferably with helper > routines so that they don't end up duplicating the complexity? I see, I guess I could keep lock_device_hotplug_sysfs(), just replace it with the implementation here; and provide a new unlock_device_hotplug_sysfs(), which will do the unlock, unbreak, and puts. I'll try to get the code ready sometime next week for your review. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v4] Use kernfs_break_active_protection() for device online store callbacks
On Thu, 2014-04-17 at 11:17 -0400, Tejun Heo wrote: Hello, On Thu, Apr 17, 2014 at 02:50:44PM +0800, Li Zhong wrote: This patch tries to solve the device hot remove locking issues in a different way from commit 5e33bc41, as kernfs already has a mechanism to break active protection. The problem here is the order of s_active, and series of hotplug related lock. It prolly deservse more detailed explanation of the deadlock along with how 5e33bc41 ($SUBJ) tried to solve it. The active protetion is there to keep the file alive by blocking deletion while operations are on-going in the file. This blocking creates a dependency loop when an operation running off a sysfs knob ends up grabbing a lock which may be held while removing the said sysfs knob. OK, I'll try to add these and something more in next version. + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* +* Break active protection here to avoid deadlocks with device +* removing process, which tries to remove sysfs entries including this +* online attribute while holding some hotplug related locks. +* +* @dev needs to be protected here, or it could go away any time after +* dropping active protection. But it is still unreasonable/unsafe to +* online/offline a device after it being removed. Fortunately, there I think this is something driver layer proper should provide synchronization for. It shouldn't be difficult to synchronize this function against device_del(), right? And, please note that @dev is guaranteed to have not been removed (at least hasn't gone through attr removal) upto this point. Ok, I think what we need here is the check below, after getting device_hotplug_lock, and abort this function if device already removed. We should allow device_del() to remove the device in the other process, which is why we are breaking the active protection. +* are some checks in online/offline knobs. Like cpu, it checks cpu +* present/online mask before doing the real work. +*/ + + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); + + /* +* If we assume device_hotplug_lock must be acquired before removing +* device, we may try to find a way to check whether the device has +* been removed here, so we don't call device_{on|off}line against +* removed device. +*/ Yeah, let's please fix this. OK, I guess we can check whether dev-kobj.sd becomes NULL. If so, it means the device has already been deleted by device_del(). ret = val ? device_online(dev) : device_offline(dev); unlock_device_hotplug(); + + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + return ret 0 ? ret : count; } static DEVICE_ATTR_RW(online); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..0d2f3a5 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -320,10 +320,17 @@ store_mem_state(struct device *dev, { struct memory_block *mem = to_memory_block(dev); int ret, online_type; + struct kernfs_node *kn; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* refer to comments in online_store() for more information */ + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); if (!strncmp(buf, online_kernel, min_t(int, count, 13))) online_type = ONLINE_KERNEL; @@ -362,6 +369,11 @@ store_mem_state(struct device *dev, err: unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); There are other users of lock_device_hotplug_sysfs(). We probably want to audit them and convert them too, preferably with helper routines so that they don't end up duplicating the complexity? I see, I guess I could keep lock_device_hotplug_sysfs(), just replace it with the implementation here; and provide a new unlock_device_hotplug_sysfs(), which will do the unlock, unbreak, and puts. I'll try to get the code ready sometime next week for your review. Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v4] Use kernfs_break_active_protection() for device online store callbacks
This patch tries to solve the device hot remove locking issues in a different way from commit 5e33bc41, as kernfs already has a mechanism to break active protection. The problem here is the order of s_active, and series of hotplug related lock. This patch takes s_active out of the lock dependency graph, so it won't dead lock with any hotplug realted locks. Signed-off-by: Li Zhong --- drivers/base/core.c | 37 ++--- drivers/base/memory.c | 18 +++--- 2 files changed, 49 insertions(+), 6 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..1b96cb9 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -439,17 +439,48 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, ); if (ret < 0) return ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* +* Break active protection here to avoid deadlocks with device +* removing process, which tries to remove sysfs entries including this +* "online" attribute while holding some hotplug related locks. +* +* @dev needs to be protected here, or it could go away any time after +* dropping active protection. But it is still unreasonable/unsafe to +* online/offline a device after it being removed. Fortunately, there +* are some checks in online/offline knobs. Like cpu, it checks cpu +* present/online mask before doing the real work. +*/ + + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); + + /* +* If we assume device_hotplug_lock must be acquired before removing +* device, we may try to find a way to check whether the device has +* been removed here, so we don't call device_{on|off}line against +* removed device. +*/ ret = val ? device_online(dev) : device_offline(dev); unlock_device_hotplug(); + + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + return ret < 0 ? ret : count; } static DEVICE_ATTR_RW(online); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..0d2f3a5 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -320,10 +320,17 @@ store_mem_state(struct device *dev, { struct memory_block *mem = to_memory_block(dev); int ret, online_type; + struct kernfs_node *kn; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* refer to comments in online_store() for more information */ + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); if (!strncmp(buf, "online_kernel", min_t(int, count, 13))) online_type = ONLINE_KERNEL; @@ -362,6 +369,11 @@ store_mem_state(struct device *dev, err: unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + if (ret) return ret; return count; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v4] Use kernfs_break_active_protection() for device online store callbacks
This patch tries to solve the device hot remove locking issues in a different way from commit 5e33bc41, as kernfs already has a mechanism to break active protection. The problem here is the order of s_active, and series of hotplug related lock. This patch takes s_active out of the lock dependency graph, so it won't dead lock with any hotplug realted locks. Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/core.c | 37 ++--- drivers/base/memory.c | 18 +++--- 2 files changed, 49 insertions(+), 6 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..1b96cb9 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -439,17 +439,48 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, val); if (ret 0) return ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* +* Break active protection here to avoid deadlocks with device +* removing process, which tries to remove sysfs entries including this +* online attribute while holding some hotplug related locks. +* +* @dev needs to be protected here, or it could go away any time after +* dropping active protection. But it is still unreasonable/unsafe to +* online/offline a device after it being removed. Fortunately, there +* are some checks in online/offline knobs. Like cpu, it checks cpu +* present/online mask before doing the real work. +*/ + + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); + + /* +* If we assume device_hotplug_lock must be acquired before removing +* device, we may try to find a way to check whether the device has +* been removed here, so we don't call device_{on|off}line against +* removed device. +*/ ret = val ? device_online(dev) : device_offline(dev); unlock_device_hotplug(); + + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + return ret 0 ? ret : count; } static DEVICE_ATTR_RW(online); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..0d2f3a5 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -320,10 +320,17 @@ store_mem_state(struct device *dev, { struct memory_block *mem = to_memory_block(dev); int ret, online_type; + struct kernfs_node *kn; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* refer to comments in online_store() for more information */ + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); if (!strncmp(buf, online_kernel, min_t(int, count, 13))) online_type = ONLINE_KERNEL; @@ -362,6 +369,11 @@ store_mem_state(struct device *dev, err: unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + if (ret) return ret; return count; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-16 at 11:17 -0400, Tejun Heo wrote: > Hello, > > On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote: > > > If so, that is > > > an actually possible deadlock, no? > > > > Yes, but it seems to me that it is solved in commit 5e33bc41, which uses > > lock_device_hotplug_sysfs() to return a restart syscall error if not > > able to try lock the device_hotplug_lock. That also requires the device > > removing code path to take the device_hotplug_lock. > > But that patch only takes out device_hotplug_lock out of the > dependency graph and does nothing for cpu_add_remove_lock. It seems > to be that there still is a deadlock condition involving s_active and > cpu_add_remove_lock. Am I missing something here? It seems to me cpu_add_remove_lock is always taken after device_hotplug_lock. So if cpu_add_remove_lock has been acquired by device removing process, then it means the other online/offline process couldn't successfully try lock device_hotplug_lock, and will release s_active with a restart syscall error; if cpu_add_remove_lock has been acquired by online/offline process, then it should already hold device_hotlug_lock, and keeps the device removing process waiting at device_hotplug_lock. So online/offline process could release the lock, and finally release s_active soon. But after some further thinking, I seem to understand your point. s_active has lock order problem with the other series of hotplug related locks, so it's better to take s_active out of the dependency chain, rather than the first of the other series of locks? like you suggested below. > > Now that kernfs has a proper mechanism to deal with it, wouldn't it > make more sense to replace 5e33bc41 with prper s_active protection > breaking? I'll try this way and send you the code for review. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-16 at 11:17 -0400, Tejun Heo wrote: Hello, On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote: If so, that is an actually possible deadlock, no? Yes, but it seems to me that it is solved in commit 5e33bc41, which uses lock_device_hotplug_sysfs() to return a restart syscall error if not able to try lock the device_hotplug_lock. That also requires the device removing code path to take the device_hotplug_lock. But that patch only takes out device_hotplug_lock out of the dependency graph and does nothing for cpu_add_remove_lock. It seems to be that there still is a deadlock condition involving s_active and cpu_add_remove_lock. Am I missing something here? It seems to me cpu_add_remove_lock is always taken after device_hotplug_lock. So if cpu_add_remove_lock has been acquired by device removing process, then it means the other online/offline process couldn't successfully try lock device_hotplug_lock, and will release s_active with a restart syscall error; if cpu_add_remove_lock has been acquired by online/offline process, then it should already hold device_hotlug_lock, and keeps the device removing process waiting at device_hotplug_lock. So online/offline process could release the lock, and finally release s_active soon. But after some further thinking, I seem to understand your point. s_active has lock order problem with the other series of hotplug related locks, so it's better to take s_active out of the dependency chain, rather than the first of the other series of locks? like you suggested below. Now that kernfs has a proper mechanism to deal with it, wouldn't it make more sense to replace 5e33bc41 with prper s_active protection breaking? I'll try this way and send you the code for review. Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v3] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index
Seems we all agree that information about SECTION, e.g. section size, sections per memory block should be kept as kernel internals, and not exposed to userspace. This patch updates Documentation/memory-hotplug.txt to refer to memory blocks instead of memory sections where appropriate and added a paragraph to explain that memory blocks are made of memory sections. The documentation update is mostly provided by Nathan. Also, as end_phys_index in code is actually not the end section id, but the end memory block id, which should always be the same as phys_index. So it is removed here. Signed-off-by: Li Zhong Reviewed-by: Zhang Yanfei --- v3: memort->memory, pointed out by Yanfei Documentation/memory-hotplug.txt | 125 +++--- drivers/base/memory.c| 12 2 files changed, 61 insertions(+), 76 deletions(-) diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 58340d5..1aa239f 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -88,16 +88,21 @@ phase by hand. 1.3. Unit of Memory online/offline operation -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory -into chunks of the same size. The chunk is called a "section". The size of -a section is architecture dependent. For example, power uses 16MiB, ia64 uses -1GiB. The unit of online/offline operation is "one section". (see Section 3.) +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided +into chunks of the same size. These chunks are called "sections". The size of +a memory section is architecture dependent. For example, power uses 16MiB, ia64 +uses 1GiB. -To determine the size of sections, please read this file: +Memory sections are combined into chunks referred to as "memory blocks". The +size of a memory block is architecture dependent and represents the logical +unit upon which memory online/offline operations are to be performed. The +default size of a memory block is the same as memory section size unless an +architecture specifies otherwise. (see Section 3.) + +To determine the size (in bytes) of a memory block please read this file: /sys/devices/system/memory/block_size_bytes -This file shows the size of sections in byte. --- 2. Kernel Configuration @@ -123,42 +128,35 @@ config options. (CONFIG_ACPI_CONTAINER). This option can be kernel module too. + -4 sysfs files for memory hotplug +3 sysfs files for memory hotplug -All sections have their device information in sysfs. Each section is part of -a memory block under /sys/devices/system/memory as +All memory blocks have their device information in sysfs. Each memory block +is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX -(XXX is the section id.) +(XXX is the memory block id.) -Now, XXX is defined as (start_address_of_section / section_size) of the first -section contained in the memory block. The files 'phys_index' and -'end_phys_index' under each directory report the beginning and end section id's -for the memory block covered by the sysfs directory. It is expected that all +For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the range. Currently there is no way to determine if there is a memory hole, but the existence of one should not affect the hotplug capabilities of the memory block. -For example, assume 1GiB section size. A device for a memory starting at +For example, assume 1GiB memory block size. A device for a memory starting at 0x1 is /sys/device/system/memory/memory4 (0x1 / 1Gib = 4) This device covers address range [0x1 ... 0x14000) -Under each section, you can see 4 or 5 files, the end_phys_index file being -a recent addition and not present on older kernels. +Under each memory block, you can see 4 files: -/sys/devices/system/memory/memoryXXX/start_phys_index -/sys/devices/system/memory/memoryXXX/end_phys_index +/sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id of the first section - in the memory block, same as XXX. -'end_phys_index' : read-only and contains section id of the last section - in the memory block. +'phys_index' : read-only and contains memory block id, same as XXX. 'state' : read-write at read: contains online/offline state of memory. at write: user can specify "online_kernel", @@ -185,6 +183,7 @@ For example: A backlink will also be created: /sys/de
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-15 at 10:50 -0400, Tejun Heo wrote: > Hello, > > On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote: > > / * > > * This process might deadlock with another process trying to > > * remove this device: > > * This process holding the s_active of "online" attribute, and tries > > * to online/offline the device with some locks protecting hotplug. > > * Device removing process holding some locks protecting hotplug, and > > * tries to remove the "online" attribute, waiting for the s_active to > > * be released. > > * > > * The deadlock described above should be solved with > > * lock_device_hotplug_sysfs(). We temporarily drop the active > > * protection here to avoid some lockdep warnings. > > * > > * If device_hotplug_lock is forgotten to be used when removing > > * device(possibly some very simple device even don't need this lock?), > > * @dev could go away any time after dropping the active protection. > > * So increase its ref count before dropping active protection. > > * Though invoking device_{on|off}line() on a removed device seems > > * unreasonable, it should be less disastrous than playing with freed > > * @dev. Also, we might be able to have some mechanism abort > > * device_{on|off}line() if @dev already removed. > > */ > > Hmmm... I'm not sure I fully understand the problem. Does the code > ever try to remove "online" while holding cpu_add_remove_lock and, > when written 0, online knob grabs cpu_add_remove_lock? Yes. In acpi_processor_remove(), cpu_maps_update_begin() is called to hold cpu_add_remove_lock, and then arch_unregister_cpu calls unregister_cpu(), which will try to remove dir cpu1 including "online". while written 0 to online, cpu_down() will also try to grab cpu_add_remove_lock with cpu_maps_update_begin(). > If so, that is > an actually possible deadlock, no? Yes, but it seems to me that it is solved in commit 5e33bc41, which uses lock_device_hotplug_sysfs() to return a restart syscall error if not able to try lock the device_hotplug_lock. That also requires the device removing code path to take the device_hotplug_lock. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-15 at 10:50 -0400, Tejun Heo wrote: Hello, On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote: / * * This process might deadlock with another process trying to * remove this device: * This process holding the s_active of online attribute, and tries * to online/offline the device with some locks protecting hotplug. * Device removing process holding some locks protecting hotplug, and * tries to remove the online attribute, waiting for the s_active to * be released. * * The deadlock described above should be solved with * lock_device_hotplug_sysfs(). We temporarily drop the active * protection here to avoid some lockdep warnings. * * If device_hotplug_lock is forgotten to be used when removing * device(possibly some very simple device even don't need this lock?), * @dev could go away any time after dropping the active protection. * So increase its ref count before dropping active protection. * Though invoking device_{on|off}line() on a removed device seems * unreasonable, it should be less disastrous than playing with freed * @dev. Also, we might be able to have some mechanism abort * device_{on|off}line() if @dev already removed. */ Hmmm... I'm not sure I fully understand the problem. Does the code ever try to remove online while holding cpu_add_remove_lock and, when written 0, online knob grabs cpu_add_remove_lock? Yes. In acpi_processor_remove(), cpu_maps_update_begin() is called to hold cpu_add_remove_lock, and then arch_unregister_cpu calls unregister_cpu(), which will try to remove dir cpu1 including online. while written 0 to online, cpu_down() will also try to grab cpu_add_remove_lock with cpu_maps_update_begin(). If so, that is an actually possible deadlock, no? Yes, but it seems to me that it is solved in commit 5e33bc41, which uses lock_device_hotplug_sysfs() to return a restart syscall error if not able to try lock the device_hotplug_lock. That also requires the device removing code path to take the device_hotplug_lock. Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v3] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index
Seems we all agree that information about SECTION, e.g. section size, sections per memory block should be kept as kernel internals, and not exposed to userspace. This patch updates Documentation/memory-hotplug.txt to refer to memory blocks instead of memory sections where appropriate and added a paragraph to explain that memory blocks are made of memory sections. The documentation update is mostly provided by Nathan. Also, as end_phys_index in code is actually not the end section id, but the end memory block id, which should always be the same as phys_index. So it is removed here. Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com --- v3: memort-memory, pointed out by Yanfei Documentation/memory-hotplug.txt | 125 +++--- drivers/base/memory.c| 12 2 files changed, 61 insertions(+), 76 deletions(-) diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 58340d5..1aa239f 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -88,16 +88,21 @@ phase by hand. 1.3. Unit of Memory online/offline operation -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory -into chunks of the same size. The chunk is called a section. The size of -a section is architecture dependent. For example, power uses 16MiB, ia64 uses -1GiB. The unit of online/offline operation is one section. (see Section 3.) +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided +into chunks of the same size. These chunks are called sections. The size of +a memory section is architecture dependent. For example, power uses 16MiB, ia64 +uses 1GiB. -To determine the size of sections, please read this file: +Memory sections are combined into chunks referred to as memory blocks. The +size of a memory block is architecture dependent and represents the logical +unit upon which memory online/offline operations are to be performed. The +default size of a memory block is the same as memory section size unless an +architecture specifies otherwise. (see Section 3.) + +To determine the size (in bytes) of a memory block please read this file: /sys/devices/system/memory/block_size_bytes -This file shows the size of sections in byte. --- 2. Kernel Configuration @@ -123,42 +128,35 @@ config options. (CONFIG_ACPI_CONTAINER). This option can be kernel module too. + -4 sysfs files for memory hotplug +3 sysfs files for memory hotplug -All sections have their device information in sysfs. Each section is part of -a memory block under /sys/devices/system/memory as +All memory blocks have their device information in sysfs. Each memory block +is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX -(XXX is the section id.) +(XXX is the memory block id.) -Now, XXX is defined as (start_address_of_section / section_size) of the first -section contained in the memory block. The files 'phys_index' and -'end_phys_index' under each directory report the beginning and end section id's -for the memory block covered by the sysfs directory. It is expected that all +For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the range. Currently there is no way to determine if there is a memory hole, but the existence of one should not affect the hotplug capabilities of the memory block. -For example, assume 1GiB section size. A device for a memory starting at +For example, assume 1GiB memory block size. A device for a memory starting at 0x1 is /sys/device/system/memory/memory4 (0x1 / 1Gib = 4) This device covers address range [0x1 ... 0x14000) -Under each section, you can see 4 or 5 files, the end_phys_index file being -a recent addition and not present on older kernels. +Under each memory block, you can see 4 files: -/sys/devices/system/memory/memoryXXX/start_phys_index -/sys/devices/system/memory/memoryXXX/end_phys_index +/sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id of the first section - in the memory block, same as XXX. -'end_phys_index' : read-only and contains section id of the last section - in the memory block. +'phys_index' : read-only and contains memory block id, same as XXX. 'state' : read-write at read: contains online/offline state of memory. at write: user can specify online_kernel, @@ -185,6 +183,7 @@ For example: A backlink will also be created: /sys/devices/system
Re: [RFC PATCH v2] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index
On Mon, 2014-04-14 at 17:13 +0800, Zhang Yanfei wrote: > On 04/14/2014 04:43 PM, Li Zhong wrote: > > Seems we all agree that information about SECTION, e.g. section size, > > sections per memory block should be kept as kernel internals, and not > > exposed to userspace. > > > > This patch updates Documentation/memory-hotplug.txt to refer to memory > > blocks instead of memory sections where appropriate and added a > > paragraph to explain that memory blocks are made of memory sections. > > The documentation update is mostly provided by Nathan. > > > > Also, as end_phys_index in code is actually not the end section id, but > > the end memory block id, which should always be the same as phys_index. > > So it is removed here. > > > > Signed-off-by: Li Zhong > > Reviewed-by: Zhang Yanfei > > Still the nitpick there. Ao.. Will fix it in next version. Thanks, Zhong > > > --- > > Documentation/memory-hotplug.txt | 125 > > +++--- > > drivers/base/memory.c| 12 > > 2 files changed, 61 insertions(+), 76 deletions(-) > > > > diff --git a/Documentation/memory-hotplug.txt > > b/Documentation/memory-hotplug.txt > > index 58340d5..1aa239f 100644 > > --- a/Documentation/memory-hotplug.txt > > +++ b/Documentation/memory-hotplug.txt > > @@ -88,16 +88,21 @@ phase by hand. > > > > 1.3. Unit of Memory online/offline operation > > > > -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole > > memory > > -into chunks of the same size. The chunk is called a "section". The size of > > -a section is architecture dependent. For example, power uses 16MiB, ia64 > > uses > > -1GiB. The unit of online/offline operation is "one section". (see Section > > 3.) > > +Memory hotplug uses SPARSEMEM memory model which allows memory to be > > divided > > +into chunks of the same size. These chunks are called "sections". The size > > of > > +a memory section is architecture dependent. For example, power uses 16MiB, > > ia64 > > +uses 1GiB. > > > > -To determine the size of sections, please read this file: > > +Memory sections are combined into chunks referred to as "memory blocks". > > The > > +size of a memory block is architecture dependent and represents the logical > > +unit upon which memory online/offline operations are to be performed. The > > +default size of a memory block is the same as memory section size unless an > > +architecture specifies otherwise. (see Section 3.) > > + > > +To determine the size (in bytes) of a memory block please read this file: > > > > /sys/devices/system/memory/block_size_bytes > > > > -This file shows the size of sections in byte. > > > > --- > > 2. Kernel Configuration > > @@ -123,42 +128,35 @@ config options. > > (CONFIG_ACPI_CONTAINER). > > This option can be kernel module too. > > > > + > > > > -4 sysfs files for memory hotplug > > +3 sysfs files for memory hotplug > > > > -All sections have their device information in sysfs. Each section is part > > of > > -a memory block under /sys/devices/system/memory as > > +All memory blocks have their device information in sysfs. Each memory > > block > > +is described under /sys/devices/system/memory as > > > > /sys/devices/system/memory/memoryXXX > > -(XXX is the section id.) > > +(XXX is the memory block id.) > > > > -Now, XXX is defined as (start_address_of_section / section_size) of the > > first > > -section contained in the memory block. The files 'phys_index' and > > -'end_phys_index' under each directory report the beginning and end section > > id's > > -for the memory block covered by the sysfs directory. It is expected that > > all > > +For the memory block covered by the sysfs directory. It is expected that > > all > > memory sections in this range are present and no memory holes exist in the > > range. Currently there is no way to determine if there is a memory hole, > > but > > the existence of one should not affect the hotplug capabilities of the > > memory > > block. > > > > -For example, assume 1GiB section size. A device for a memory starting at > > +For example, assume 1GiB memory block size. A device for a memory starting > > at > > 0x1
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Mon, 2014-04-14 at 16:13 -0400, Tejun Heo wrote: > Hello, > > On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote: > > @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct > > device_attribute *attr, > > { > > bool val; > > int ret; > > + struct kernfs_node *kn; > > > > ret = strtobool(buf, ); > > if (ret < 0) > > @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct > > device_attribute *attr, > > if (ret) > > return ret; > > > > + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); > > + if (WARN_ON_ONCE(!kn)) > > + goto out; > > + > > + get_device(dev); > > + kernfs_break_active_protection(kn); > > ret = val ? device_online(dev) : device_offline(dev); > > + kernfs_unbreak_active_protection(kn); > > + put_device(dev); > > + > > + kernfs_put(kn); > > + > > +out: > > unlock_device_hotplug(); > > return ret < 0 ? ret : count; > > } > > Can you please add comment explainin why this is being down? As it > currently stands, it's quite a bit of complexity without any > indication what's going on why. Also, if device_hotplug is locked, is > it really necessary to get @dev? Can it go away inbetween? The code > snippet looks weird because getting @dev indicates that the device > might go away without doing it but then it proceeds to invoke > device_{on|off}line() which probably isn't safe to invoke on a removed > device. Hi Tejun, I tried to write some draft words here... (I'm not good at writing it...) Could you please help to have a review and comment? thanks. / * * This process might deadlock with another process trying to * remove this device: * This process holding the s_active of "online" attribute, and tries * to online/offline the device with some locks protecting hotplug. * Device removing process holding some locks protecting hotplug, and * tries to remove the "online" attribute, waiting for the s_active to * be released. * * The deadlock described above should be solved with * lock_device_hotplug_sysfs(). We temporarily drop the active * protection here to avoid some lockdep warnings. * * If device_hotplug_lock is forgotten to be used when removing * device(possibly some very simple device even don't need this lock?), * @dev could go away any time after dropping the active protection. * So increase its ref count before dropping active protection. * Though invoking device_{on|off}line() on a removed device seems * unreasonable, it should be less disastrous than playing with freed * @dev. Also, we might be able to have some mechanism abort * device_{on|off}line() if @dev already removed. */ Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index
Seems we all agree that information about SECTION, e.g. section size, sections per memory block should be kept as kernel internals, and not exposed to userspace. This patch updates Documentation/memory-hotplug.txt to refer to memory blocks instead of memory sections where appropriate and added a paragraph to explain that memory blocks are made of memory sections. The documentation update is mostly provided by Nathan. Also, as end_phys_index in code is actually not the end section id, but the end memory block id, which should always be the same as phys_index. So it is removed here. Signed-off-by: Li Zhong --- Documentation/memory-hotplug.txt | 125 +++--- drivers/base/memory.c| 12 2 files changed, 61 insertions(+), 76 deletions(-) diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 58340d5..1aa239f 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -88,16 +88,21 @@ phase by hand. 1.3. Unit of Memory online/offline operation -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory -into chunks of the same size. The chunk is called a "section". The size of -a section is architecture dependent. For example, power uses 16MiB, ia64 uses -1GiB. The unit of online/offline operation is "one section". (see Section 3.) +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided +into chunks of the same size. These chunks are called "sections". The size of +a memory section is architecture dependent. For example, power uses 16MiB, ia64 +uses 1GiB. -To determine the size of sections, please read this file: +Memory sections are combined into chunks referred to as "memory blocks". The +size of a memory block is architecture dependent and represents the logical +unit upon which memory online/offline operations are to be performed. The +default size of a memory block is the same as memory section size unless an +architecture specifies otherwise. (see Section 3.) + +To determine the size (in bytes) of a memory block please read this file: /sys/devices/system/memory/block_size_bytes -This file shows the size of sections in byte. --- 2. Kernel Configuration @@ -123,42 +128,35 @@ config options. (CONFIG_ACPI_CONTAINER). This option can be kernel module too. + -4 sysfs files for memory hotplug +3 sysfs files for memory hotplug -All sections have their device information in sysfs. Each section is part of -a memory block under /sys/devices/system/memory as +All memory blocks have their device information in sysfs. Each memory block +is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX -(XXX is the section id.) +(XXX is the memory block id.) -Now, XXX is defined as (start_address_of_section / section_size) of the first -section contained in the memory block. The files 'phys_index' and -'end_phys_index' under each directory report the beginning and end section id's -for the memory block covered by the sysfs directory. It is expected that all +For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the range. Currently there is no way to determine if there is a memory hole, but the existence of one should not affect the hotplug capabilities of the memory block. -For example, assume 1GiB section size. A device for a memory starting at +For example, assume 1GiB memory block size. A device for a memory starting at 0x1 is /sys/device/system/memory/memory4 (0x1 / 1Gib = 4) This device covers address range [0x1 ... 0x14000) -Under each section, you can see 4 or 5 files, the end_phys_index file being -a recent addition and not present on older kernels. +Under each memory block, you can see 4 files: -/sys/devices/system/memory/memoryXXX/start_phys_index -/sys/devices/system/memory/memoryXXX/end_phys_index +/sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id of the first section - in the memory block, same as XXX. -'end_phys_index' : read-only and contains section id of the last section - in the memory block. +'phys_index' : read-only and contains memory block id, same as XXX. 'state' : read-write at read: contains online/offline state of memory. at write: user can specify "online_kernel", @@ -185,6 +183,7 @@ For example: A backlink will also be created: /sys/devices/system/memory/memory9/node0 -> ../../node/node0 + -
[RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
I noticed following lockdep warning when trying acpi hot-remove cpus: [84154.204080] == [84154.204080] [ INFO: possible circular locking dependency detected ] [84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW [84154.204080] --- [84154.204080] bash/777 is trying to acquire lock: [84154.204080] (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x20 [84154.213203] [84154.213203] but task is already holding lock: [84154.213203] (s_active#79){.+}, at: [] kernfs_fop_write+0xe4/0x190 [84154.213203] [84154.213203] which lock already depends on the new lock. [84154.213203] [84154.213203] [84154.213203] the existing dependency chain (in reverse order) is: [84154.213203] -> #3 (s_active#79){.+}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] __kernfs_remove+0x250/0x310 [84154.213203][] kernfs_remove_by_name_ns +0x50/0xc0 [84154.213203][] sysfs_remove_file_ns +0x15/0x20 [84154.213203][] device_remove_file+0x19/0x20 [84154.213203][] device_remove_attrs+0x33/0x80 [84154.213203][] device_del+0x127/0x1c0 [84154.213203][] device_unregister+0x22/0x60 [84154.213203][] unregister_cpu+0x1e/0x40 [84154.213203][] arch_unregister_cpu+0x23/0x30 [84154.213203][] acpi_processor_remove +0x8d/0xb2 [84154.213203][] acpi_bus_trim+0x5a/0x8d [84154.213203][] acpi_device_hotplug +0x1a8/0x3ec [84154.213203][] acpi_hotplug_work_fn +0x1f/0x2b [84154.213203][] process_one_work+0x1eb/0x6b0 [84154.213203][] worker_thread+0x11b/0x370 [84154.213203][] kthread+0xe4/0x100 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #2 (cpu_hotplug.lock#2){+.+.+.}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_hotplug_begin+0x4f/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable +0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #1 (cpu_hotplug.lock){++}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] cpu_hotplug_begin+0x41/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable +0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #0 (cpu_add_remove_lock){+.+.+.}: [84154.213203][] __lock_acquire+0x1f2a/0x1f60 [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_maps_update_begin +0x17/0x20 [84154.213203][] cpu_down+0x1d/0x50 [84154.213203][] cpu_subsys_offline+0x14/0x20 [84154.213203][] device_offline+0xad/0xd0 [84154.213203][] online_store+0x42/0x80 [84154.213203][] dev_attr_store+0x18/0x30 [84154.213203][] sysfs_kf_write+0x49/0x60 [84154.213203][] kernfs_fop_write+0x109/0x190 [84154.213203][] vfs_write+0xbe/0x1c0 [84154.213203][] SyS_write+0x52/0xb0 [84154.213203][] tracesys+0xd0/0xd5 [84154.213203] [84154.213203] other info that might help us debug this: [84154.213203] [84154.213203] Chain exists of: cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79 [84154.213203] Possible unsafe locking scenario: [84154.213203] [84154.213203]CPU0CPU1 [84154.213203] [84154.213203] lock(s_active#79); [84154.213203]lock(cpu_hotplug.lock#2); [84154.213203]lock(s_active#79); [84154.213203] lock(cpu_add_remove_lock); [84154.213203] [84154.213203] *** DEADLOCK *** . The deadlock itself seems already fixed in commit 5e33bc41. As Tejun suggested, to avoid this lockdep warning, kernfs_break_active_protection() is used before online/offline callbacks to take the s_active lock out of the dependency chain during online/offline operations. Signed-off-by: Li Zhong --- drivers/base/core.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..b313ad7 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, ); if (ret < 0) @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, if (ret)
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Mon, 2014-04-14 at 16:13 -0400, Tejun Heo wrote: Hello, On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote: @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, val); if (ret 0) @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, if (ret) return ret; + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + if (WARN_ON_ONCE(!kn)) + goto out; + + get_device(dev); + kernfs_break_active_protection(kn); ret = val ? device_online(dev) : device_offline(dev); + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + +out: unlock_device_hotplug(); return ret 0 ? ret : count; } Can you please add comment explainin why this is being down? As it currently stands, it's quite a bit of complexity without any indication what's going on why. Also, if device_hotplug is locked, is it really necessary to get @dev? Can it go away inbetween? The code snippet looks weird because getting @dev indicates that the device might go away without doing it but then it proceeds to invoke device_{on|off}line() which probably isn't safe to invoke on a removed device. Hi Tejun, I tried to write some draft words here... (I'm not good at writing it...) Could you please help to have a review and comment? thanks. / * * This process might deadlock with another process trying to * remove this device: * This process holding the s_active of online attribute, and tries * to online/offline the device with some locks protecting hotplug. * Device removing process holding some locks protecting hotplug, and * tries to remove the online attribute, waiting for the s_active to * be released. * * The deadlock described above should be solved with * lock_device_hotplug_sysfs(). We temporarily drop the active * protection here to avoid some lockdep warnings. * * If device_hotplug_lock is forgotten to be used when removing * device(possibly some very simple device even don't need this lock?), * @dev could go away any time after dropping the active protection. * So increase its ref count before dropping active protection. * Though invoking device_{on|off}line() on a removed device seems * unreasonable, it should be less disastrous than playing with freed * @dev. Also, we might be able to have some mechanism abort * device_{on|off}line() if @dev already removed. */ Thanks, Zhong Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index
On Mon, 2014-04-14 at 17:13 +0800, Zhang Yanfei wrote: On 04/14/2014 04:43 PM, Li Zhong wrote: Seems we all agree that information about SECTION, e.g. section size, sections per memory block should be kept as kernel internals, and not exposed to userspace. This patch updates Documentation/memory-hotplug.txt to refer to memory blocks instead of memory sections where appropriate and added a paragraph to explain that memory blocks are made of memory sections. The documentation update is mostly provided by Nathan. Also, as end_phys_index in code is actually not the end section id, but the end memory block id, which should always be the same as phys_index. So it is removed here. Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com Still the nitpick there. Ao.. Will fix it in next version. Thanks, Zhong --- Documentation/memory-hotplug.txt | 125 +++--- drivers/base/memory.c| 12 2 files changed, 61 insertions(+), 76 deletions(-) diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 58340d5..1aa239f 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -88,16 +88,21 @@ phase by hand. 1.3. Unit of Memory online/offline operation -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory -into chunks of the same size. The chunk is called a section. The size of -a section is architecture dependent. For example, power uses 16MiB, ia64 uses -1GiB. The unit of online/offline operation is one section. (see Section 3.) +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided +into chunks of the same size. These chunks are called sections. The size of +a memory section is architecture dependent. For example, power uses 16MiB, ia64 +uses 1GiB. -To determine the size of sections, please read this file: +Memory sections are combined into chunks referred to as memory blocks. The +size of a memory block is architecture dependent and represents the logical +unit upon which memory online/offline operations are to be performed. The +default size of a memory block is the same as memory section size unless an +architecture specifies otherwise. (see Section 3.) + +To determine the size (in bytes) of a memory block please read this file: /sys/devices/system/memory/block_size_bytes -This file shows the size of sections in byte. --- 2. Kernel Configuration @@ -123,42 +128,35 @@ config options. (CONFIG_ACPI_CONTAINER). This option can be kernel module too. + -4 sysfs files for memory hotplug +3 sysfs files for memory hotplug -All sections have their device information in sysfs. Each section is part of -a memory block under /sys/devices/system/memory as +All memory blocks have their device information in sysfs. Each memory block +is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX -(XXX is the section id.) +(XXX is the memory block id.) -Now, XXX is defined as (start_address_of_section / section_size) of the first -section contained in the memory block. The files 'phys_index' and -'end_phys_index' under each directory report the beginning and end section id's -for the memory block covered by the sysfs directory. It is expected that all +For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the range. Currently there is no way to determine if there is a memory hole, but the existence of one should not affect the hotplug capabilities of the memory block. -For example, assume 1GiB section size. A device for a memory starting at +For example, assume 1GiB memory block size. A device for a memory starting at 0x1 is /sys/device/system/memory/memory4 (0x1 / 1Gib = 4) This device covers address range [0x1 ... 0x14000) -Under each section, you can see 4 or 5 files, the end_phys_index file being -a recent addition and not present on older kernels. +Under each memory block, you can see 4 files: -/sys/devices/system/memory/memoryXXX/start_phys_index -/sys/devices/system/memory/memoryXXX/end_phys_index +/sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id of the first section - in the memory block, same as XXX. -'end_phys_index' : read-only and contains section id
[RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
suggested, to avoid this lockdep warning, kernfs_break_active_protection() is used before online/offline callbacks to take the s_active lock out of the dependency chain during online/offline operations. Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/core.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..b313ad7 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, val); if (ret 0) @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, if (ret) return ret; + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + if (WARN_ON_ONCE(!kn)) + goto out; + + get_device(dev); + kernfs_break_active_protection(kn); ret = val ? device_online(dev) : device_offline(dev); + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + +out: unlock_device_hotplug(); return ret 0 ? ret : count; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index
Seems we all agree that information about SECTION, e.g. section size, sections per memory block should be kept as kernel internals, and not exposed to userspace. This patch updates Documentation/memory-hotplug.txt to refer to memory blocks instead of memory sections where appropriate and added a paragraph to explain that memory blocks are made of memory sections. The documentation update is mostly provided by Nathan. Also, as end_phys_index in code is actually not the end section id, but the end memory block id, which should always be the same as phys_index. So it is removed here. Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- Documentation/memory-hotplug.txt | 125 +++--- drivers/base/memory.c| 12 2 files changed, 61 insertions(+), 76 deletions(-) diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 58340d5..1aa239f 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -88,16 +88,21 @@ phase by hand. 1.3. Unit of Memory online/offline operation -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory -into chunks of the same size. The chunk is called a section. The size of -a section is architecture dependent. For example, power uses 16MiB, ia64 uses -1GiB. The unit of online/offline operation is one section. (see Section 3.) +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided +into chunks of the same size. These chunks are called sections. The size of +a memory section is architecture dependent. For example, power uses 16MiB, ia64 +uses 1GiB. -To determine the size of sections, please read this file: +Memory sections are combined into chunks referred to as memory blocks. The +size of a memory block is architecture dependent and represents the logical +unit upon which memory online/offline operations are to be performed. The +default size of a memory block is the same as memory section size unless an +architecture specifies otherwise. (see Section 3.) + +To determine the size (in bytes) of a memory block please read this file: /sys/devices/system/memory/block_size_bytes -This file shows the size of sections in byte. --- 2. Kernel Configuration @@ -123,42 +128,35 @@ config options. (CONFIG_ACPI_CONTAINER). This option can be kernel module too. + -4 sysfs files for memory hotplug +3 sysfs files for memory hotplug -All sections have their device information in sysfs. Each section is part of -a memory block under /sys/devices/system/memory as +All memory blocks have their device information in sysfs. Each memory block +is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX -(XXX is the section id.) +(XXX is the memory block id.) -Now, XXX is defined as (start_address_of_section / section_size) of the first -section contained in the memory block. The files 'phys_index' and -'end_phys_index' under each directory report the beginning and end section id's -for the memory block covered by the sysfs directory. It is expected that all +For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the range. Currently there is no way to determine if there is a memory hole, but the existence of one should not affect the hotplug capabilities of the memory block. -For example, assume 1GiB section size. A device for a memory starting at +For example, assume 1GiB memory block size. A device for a memory starting at 0x1 is /sys/device/system/memory/memory4 (0x1 / 1Gib = 4) This device covers address range [0x1 ... 0x14000) -Under each section, you can see 4 or 5 files, the end_phys_index file being -a recent addition and not present on older kernels. +Under each memory block, you can see 4 files: -/sys/devices/system/memory/memoryXXX/start_phys_index -/sys/devices/system/memory/memoryXXX/end_phys_index +/sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id of the first section - in the memory block, same as XXX. -'end_phys_index' : read-only and contains section id of the last section - in the memory block. +'phys_index' : read-only and contains memory block id, same as XXX. 'state' : read-write at read: contains online/offline state of memory. at write: user can specify online_kernel, @@ -185,6 +183,7 @@ For example: A backlink will also be created: /sys/devices/system/memory/memory9/node0 - ../../node/node0 + 4. Physical memory
[RFC PATCH v2] Use kernfs_break_active_protection() for device online store callbacks
I noticed following lockdep warning when trying acpi hot-remove cpus: [84154.204080] == [84154.204080] [ INFO: possible circular locking dependency detected ] [84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW [84154.204080] --- [84154.204080] bash/777 is trying to acquire lock: [84154.204080] (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x20 [84154.213203] [84154.213203] but task is already holding lock: [84154.213203] (s_active#79){.+}, at: [] kernfs_fop_write+0xe4/0x190 [84154.213203] [84154.213203] which lock already depends on the new lock. [84154.213203] [84154.213203] [84154.213203] the existing dependency chain (in reverse order) is: [84154.213203] -> #3 (s_active#79){.+}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] __kernfs_remove+0x250/0x310 [84154.213203][] kernfs_remove_by_name_ns +0x50/0xc0 [84154.213203][] sysfs_remove_file_ns +0x15/0x20 [84154.213203][] device_remove_file+0x19/0x20 [84154.213203][] device_remove_attrs+0x33/0x80 [84154.213203][] device_del+0x127/0x1c0 [84154.213203][] device_unregister+0x22/0x60 [84154.213203][] unregister_cpu+0x1e/0x40 [84154.213203][] arch_unregister_cpu+0x23/0x30 [84154.213203][] acpi_processor_remove +0x8d/0xb2 [84154.213203][] acpi_bus_trim+0x5a/0x8d [84154.213203][] acpi_device_hotplug +0x1a8/0x3ec [84154.213203][] acpi_hotplug_work_fn +0x1f/0x2b [84154.213203][] process_one_work+0x1eb/0x6b0 [84154.213203][] worker_thread+0x11b/0x370 [84154.213203][] kthread+0xe4/0x100 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #2 (cpu_hotplug.lock#2){+.+.+.}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_hotplug_begin+0x4f/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable +0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #1 (cpu_hotplug.lock){++}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] cpu_hotplug_begin+0x41/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable +0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #0 (cpu_add_remove_lock){+.+.+.}: [84154.213203][] __lock_acquire+0x1f2a/0x1f60 [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_maps_update_begin +0x17/0x20 [84154.213203][] cpu_down+0x1d/0x50 [84154.213203][] cpu_subsys_offline+0x14/0x20 [84154.213203][] device_offline+0xad/0xd0 [84154.213203][] online_store+0x42/0x80 [84154.213203][] dev_attr_store+0x18/0x30 [84154.213203][] sysfs_kf_write+0x49/0x60 [84154.213203][] kernfs_fop_write+0x109/0x190 [84154.213203][] vfs_write+0xbe/0x1c0 [84154.213203][] SyS_write+0x52/0xb0 [84154.213203][] tracesys+0xd0/0xd5 [84154.213203] [84154.213203] other info that might help us debug this: [84154.213203] [84154.213203] Chain exists of: cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79 [84154.213203] Possible unsafe locking scenario: [84154.213203] [84154.213203]CPU0CPU1 [84154.213203] [84154.213203] lock(s_active#79); [84154.213203]lock(cpu_hotplug.lock#2); [84154.213203]lock(s_active#79); [84154.213203] lock(cpu_add_remove_lock); [84154.213203] [84154.213203] *** DEADLOCK *** . The deadlock itself seems already fixed in commit 5e33bc41. As Tejun suggested, to avoid this lockdep warning, kernfs_break_active_protection() is used before online/offline callbacks to take the s_active lock out of the dependency chain during online/offline operations. Signed-off-by: Li Zhong --- drivers/base/core.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..2b9f68e 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, ); if (ret < 0) @@ -448,7 +449,15 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, if (ret)
[RFC PATCH] Suppress a device hot remove related lockdep warning
I noticed following lockdep warning when trying acpi hot-remove cpus: [84154.204080] == [84154.204080] [ INFO: possible circular locking dependency detected ] [84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW [84154.204080] --- [84154.204080] bash/777 is trying to acquire lock: [84154.204080] (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x20 [84154.213203] [84154.213203] but task is already holding lock: [84154.213203] (s_active#79){.+}, at: [] kernfs_fop_write+0xe4/0x190 [84154.213203] [84154.213203] which lock already depends on the new lock. [84154.213203] [84154.213203] [84154.213203] the existing dependency chain (in reverse order) is: [84154.213203] -> #3 (s_active#79){.+}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] __kernfs_remove+0x250/0x310 [84154.213203][] kernfs_remove_by_name_ns+0x50/0xc0 [84154.213203][] sysfs_remove_file_ns+0x15/0x20 [84154.213203][] device_remove_file+0x19/0x20 [84154.213203][] device_remove_attrs+0x33/0x80 [84154.213203][] device_del+0x127/0x1c0 [84154.213203][] device_unregister+0x22/0x60 [84154.213203][] unregister_cpu+0x1e/0x40 [84154.213203][] arch_unregister_cpu+0x23/0x30 [84154.213203][] acpi_processor_remove+0x8d/0xb2 [84154.213203][] acpi_bus_trim+0x5a/0x8d [84154.213203][] acpi_device_hotplug+0x1a8/0x3ec [84154.213203][] acpi_hotplug_work_fn+0x1f/0x2b [84154.213203][] process_one_work+0x1eb/0x6b0 [84154.213203][] worker_thread+0x11b/0x370 [84154.213203][] kthread+0xe4/0x100 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #2 (cpu_hotplug.lock#2){+.+.+.}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_hotplug_begin+0x4f/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable+0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #1 (cpu_hotplug.lock){++}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] cpu_hotplug_begin+0x41/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable+0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #0 (cpu_add_remove_lock){+.+.+.}: [84154.213203][] __lock_acquire+0x1f2a/0x1f60 [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_maps_update_begin+0x17/0x20 [84154.213203][] cpu_down+0x1d/0x50 [84154.213203][] cpu_subsys_offline+0x14/0x20 [84154.213203][] device_offline+0xad/0xd0 [84154.213203][] online_store+0x42/0x80 [84154.213203][] dev_attr_store+0x18/0x30 [84154.213203][] sysfs_kf_write+0x49/0x60 [84154.213203][] kernfs_fop_write+0x109/0x190 [84154.213203][] vfs_write+0xbe/0x1c0 [84154.213203][] SyS_write+0x52/0xb0 [84154.213203][] tracesys+0xd0/0xd5 [84154.213203] [84154.213203] other info that might help us debug this: [84154.213203] [84154.213203] Chain exists of: cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79 [84154.213203] Possible unsafe locking scenario: [84154.213203] [84154.213203]CPU0CPU1 [84154.213203] [84154.213203] lock(s_active#79); [84154.213203]lock(cpu_hotplug.lock#2); [84154.213203]lock(s_active#79); [84154.213203] lock(cpu_add_remove_lock); [84154.213203] [84154.213203] *** DEADLOCK *** . The deadlock itself seems already fixed in commit 5e33bc41. This patch uses DEVICE_ATTR_IGNORE_LOCKDEP for "online" attr to suppress this lockdep warning. But I'm a little afraid it might also hide (future) potential real dead lock scenarios? Signed-off-by: Li Zhong --- drivers/base/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..98e3e09 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -452,7 +452,9 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, unlock_device_hotplug(); return ret < 0 ? ret : count; } -static DEVICE_ATTR_RW(online); + +static DEVICE_ATTR_IGNORE_LOCKDEP(online, (S_IWUSR | S_IRUGO), + online_show, online_store); int device_add_groups(s
[RFC PATCH] Suppress a device hot remove related lockdep warning
DEVICE_ATTR_IGNORE_LOCKDEP for online attr to suppress this lockdep warning. But I'm a little afraid it might also hide (future) potential real dead lock scenarios? Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..98e3e09 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -452,7 +452,9 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, unlock_device_hotplug(); return ret 0 ? ret : count; } -static DEVICE_ATTR_RW(online); + +static DEVICE_ATTR_IGNORE_LOCKDEP(online, (S_IWUSR | S_IRUGO), + online_show, online_store); int device_add_groups(struct device *dev, const struct attribute_group **groups) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2] Use kernfs_break_active_protection() for device online store callbacks
suggested, to avoid this lockdep warning, kernfs_break_active_protection() is used before online/offline callbacks to take the s_active lock out of the dependency chain during online/offline operations. Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/core.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..2b9f68e 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, val); if (ret 0) @@ -448,7 +449,15 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, if (ret) return ret; + kn = kernfs_find_and_get(dev-kobj.sd, attr-attr.name); + if (WARN_ON_ONCE(!kn)) + goto out; + + kernfs_break_active_protection(kn); + ret = val ? device_online(dev) : device_offline(dev); + kernfs_unbreak_active_protection(kn); +out: unlock_device_hotplug(); return ret 0 ? ret : count; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Wed, 2014-04-09 at 12:39 -0500, Nathan Fontenot wrote: > On 04/08/2014 02:47 PM, Dave Hansen wrote: > > > > That document really needs to be updated to stop referring to sections > > (at least in the descriptions of the user interface). We can not change > > the units of phys_index/end_phys_index without also changing > > block_size_bytes. > > > > Here is a first pass at updating the documentation. > > I have tried to update the documentation to refer to memory blocks instead > of memory sections where appropriate and added a paragraph to explain > that memory blocks are mode of memory sections. > > Thoughts? If we all agree to hide the information about sections, then I think we also need to update the section id's used for phys_index/end_phys_index, something like following on top of yours? -- diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 92d15e2..9fbb025 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -138,10 +138,7 @@ is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX (XXX is the memory block id.) -Now, XXX is defined as (start_address_of_section / section_size) of the first -section contained in the memory block. The files 'phys_index' and -'end_phys_index' under each directory report the beginning and end section id's -for the memory block covered by the sysfs directory. It is expected that all +For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the range. Currently there is no way to determine if there is a memory hole, but the existence of one should not affect the hotplug capabilities of the memory @@ -155,16 +152,14 @@ This device covers address range [0x1 ... 0x14000) Under each memory block, you can see 4 or 5 files, the end_phys_index file being a recent addition and not present on older kernels. -/sys/devices/system/memory/memoryXXX/start_phys_index +/sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/end_phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id of the first section - in the memory block, same as XXX. -'end_phys_index' : read-only and contains section id of the last section - in the memory block. +'phys_index' : read-only and contains memory block id, same as XXX. +'end_phys_index' : read-only and contains memory block id, same as XXX. 'state' : read-write at read: contains online/offline state of memory. at write: user can specify "online_kernel", -- Not sure whether it is proper to remove end_phys_index, too? Thanks, Zhong > > -Nathan > --- > Documentation/memory-hotplug.txt | 113 > --- > 1 file changed, 59 insertions(+), 54 deletions(-) > > Index: linux/Documentation/memory-hotplug.txt > === > --- linux.orig/Documentation/memory-hotplug.txt > +++ linux/Documentation/memory-hotplug.txt > @@ -88,16 +88,21 @@ phase by hand. > > 1.3. Unit of Memory online/offline operation > > -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole > memory > -into chunks of the same size. The chunk is called a "section". The size of > -a section is architecture dependent. For example, power uses 16MiB, ia64 uses > -1GiB. The unit of online/offline operation is "one section". (see Section 3.) > +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided > +into chunks of the same size. These chunks are called "sections". The size of > +a memory section is architecture dependent. For example, power uses 16MiB, > ia64 > +uses 1GiB. > + > +Memory sections are combined into chunks referred to as "memory blocks". The > +size of a memory block is architecture dependent and represents the logical > +unit upon which memory online/offline operations are to be performed. The > +default size of a memory block is the same as memory section size unless an > +architecture specifies otherwise. (see Section 3.) > > -To determine the size of sections, please read this file: > +To determine the size (in bytes) of a memory block please read this file: > > /sys/devices/system/memory/block_size_bytes > > -This file shows the size of sections in byte. > > --- > 2. Kernel Configuration > @@ -123,14 +128,15 @@ config options. > (CONFIG_ACPI_CONTAINER). > This option can be kernel module too. > > + > > -4 sysfs files for memory hotplug > +3 sysfs files for memory hotplug > > -All sections have their device information in
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Wed, 2014-04-09 at 08:49 -0700, Dave Hansen wrote: > On 04/09/2014 02:20 AM, Li Zhong wrote: > > Or do you mean we don't need to expose any information related to > > SECTION to userspace? > > Right, we don't need to expose sections themselves to userspace. Do we? > OK, I agree with that. Yanfei, I recall you once expressed your preference for section numbers? Thanks, Zhong -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Tue, 2014-04-08 at 12:47 -0700, Dave Hansen wrote: > On 04/08/2014 11:23 AM, Nathan Fontenot wrote: > > On 04/08/2014 11:13 AM, Dave Hansen wrote: > >> On 04/08/2014 01:27 AM, Li Zhong wrote: > >>> If Dave and others don't have further objections, it seems this small > >>> userspace incompatibility could be accepted by most of us, and I don't > >>> need to make a version 2. > >> > >> Let me ask another question then. What are the units of > >> phys_index/end_phys_index? How do we expose those units to userspace? > >> > > > > The documentation for these files just states that the files contain > > the first and last section id of memory in the memory block for > > phys_index and end_phys_index respectively. > > > > I'm not sure the values have ever been units of anything, at least not > > that I remember. > > > > There are two units. SECTION_SIZE, which is completely internal to the > kernel, and block_size_bytes which used to be the same as SECTION_SIZE, > but is not now. Which one of those two is phys_index/end_phys_index in, > and if it is in terms of SECTION_SIZE like this patch proposes, how do > we tell userspace how large SECTION_SIZE is? With this patch, I think we could still tell how large SECTION_SIZE is. block_size_bytes tells us the size of the block, and end_phys_index-phys_index+1, tells us the numbers of sections in the block, and then we could know the SECTION_SIZE by a division. Not that obvious, but if needed, we could easily add a file telling us section_size or sections_per_block. > > block_size_bytes is supposed to tell you how large the sections are. In > the case where we lumped a bunch of sections together, we also bumped up > block_size_bytes. That's why we currently divide the *ACTUAL* section > number in phys_index/end_phys_index by block_size_bytes. > > That document really needs to be updated to stop referring to sections > (at least in the descriptions of the user interface). We can not change > the units of phys_index/end_phys_index without also changing > block_size_bytes. How about adding a new section_size_bytes together with this patch? Or do you mean we don't need to expose any information related to SECTION to userspace? Thanks, Zhong -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Tue, 2014-04-08 at 12:47 -0700, Dave Hansen wrote: On 04/08/2014 11:23 AM, Nathan Fontenot wrote: On 04/08/2014 11:13 AM, Dave Hansen wrote: On 04/08/2014 01:27 AM, Li Zhong wrote: If Dave and others don't have further objections, it seems this small userspace incompatibility could be accepted by most of us, and I don't need to make a version 2. Let me ask another question then. What are the units of phys_index/end_phys_index? How do we expose those units to userspace? The documentation for these files just states that the files contain the first and last section id of memory in the memory block for phys_index and end_phys_index respectively. I'm not sure the values have ever been units of anything, at least not that I remember. sigh There are two units. SECTION_SIZE, which is completely internal to the kernel, and block_size_bytes which used to be the same as SECTION_SIZE, but is not now. Which one of those two is phys_index/end_phys_index in, and if it is in terms of SECTION_SIZE like this patch proposes, how do we tell userspace how large SECTION_SIZE is? With this patch, I think we could still tell how large SECTION_SIZE is. block_size_bytes tells us the size of the block, and end_phys_index-phys_index+1, tells us the numbers of sections in the block, and then we could know the SECTION_SIZE by a division. Not that obvious, but if needed, we could easily add a file telling us section_size or sections_per_block. block_size_bytes is supposed to tell you how large the sections are. In the case where we lumped a bunch of sections together, we also bumped up block_size_bytes. That's why we currently divide the *ACTUAL* section number in phys_index/end_phys_index by block_size_bytes. That document really needs to be updated to stop referring to sections (at least in the descriptions of the user interface). We can not change the units of phys_index/end_phys_index without also changing block_size_bytes. How about adding a new section_size_bytes together with this patch? Or do you mean we don't need to expose any information related to SECTION to userspace? Thanks, Zhong -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Wed, 2014-04-09 at 08:49 -0700, Dave Hansen wrote: On 04/09/2014 02:20 AM, Li Zhong wrote: Or do you mean we don't need to expose any information related to SECTION to userspace? Right, we don't need to expose sections themselves to userspace. Do we? OK, I agree with that. Yanfei, I recall you once expressed your preference for section numbers? Thanks, Zhong -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Wed, 2014-04-09 at 12:39 -0500, Nathan Fontenot wrote: On 04/08/2014 02:47 PM, Dave Hansen wrote: That document really needs to be updated to stop referring to sections (at least in the descriptions of the user interface). We can not change the units of phys_index/end_phys_index without also changing block_size_bytes. Here is a first pass at updating the documentation. I have tried to update the documentation to refer to memory blocks instead of memory sections where appropriate and added a paragraph to explain that memory blocks are mode of memory sections. Thoughts? If we all agree to hide the information about sections, then I think we also need to update the section id's used for phys_index/end_phys_index, something like following on top of yours? -- diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 92d15e2..9fbb025 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -138,10 +138,7 @@ is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX (XXX is the memory block id.) -Now, XXX is defined as (start_address_of_section / section_size) of the first -section contained in the memory block. The files 'phys_index' and -'end_phys_index' under each directory report the beginning and end section id's -for the memory block covered by the sysfs directory. It is expected that all +For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the range. Currently there is no way to determine if there is a memory hole, but the existence of one should not affect the hotplug capabilities of the memory @@ -155,16 +152,14 @@ This device covers address range [0x1 ... 0x14000) Under each memory block, you can see 4 or 5 files, the end_phys_index file being a recent addition and not present on older kernels. -/sys/devices/system/memory/memoryXXX/start_phys_index +/sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/end_phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id of the first section - in the memory block, same as XXX. -'end_phys_index' : read-only and contains section id of the last section - in the memory block. +'phys_index' : read-only and contains memory block id, same as XXX. +'end_phys_index' : read-only and contains memory block id, same as XXX. 'state' : read-write at read: contains online/offline state of memory. at write: user can specify online_kernel, -- Not sure whether it is proper to remove end_phys_index, too? Thanks, Zhong -Nathan --- Documentation/memory-hotplug.txt | 113 --- 1 file changed, 59 insertions(+), 54 deletions(-) Index: linux/Documentation/memory-hotplug.txt === --- linux.orig/Documentation/memory-hotplug.txt +++ linux/Documentation/memory-hotplug.txt @@ -88,16 +88,21 @@ phase by hand. 1.3. Unit of Memory online/offline operation -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory -into chunks of the same size. The chunk is called a section. The size of -a section is architecture dependent. For example, power uses 16MiB, ia64 uses -1GiB. The unit of online/offline operation is one section. (see Section 3.) +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided +into chunks of the same size. These chunks are called sections. The size of +a memory section is architecture dependent. For example, power uses 16MiB, ia64 +uses 1GiB. + +Memory sections are combined into chunks referred to as memory blocks. The +size of a memory block is architecture dependent and represents the logical +unit upon which memory online/offline operations are to be performed. The +default size of a memory block is the same as memory section size unless an +architecture specifies otherwise. (see Section 3.) -To determine the size of sections, please read this file: +To determine the size (in bytes) of a memory block please read this file: /sys/devices/system/memory/block_size_bytes -This file shows the size of sections in byte. --- 2. Kernel Configuration @@ -123,14 +128,15 @@ config options. (CONFIG_ACPI_CONTAINER). This option can be kernel module too. + -4 sysfs files for memory hotplug +3 sysfs files for memory hotplug -All sections have their device information in sysfs. Each section is part of -a memory block under /sys/devices/system/memory as
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Fri, 2014-04-04 at 10:29 +0900, Yasuaki Ishimatsu wrote: > (2014/04/02 17:56), Li Zhong wrote: > > I noticed the phys_index and end_phys_index under > > /sys/devices/system/memory/memoryXXX/ have the same value, e.g. > > (for the test machine, one memory block has 8 sections, that is > > sections_per_block equals 8) > > > > # cd /sys/devices/system/memory/memory100/ > > # cat phys_index end_phys_index > > 0064 > > 0064 > > > > Seems they should reflect the start/end section number respectively, which > > also matches what is said in Documentation/memory-hotplug.txt > > > > This patch tries to modify that so the two files could show the start/end > > section number of the memory block. > > > > After this change, output of the above example looks like: > > > > # cat phys_index end_phys_index > > 0320 > > 0327 > > > > Signed-off-by: Li Zhong > > --- > > Good catch! This is a bug. So I think the bug should be fixed. > > Reviewed-by: Yasuaki Ishimatsu Thank you all for the review. If Dave and others don't have further objections, it seems this small userspace incompatibility could be accepted by most of us, and I don't need to make a version 2. Thanks, Zhong > Thanks, > Yasuaki Ishimatsu > > > drivers/base/memory.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > > index bece691..b10f2fa 100644 > > --- a/drivers/base/memory.c > > +++ b/drivers/base/memory.c > > @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device > > *dev, > > struct memory_block *mem = to_memory_block(dev); > > unsigned long phys_index; > > > > - phys_index = mem->start_section_nr / sections_per_block; > > + phys_index = mem->start_section_nr; > > return sprintf(buf, "%08lx\n", phys_index); > > } > > > > @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device > > *dev, > > struct memory_block *mem = to_memory_block(dev); > > unsigned long phys_index; > > > > - phys_index = mem->end_section_nr / sections_per_block; > > + phys_index = mem->end_section_nr; > > return sprintf(buf, "%08lx\n", phys_index); > > } > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Fri, 2014-04-04 at 10:29 +0900, Yasuaki Ishimatsu wrote: (2014/04/02 17:56), Li Zhong wrote: I noticed the phys_index and end_phys_index under /sys/devices/system/memory/memoryXXX/ have the same value, e.g. (for the test machine, one memory block has 8 sections, that is sections_per_block equals 8) # cd /sys/devices/system/memory/memory100/ # cat phys_index end_phys_index 0064 0064 Seems they should reflect the start/end section number respectively, which also matches what is said in Documentation/memory-hotplug.txt This patch tries to modify that so the two files could show the start/end section number of the memory block. After this change, output of the above example looks like: # cat phys_index end_phys_index 0320 0327 Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- Good catch! This is a bug. So I think the bug should be fixed. Reviewed-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Thank you all for the review. If Dave and others don't have further objections, it seems this small userspace incompatibility could be accepted by most of us, and I don't need to make a version 2. Thanks, Zhong Thanks, Yasuaki Ishimatsu drivers/base/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..b10f2fa 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem-start_section_nr / sections_per_block; + phys_index = mem-start_section_nr; return sprintf(buf, %08lx\n, phys_index); } @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem-end_section_nr / sections_per_block; + phys_index = mem-end_section_nr; return sprintf(buf, %08lx\n, phys_index); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Thu, 2014-04-03 at 11:06 +0800, Zhang Yanfei wrote: > On 04/03/2014 10:37 AM, Li Zhong wrote: > > On Thu, 2014-04-03 at 09:37 +0800, Zhang Yanfei wrote: > >> Add ccing > >> > >> On 04/02/2014 04:56 PM, Li Zhong wrote: > >>> I noticed the phys_index and end_phys_index under > >>> /sys/devices/system/memory/memoryXXX/ have the same value, e.g. > >>> (for the test machine, one memory block has 8 sections, that is > >>> sections_per_block equals 8) > >>> > >>> # cd /sys/devices/system/memory/memory100/ > >>> # cat phys_index end_phys_index > >>> 0064 > >>> 0064 > >>> > >>> Seems they should reflect the start/end section number respectively, > >>> which > >>> also matches what is said in Documentation/memory-hotplug.txt > >> > > Hi Yanfei, > > > > Thanks for the review. > > > >> Indeed. I've noticed this before. The value in 'end_phys_index' doesn't > >> match what it really means. But, the name itself is vague, it looks like > >> it is the index of some page frame. (we keep this name for compatibility?) > > > > I guess so, Dave just reminded me that the RFC would also break > > userspace.. > > > > And now my plan is: > > leave the code unchanged > > update the document, state the end_phys_index/phys_index are the same, > > and means the memory block index > > Ah. I doubt whether there is userspace tool which is using the two sysfiles? > for example, the memory100 directory itself can tell us which block it is. > So why there is the two files under it give the same meaning. > > If there is userspace tool using the two files, does it use 'end_phys_index' > in the correct way? That said, if a userspace tool knows what the > 'end_phys_index' > really mean, does it still need it since we have 'phys_index' with the same > value? For the end_phys_index, I totally agree with you. If somebody tries to use it, say as the end section number, he should finally be able to find that the value is not what he expects. But who knows ... For phys_index, I guess it is also reasonable for some userspace tool to just loop for each memoryXXX directory under /sys/devices/system/memory, and use the phys_index as the memory block index for the directory, instead of extracting the XXX from the directory name memoryXXX Don't know whether there are some generic rules for handling such kind of compatibility issues... > > [optional] create two new files start_sec_nr, end_sec_nr if needed > > These two are the really meaningful sysfiles for userspace, IMO. OK, I see. > > > > > Do you have any other suggestions? > > No. I think we should wait for other guys to comment more. OK, let's wait. Thanks, Zhong > > Thanks. > > > > > Thanks, Zhong > > > >> > >> The corresponding member in struct memory_block is: > >> > >> struct memory_block { > >> unsigned long start_section_nr; > >> unsigned long end_section_nr; > >> ... > >> > >> The two members seem to have the right name, and have the right value in > >> kernel. > >> > >> > >>> > >>> This patch tries to modify that so the two files could show the start/end > >>> section number of the memory block. > >>> > >>> After this change, output of the above example looks like: > >>> > >>> # cat phys_index end_phys_index > >>> 0320 > >>> 0327 > >>> > >>> Signed-off-by: Li Zhong > >>> --- > >>> drivers/base/memory.c | 4 ++-- > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c > >>> index bece691..b10f2fa 100644 > >>> --- a/drivers/base/memory.c > >>> +++ b/drivers/base/memory.c > >>> @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct > >>> device *dev, > >>> struct memory_block *mem = to_memory_block(dev); > >>> unsigned long phys_index; > >>> > >>> - phys_index = mem->start_section_nr / sections_per_block; > >>> + phys_index = mem->start_section_nr; > >>> return sprintf(buf, "%08lx\n", phys_index); > >>> } > >>> > >>> @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device > >>> *dev, > >>> struct memory_block *mem = to_memory_block(dev); > >>> unsigned long phys_index; > >>> > >>> - phys_index = mem->end_section_nr / sections_per_block; > >>> + phys_index = mem->end_section_nr; > >>> return sprintf(buf, "%08lx\n", phys_index); > >>> } > >>> > >>> > >>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >>> the body of a message to majord...@vger.kernel.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> Please read the FAQ at http://www.tux.org/lkml/ > >>> > >> > >> > > > > > > . > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Thu, 2014-04-03 at 11:06 +0800, Zhang Yanfei wrote: On 04/03/2014 10:37 AM, Li Zhong wrote: On Thu, 2014-04-03 at 09:37 +0800, Zhang Yanfei wrote: Add ccing On 04/02/2014 04:56 PM, Li Zhong wrote: I noticed the phys_index and end_phys_index under /sys/devices/system/memory/memoryXXX/ have the same value, e.g. (for the test machine, one memory block has 8 sections, that is sections_per_block equals 8) # cd /sys/devices/system/memory/memory100/ # cat phys_index end_phys_index 0064 0064 Seems they should reflect the start/end section number respectively, which also matches what is said in Documentation/memory-hotplug.txt Hi Yanfei, Thanks for the review. Indeed. I've noticed this before. The value in 'end_phys_index' doesn't match what it really means. But, the name itself is vague, it looks like it is the index of some page frame. (we keep this name for compatibility?) I guess so, Dave just reminded me that the RFC would also break userspace.. And now my plan is: leave the code unchanged update the document, state the end_phys_index/phys_index are the same, and means the memory block index Ah. I doubt whether there is userspace tool which is using the two sysfiles? for example, the memory100 directory itself can tell us which block it is. So why there is the two files under it give the same meaning. If there is userspace tool using the two files, does it use 'end_phys_index' in the correct way? That said, if a userspace tool knows what the 'end_phys_index' really mean, does it still need it since we have 'phys_index' with the same value? For the end_phys_index, I totally agree with you. If somebody tries to use it, say as the end section number, he should finally be able to find that the value is not what he expects. But who knows ... For phys_index, I guess it is also reasonable for some userspace tool to just loop for each memoryXXX directory under /sys/devices/system/memory, and use the phys_index as the memory block index for the directory, instead of extracting the XXX from the directory name memoryXXX Don't know whether there are some generic rules for handling such kind of compatibility issues... [optional] create two new files start_sec_nr, end_sec_nr if needed These two are the really meaningful sysfiles for userspace, IMO. OK, I see. Do you have any other suggestions? No. I think we should wait for other guys to comment more. OK, let's wait. Thanks, Zhong Thanks. Thanks, Zhong The corresponding member in struct memory_block is: struct memory_block { unsigned long start_section_nr; unsigned long end_section_nr; ... The two members seem to have the right name, and have the right value in kernel. This patch tries to modify that so the two files could show the start/end section number of the memory block. After this change, output of the above example looks like: # cat phys_index end_phys_index 0320 0327 Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..b10f2fa 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem-start_section_nr / sections_per_block; + phys_index = mem-start_section_nr; return sprintf(buf, %08lx\n, phys_index); } @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem-end_section_nr / sections_per_block; + phys_index = mem-end_section_nr; return sprintf(buf, %08lx\n, phys_index); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Thu, 2014-04-03 at 09:37 +0800, Zhang Yanfei wrote: > Add ccing > > On 04/02/2014 04:56 PM, Li Zhong wrote: > > I noticed the phys_index and end_phys_index under > > /sys/devices/system/memory/memoryXXX/ have the same value, e.g. > > (for the test machine, one memory block has 8 sections, that is > > sections_per_block equals 8) > > > > # cd /sys/devices/system/memory/memory100/ > > # cat phys_index end_phys_index > > 0064 > > 0064 > > > > Seems they should reflect the start/end section number respectively, which > > also matches what is said in Documentation/memory-hotplug.txt > Hi Yanfei, Thanks for the review. > Indeed. I've noticed this before. The value in 'end_phys_index' doesn't > match what it really means. But, the name itself is vague, it looks like > it is the index of some page frame. (we keep this name for compatibility?) I guess so, Dave just reminded me that the RFC would also break userspace.. And now my plan is: leave the code unchanged update the document, state the end_phys_index/phys_index are the same, and means the memory block index [optional] create two new files start_sec_nr, end_sec_nr if needed Do you have any other suggestions? Thanks, Zhong > > The corresponding member in struct memory_block is: > > struct memory_block { > unsigned long start_section_nr; > unsigned long end_section_nr; > ... > > The two members seem to have the right name, and have the right value in > kernel. > > > > > > This patch tries to modify that so the two files could show the start/end > > section number of the memory block. > > > > After this change, output of the above example looks like: > > > > # cat phys_index end_phys_index > > 0320 > > 0327 > > > > Signed-off-by: Li Zhong > > --- > > drivers/base/memory.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > > index bece691..b10f2fa 100644 > > --- a/drivers/base/memory.c > > +++ b/drivers/base/memory.c > > @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device > > *dev, > > struct memory_block *mem = to_memory_block(dev); > > unsigned long phys_index; > > > > - phys_index = mem->start_section_nr / sections_per_block; > > + phys_index = mem->start_section_nr; > > return sprintf(buf, "%08lx\n", phys_index); > > } > > > > @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device > > *dev, > > struct memory_block *mem = to_memory_block(dev); > > unsigned long phys_index; > > > > - phys_index = mem->end_section_nr / sections_per_block; > > + phys_index = mem->end_section_nr; > > return sprintf(buf, "%08lx\n", phys_index); > > } > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Wed, 2014-04-02 at 09:09 -0700, Dave Hansen wrote: > On 04/02/2014 01:56 AM, Li Zhong wrote: > > I noticed the phys_index and end_phys_index under > > /sys/devices/system/memory/memoryXXX/ have the same value, e.g. > > (for the test machine, one memory block has 8 sections, that is > > sections_per_block equals 8) > > > > # cd /sys/devices/system/memory/memory100/ > > # cat phys_index end_phys_index > > 0064 > > 0064 > > > > Seems they should reflect the start/end section number respectively, which > > also matches what is said in Documentation/memory-hotplug.txt > > This changes a user-visible interface. Won't this break userspace? Hi Dave, Hmm, I think so... Thank you for the reminder Do you have some better ideas to fix this? Maybe we should leave the code unchanged, just change the corresponding document (just it seems that the end_phys_index will always be the same as phys_index)? And if the section numbers are really needed, then we could create two new files later, e.g. start_section_nr, end_section_nr? Thanks, Zhong > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
I noticed the phys_index and end_phys_index under /sys/devices/system/memory/memoryXXX/ have the same value, e.g. (for the test machine, one memory block has 8 sections, that is sections_per_block equals 8) # cd /sys/devices/system/memory/memory100/ # cat phys_index end_phys_index 0064 0064 Seems they should reflect the start/end section number respectively, which also matches what is said in Documentation/memory-hotplug.txt This patch tries to modify that so the two files could show the start/end section number of the memory block. After this change, output of the above example looks like: # cat phys_index end_phys_index 0320 0327 Signed-off-by: Li Zhong --- drivers/base/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..b10f2fa 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem->start_section_nr / sections_per_block; + phys_index = mem->start_section_nr; return sprintf(buf, "%08lx\n", phys_index); } @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem->end_section_nr / sections_per_block; + phys_index = mem->end_section_nr; return sprintf(buf, "%08lx\n", phys_index); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
I noticed the phys_index and end_phys_index under /sys/devices/system/memory/memoryXXX/ have the same value, e.g. (for the test machine, one memory block has 8 sections, that is sections_per_block equals 8) # cd /sys/devices/system/memory/memory100/ # cat phys_index end_phys_index 0064 0064 Seems they should reflect the start/end section number respectively, which also matches what is said in Documentation/memory-hotplug.txt This patch tries to modify that so the two files could show the start/end section number of the memory block. After this change, output of the above example looks like: # cat phys_index end_phys_index 0320 0327 Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..b10f2fa 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem-start_section_nr / sections_per_block; + phys_index = mem-start_section_nr; return sprintf(buf, %08lx\n, phys_index); } @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem-end_section_nr / sections_per_block; + phys_index = mem-end_section_nr; return sprintf(buf, %08lx\n, phys_index); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Wed, 2014-04-02 at 09:09 -0700, Dave Hansen wrote: On 04/02/2014 01:56 AM, Li Zhong wrote: I noticed the phys_index and end_phys_index under /sys/devices/system/memory/memoryXXX/ have the same value, e.g. (for the test machine, one memory block has 8 sections, that is sections_per_block equals 8) # cd /sys/devices/system/memory/memory100/ # cat phys_index end_phys_index 0064 0064 Seems they should reflect the start/end section number respectively, which also matches what is said in Documentation/memory-hotplug.txt This changes a user-visible interface. Won't this break userspace? Hi Dave, Hmm, I think so... Thank you for the reminder Do you have some better ideas to fix this? Maybe we should leave the code unchanged, just change the corresponding document (just it seems that the end_phys_index will always be the same as phys_index)? And if the section numbers are really needed, then we could create two new files later, e.g. start_section_nr, end_section_nr? Thanks, Zhong -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Thu, 2014-04-03 at 09:37 +0800, Zhang Yanfei wrote: Add ccing On 04/02/2014 04:56 PM, Li Zhong wrote: I noticed the phys_index and end_phys_index under /sys/devices/system/memory/memoryXXX/ have the same value, e.g. (for the test machine, one memory block has 8 sections, that is sections_per_block equals 8) # cd /sys/devices/system/memory/memory100/ # cat phys_index end_phys_index 0064 0064 Seems they should reflect the start/end section number respectively, which also matches what is said in Documentation/memory-hotplug.txt Hi Yanfei, Thanks for the review. Indeed. I've noticed this before. The value in 'end_phys_index' doesn't match what it really means. But, the name itself is vague, it looks like it is the index of some page frame. (we keep this name for compatibility?) I guess so, Dave just reminded me that the RFC would also break userspace.. And now my plan is: leave the code unchanged update the document, state the end_phys_index/phys_index are the same, and means the memory block index [optional] create two new files start_sec_nr, end_sec_nr if needed Do you have any other suggestions? Thanks, Zhong The corresponding member in struct memory_block is: struct memory_block { unsigned long start_section_nr; unsigned long end_section_nr; ... The two members seem to have the right name, and have the right value in kernel. This patch tries to modify that so the two files could show the start/end section number of the memory block. After this change, output of the above example looks like: # cat phys_index end_phys_index 0320 0327 Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..b10f2fa 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem-start_section_nr / sections_per_block; + phys_index = mem-start_section_nr; return sprintf(buf, %08lx\n, phys_index); } @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem-end_section_nr / sections_per_block; + phys_index = mem-end_section_nr; return sprintf(buf, %08lx\n, phys_index); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH ] workqueue: add args to workqueue lockdep name
Tommi noticed a 'funny' lock class name: "%s#5" from a lock acquired in process_one_work(). Maybe #fmt plus #args could be used as the lock_name to give some more information for some fmt string like the above. __builtin_constant_p() check is removed (as there seems no good way to check all the variables in args list). However, by removing the check, it only adds two additional "s for those constants. Some lockdep name examples printed out after the change: lockdep namewq->name "events_long" events_long "%s"("khelper") khelper "xfs-data/%s"mp->m_fsname xfs-data/dm-3 Signed-off-by: Li Zhong --- include/linux/workqueue.h |5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 594521b..704f4f6 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -419,10 +419,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active, static struct lock_class_key __key; \ const char *__lock_name;\ \ - if (__builtin_constant_p(fmt)) \ - __lock_name = (fmt);\ - else\ - __lock_name = #fmt; \ + __lock_name = #fmt#args;\ \ __alloc_workqueue_key((fmt), (flags), (max_active), \ &__key, __lock_name, ##args); \ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH ] workqueue: add args to workqueue lockdep name
Tommi noticed a 'funny' lock class name: %s#5 from a lock acquired in process_one_work(). Maybe #fmt plus #args could be used as the lock_name to give some more information for some fmt string like the above. __builtin_constant_p() check is removed (as there seems no good way to check all the variables in args list). However, by removing the check, it only adds two additional s for those constants. Some lockdep name examples printed out after the change: lockdep namewq-name events_long events_long %s(khelper) khelper xfs-data/%smp-m_fsname xfs-data/dm-3 Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- include/linux/workqueue.h |5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 594521b..704f4f6 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -419,10 +419,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active, static struct lock_class_key __key; \ const char *__lock_name;\ \ - if (__builtin_constant_p(fmt)) \ - __lock_name = (fmt);\ - else\ - __lock_name = #fmt; \ + __lock_name = #fmt#args;\ \ __alloc_workqueue_key((fmt), (flags), (max_active), \ __key, __lock_name, ##args); \ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockdep: strange %s#5 lock name
On Tue, 2014-02-11 at 10:27 -0500, Tejun Heo wrote: > On Tue, Feb 11, 2014 at 12:00:36PM +0100, Peter Zijlstra wrote: > > > Looks good to me. Can you please post the patch with SOB? > > > > --- > > Subject: workqueue: Fix workqueue lockdep name > > > > Tommi noticed a 'funny' lock class name: "%s#5" from a lock acquired in > > process_one_work(). It turns out that commit b196be89cdc14 forgot to > > change the lockdep_init_map() when it changed the @lock_name argument > > from a string to a format. > > > > Fixes: b196be89cdc14 ("workqueue: make alloc_workqueue() take printf fmt > > and args for name") > > Reported-by: Tommi Rantala > > Signed-off-by: Peter Zijlstra > > Applied to wq/for-3.14-fixes. Hi Tejun, Peter, I found following lockdep warning caused by this commit in next-tree: [5.251993] [ cut here ] [5.252019] WARNING: CPU: 0 PID: 221 at kernel/locking/lockdep.c:710 __lock_acquire+0x1761/0x1f60() [5.252019] Modules linked in: e1000 [5.252019] CPU: 0 PID: 221 Comm: lvm Not tainted 3.14.0-rc2-next-20140212 #1 [5.252019] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [5.252019] 0009 880118e91938 8155fe12 880118e91978 [5.252019] 8105c195 880118e91958 81eb33d0 0002 [5.252019] 880118dd2318 880118e91988 [5.252019] Call Trace: [5.252019] [] dump_stack+0x19/0x1b [5.252019] [] warn_slowpath_common+0x85/0xb0 [5.252019] [] warn_slowpath_null+0x1a/0x20 [5.252019] [] __lock_acquire+0x1761/0x1f60 [5.252019] [] ? mark_held_locks+0xae/0x120 [5.252019] [] ? debug_check_no_locks_freed+0x8e/0x160 [5.252019] [] ? lockdep_init_map+0xac/0x600 [5.252019] [] lock_acquire+0x9a/0x120 [5.252019] [] ? flush_workqueue+0x5/0x750 [5.252019] [] flush_workqueue+0x109/0x750 [5.252019] [] ? flush_workqueue+0x5/0x750 [5.252019] [] ? _raw_spin_unlock_irq+0x30/0x40 [5.252019] [] ? srcu_reschedule+0xe0/0xf0 [5.252019] [] dm_suspend+0xe9/0x1e0 [5.252019] [] dev_suspend+0x1e3/0x270 [5.252019] [] ? table_load+0x350/0x350 [5.252019] [] ctl_ioctl+0x26c/0x510 [5.252019] [] ? __lock_acquire+0x41c/0x1f60 [5.252019] [] ? vtime_account_user+0x98/0xb0 [5.252019] [] dm_ctl_ioctl+0x13/0x20 [5.252019] [] do_vfs_ioctl+0x88/0x570 [5.252019] [] ? __fget_light+0x129/0x150 [5.252019] [] SyS_ioctl+0x91/0xb0 [5.252019] [] tracesys+0xcf/0xd4 [5.252019] ---[ end trace ff1fa506f34be3bc ]--- It seems to me that when the second time alloc_workqueue() is called from the same code path, it would have two locks with the same key, but not the same >name, which doesn't meet lockdep's assumption. Maybe we could use the #fmt plus #args as the lock_name to avoid the above issue? It could also give some more information for some string like %s#5. Draft code attached below. I removed __builtin_constant_p() check (I don't know how to check all the args). However, by removing the check, it only adds two additional two "s for those constants. Some lockdep name examples I printed out after the change: lockdep namewq->name "events_long" events_long "%s"("khelper") khelper "xfs-data/%s"mp->m_fsname xfs-data/dm-3 ... A little bit ugly, not sure whether we could generate some better names here? Thanks, Zhong --- diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 594521b..704f4f6 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -419,10 +419,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active, static struct lock_class_key __key; \ const char *__lock_name;\ \ - if (__builtin_constant_p(fmt)) \ - __lock_name = (fmt);\ - else\ - __lock_name = #fmt; \ + __lock_name = #fmt#args;\ \ __alloc_workqueue_key((fmt), (flags), (max_active), \ &__key, __lock_name, ##args); \ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 861d8dd..82ef9f3 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4202,7 +4202,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt, INIT_LIST_HEAD(>flusher_overflow); INIT_LIST_HEAD(>maydays); - lockdep_init_map(>lockdep_map, wq->name, key, 0); + lockdep_init_map(>lockdep_map, lock_name, key, 0); INIT_LIST_HEAD(>list);
Re: lockdep: strange %s#5 lock name
On Tue, 2014-02-11 at 10:27 -0500, Tejun Heo wrote: On Tue, Feb 11, 2014 at 12:00:36PM +0100, Peter Zijlstra wrote: Looks good to me. Can you please post the patch with SOB? --- Subject: workqueue: Fix workqueue lockdep name Tommi noticed a 'funny' lock class name: %s#5 from a lock acquired in process_one_work(). It turns out that commit b196be89cdc14 forgot to change the lockdep_init_map() when it changed the @lock_name argument from a string to a format. Fixes: b196be89cdc14 (workqueue: make alloc_workqueue() take printf fmt and args for name) Reported-by: Tommi Rantala tt.rant...@gmail.com Signed-off-by: Peter Zijlstra pet...@infradead.org Applied to wq/for-3.14-fixes. Hi Tejun, Peter, I found following lockdep warning caused by this commit in next-tree: [5.251993] [ cut here ] [5.252019] WARNING: CPU: 0 PID: 221 at kernel/locking/lockdep.c:710 __lock_acquire+0x1761/0x1f60() [5.252019] Modules linked in: e1000 [5.252019] CPU: 0 PID: 221 Comm: lvm Not tainted 3.14.0-rc2-next-20140212 #1 [5.252019] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [5.252019] 0009 880118e91938 8155fe12 880118e91978 [5.252019] 8105c195 880118e91958 81eb33d0 0002 [5.252019] 880118dd2318 880118e91988 [5.252019] Call Trace: [5.252019] [8155fe12] dump_stack+0x19/0x1b [5.252019] [8105c195] warn_slowpath_common+0x85/0xb0 [5.252019] [8105c1da] warn_slowpath_null+0x1a/0x20 [5.252019] [810a1721] __lock_acquire+0x1761/0x1f60 [5.252019] [8109ec2e] ? mark_held_locks+0xae/0x120 [5.252019] [8109ef4e] ? debug_check_no_locks_freed+0x8e/0x160 [5.252019] [810a264c] ? lockdep_init_map+0xac/0x600 [5.252019] [810a251a] lock_acquire+0x9a/0x120 [5.252019] [810793f5] ? flush_workqueue+0x5/0x750 [5.252019] [810794f9] flush_workqueue+0x109/0x750 [5.252019] [810793f5] ? flush_workqueue+0x5/0x750 [5.252019] [81566890] ? _raw_spin_unlock_irq+0x30/0x40 [5.252019] [810b7720] ? srcu_reschedule+0xe0/0xf0 [5.252019] [81405889] dm_suspend+0xe9/0x1e0 [5.252019] [8140a853] dev_suspend+0x1e3/0x270 [5.252019] [8140a670] ? table_load+0x350/0x350 [5.252019] [8140b40c] ctl_ioctl+0x26c/0x510 [5.252019] [810a03dc] ? __lock_acquire+0x41c/0x1f60 [5.252019] [810923d8] ? vtime_account_user+0x98/0xb0 [5.252019] [8140b6c3] dm_ctl_ioctl+0x13/0x20 [5.252019] [811986c8] do_vfs_ioctl+0x88/0x570 [5.252019] [811a5579] ? __fget_light+0x129/0x150 [5.252019] [81198c41] SyS_ioctl+0x91/0xb0 [5.252019] [8157049d] tracesys+0xcf/0xd4 [5.252019] ---[ end trace ff1fa506f34be3bc ]--- It seems to me that when the second time alloc_workqueue() is called from the same code path, it would have two locks with the same key, but not the same wq-name, which doesn't meet lockdep's assumption. Maybe we could use the #fmt plus #args as the lock_name to avoid the above issue? It could also give some more information for some string like %s#5. Draft code attached below. I removed __builtin_constant_p() check (I don't know how to check all the args). However, by removing the check, it only adds two additional two s for those constants. Some lockdep name examples I printed out after the change: lockdep namewq-name events_long events_long %s(khelper) khelper xfs-data/%smp-m_fsname xfs-data/dm-3 ... A little bit ugly, not sure whether we could generate some better names here? Thanks, Zhong --- diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 594521b..704f4f6 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -419,10 +419,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active, static struct lock_class_key __key; \ const char *__lock_name;\ \ - if (__builtin_constant_p(fmt)) \ - __lock_name = (fmt);\ - else\ - __lock_name = #fmt; \ + __lock_name = #fmt#args;\ \ __alloc_workqueue_key((fmt), (flags), (max_active), \ __key, __lock_name, ##args); \ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 861d8dd..82ef9f3
Re: [RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early
On Thu, 2013-08-22 at 16:30 +0930, Rusty Russell wrote: > Greg KH writes: > > On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote: > > > struct kobj_type module_ktype = { > > > + .release =module_kobj_release, > > > .sysfs_ops =_sysfs_ops, > > > }; > > > > Wait, as there is no release function here for the kobject (a different > > problem), why is the deferred release function causing any problems? > > There is no release function to call, so what is causing the oops? > > Because DEBUG_KOBJECT_RELEASE does the kobject_put() sometime later, > which is what causes the oops. > > Since kobjects don't have an owner field, AFAICT someone *could* grab > one in a module which is unloading, then put it after unload. So this > fixes a real bug, albeit not one seen in the real world. > > Applied, Oh, thank you, Rusty. I just sent out another version... which fix it in another way as Greg suggested, could you please also help to take a look at it? Thanks, Zhong > Rusty. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 next]module: Fix mod->mkobj.kobj potentially freed too early
DEBUG_KOBJECT_RELEASE helps to find the issue attached below. After some investigation, it seems the reason is: The mod->mkobj.kobj(a01600d0 below) is freed together with mod itself by module_free(mod, mod->module_core) in free_module(). However, its children still hold references to it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the child('holders' below) tries to decrease the reference count to its parent in kobject_del(), BUG happens as it tries to access already freed memory. v2: Greg suggested to make the kobject as a pointer. But it seems a little hard to make kobj(kobject) embedded in mkobj(module_kobject) a pointer, as that seems to cause getting the mkobj from the kobj impossible -- to_module_kobject() is used in several places to derive mkobj from its member kobj. So in this version, I made the whole mkobj(module_kobject) as a pointer in mod(module). The mkobj is then allocated in mod_sysfs_init(), and freed in its member kobj's release function -- it seems that there is no mkobj usage after its kobj is released? [ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, parent a01600d0 (delayed) [ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent a01600d0 (delayed) [ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, parent a01600d0 [ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup kobject_del [ 1845.184120] BUG: unable to handle kernel paging request at a01601d0 [ 1845.185026] IP: [] kobject_put+0x11/0x60 [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 [ 1845.185026] Oops: [#1] PREEMPT [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example] [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O 3.11.0-rc6-next-20130819+ #1 [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 1845.185026] Workqueue: events kobject_delayed_cleanup [ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: 88007ca5c000 [ 1845.185026] RIP: 0010:[] [] kobject_put+0x11/0x60 [ 1845.185026] RSP: 0018:88007ca5dd08 EFLAGS: 00010282 [ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: 8177d638 [ 1845.185026] RDX: 88007ca5dc18 RSI: RDI: a01600d0 [ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: [ 1845.185026] R10: 8800 R11: dead4ead0001 R12: 81a95040 [ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: [ 1845.185026] FS: () GS:81a23000() knlGS: [ 1845.185026] CS: 0010 DS: ES: CR0: 8005003b [ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: 06b0 [ 1845.185026] Stack: [ 1845.185026] 88007c1f1600 88007c1f1600 88007ca5dd38 812cdb7e [ 1845.185026] 88007c1f1640 88007ca5dd68 812cdbfe [ 1845.185026] 88007c974800 88007c1f1640 88007ff61a00 [ 1845.185026] Call Trace: [ 1845.185026] [] kobject_del+0x2e/0x40 [ 1845.185026] [] kobject_delayed_cleanup+0x6e/0x1d0 [ 1845.185026] [] process_one_work+0x1e5/0x670 [ 1845.185026] [] ? process_one_work+0x183/0x670 [ 1845.185026] [] worker_thread+0x113/0x370 [ 1845.185026] [] ? rescuer_thread+0x290/0x290 [ 1845.185026] [] kthread+0xda/0xe0 [ 1845.185026] [] ? _raw_spin_unlock_irq+0x30/0x60 [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] [] ret_from_fork+0x7a/0xb0 [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d 87 00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 [ 1845.185026] RIP [] kobject_put+0x11/0x60 [ 1845.185026] RSP [ 1845.185026] CR2: a01601d0 [ 1845.185026] ---[ end trace 49a70afd109f5653 ]--- Signed-off-by: Li Zhong --- drivers/base/core.c| 2 +- drivers/base/module.c | 4 ++-- include/linux/module.h | 2 +- kernel/module.c| 37 + kernel/params.c| 19 +-- 5 files changed, 38 insertions(+), 26 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 09a99d6..b8a1fc6 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1629,7 +1629,7 @@ struct device *__root_device_register(const char *name, struct module *owner) #ifdef CONFIG_MODULES /* gotta find a "cleaner" way to do this */ if (owner) { - struct module_kobject *mk = >mkobj; + struct module_kobject *mk = owner->mkobj; err = sysfs_create_link(>dev.kobj, >kobj, "module"); if (err) { diff --git a/drivers/base/module.c b/drivers/base/module.c index db930d3..7b7bd5a 100644 --- a/dr
[RFC PATCH v2 next]module: Fix mod-mkobj.kobj potentially freed too early
DEBUG_KOBJECT_RELEASE helps to find the issue attached below. After some investigation, it seems the reason is: The mod-mkobj.kobj(a01600d0 below) is freed together with mod itself by module_free(mod, mod-module_core) in free_module(). However, its children still hold references to it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the child('holders' below) tries to decrease the reference count to its parent in kobject_del(), BUG happens as it tries to access already freed memory. v2: Greg suggested to make the kobject as a pointer. But it seems a little hard to make kobj(kobject) embedded in mkobj(module_kobject) a pointer, as that seems to cause getting the mkobj from the kobj impossible -- to_module_kobject() is used in several places to derive mkobj from its member kobj. So in this version, I made the whole mkobj(module_kobject) as a pointer in mod(module). The mkobj is then allocated in mod_sysfs_init(), and freed in its member kobj's release function -- it seems that there is no mkobj usage after its kobj is released? [ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, parent a01600d0 (delayed) [ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent a01600d0 (delayed) [ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, parent a01600d0 [ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup kobject_del [ 1845.184120] BUG: unable to handle kernel paging request at a01601d0 [ 1845.185026] IP: [812cda81] kobject_put+0x11/0x60 [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 [ 1845.185026] Oops: [#1] PREEMPT [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example] [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O 3.11.0-rc6-next-20130819+ #1 [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 1845.185026] Workqueue: events kobject_delayed_cleanup [ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: 88007ca5c000 [ 1845.185026] RIP: 0010:[812cda81] [812cda81] kobject_put+0x11/0x60 [ 1845.185026] RSP: 0018:88007ca5dd08 EFLAGS: 00010282 [ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: 8177d638 [ 1845.185026] RDX: 88007ca5dc18 RSI: RDI: a01600d0 [ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: [ 1845.185026] R10: 8800 R11: dead4ead0001 R12: 81a95040 [ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: [ 1845.185026] FS: () GS:81a23000() knlGS: [ 1845.185026] CS: 0010 DS: ES: CR0: 8005003b [ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: 06b0 [ 1845.185026] Stack: [ 1845.185026] 88007c1f1600 88007c1f1600 88007ca5dd38 812cdb7e [ 1845.185026] 88007c1f1640 88007ca5dd68 812cdbfe [ 1845.185026] 88007c974800 88007c1f1640 88007ff61a00 [ 1845.185026] Call Trace: [ 1845.185026] [812cdb7e] kobject_del+0x2e/0x40 [ 1845.185026] [812cdbfe] kobject_delayed_cleanup+0x6e/0x1d0 [ 1845.185026] [81063a45] process_one_work+0x1e5/0x670 [ 1845.185026] [810639e3] ? process_one_work+0x183/0x670 [ 1845.185026] [810642b3] worker_thread+0x113/0x370 [ 1845.185026] [810641a0] ? rescuer_thread+0x290/0x290 [ 1845.185026] [8106bfba] kthread+0xda/0xe0 [ 1845.185026] [814ff0f0] ? _raw_spin_unlock_irq+0x30/0x60 [ 1845.185026] [8106bee0] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] [8150751a] ret_from_fork+0x7a/0xb0 [ 1845.185026] [8106bee0] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d f6 87 00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 [ 1845.185026] RIP [812cda81] kobject_put+0x11/0x60 [ 1845.185026] RSP 88007ca5dd08 [ 1845.185026] CR2: a01601d0 [ 1845.185026] ---[ end trace 49a70afd109f5653 ]--- Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- drivers/base/core.c| 2 +- drivers/base/module.c | 4 ++-- include/linux/module.h | 2 +- kernel/module.c| 37 + kernel/params.c| 19 +-- 5 files changed, 38 insertions(+), 26 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 09a99d6..b8a1fc6 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1629,7 +1629,7 @@ struct device *__root_device_register(const char *name, struct module *owner) #ifdef CONFIG_MODULES /* gotta find a cleaner way to do this */ if (owner) { - struct module_kobject *mk = owner-mkobj
Re: [RFC PATCH next]module: Fix mod-mkobj.kobj potentially freed too early
On Thu, 2013-08-22 at 16:30 +0930, Rusty Russell wrote: Greg KH gre...@linuxfoundation.org writes: On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote: struct kobj_type module_ktype = { + .release =module_kobj_release, .sysfs_ops =module_sysfs_ops, }; Wait, as there is no release function here for the kobject (a different problem), why is the deferred release function causing any problems? There is no release function to call, so what is causing the oops? Because DEBUG_KOBJECT_RELEASE does the kobject_put() sometime later, which is what causes the oops. Since kobjects don't have an owner field, AFAICT someone *could* grab one in a module which is unloading, then put it after unload. So this fixes a real bug, albeit not one seen in the real world. Applied, Oh, thank you, Rusty. I just sent out another version... which fix it in another way as Greg suggested, could you please also help to take a look at it? Thanks, Zhong Rusty. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early
On Wed, 2013-08-21 at 21:03 -0700, Greg KH wrote: > On Thu, Aug 22, 2013 at 10:34:06AM +0800, Li Zhong wrote: > > On Wed, 2013-08-21 at 09:18 -0700, Greg KH wrote: > > > On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote: > > > > DEBUG_KOBJECT_RELEASE helps to find the issue attached below. > > > > > > > > After some investigation, it seems the reason is: > > > > The mod->mkobj.kobj(a01600d0 below) is freed together with mod > > > > itself in free_module(). However, its children still hold references to > > > > it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the > > > > child(holders below) tries to decrease the reference count to its parent > > > > in kobject_del(), BUG happens as it tries to access already freed > > > > memory. > > > > > > Ah, thanks for tracking this down. I had seen this in my local testing, > > > but wasn't able to figure out the offending code. > > > > > > > This patch tries to fix it by waiting for the mod->mkobj.kobj to be > > > > really released in the module removing process (and some error code > > > > paths). > > > > > > Nasty, we should just be freeing the structure in the release function, > > > why doesn't that work? > > > > Hi Greg, > > > > It seems I didn't describe it clearly in the previous mail. I'm trying > > to do it better below: > > > > mod->mkobj.kobj is embedded in module_kobject(not a pointer): > > struct module_kobject { > > struct kobject kobj; > > ... > > > > and allocated with the module memory. So we could see the parent below > > a01600d0 is between MODULES_VADDR (a000) and > > MODULES_END(ff00). > > > > It seem to me that the mkobj.kobj is freed by module_free(mod, > > mod->module_core). > > Ick, you are right. If a kobject is being embedded in an object, it > should control the lifespan of the object, not somewhere else like is > happening here. > > The best solution for this is to make the kobject a pointer, not > embedded in the structure, that will fix this issue, right? Yes, I think so. I'll try to write a fix using this way, thanks for your suggestion. Thanks, Zhong > > thanks, > > greg k-h > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early
On Wed, 2013-08-21 at 09:18 -0700, Greg KH wrote: > On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote: > > DEBUG_KOBJECT_RELEASE helps to find the issue attached below. > > > > After some investigation, it seems the reason is: > > The mod->mkobj.kobj(a01600d0 below) is freed together with mod > > itself in free_module(). However, its children still hold references to > > it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the > > child(holders below) tries to decrease the reference count to its parent > > in kobject_del(), BUG happens as it tries to access already freed memory. > > Ah, thanks for tracking this down. I had seen this in my local testing, > but wasn't able to figure out the offending code. > > > This patch tries to fix it by waiting for the mod->mkobj.kobj to be > > really released in the module removing process (and some error code > > paths). > > Nasty, we should just be freeing the structure in the release function, > why doesn't that work? Hi Greg, It seems I didn't describe it clearly in the previous mail. I'm trying to do it better below: mod->mkobj.kobj is embedded in module_kobject(not a pointer): struct module_kobject { struct kobject kobj; ... and allocated with the module memory. So we could see the parent below a01600d0 is between MODULES_VADDR (a000) and MODULES_END(ff00). It seem to me that the mkobj.kobj is freed by module_free(mod, mod->module_core). So in free_module(), we need delay the call of module_free(), until mkobj.kobj finishes its cleanup. > > [ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, > > parent a01600d0 (delayed) > > [ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent > > a01600d0 (delayed) > > [ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, > > parent a01600d0 > > [ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup > > kobject_del > > [ 1845.184120] BUG: unable to handle kernel paging request at > > a01601d0 > > [ 1845.185026] IP: [] kobject_put+0x11/0x60 > > [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 > > [ 1845.185026] Oops: [#1] PREEMPT > > [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: > > kprobe_example] > > [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O > > 3.11.0-rc6-next-20130819+ #1 > > [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 > > [ 1845.185026] Workqueue: events kobject_delayed_cleanup > > [ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: > > 88007ca5c000 > > [ 1845.185026] RIP: 0010:[] [] > > kobject_put+0x11/0x60 > > [ 1845.185026] RSP: 0018:88007ca5dd08 EFLAGS: 00010282 > > [ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: > > 8177d638 > > [ 1845.185026] RDX: 88007ca5dc18 RSI: RDI: > > a01600d0 > > [ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: > > > > [ 1845.185026] R10: 8800 R11: dead4ead0001 R12: > > 81a95040 > > [ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: > > > > [ 1845.185026] FS: () GS:81a23000() > > knlGS: > > [ 1845.185026] CS: 0010 DS: ES: CR0: 8005003b > > [ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: > > 06b0 > > [ 1845.185026] Stack: > > [ 1845.185026] 88007c1f1600 88007c1f1600 88007ca5dd38 > > 812cdb7e > > [ 1845.185026] 88007c1f1640 88007ca5dd68 > > 812cdbfe > > [ 1845.185026] 88007c974800 88007c1f1640 88007ff61a00 > > > > [ 1845.185026] Call Trace: > > [ 1845.185026] [] kobject_del+0x2e/0x40 > > [ 1845.185026] [] kobject_delayed_cleanup+0x6e/0x1d0 > > [ 1845.185026] [] process_one_work+0x1e5/0x670 > > [ 1845.185026] [] ? process_one_work+0x183/0x670 > > [ 1845.185026] [] worker_thread+0x113/0x370 > > [ 1845.185026] [] ? rescuer_thread+0x290/0x290 > > [ 1845.185026] [] kthread+0xda/0xe0 > > [ 1845.185026] [] ? _raw_spin_unlock_irq+0x30/0x60 > > [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 > > [ 1845.185026] [] ret_from_fork+0x7a/0xb0 > > [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 > > [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff > > ff ff 66 0f 1f 44 00 00 55 48
[RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early
DEBUG_KOBJECT_RELEASE helps to find the issue attached below. After some investigation, it seems the reason is: The mod->mkobj.kobj(a01600d0 below) is freed together with mod itself in free_module(). However, its children still hold references to it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the child(holders below) tries to decrease the reference count to its parent in kobject_del(), BUG happens as it tries to access already freed memory. This patch tries to fix it by waiting for the mod->mkobj.kobj to be really released in the module removing process (and some error code paths). [ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, parent a01600d0 (delayed) [ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent a01600d0 (delayed) [ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, parent a01600d0 [ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup kobject_del [ 1845.184120] BUG: unable to handle kernel paging request at a01601d0 [ 1845.185026] IP: [] kobject_put+0x11/0x60 [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 [ 1845.185026] Oops: [#1] PREEMPT [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example] [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O 3.11.0-rc6-next-20130819+ #1 [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 1845.185026] Workqueue: events kobject_delayed_cleanup [ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: 88007ca5c000 [ 1845.185026] RIP: 0010:[] [] kobject_put+0x11/0x60 [ 1845.185026] RSP: 0018:88007ca5dd08 EFLAGS: 00010282 [ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: 8177d638 [ 1845.185026] RDX: 88007ca5dc18 RSI: RDI: a01600d0 [ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: [ 1845.185026] R10: 8800 R11: dead4ead0001 R12: 81a95040 [ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: [ 1845.185026] FS: () GS:81a23000() knlGS: [ 1845.185026] CS: 0010 DS: ES: CR0: 8005003b [ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: 06b0 [ 1845.185026] Stack: [ 1845.185026] 88007c1f1600 88007c1f1600 88007ca5dd38 812cdb7e [ 1845.185026] 88007c1f1640 88007ca5dd68 812cdbfe [ 1845.185026] 88007c974800 88007c1f1640 88007ff61a00 [ 1845.185026] Call Trace: [ 1845.185026] [] kobject_del+0x2e/0x40 [ 1845.185026] [] kobject_delayed_cleanup+0x6e/0x1d0 [ 1845.185026] [] process_one_work+0x1e5/0x670 [ 1845.185026] [] ? process_one_work+0x183/0x670 [ 1845.185026] [] worker_thread+0x113/0x370 [ 1845.185026] [] ? rescuer_thread+0x290/0x290 [ 1845.185026] [] kthread+0xda/0xe0 [ 1845.185026] [] ? _raw_spin_unlock_irq+0x30/0x60 [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] [] ret_from_fork+0x7a/0xb0 [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d 87 00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 [ 1845.185026] RIP [] kobject_put+0x11/0x60 [ 1845.185026] RSP [ 1845.185026] CR2: a01601d0 [ 1845.185026] ---[ end trace 49a70afd109f5653 ]--- Signed-off-by: Li Zhong --- include/linux/module.h | 1 + kernel/module.c| 14 +++--- kernel/params.c| 7 +++ 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/include/linux/module.h b/include/linux/module.h index 504035f..05f2447 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -42,6 +42,7 @@ struct module_kobject { struct module *mod; struct kobject *drivers_dir; struct module_param_attrs *mp; + struct completion *kobj_completion; }; struct module_attribute { diff --git a/kernel/module.c b/kernel/module.c index 2069158..b5e2baa 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1611,6 +1611,14 @@ static void module_remove_modinfo_attrs(struct module *mod) kfree(mod->modinfo_attrs); } +static void mod_kobject_put(struct module *mod) +{ + DECLARE_COMPLETION_ONSTACK(c); + mod->mkobj.kobj_completion = + kobject_put(>mkobj.kobj); + wait_for_completion(); +} + static int mod_sysfs_init(struct module *mod) { int err; @@ -1638,7 +1646,7 @@ static int mod_sysfs_init(struct module *mod) err = kobject_init_and_add(>mkobj.kobj, _ktype, NULL, "%s", mod->name); if (err) - kobject_put(>mkobj.kobj); + mod_kobject_put(mod); /* delay uevent until full sy
[RFC PATCH next]module: Fix mod-mkobj.kobj potentially freed too early
DEBUG_KOBJECT_RELEASE helps to find the issue attached below. After some investigation, it seems the reason is: The mod-mkobj.kobj(a01600d0 below) is freed together with mod itself in free_module(). However, its children still hold references to it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the child(holders below) tries to decrease the reference count to its parent in kobject_del(), BUG happens as it tries to access already freed memory. This patch tries to fix it by waiting for the mod-mkobj.kobj to be really released in the module removing process (and some error code paths). [ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, parent a01600d0 (delayed) [ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent a01600d0 (delayed) [ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, parent a01600d0 [ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup kobject_del [ 1845.184120] BUG: unable to handle kernel paging request at a01601d0 [ 1845.185026] IP: [812cda81] kobject_put+0x11/0x60 [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 [ 1845.185026] Oops: [#1] PREEMPT [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example] [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O 3.11.0-rc6-next-20130819+ #1 [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 1845.185026] Workqueue: events kobject_delayed_cleanup [ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: 88007ca5c000 [ 1845.185026] RIP: 0010:[812cda81] [812cda81] kobject_put+0x11/0x60 [ 1845.185026] RSP: 0018:88007ca5dd08 EFLAGS: 00010282 [ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: 8177d638 [ 1845.185026] RDX: 88007ca5dc18 RSI: RDI: a01600d0 [ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: [ 1845.185026] R10: 8800 R11: dead4ead0001 R12: 81a95040 [ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: [ 1845.185026] FS: () GS:81a23000() knlGS: [ 1845.185026] CS: 0010 DS: ES: CR0: 8005003b [ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: 06b0 [ 1845.185026] Stack: [ 1845.185026] 88007c1f1600 88007c1f1600 88007ca5dd38 812cdb7e [ 1845.185026] 88007c1f1640 88007ca5dd68 812cdbfe [ 1845.185026] 88007c974800 88007c1f1640 88007ff61a00 [ 1845.185026] Call Trace: [ 1845.185026] [812cdb7e] kobject_del+0x2e/0x40 [ 1845.185026] [812cdbfe] kobject_delayed_cleanup+0x6e/0x1d0 [ 1845.185026] [81063a45] process_one_work+0x1e5/0x670 [ 1845.185026] [810639e3] ? process_one_work+0x183/0x670 [ 1845.185026] [810642b3] worker_thread+0x113/0x370 [ 1845.185026] [810641a0] ? rescuer_thread+0x290/0x290 [ 1845.185026] [8106bfba] kthread+0xda/0xe0 [ 1845.185026] [814ff0f0] ? _raw_spin_unlock_irq+0x30/0x60 [ 1845.185026] [8106bee0] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] [8150751a] ret_from_fork+0x7a/0xb0 [ 1845.185026] [8106bee0] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d f6 87 00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 [ 1845.185026] RIP [812cda81] kobject_put+0x11/0x60 [ 1845.185026] RSP 88007ca5dd08 [ 1845.185026] CR2: a01601d0 [ 1845.185026] ---[ end trace 49a70afd109f5653 ]--- Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- include/linux/module.h | 1 + kernel/module.c| 14 +++--- kernel/params.c| 7 +++ 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/include/linux/module.h b/include/linux/module.h index 504035f..05f2447 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -42,6 +42,7 @@ struct module_kobject { struct module *mod; struct kobject *drivers_dir; struct module_param_attrs *mp; + struct completion *kobj_completion; }; struct module_attribute { diff --git a/kernel/module.c b/kernel/module.c index 2069158..b5e2baa 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1611,6 +1611,14 @@ static void module_remove_modinfo_attrs(struct module *mod) kfree(mod-modinfo_attrs); } +static void mod_kobject_put(struct module *mod) +{ + DECLARE_COMPLETION_ONSTACK(c); + mod-mkobj.kobj_completion = c; + kobject_put(mod-mkobj.kobj); + wait_for_completion(c); +} + static int mod_sysfs_init(struct module *mod) { int err; @@ -1638,7 +1646,7 @@ static int mod_sysfs_init(struct module *mod
Re: [RFC PATCH next]module: Fix mod-mkobj.kobj potentially freed too early
On Wed, 2013-08-21 at 09:18 -0700, Greg KH wrote: On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote: DEBUG_KOBJECT_RELEASE helps to find the issue attached below. After some investigation, it seems the reason is: The mod-mkobj.kobj(a01600d0 below) is freed together with mod itself in free_module(). However, its children still hold references to it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the child(holders below) tries to decrease the reference count to its parent in kobject_del(), BUG happens as it tries to access already freed memory. Ah, thanks for tracking this down. I had seen this in my local testing, but wasn't able to figure out the offending code. This patch tries to fix it by waiting for the mod-mkobj.kobj to be really released in the module removing process (and some error code paths). Nasty, we should just be freeing the structure in the release function, why doesn't that work? Hi Greg, It seems I didn't describe it clearly in the previous mail. I'm trying to do it better below: mod-mkobj.kobj is embedded in module_kobject(not a pointer): struct module_kobject { struct kobject kobj; ... and allocated with the module memory. So we could see the parent below a01600d0 is between MODULES_VADDR (a000) and MODULES_END(ff00). It seem to me that the mkobj.kobj is freed by module_free(mod, mod-module_core). So in free_module(), we need delay the call of module_free(), until mkobj.kobj finishes its cleanup. [ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, parent a01600d0 (delayed) [ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent a01600d0 (delayed) [ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, parent a01600d0 [ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup kobject_del [ 1845.184120] BUG: unable to handle kernel paging request at a01601d0 [ 1845.185026] IP: [812cda81] kobject_put+0x11/0x60 [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 [ 1845.185026] Oops: [#1] PREEMPT [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example] [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O 3.11.0-rc6-next-20130819+ #1 [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 1845.185026] Workqueue: events kobject_delayed_cleanup [ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: 88007ca5c000 [ 1845.185026] RIP: 0010:[812cda81] [812cda81] kobject_put+0x11/0x60 [ 1845.185026] RSP: 0018:88007ca5dd08 EFLAGS: 00010282 [ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: 8177d638 [ 1845.185026] RDX: 88007ca5dc18 RSI: RDI: a01600d0 [ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: [ 1845.185026] R10: 8800 R11: dead4ead0001 R12: 81a95040 [ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: [ 1845.185026] FS: () GS:81a23000() knlGS: [ 1845.185026] CS: 0010 DS: ES: CR0: 8005003b [ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: 06b0 [ 1845.185026] Stack: [ 1845.185026] 88007c1f1600 88007c1f1600 88007ca5dd38 812cdb7e [ 1845.185026] 88007c1f1640 88007ca5dd68 812cdbfe [ 1845.185026] 88007c974800 88007c1f1640 88007ff61a00 [ 1845.185026] Call Trace: [ 1845.185026] [812cdb7e] kobject_del+0x2e/0x40 [ 1845.185026] [812cdbfe] kobject_delayed_cleanup+0x6e/0x1d0 [ 1845.185026] [81063a45] process_one_work+0x1e5/0x670 [ 1845.185026] [810639e3] ? process_one_work+0x183/0x670 [ 1845.185026] [810642b3] worker_thread+0x113/0x370 [ 1845.185026] [810641a0] ? rescuer_thread+0x290/0x290 [ 1845.185026] [8106bfba] kthread+0xda/0xe0 [ 1845.185026] [814ff0f0] ? _raw_spin_unlock_irq+0x30/0x60 [ 1845.185026] [8106bee0] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] [8150751a] ret_from_fork+0x7a/0xb0 [ 1845.185026] [8106bee0] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d f6 87 00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 [ 1845.185026] RIP [812cda81] kobject_put+0x11/0x60 [ 1845.185026] RSP 88007ca5dd08 [ 1845.185026] CR2: a01601d0 [ 1845.185026] ---[ end trace 49a70afd109f5653 ]--- Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com --- include/linux/module.h
Re: [RFC PATCH next]module: Fix mod-mkobj.kobj potentially freed too early
On Wed, 2013-08-21 at 21:03 -0700, Greg KH wrote: On Thu, Aug 22, 2013 at 10:34:06AM +0800, Li Zhong wrote: On Wed, 2013-08-21 at 09:18 -0700, Greg KH wrote: On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote: DEBUG_KOBJECT_RELEASE helps to find the issue attached below. After some investigation, it seems the reason is: The mod-mkobj.kobj(a01600d0 below) is freed together with mod itself in free_module(). However, its children still hold references to it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the child(holders below) tries to decrease the reference count to its parent in kobject_del(), BUG happens as it tries to access already freed memory. Ah, thanks for tracking this down. I had seen this in my local testing, but wasn't able to figure out the offending code. This patch tries to fix it by waiting for the mod-mkobj.kobj to be really released in the module removing process (and some error code paths). Nasty, we should just be freeing the structure in the release function, why doesn't that work? Hi Greg, It seems I didn't describe it clearly in the previous mail. I'm trying to do it better below: mod-mkobj.kobj is embedded in module_kobject(not a pointer): struct module_kobject { struct kobject kobj; ... and allocated with the module memory. So we could see the parent below a01600d0 is between MODULES_VADDR (a000) and MODULES_END(ff00). It seem to me that the mkobj.kobj is freed by module_free(mod, mod-module_core). Ick, you are right. If a kobject is being embedded in an object, it should control the lifespan of the object, not somewhere else like is happening here. The best solution for this is to make the kobject a pointer, not embedded in the structure, that will fix this issue, right? Yes, I think so. I'll try to write a fix using this way, thanks for your suggestion. Thanks, Zhong thanks, greg k-h -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/