Re: svn commit: r277199 - in head/sys: fs/devfs kern
Hi Konstantin, On 01/18/15 12:31, Konstantin Belousov wrote: On Sun, Jan 18, 2015 at 12:08:26PM +0100, Hans Petter Selasky wrote: Hi Konstantin, On 01/18/15 11:54, Konstantin Belousov wrote: On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote: Hi Konstantin, What about events for devd that a new character devices is present in the system or being destroyed for user-space applications? It is up to the driver to decide. If wanted, it could post the pair of destroy/create itself (e.g., if there is some state which require the userspace to reset), or it could decide not to. The later is reasonable if the destroy/create is believed to be caused by a glitch rather than the actual replug. I would really prefer if that path is chosen, that kern_conf.c exposes an API for this. Consider the following case: You have a daemon that is started and stopped based upon devd DEVFS create/destroy events. The daemon is started when the create event is received. Then the kernel side calls destroy_dev(), because the application still keeps the file handle opened, the destroy_dev() never sends the destroy event, and the application then never receives the destroy event --- another example of a totally invisible deadlock. I find it very strange that you need sleepless operation which can _both_ destroy and create cdev. At least one operation of creation or My simple argument against sleepable operation inside devfs functions which are used to register and unregister a character devices is simply that it leads to the fact we cannot make an atomic shutdown of a kernel application which involves devfs and other kernel subsystems. Let me explain. It is not enough that devfs works alone, it needs to work with together with other kernel APIs aswell. Let's make up an example. A kernel module is using three different APIs, lets say USB, the callout subsystem and devfs. Devfs is calling USB to start and stop USB transfers. The callout subsystem is calling selwakeup() and waking up devfs. The USB subsystem is calling selwakeup() and waking up devfs aswell. The USB subsystem is also calling destroy_dev(). This is a very high-level architectural thought and I hope you get it: How can you tear down a kernel application using three different subsystems in the kernel _atomically_ without having lingering callbacks? To make the shutdown sequence atomic, we might want to use a mutex. Now iff destroy_dev() is sleepable we need to drop that mutex when destroying the character device, and then woops - the shutdown sequence is no longer atomic. That's the root of the problem, and please think beyond devfs alone. The problem happens when devfs is used together with other APIs and applications in the kernel. destruction must sleep, at least in the current devfs design. It could be made sleepless, by making VFS visible part even less connected to the cdev part, but this is not how it (can) currently work. I think it would be good practice that all resources needed for creating a character device can be allocated prior to registration. That means we first should allocate all resources needed, but not register as a single function. For example: Once make_dev() has returned, we can get an d_open callback. But si_drv1/2 is always set by drivers _after_ that make_dev() has returned! This causes a race we could simply avoid by splitting the make_dev() like I suggest. Instead of putting more logic and code inside the drivers to handle the race! I wanted to tunnel the si_drv values through the make_dev(). I.e. there could appear yet another variant of make_dev() which takes the initialization parameters for the si_* driver vars. I just did not have time/motivation to do the pass. I might be interested in giving you a hand there, but not right now. This would also close a window where /dev node is non-existent or operate erronously. I'm not saying I plan so, but I think cdevs at some point need to understand mutexes and locks. That means like with other API's in the kernel we can associate a lock with the cdev, and this lock is then used to ensure an atomic shutdown of the system in a non-blocking fashion. In my past experience multithreaded APIs should be high level implemented like this: NON-BLOCKING methods: lock(); ** xxx_start(); xxx_stop(); unlock(); BLOCKING methods: setup(); // init unsetup(); // drain Any callbacks should always be called locked ** In devfs there was no non-blocking stop before I added the delist_dev() function. You do not understand my point. I object against imposing one additional global reference on all cdevs just to cope with the delist hack. See the patch at the end of the message. It's fine by me. WRT destroy_dev_sched_cb() calling delist_dev(), even after calling delist_dev(), the node still exists in the /dev. It is only removed after populate loop is run sometime later. dev_sched() KPI is inheritly racy, drivers must handle the
Re: svn commit: r277199 - in head/sys: fs/devfs kern
On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote: Hi Konstantin, On 01/16/15 09:03, Konstantin Belousov wrote: My opinion is that you should have tried to handle the issue at the driver level, instead of making this devfs issue. I.e., if you already have cdev node with the correct name, driver should have replaced the private data to point to new device. I think this way cannot be implemented in a clean way, because of locking order reversal. And if you try to avoid the LOR you end up with the race. Chess mate sort of ;-) Let me explain: Thread 1: usb_sx_lock(); cdev = Look in freelist for existing device(); else cdev = make_dev(); usb_sx_unlock(); Thread 2: usb_sx_lock(); put cdev on freelist usb_sx_unlock(); Thread 3: usb_sx_lock(); cdev = remove first entry in freelist usb_sx_unlock(); /* * XXX because USB needs to call destroy_dev() unlocked we * are racing with Thread 1 again */ destroy_dev(cdev); I do not understand the 'freelist' part. You have some usb (or whatever other kind) device, which is represented by some data structure. This data structure references cdev. In reverse, cdev points to this data structure by si_drv1 or similar driver-owned member of struct cdev. The structures naming usb devices have some mapping to/from usb bus addresses. When device is destroyed or created due to external plug event, there is some mechanism that ensures mapping consistence. It could be lock, it could be single-threaded process which discover the bus, or something else, I do not know. While the structure notes that device with some address goes out and comes in, it can look up the corresponding cdev and just change the si_drv1 pointer to point to the new device. I find it very strange that you need sleepless operation which can _both_ destroy and create cdev. At least one operation of creation or destruction must sleep, at least in the current devfs design. It could be made sleepless, by making VFS visible part even less connected to the cdev part, but this is not how it (can) currently work. This would also close a window where /dev node is non-existent or operate erronously. I'm not saying I plan so, but I think cdevs at some point need to understand mutexes and locks. That means like with other API's in the kernel we can associate a lock with the cdev, and this lock is then used to ensure an atomic shutdown of the system in a non-blocking fashion. In my past experience multithreaded APIs should be high level implemented like this: NON-BLOCKING methods: lock(); ** xxx_start(); xxx_stop(); unlock(); BLOCKING methods: setup(); // init unsetup(); // drain Any callbacks should always be called locked ** In devfs there was no non-blocking stop before I added the delist_dev() function. You do not understand my point. I object against imposing one additional global reference on all cdevs just to cope with the delist hack. See the patch at the end of the message. It's fine by me. WRT destroy_dev_sched_cb() calling delist_dev(), even after calling delist_dev(), the node still exists in the /dev. It is only removed after populate loop is run sometime later. dev_sched() KPI is inheritly racy, drivers must handle the races for other reasons. The populate loop is all running under the dev_lock() from what I can see and make_dev() is also keeping the same lock when inserting and removing new cdevs. The populate loop should always give a consistent No, populate loop is not run under dev_mtx. It would not be able to allocate memory, even in M_NOWAIT fashion. The dev_mtx is after the vm and vnode locks. Only iteration over the cdevp_list is protected by dev_mtx, which is dropped right after something which require an action, is found on the list. Populate() needs some way to avoid reading inconsistent data from the cdev layer, but there is not guarantee that we are always up to date. view of the character devices available, and I don't see how cdev structures without the CDP_ACTIVE flag can appear with recently created ones, even if the name is the same. It cannot simply because it is not synchronized with the device creation/destruction. Entry can be random, since after the populate loop is ran, your code in other thread could start and create duplicate entry. There is a window in the lookup where both directory vnode lock and mount point sx locks are dropped. So running the populate does not guarantee anything. If there is such a race, it is already there! My patch changes nothing in that area: Thread1: Calls destroy_dev() and clears CDP_ACTIVE, after dev_unlock() it goes waiting for some refs for xxx milliseconds. Thread2: Tries to create create a new character device having the same name like the one in thread1. Device name duplication check is missed because CDP_ACTIVE is cleared. Still thread1 is waiting.
Re: svn commit: r277199 - in head/sys: fs/devfs kern
Hi Konstantin, On 01/18/15 11:54, Konstantin Belousov wrote: On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote: Hi Konstantin, On 01/16/15 09:03, Konstantin Belousov wrote: My opinion is that you should have tried to handle the issue at the driver level, instead of making this devfs issue. I.e., if you already have cdev node with the correct name, driver should have replaced the private data to point to new device. I think this way cannot be implemented in a clean way, because of locking order reversal. And if you try to avoid the LOR you end up with the race. Chess mate sort of ;-) Let me explain: Thread 1: usb_sx_lock(); cdev = Look in freelist for existing device(); else cdev = make_dev(); usb_sx_unlock(); Thread 2: usb_sx_lock(); put cdev on freelist usb_sx_unlock(); Thread 3: usb_sx_lock(); cdev = remove first entry in freelist usb_sx_unlock(); /* * XXX because USB needs to call destroy_dev() unlocked we * are racing with Thread 1 again */ destroy_dev(cdev); I do not understand the 'freelist' part. You have some usb (or whatever other kind) device, which is represented by some data structure. This data structure references cdev. In reverse, cdev points to this data structure by si_drv1 or similar driver-owned member of struct cdev. The structures naming usb devices have some mapping to/from usb bus addresses. When device is destroyed or created due to external plug event, there is some mechanism that ensures mapping consistence. It could be lock, it could be single-threaded process which discover the bus, or something else, I do not know. While the structure notes that device with some address goes out and comes in, it can look up the corresponding cdev and just change the si_drv1 pointer to point to the new device. What about events for devd that a new character devices is present in the system or being destroyed for user-space applications? I find it very strange that you need sleepless operation which can _both_ destroy and create cdev. At least one operation of creation or destruction must sleep, at least in the current devfs design. It could be made sleepless, by making VFS visible part even less connected to the cdev part, but this is not how it (can) currently work. I think it would be good practice that all resources needed for creating a character device can be allocated prior to registration. That means we first should allocate all resources needed, but not register as a single function. For example: Once make_dev() has returned, we can get an d_open callback. But si_drv1/2 is always set by drivers _after_ that make_dev() has returned! This causes a race we could simply avoid by splitting the make_dev() like I suggest. Instead of putting more logic and code inside the drivers to handle the race! This would also close a window where /dev node is non-existent or operate erronously. I'm not saying I plan so, but I think cdevs at some point need to understand mutexes and locks. That means like with other API's in the kernel we can associate a lock with the cdev, and this lock is then used to ensure an atomic shutdown of the system in a non-blocking fashion. In my past experience multithreaded APIs should be high level implemented like this: NON-BLOCKING methods: lock(); ** xxx_start(); xxx_stop(); unlock(); BLOCKING methods: setup(); // init unsetup(); // drain Any callbacks should always be called locked ** In devfs there was no non-blocking stop before I added the delist_dev() function. You do not understand my point. I object against imposing one additional global reference on all cdevs just to cope with the delist hack. See the patch at the end of the message. It's fine by me. WRT destroy_dev_sched_cb() calling delist_dev(), even after calling delist_dev(), the node still exists in the /dev. It is only removed after populate loop is run sometime later. dev_sched() KPI is inheritly racy, drivers must handle the races for other reasons. The populate loop is all running under the dev_lock() from what I can see and make_dev() is also keeping the same lock when inserting and removing new cdevs. The populate loop should always give a consistent No, populate loop is not run under dev_mtx. Are you sure: static int devfs_populate_loop(struct devfs_mount *dm, int cleanup) { struct cdev_priv *cdp; struct devfs_dirent *de; struct devfs_dirent *dd, *dt; struct cdev *pdev; int de_flags, depth, j; char *q, *s; sx_assert(dm-dm_lock, SX_XLOCKED); dev_lock(); what is this ? TAILQ_FOREACH(cdp, cdevp_list, cdp_list) { It would not be able to allocate memory, even in M_NOWAIT fashion. The dev_mtx is after the vm and vnode locks. Only iteration over the cdevp_list is protected by dev_mtx, which is dropped right after something which require an action, is found on the list. Populate() needs some way to avoid reading
Re: svn commit: r277199 - in head/sys: fs/devfs kern
On Sun, Jan 18, 2015 at 12:08:26PM +0100, Hans Petter Selasky wrote: Hi Konstantin, On 01/18/15 11:54, Konstantin Belousov wrote: On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote: Hi Konstantin, On 01/16/15 09:03, Konstantin Belousov wrote: My opinion is that you should have tried to handle the issue at the driver level, instead of making this devfs issue. I.e., if you already have cdev node with the correct name, driver should have replaced the private data to point to new device. I think this way cannot be implemented in a clean way, because of locking order reversal. And if you try to avoid the LOR you end up with the race. Chess mate sort of ;-) Let me explain: Thread 1: usb_sx_lock(); cdev = Look in freelist for existing device(); else cdev = make_dev(); usb_sx_unlock(); Thread 2: usb_sx_lock(); put cdev on freelist usb_sx_unlock(); Thread 3: usb_sx_lock(); cdev = remove first entry in freelist usb_sx_unlock(); /* * XXX because USB needs to call destroy_dev() unlocked we * are racing with Thread 1 again */ destroy_dev(cdev); I do not understand the 'freelist' part. You have some usb (or whatever other kind) device, which is represented by some data structure. This data structure references cdev. In reverse, cdev points to this data structure by si_drv1 or similar driver-owned member of struct cdev. The structures naming usb devices have some mapping to/from usb bus addresses. When device is destroyed or created due to external plug event, there is some mechanism that ensures mapping consistence. It could be lock, it could be single-threaded process which discover the bus, or something else, I do not know. While the structure notes that device with some address goes out and comes in, it can look up the corresponding cdev and just change the si_drv1 pointer to point to the new device. What about events for devd that a new character devices is present in the system or being destroyed for user-space applications? It is up to the driver to decide. If wanted, it could post the pair of destroy/create itself (e.g., if there is some state which require the userspace to reset), or it could decide not to. The later is reasonable if the destroy/create is believed to be caused by a glitch rather than the actual replug. I find it very strange that you need sleepless operation which can _both_ destroy and create cdev. At least one operation of creation or destruction must sleep, at least in the current devfs design. It could be made sleepless, by making VFS visible part even less connected to the cdev part, but this is not how it (can) currently work. I think it would be good practice that all resources needed for creating a character device can be allocated prior to registration. That means we first should allocate all resources needed, but not register as a single function. For example: Once make_dev() has returned, we can get an d_open callback. But si_drv1/2 is always set by drivers _after_ that make_dev() has returned! This causes a race we could simply avoid by splitting the make_dev() like I suggest. Instead of putting more logic and code inside the drivers to handle the race! I wanted to tunnel the si_drv values through the make_dev(). I.e. there could appear yet another variant of make_dev() which takes the initialization parameters for the si_* driver vars. I just did not have time/motivation to do the pass. This would also close a window where /dev node is non-existent or operate erronously. I'm not saying I plan so, but I think cdevs at some point need to understand mutexes and locks. That means like with other API's in the kernel we can associate a lock with the cdev, and this lock is then used to ensure an atomic shutdown of the system in a non-blocking fashion. In my past experience multithreaded APIs should be high level implemented like this: NON-BLOCKING methods: lock(); ** xxx_start(); xxx_stop(); unlock(); BLOCKING methods: setup(); // init unsetup(); // drain Any callbacks should always be called locked ** In devfs there was no non-blocking stop before I added the delist_dev() function. You do not understand my point. I object against imposing one additional global reference on all cdevs just to cope with the delist hack. See the patch at the end of the message. It's fine by me. WRT destroy_dev_sched_cb() calling delist_dev(), even after calling delist_dev(), the node still exists in the /dev. It is only removed after populate loop is run sometime later. dev_sched() KPI is inheritly racy, drivers must handle the races for other reasons. The populate loop is all running under the dev_lock() from what I can see and make_dev() is also keeping the same lock when inserting and removing new cdevs. The populate loop
Re: svn commit: r277199 - in head/sys: fs/devfs kern
On Thu, Jan 15, 2015 at 01:14:39PM +0100, Hans Petter Selasky wrote: On 01/15/15 12:51, Konstantin Belousov wrote: On Thu, Jan 15, 2015 at 11:49:09AM +0100, Hans Petter Selasky wrote: I see no leakage in that case either! Because these cases come through destroy_dev(). Are there more cases which I don't see? You are breaking existig devfs KPI by your hack. You introduce yet another reference on the device, which is not supposed to be there. Hi Konstantin, I need a non-sleeping way to say a character device is no longer supposed to be used and be able to re-use the device name right away creating a new device. I guess you got that. Yes, I got it. The devfs design is not suitable for this, and your hack is the good witness of the fact. My opinion is that you should have tried to handle the issue at the driver level, instead of making this devfs issue. I.e., if you already have cdev node with the correct name, driver should have replaced the private data to point to new device. This would also close a window where /dev node is non-existent or operate erronously. If some code calls delist_dev(), it could be said that it is a contract of the new function that destroy_dev() must be called eventually on the cdev. Then, the reference could be said to be shared-owned by delist_dev() and destroy_dev(). But, for arbitrary devfs user this new reference is unacceptable and breaks interface. delist_dev() changes no references. It can be called multiple times even, also inside destroy_devl(). Also I think that the destroy_dev_sched_cbl() function should call delist_dev() first so that we don't have a time from when the destroy_dev_sched_cbl() function is called where the device entry still exists in devfs mounts until the final destroy_devl() is done by a taskqueue. You do not understand my point. I object against imposing one additional global reference on all cdevs just to cope with the delist hack. See the patch at the end of the message. WRT destroy_dev_sched_cb() calling delist_dev(), even after calling delist_dev(), the node still exists in the /dev. It is only removed after populate loop is run sometime later. dev_sched() KPI is inheritly racy, drivers must handle the races for other reasons. In the case of direct free through #1, the reference count is ignored and it doesn't matter if it is one or zero. Only in the case of destruction through destroy_dev() it matters. Like the comment says in destroy_devl(): /* Avoid race with dev_rel() */ The problem is that the cdev-si_refcount is zero when the initial devfs_create() is called. Then one ref is made. When we clear the CDP_ACTIVE flag in devfs_destroy() it instructs a !parallel! running process to destroy all the FS related structures and the reference count goes back to zero when the cdp is removed from the cdevp_list. Then the cdev is freed too early. This happens because destroy_devl() is dropping the dev_lock() to sleep waiting for pending references. Basically, this is very good explanation why your delist hack is wrong, for one of the reason. Another reason is explained below. You are trying to cover it with additional reference, but this is wrong as well. Do you see something else? I think that what you are trying to do with the CDP_ACTIVE hack is doomed anyway, because you are allowing for devfs directory to have two entries with the same name, until the populate loop cleans up the inactive one. In the meantime, any access to the directory operates on random entry. The entry will not be random, because upon an open() call to a character device, I believe the devfs_lookup() function will be called, which always populate the devfs tree at first by calls to devfs_populate_xxx(). Any delisted devices which don't have the CDP_ACTIVE bit set, will never be seen by any open. Entry can be random, since after the populate loop is ran, your code in other thread could start and create duplicate entry. There is a window in the lookup where both directory vnode lock and mount point sx locks are dropped. So running the populate does not guarantee anything. If there is such a race, it is already there! My patch changes nothing in that area: Thread1: Calls destroy_dev() and clears CDP_ACTIVE, after dev_unlock() it goes waiting for some refs for xxx milliseconds. Thread2: Tries to create create a new character device having the same name like the one in thread1. Device name duplication check is missed because CDP_ACTIVE is cleared. Still thread1 is waiting. Thread3: Tries to open character device created by thread2 while thread1 is still waiting for some ref held by a userspace app to go away. This can happen already before my patches! What do you think? Possibly. Regarding leftover filedescriptors which still access the old cdev this is not a problem, and these will be closed when the
Re: svn commit: r277199 - in head/sys: fs/devfs kern
Hi Konstantin, On 01/16/15 09:03, Konstantin Belousov wrote: On Thu, Jan 15, 2015 at 01:14:39PM +0100, Hans Petter Selasky wrote: On 01/15/15 12:51, Konstantin Belousov wrote: On Thu, Jan 15, 2015 at 11:49:09AM +0100, Hans Petter Selasky wrote: I see no leakage in that case either! Because these cases come through destroy_dev(). Are there more cases which I don't see? You are breaking existig devfs KPI by your hack. You introduce yet another reference on the device, which is not supposed to be there. Hi Konstantin, I need a non-sleeping way to say a character device is no longer supposed to be used and be able to re-use the device name right away creating a new device. I guess you got that. Yes, I got it. The devfs design is not suitable for this, and your hack is the good witness of the fact. My opinion is that you should have tried to handle the issue at the driver level, instead of making this devfs issue. I.e., if you already have cdev node with the correct name, driver should have replaced the private data to point to new device. I think this way cannot be implemented in a clean way, because of locking order reversal. And if you try to avoid the LOR you end up with the race. Chess mate sort of ;-) Let me explain: Thread 1: usb_sx_lock(); cdev = Look in freelist for existing device(); else cdev = make_dev(); usb_sx_unlock(); Thread 2: usb_sx_lock(); put cdev on freelist usb_sx_unlock(); Thread 3: usb_sx_lock(); cdev = remove first entry in freelist usb_sx_unlock(); /* * XXX because USB needs to call destroy_dev() unlocked we * are racing with Thread 1 again */ destroy_dev(cdev); This would also close a window where /dev node is non-existent or operate erronously. I'm not saying I plan so, but I think cdevs at some point need to understand mutexes and locks. That means like with other API's in the kernel we can associate a lock with the cdev, and this lock is then used to ensure an atomic shutdown of the system in a non-blocking fashion. In my past experience multithreaded APIs should be high level implemented like this: NON-BLOCKING methods: lock(); ** xxx_start(); xxx_stop(); unlock(); BLOCKING methods: setup(); // init unsetup(); // drain Any callbacks should always be called locked ** In devfs there was no non-blocking stop before I added the delist_dev() function. You do not understand my point. I object against imposing one additional global reference on all cdevs just to cope with the delist hack. See the patch at the end of the message. It's fine by me. WRT destroy_dev_sched_cb() calling delist_dev(), even after calling delist_dev(), the node still exists in the /dev. It is only removed after populate loop is run sometime later. dev_sched() KPI is inheritly racy, drivers must handle the races for other reasons. The populate loop is all running under the dev_lock() from what I can see and make_dev() is also keeping the same lock when inserting and removing new cdevs. The populate loop should always give a consistent view of the character devices available, and I don't see how cdev structures without the CDP_ACTIVE flag can appear with recently created ones, even if the name is the same. Entry can be random, since after the populate loop is ran, your code in other thread could start and create duplicate entry. There is a window in the lookup where both directory vnode lock and mount point sx locks are dropped. So running the populate does not guarantee anything. If there is such a race, it is already there! My patch changes nothing in that area: Thread1: Calls destroy_dev() and clears CDP_ACTIVE, after dev_unlock() it goes waiting for some refs for xxx milliseconds. Thread2: Tries to create create a new character device having the same name like the one in thread1. Device name duplication check is missed because CDP_ACTIVE is cleared. Still thread1 is waiting. Thread3: Tries to open character device created by thread2 while thread1 is still waiting for some ref held by a userspace app to go away. This can happen already before my patches! What do you think? Possibly. At what level do you mean duplicate names, I don't get this fully? At the directory level (DE nodes)? Or inside the list of character devices (LIST_XXX)? It does not matter, dup at either one directory level, or dup of full names in the global list are equivalent (bad) things. Like I write above I don't see where the problem is. At the cdev level, we are protecting the cdev's LIST with dev_lock() and only one entry will exist having CDP_ACTIVE bit set per unique cdev name and path. Else we will hit a panic in make_dev() and friends. In the directory entry level the populate loop will also ensure a consistent view, and hence the cdev's LIST is consistent, the view presented to userspace will also be consistent. That system functions can still call into the dangling read/write/ioctl functions is another story, and that is why I tell,
Re: svn commit: r277199 - in head/sys: fs/devfs kern
Hi, On 01/15/15 04:31, Konstantin Belousov wrote: Please do not commit (to devfs) without seeking for the review first. I was a bit stuck this time either having the choice to back out a USB patch fixing no problems or to fix the kernel devfs to support what I needed. Actually devfs is not listed in src/MAINTAINERS. I understand a review can be good in such critical areas in advance as a general practice and I'm currently doing that for another kern/ patch which is much bigger than this one. --HPS ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r277199 - in head/sys: fs/devfs kern
On 01/15/15 10:38, Konstantin Belousov wrote: On Thu, Jan 15, 2015 at 08:41:31AM +0100, Hans Petter Selasky wrote: On 01/15/15 04:31, Konstantin Belousov wrote: On Wed, Jan 14, 2015 at 10:07:13PM +, Hans Petter Selasky wrote: Author: hselasky Date: Wed Jan 14 22:07:13 2015 New Revision: 277199 URL: https://svnweb.freebsd.org/changeset/base/277199 Log: Avoid race with dev_rel() when using the recently added delist_dev() function. Make sure the character device structure doesn't go away until the end of the destroy_dev() function due to concurrently running cleanup code inside devfs_populate(). MFC after: 1 week Reported by:dchagin@ Modified: head/sys/fs/devfs/devfs_devs.c head/sys/kern/kern_conf.c Modified: head/sys/fs/devfs/devfs_devs.c == --- head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:05:57 2015 (r277198) +++ head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:07:13 2015 (r277199) @@ -137,6 +137,12 @@ devfs_alloc(int flags) vfs_timestamp(ts); cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts; cdev-si_cred = NULL; + /* +* Avoid race with dev_rel() by setting the initial +* reference count to 1. This last reference is taken +* by the destroy_dev() function. +*/ + cdev-si_refcount = 1; This is wrong. Not all devices are destroyed with destroy_dev(). dev_rel() must be allowed to clean up allocated device. That said, I do not understand what race you are trying to solve. Freeing of the accessible cdev memory cannot happen in parallel while dev_mtx is owned. Please do not commit (to devfs) without seeking for the review first. Hi Konstantin, From my analysis there are basically three ways for a cdev to die: 1) Through dev_free_devlocked() 2) Through destroy_devl() which then later calls dev_free_devlocked() 3) Through destroy_dev_sched() which really is a wrapper around destroy_devl(). You only look from the consumers PoV. Devfs cdev can be dereferenced because e.g. clone handler decides that cdev is not valid/needed, and now the memory is never freed due to extra reference. Do not assume that all cdevs go through destroy_dev(). Hi, All cdevs go through either case #2 or case #1 eventually from what I can see, including clone devices, which call destroy_devl() in the end aswell. See the clone_destroy() function! I did a simple test with /dev/dspX.Y which use clone devices. I did: vmstat -m | grep -i devfs1 1) Before plugging USB audio device: DEVFS1 15779K - 189 512 2) Plug USB audio device: DEVFS1 16482K - 196 512 3) Play something (env AUDIODEV=/dev/dsp2.4 play track01.wav) DEVFS1 16583K - 197 512 4) Stop playing (clone device still exits): DEVFS1 16583K - 197 512 5) Detach USB audio device: DEVFS1 15779K - 197 512 I see no leakage in that case! Other case: 1) After kldload if_tap DEVFS1 15879K - 201 512 2) After creating TAP device (cat /dev/tap99) DEVFS1 15980K - 204 512 3) After creating TAP device (cat /dev/tap101) DEVFS1 16080K - 207 512 5) After kldunload if_tap: DEVFS1 15879K - 207 512 6) After kldload if_tap again: DEVFS1 15879K - 207 512 I see no leakage in that case either! Are there more cases which I don't see? In the case of direct free through #1, the reference count is ignored and it doesn't matter if it is one or zero. Only in the case of destruction through destroy_dev() it matters. Like the comment says in destroy_devl(): /* Avoid race with dev_rel() */ The problem is that the cdev-si_refcount is zero when the initial devfs_create() is called. Then one ref is made. When we clear the CDP_ACTIVE flag in devfs_destroy() it instructs a !parallel! running process to destroy all the FS related structures and the reference count goes back to zero when the cdp is removed from the cdevp_list. Then the cdev is freed too early. This happens because destroy_devl() is dropping the dev_lock() to sleep waiting for pending references. Basically, this is very good explanation why your delist hack is wrong, for one of the reason. Another reason is explained below. You are trying to cover it with additional reference, but this is wrong as well. Do you see something else? I think that what you are trying to do with the CDP_ACTIVE hack is doomed anyway, because you are allowing for devfs directory to have two entries with the same name, until the populate loop cleans up the inactive one. In the meantime, any access to the directory operates on random entry. The entry will not be random, because upon an open() call to a character device, I believe the devfs_lookup() function will be called,
Re: svn commit: r277199 - in head/sys: fs/devfs kern
On Thu, Jan 15, 2015 at 08:41:31AM +0100, Hans Petter Selasky wrote: On 01/15/15 04:31, Konstantin Belousov wrote: On Wed, Jan 14, 2015 at 10:07:13PM +, Hans Petter Selasky wrote: Author: hselasky Date: Wed Jan 14 22:07:13 2015 New Revision: 277199 URL: https://svnweb.freebsd.org/changeset/base/277199 Log: Avoid race with dev_rel() when using the recently added delist_dev() function. Make sure the character device structure doesn't go away until the end of the destroy_dev() function due to concurrently running cleanup code inside devfs_populate(). MFC after: 1 week Reported by:dchagin@ Modified: head/sys/fs/devfs/devfs_devs.c head/sys/kern/kern_conf.c Modified: head/sys/fs/devfs/devfs_devs.c == --- head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:05:57 2015 (r277198) +++ head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:07:13 2015 (r277199) @@ -137,6 +137,12 @@ devfs_alloc(int flags) vfs_timestamp(ts); cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts; cdev-si_cred = NULL; + /* + * Avoid race with dev_rel() by setting the initial + * reference count to 1. This last reference is taken + * by the destroy_dev() function. + */ + cdev-si_refcount = 1; This is wrong. Not all devices are destroyed with destroy_dev(). dev_rel() must be allowed to clean up allocated device. That said, I do not understand what race you are trying to solve. Freeing of the accessible cdev memory cannot happen in parallel while dev_mtx is owned. Please do not commit (to devfs) without seeking for the review first. Hi Konstantin, From my analysis there are basically three ways for a cdev to die: 1) Through dev_free_devlocked() 2) Through destroy_devl() which then later calls dev_free_devlocked() 3) Through destroy_dev_sched() which really is a wrapper around destroy_devl(). You only look from the consumers PoV. Devfs cdev can be dereferenced because e.g. clone handler decides that cdev is not valid/needed, and now the memory is never freed due to extra reference. Do not assume that all cdevs go through destroy_dev(). In the case of direct free through #1, the reference count is ignored and it doesn't matter if it is one or zero. Only in the case of destruction through destroy_dev() it matters. Like the comment says in destroy_devl(): /* Avoid race with dev_rel() */ The problem is that the cdev-si_refcount is zero when the initial devfs_create() is called. Then one ref is made. When we clear the CDP_ACTIVE flag in devfs_destroy() it instructs a !parallel! running process to destroy all the FS related structures and the reference count goes back to zero when the cdp is removed from the cdevp_list. Then the cdev is freed too early. This happens because destroy_devl() is dropping the dev_lock() to sleep waiting for pending references. Basically, this is very good explanation why your delist hack is wrong, for one of the reason. Another reason is explained below. You are trying to cover it with additional reference, but this is wrong as well. Do you see something else? I think that what you are trying to do with the CDP_ACTIVE hack is doomed anyway, because you are allowing for devfs directory to have two entries with the same name, until the populate loop cleans up the inactive one. In the meantime, any access to the directory operates on random entry. The checks for existent names in make_dev() are performed for the reason, and you makes the rounds to effectively ignore it. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r277199 - in head/sys: fs/devfs kern
On Thu, Jan 15, 2015 at 11:49:09AM +0100, Hans Petter Selasky wrote: On 01/15/15 10:38, Konstantin Belousov wrote: On Thu, Jan 15, 2015 at 08:41:31AM +0100, Hans Petter Selasky wrote: On 01/15/15 04:31, Konstantin Belousov wrote: On Wed, Jan 14, 2015 at 10:07:13PM +, Hans Petter Selasky wrote: Author: hselasky Date: Wed Jan 14 22:07:13 2015 New Revision: 277199 URL: https://svnweb.freebsd.org/changeset/base/277199 Log: Avoid race with dev_rel() when using the recently added delist_dev() function. Make sure the character device structure doesn't go away until the end of the destroy_dev() function due to concurrently running cleanup code inside devfs_populate(). MFC after: 1 week Reported by: dchagin@ Modified: head/sys/fs/devfs/devfs_devs.c head/sys/kern/kern_conf.c Modified: head/sys/fs/devfs/devfs_devs.c == --- head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:05:57 2015 (r277198) +++ head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:07:13 2015 (r277199) @@ -137,6 +137,12 @@ devfs_alloc(int flags) vfs_timestamp(ts); cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts; cdev-si_cred = NULL; +/* + * Avoid race with dev_rel() by setting the initial + * reference count to 1. This last reference is taken + * by the destroy_dev() function. + */ +cdev-si_refcount = 1; This is wrong. Not all devices are destroyed with destroy_dev(). dev_rel() must be allowed to clean up allocated device. That said, I do not understand what race you are trying to solve. Freeing of the accessible cdev memory cannot happen in parallel while dev_mtx is owned. Please do not commit (to devfs) without seeking for the review first. Hi Konstantin, From my analysis there are basically three ways for a cdev to die: 1) Through dev_free_devlocked() 2) Through destroy_devl() which then later calls dev_free_devlocked() 3) Through destroy_dev_sched() which really is a wrapper around destroy_devl(). You only look from the consumers PoV. Devfs cdev can be dereferenced because e.g. clone handler decides that cdev is not valid/needed, and now the memory is never freed due to extra reference. Do not assume that all cdevs go through destroy_dev(). Hi, All cdevs go through either case #2 or case #1 eventually from what I can see, including clone devices, which call destroy_devl() in the end aswell. See the clone_destroy() function! I did a simple test with /dev/dspX.Y which use clone devices. I did: vmstat -m | grep -i devfs1 1) Before plugging USB audio device: DEVFS1 15779K - 189 512 2) Plug USB audio device: DEVFS1 16482K - 196 512 3) Play something (env AUDIODEV=/dev/dsp2.4 play track01.wav) DEVFS1 16583K - 197 512 4) Stop playing (clone device still exits): DEVFS1 16583K - 197 512 5) Detach USB audio device: DEVFS1 15779K - 197 512 I see no leakage in that case! Other case: 1) After kldload if_tap DEVFS1 15879K - 201 512 2) After creating TAP device (cat /dev/tap99) DEVFS1 15980K - 204 512 3) After creating TAP device (cat /dev/tap101) DEVFS1 16080K - 207 512 5) After kldunload if_tap: DEVFS1 15879K - 207 512 6) After kldload if_tap again: DEVFS1 15879K - 207 512 I see no leakage in that case either! Because these cases come through destroy_dev(). Are there more cases which I don't see? You are breaking existig devfs KPI by your hack. You introduce yet another reference on the device, which is not supposed to be there. If some code calls delist_dev(), it could be said that it is a contract of the new function that destroy_dev() must be called eventually on the cdev. Then, the reference could be said to be shared-owned by delist_dev() and destroy_dev(). But, for arbitrary devfs user this new reference is unacceptable and breaks interface. In the case of direct free through #1, the reference count is ignored and it doesn't matter if it is one or zero. Only in the case of destruction through destroy_dev() it matters. Like the comment says in destroy_devl(): /* Avoid race with dev_rel() */ The problem is that the cdev-si_refcount is zero when the initial devfs_create() is called. Then one ref is made. When we clear the CDP_ACTIVE flag in devfs_destroy() it instructs a !parallel! running process to destroy all the FS related structures and the reference count goes back to zero when the cdp is removed from the cdevp_list. Then the cdev is freed too
svn commit: r277199 - in head/sys: fs/devfs kern
Author: hselasky Date: Wed Jan 14 22:07:13 2015 New Revision: 277199 URL: https://svnweb.freebsd.org/changeset/base/277199 Log: Avoid race with dev_rel() when using the recently added delist_dev() function. Make sure the character device structure doesn't go away until the end of the destroy_dev() function due to concurrently running cleanup code inside devfs_populate(). MFC after:1 week Reported by: dchagin@ Modified: head/sys/fs/devfs/devfs_devs.c head/sys/kern/kern_conf.c Modified: head/sys/fs/devfs/devfs_devs.c == --- head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:05:57 2015 (r277198) +++ head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:07:13 2015 (r277199) @@ -137,6 +137,12 @@ devfs_alloc(int flags) vfs_timestamp(ts); cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts; cdev-si_cred = NULL; + /* +* Avoid race with dev_rel() by setting the initial +* reference count to 1. This last reference is taken +* by the destroy_dev() function. +*/ + cdev-si_refcount = 1; return (cdev); } Modified: head/sys/kern/kern_conf.c == --- head/sys/kern/kern_conf.c Wed Jan 14 22:05:57 2015(r277198) +++ head/sys/kern/kern_conf.c Wed Jan 14 22:07:13 2015(r277199) @@ -1048,8 +1048,6 @@ destroy_devl(struct cdev *dev) /* Remove name marking */ dev-si_flags = ~SI_NAMED; - dev-si_refcount++; /* Avoid race with dev_rel() */ - /* If we are a child, remove us from the parents list */ if (dev-si_flags SI_CHILD) { LIST_REMOVE(dev, si_siblings); ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r277199 - in head/sys: fs/devfs kern
On Wed, Jan 14, 2015 at 10:07:13PM +, Hans Petter Selasky wrote: Author: hselasky Date: Wed Jan 14 22:07:13 2015 New Revision: 277199 URL: https://svnweb.freebsd.org/changeset/base/277199 Log: Avoid race with dev_rel() when using the recently added delist_dev() function. Make sure the character device structure doesn't go away until the end of the destroy_dev() function due to concurrently running cleanup code inside devfs_populate(). MFC after: 1 week Reported by:dchagin@ Modified: head/sys/fs/devfs/devfs_devs.c head/sys/kern/kern_conf.c Modified: head/sys/fs/devfs/devfs_devs.c == --- head/sys/fs/devfs/devfs_devs.cWed Jan 14 22:05:57 2015 (r277198) +++ head/sys/fs/devfs/devfs_devs.cWed Jan 14 22:07:13 2015 (r277199) @@ -137,6 +137,12 @@ devfs_alloc(int flags) vfs_timestamp(ts); cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts; cdev-si_cred = NULL; + /* + * Avoid race with dev_rel() by setting the initial + * reference count to 1. This last reference is taken + * by the destroy_dev() function. + */ + cdev-si_refcount = 1; This is wrong. Not all devices are destroyed with destroy_dev(). dev_rel() must be allowed to clean up allocated device. That said, I do not understand what race you are trying to solve. Freeing of the accessible cdev memory cannot happen in parallel while dev_mtx is owned. Please do not commit (to devfs) without seeking for the review first. return (cdev); } Modified: head/sys/kern/kern_conf.c == --- head/sys/kern/kern_conf.c Wed Jan 14 22:05:57 2015(r277198) +++ head/sys/kern/kern_conf.c Wed Jan 14 22:07:13 2015(r277199) @@ -1048,8 +1048,6 @@ destroy_devl(struct cdev *dev) /* Remove name marking */ dev-si_flags = ~SI_NAMED; - dev-si_refcount++; /* Avoid race with dev_rel() */ - /* If we are a child, remove us from the parents list */ if (dev-si_flags SI_CHILD) { LIST_REMOVE(dev, si_siblings); ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r277199 - in head/sys: fs/devfs kern
On 01/15/15 04:31, Konstantin Belousov wrote: On Wed, Jan 14, 2015 at 10:07:13PM +, Hans Petter Selasky wrote: Author: hselasky Date: Wed Jan 14 22:07:13 2015 New Revision: 277199 URL: https://svnweb.freebsd.org/changeset/base/277199 Log: Avoid race with dev_rel() when using the recently added delist_dev() function. Make sure the character device structure doesn't go away until the end of the destroy_dev() function due to concurrently running cleanup code inside devfs_populate(). MFC after: 1 week Reported by: dchagin@ Modified: head/sys/fs/devfs/devfs_devs.c head/sys/kern/kern_conf.c Modified: head/sys/fs/devfs/devfs_devs.c == --- head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:05:57 2015 (r277198) +++ head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:07:13 2015 (r277199) @@ -137,6 +137,12 @@ devfs_alloc(int flags) vfs_timestamp(ts); cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts; cdev-si_cred = NULL; + /* +* Avoid race with dev_rel() by setting the initial +* reference count to 1. This last reference is taken +* by the destroy_dev() function. +*/ + cdev-si_refcount = 1; This is wrong. Not all devices are destroyed with destroy_dev(). dev_rel() must be allowed to clean up allocated device. That said, I do not understand what race you are trying to solve. Freeing of the accessible cdev memory cannot happen in parallel while dev_mtx is owned. Please do not commit (to devfs) without seeking for the review first. Hi Konstantin, From my analysis there are basically three ways for a cdev to die: 1) Through dev_free_devlocked() 2) Through destroy_devl() which then later calls dev_free_devlocked() 3) Through destroy_dev_sched() which really is a wrapper around destroy_devl(). In the case of direct free through #1, the reference count is ignored and it doesn't matter if it is one or zero. Only in the case of destruction through destroy_dev() it matters. Like the comment says in destroy_devl(): /* Avoid race with dev_rel() */ The problem is that the cdev-si_refcount is zero when the initial devfs_create() is called. Then one ref is made. When we clear the CDP_ACTIVE flag in devfs_destroy() it instructs a !parallel! running process to destroy all the FS related structures and the reference count goes back to zero when the cdp is removed from the cdevp_list. Then the cdev is freed too early. This happens because destroy_devl() is dropping the dev_lock() to sleep waiting for pending references. Do you see something else? --HPS ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org