Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-18 Thread Hans Petter Selasky

Hi Konstantin,

On 01/18/15 12:31, Konstantin Belousov wrote:

On Sun, Jan 18, 2015 at 12:08:26PM +0100, Hans Petter Selasky wrote:

Hi Konstantin,

On 01/18/15 11:54, Konstantin Belousov wrote:

On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote:

Hi Konstantin,




What about events for devd that a new character devices is present in
the system or being destroyed for user-space applications?

It is up to the driver to decide.  If wanted, it could post the pair of
destroy/create itself (e.g., if there is some state which require
the userspace to reset), or it could decide not to.  The later is
reasonable if the destroy/create is believed to be caused by a glitch
rather than the actual replug.


I would really prefer if that path is chosen, that kern_conf.c exposes 
an API for this. Consider the following case:


You have a daemon that is started and stopped based upon devd DEVFS 
create/destroy events. The daemon is started when the create event is 
received. Then the kernel side calls destroy_dev(), because the 
application still keeps the file handle opened, the destroy_dev() never 
sends the destroy event, and the application then never receives the 
destroy event --- another example of a totally invisible deadlock.




I find it very strange that you need sleepless operation which can
_both_ destroy and create cdev. At least one operation of creation or


My simple argument against sleepable operation inside devfs functions 
which are used to register and unregister a character devices is simply 
that it leads to the fact we cannot make an atomic shutdown of a kernel 
application which involves devfs and other kernel subsystems. Let me 
explain.


It is not enough that devfs works alone, it needs to work with together 
with other kernel APIs aswell. Let's make up an example. A kernel module 
is using three different APIs, lets say USB, the callout subsystem and 
devfs.


Devfs is calling USB to start and stop USB transfers.
The callout subsystem is calling selwakeup() and waking up devfs.
The USB subsystem is calling selwakeup() and waking up devfs aswell.
The USB subsystem is also calling destroy_dev().

This is a very high-level architectural thought and I hope you get it:

How can you tear down a kernel application using three different 
subsystems in the kernel _atomically_ without having lingering callbacks?


To make the shutdown sequence atomic, we might want to use a mutex. Now 
iff destroy_dev() is sleepable we need to drop that mutex when 
destroying the character device, and then woops - the shutdown sequence 
is no longer atomic. That's the root of the problem, and please think 
beyond devfs alone. The problem happens when devfs is used together with 
other APIs and applications in the kernel.



destruction must sleep, at least in the current devfs design. It could
be made sleepless, by making VFS visible part even less connected to the
cdev part, but this is not how it (can) currently work.


I think it would be good practice that all resources needed for creating
a character device can be allocated prior to registration. That means we
first should allocate all resources needed, but not register as a single
function.

For example:

Once make_dev() has returned, we can get an d_open callback. But
si_drv1/2 is always set by drivers _after_ that make_dev() has
returned! This causes a race we could simply avoid by splitting the
make_dev() like I suggest. Instead of putting more logic and code inside
the drivers to handle the race!

I wanted to tunnel the si_drv values through the make_dev().
I.e. there could appear yet another variant of make_dev() which takes
the initialization parameters for the si_* driver vars.

I just did not have time/motivation to do the pass.


I might be interested in giving you a hand there, but not right now.







This would also close a window where /dev node is non-existent or operate
erronously.


I'm not saying I plan so, but I think cdevs at some point need to
understand mutexes and locks. That means like with other API's in the
kernel we can associate a lock with the cdev, and this lock is then
used to ensure an atomic shutdown of the system in a non-blocking
fashion. In my past experience multithreaded APIs should be high level
implemented like this:

NON-BLOCKING methods:
lock(); **
xxx_start();
xxx_stop();
unlock();

BLOCKING methods:
setup(); // init
unsetup(); // drain

Any callbacks should always be called locked **

In devfs there was no non-blocking stop before I added the delist_dev()
function.



You do not understand my point.

I object against imposing one additional global reference on all cdevs
just to cope with the delist hack.  See the patch at the end of the message.


It's fine by me.


WRT destroy_dev_sched_cb() calling delist_dev(), even after calling
delist_dev(), the node still exists in the /dev. It is only removed
after populate loop is run sometime later. dev_sched() KPI is inheritly
racy, drivers must handle the 

Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-18 Thread Konstantin Belousov
On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote:
 Hi Konstantin,
 
 On 01/16/15 09:03, Konstantin Belousov wrote:
  My opinion is that you should have tried to handle the issue at the driver
  level, instead of making this devfs issue.  I.e., if you already have
  cdev node with the correct name, driver should have replaced the private
  data to point to new device.
 
 I think this way cannot be implemented in a clean way, because of 
 locking order reversal. And if you try to avoid the LOR you end up with 
 the race. Chess mate sort of ;-)  Let me explain:
 
 Thread 1:
 usb_sx_lock();
cdev = Look in freelist for existing device();
 else
cdev = make_dev();
 usb_sx_unlock();
 
 Thread 2:
 usb_sx_lock();
 put cdev on freelist
 usb_sx_unlock();
 
 Thread 3:
 usb_sx_lock();
 cdev = remove first entry in freelist
 usb_sx_unlock();
 
 /*
   * XXX because USB needs to call destroy_dev() unlocked we
   * are racing with Thread 1 again
   */
 destroy_dev(cdev);
I do not understand the 'freelist' part.

You have some usb (or whatever other kind) device, which is represented by
some data structure.  This data structure references cdev.  In reverse,
cdev points to this data structure by si_drv1 or similar driver-owned
member of struct cdev.

The structures naming usb devices have some mapping to/from usb bus addresses.
When device is destroyed or created due to external plug event, there is
some mechanism that ensures mapping consistence.  It could be lock, it
could be single-threaded process which discover the bus, or something
else, I do not know.

While the structure notes that device with some address goes out and comes
in, it can look up the corresponding cdev and just change the si_drv1 pointer
to point to the new device.

I find it very strange that you need sleepless operation which can
_both_ destroy and create cdev. At least one operation of creation or
destruction must sleep, at least in the current devfs design. It could
be made sleepless, by making VFS visible part even less connected to the
cdev part, but this is not how it (can) currently work.

 
 
  This would also close a window where /dev node is non-existent or operate
  erronously.
 
 I'm not saying I plan so, but I think cdevs at some point need to 
 understand mutexes and locks. That means like with other API's in the 
 kernel we can associate a lock with the cdev, and this lock is then 
 used to ensure an atomic shutdown of the system in a non-blocking 
 fashion. In my past experience multithreaded APIs should be high level 
 implemented like this:
 
 NON-BLOCKING methods:
 lock(); **
 xxx_start();
 xxx_stop();
 unlock();
 
 BLOCKING methods:
 setup(); // init
 unsetup(); // drain
 
 Any callbacks should always be called locked **
 
 In devfs there was no non-blocking stop before I added the delist_dev() 
 function.
 
 
  You do not understand my point.
 
  I object against imposing one additional global reference on all cdevs
  just to cope with the delist hack.  See the patch at the end of the message.
 
 It's fine by me.
 
  WRT destroy_dev_sched_cb() calling delist_dev(), even after calling
  delist_dev(), the node still exists in the /dev. It is only removed
  after populate loop is run sometime later. dev_sched() KPI is inheritly
  racy, drivers must handle the races for other reasons.
 
 The populate loop is all running under the dev_lock() from what I can 
 see and make_dev() is also keeping the same lock when inserting and 
 removing new cdevs. The populate loop should always give a consistent 
No, populate loop is not run under dev_mtx.  It would not be able to
allocate memory, even in M_NOWAIT fashion.  The dev_mtx is after the
vm and vnode locks.  Only iteration over the cdevp_list is protected by
dev_mtx, which is dropped right after something which require an
action, is found on the list. Populate() needs some way to avoid
reading inconsistent data from the cdev layer, but there is not
guarantee that we are always up to date.

 view of the character devices available, and I don't see how cdev 
 structures without the CDP_ACTIVE flag can appear with recently created 
 ones, even if the name is the same.
It cannot simply because it is not synchronized with the device
creation/destruction.

 
 
 
  Entry can be random, since after the populate loop is ran, your code in
  other thread could start and create duplicate entry. There is a window
  in the lookup where both directory vnode lock and mount point sx locks
  are dropped. So running the populate does not guarantee anything.
 
  If there is such a race, it is already there! My patch changes nothing
  in that area:
 
  Thread1:
  Calls destroy_dev() and clears CDP_ACTIVE, after dev_unlock() it goes
  waiting for some refs for xxx milliseconds.
  Thread2:
  Tries to create create a new character device having the same name like
  the one in thread1. Device name duplication check is missed because
  CDP_ACTIVE is cleared. Still thread1 is waiting.
  

Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-18 Thread Hans Petter Selasky

Hi Konstantin,

On 01/18/15 11:54, Konstantin Belousov wrote:

On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote:

Hi Konstantin,

On 01/16/15 09:03, Konstantin Belousov wrote:

My opinion is that you should have tried to handle the issue at the driver
level, instead of making this devfs issue.  I.e., if you already have
cdev node with the correct name, driver should have replaced the private
data to point to new device.


I think this way cannot be implemented in a clean way, because of
locking order reversal. And if you try to avoid the LOR you end up with
the race. Chess mate sort of ;-)  Let me explain:

Thread 1:
usb_sx_lock();
cdev = Look in freelist for existing device();
else
cdev = make_dev();
usb_sx_unlock();

Thread 2:
usb_sx_lock();
put cdev on freelist
usb_sx_unlock();

Thread 3:
usb_sx_lock();
cdev = remove first entry in freelist
usb_sx_unlock();

/*
   * XXX because USB needs to call destroy_dev() unlocked we
   * are racing with Thread 1 again
   */
destroy_dev(cdev);

I do not understand the 'freelist' part.

You have some usb (or whatever other kind) device, which is represented by
some data structure.  This data structure references cdev.  In reverse,
cdev points to this data structure by si_drv1 or similar driver-owned
member of struct cdev.

The structures naming usb devices have some mapping to/from usb bus addresses.
When device is destroyed or created due to external plug event, there is
some mechanism that ensures mapping consistence.  It could be lock, it
could be single-threaded process which discover the bus, or something
else, I do not know.

While the structure notes that device with some address goes out and comes
in, it can look up the corresponding cdev and just change the si_drv1 pointer
to point to the new device.


What about events for devd that a new character devices is present in 
the system or being destroyed for user-space applications?




I find it very strange that you need sleepless operation which can
_both_ destroy and create cdev. At least one operation of creation or
destruction must sleep, at least in the current devfs design. It could
be made sleepless, by making VFS visible part even less connected to the
cdev part, but this is not how it (can) currently work.


I think it would be good practice that all resources needed for creating 
a character device can be allocated prior to registration. That means we 
first should allocate all resources needed, but not register as a single 
function.


For example:

Once make_dev() has returned, we can get an d_open callback. But 
si_drv1/2 is always set by drivers _after_ that make_dev() has 
returned! This causes a race we could simply avoid by splitting the 
make_dev() like I suggest. Instead of putting more logic and code inside 
the drivers to handle the race!






This would also close a window where /dev node is non-existent or operate
erronously.


I'm not saying I plan so, but I think cdevs at some point need to
understand mutexes and locks. That means like with other API's in the
kernel we can associate a lock with the cdev, and this lock is then
used to ensure an atomic shutdown of the system in a non-blocking
fashion. In my past experience multithreaded APIs should be high level
implemented like this:

NON-BLOCKING methods:
lock(); **
xxx_start();
xxx_stop();
unlock();

BLOCKING methods:
setup(); // init
unsetup(); // drain

Any callbacks should always be called locked **

In devfs there was no non-blocking stop before I added the delist_dev()
function.



You do not understand my point.

I object against imposing one additional global reference on all cdevs
just to cope with the delist hack.  See the patch at the end of the message.


It's fine by me.


WRT destroy_dev_sched_cb() calling delist_dev(), even after calling
delist_dev(), the node still exists in the /dev. It is only removed
after populate loop is run sometime later. dev_sched() KPI is inheritly
racy, drivers must handle the races for other reasons.


The populate loop is all running under the dev_lock() from what I can
see and make_dev() is also keeping the same lock when inserting and
removing new cdevs. The populate loop should always give a consistent

No, populate loop is not run under dev_mtx.


Are you sure:


static int
devfs_populate_loop(struct devfs_mount *dm, int cleanup)
{
struct cdev_priv *cdp;
struct devfs_dirent *de;
struct devfs_dirent *dd, *dt;
struct cdev *pdev;
int de_flags, depth, j;
char *q, *s;

sx_assert(dm-dm_lock, SX_XLOCKED);
dev_lock();

   what is this ?

TAILQ_FOREACH(cdp, cdevp_list, cdp_list) {



It would not be able to
allocate memory, even in M_NOWAIT fashion.  The dev_mtx is after the
vm and vnode locks.  Only iteration over the cdevp_list is protected by
dev_mtx, which is dropped right after something which require an
action, is found on the list. Populate() needs some way to avoid
reading 

Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-18 Thread Konstantin Belousov
On Sun, Jan 18, 2015 at 12:08:26PM +0100, Hans Petter Selasky wrote:
 Hi Konstantin,
 
 On 01/18/15 11:54, Konstantin Belousov wrote:
  On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote:
  Hi Konstantin,
 
  On 01/16/15 09:03, Konstantin Belousov wrote:
  My opinion is that you should have tried to handle the issue at the driver
  level, instead of making this devfs issue.  I.e., if you already have
  cdev node with the correct name, driver should have replaced the private
  data to point to new device.
 
  I think this way cannot be implemented in a clean way, because of
  locking order reversal. And if you try to avoid the LOR you end up with
  the race. Chess mate sort of ;-)  Let me explain:
 
  Thread 1:
  usb_sx_lock();
  cdev = Look in freelist for existing device();
  else
  cdev = make_dev();
  usb_sx_unlock();
 
  Thread 2:
  usb_sx_lock();
  put cdev on freelist
  usb_sx_unlock();
 
  Thread 3:
  usb_sx_lock();
  cdev = remove first entry in freelist
  usb_sx_unlock();
 
  /*
 * XXX because USB needs to call destroy_dev() unlocked we
 * are racing with Thread 1 again
 */
  destroy_dev(cdev);
  I do not understand the 'freelist' part.
 
  You have some usb (or whatever other kind) device, which is represented by
  some data structure.  This data structure references cdev.  In reverse,
  cdev points to this data structure by si_drv1 or similar driver-owned
  member of struct cdev.
 
  The structures naming usb devices have some mapping to/from usb bus 
  addresses.
  When device is destroyed or created due to external plug event, there is
  some mechanism that ensures mapping consistence.  It could be lock, it
  could be single-threaded process which discover the bus, or something
  else, I do not know.
 
  While the structure notes that device with some address goes out and comes
  in, it can look up the corresponding cdev and just change the si_drv1 
  pointer
  to point to the new device.
 
 What about events for devd that a new character devices is present in 
 the system or being destroyed for user-space applications?
It is up to the driver to decide.  If wanted, it could post the pair of
destroy/create itself (e.g., if there is some state which require
the userspace to reset), or it could decide not to.  The later is
reasonable if the destroy/create is believed to be caused by a glitch
rather than the actual replug.

 
 
  I find it very strange that you need sleepless operation which can
  _both_ destroy and create cdev. At least one operation of creation or
  destruction must sleep, at least in the current devfs design. It could
  be made sleepless, by making VFS visible part even less connected to the
  cdev part, but this is not how it (can) currently work.
 
 I think it would be good practice that all resources needed for creating 
 a character device can be allocated prior to registration. That means we 
 first should allocate all resources needed, but not register as a single 
 function.
 
 For example:
 
 Once make_dev() has returned, we can get an d_open callback. But 
 si_drv1/2 is always set by drivers _after_ that make_dev() has 
 returned! This causes a race we could simply avoid by splitting the 
 make_dev() like I suggest. Instead of putting more logic and code inside 
 the drivers to handle the race!
I wanted to tunnel the si_drv values through the make_dev().
I.e. there could appear yet another variant of make_dev() which takes
the initialization parameters for the si_* driver vars.

I just did not have time/motivation to do the pass.
 
 
 
  This would also close a window where /dev node is non-existent or operate
  erronously.
 
  I'm not saying I plan so, but I think cdevs at some point need to
  understand mutexes and locks. That means like with other API's in the
  kernel we can associate a lock with the cdev, and this lock is then
  used to ensure an atomic shutdown of the system in a non-blocking
  fashion. In my past experience multithreaded APIs should be high level
  implemented like this:
 
  NON-BLOCKING methods:
  lock(); **
  xxx_start();
  xxx_stop();
  unlock();
 
  BLOCKING methods:
  setup(); // init
  unsetup(); // drain
 
  Any callbacks should always be called locked **
 
  In devfs there was no non-blocking stop before I added the delist_dev()
  function.
 
 
  You do not understand my point.
 
  I object against imposing one additional global reference on all cdevs
  just to cope with the delist hack.  See the patch at the end of the 
  message.
 
  It's fine by me.
 
  WRT destroy_dev_sched_cb() calling delist_dev(), even after calling
  delist_dev(), the node still exists in the /dev. It is only removed
  after populate loop is run sometime later. dev_sched() KPI is inheritly
  racy, drivers must handle the races for other reasons.
 
  The populate loop is all running under the dev_lock() from what I can
  see and make_dev() is also keeping the same lock when inserting and
  removing new cdevs. The populate loop 

Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-16 Thread Konstantin Belousov
On Thu, Jan 15, 2015 at 01:14:39PM +0100, Hans Petter Selasky wrote:
 On 01/15/15 12:51, Konstantin Belousov wrote:
  On Thu, Jan 15, 2015 at 11:49:09AM +0100, Hans Petter Selasky wrote:
 
  I see no leakage in that case either!
  Because these cases come through destroy_dev().
 
 
  Are there more cases which I don't see?
  You are breaking existig devfs KPI by your hack.  You introduce yet another
  reference on the device, which is not supposed to be there.
 
 Hi Konstantin,
 
 I need a non-sleeping way to say a character device is no longer 
 supposed to be used and be able to re-use the device name right away 
 creating a new device. I guess you got that.
Yes, I got it.  The devfs design is not suitable for this, and your
hack is the good witness of the fact.

My opinion is that you should have tried to handle the issue at the driver
level, instead of making this devfs issue.  I.e., if you already have
cdev node with the correct name, driver should have replaced the private
data to point to new device.

This would also close a window where /dev node is non-existent or operate
erronously.

 
 
  If some code calls delist_dev(), it could be said that it is a contract
  of the new function that destroy_dev() must be called eventually on
  the cdev. Then, the reference could be said to be shared-owned by
  delist_dev() and destroy_dev(). But, for arbitrary devfs user this new
  reference is unacceptable and breaks interface.
 
 delist_dev() changes no references. It can be called multiple times 
 even, also inside destroy_devl(). Also I think that the 
 destroy_dev_sched_cbl() function should call delist_dev() first so 
 that we don't have a time from when the destroy_dev_sched_cbl() 
 function is called where the device entry still exists in devfs mounts 
 until the final destroy_devl() is done by a taskqueue.
You do not understand my point.

I object against imposing one additional global reference on all cdevs
just to cope with the delist hack.  See the patch at the end of the message.

WRT destroy_dev_sched_cb() calling delist_dev(), even after calling
delist_dev(), the node still exists in the /dev. It is only removed
after populate loop is run sometime later. dev_sched() KPI is inheritly
racy, drivers must handle the races for other reasons.

 
  In the case of direct free through #1, the reference count is ignored
  and it doesn't matter if it is one or zero. Only in the case of
  destruction through destroy_dev() it matters.
 
  Like the comment says in destroy_devl():
 
  /* Avoid race with dev_rel() */
 
  The problem is that the cdev-si_refcount is zero when the initial
  devfs_create() is called. Then one ref is made. When we clear the
  CDP_ACTIVE flag in devfs_destroy() it instructs a !parallel! running
  process to destroy all the FS related structures and the reference count
  goes back to zero when the cdp is removed from the cdevp_list. Then
  the cdev is freed too early. This happens because destroy_devl() is
  dropping the dev_lock() to sleep waiting for pending references.
  Basically, this is very good explanation why your delist hack is wrong,
  for one of the reason.  Another reason is explained below.
  You are trying to cover it with additional reference, but this is wrong
  as well.
 
 
  Do you see something else?
 
  I think that what you are trying to do with the CDP_ACTIVE hack is doomed
  anyway, because you are allowing for devfs directory to have two entries
  with the same name, until the populate loop cleans up the inactive one.
  In the meantime, any access to the directory operates on random entry.
 
  The entry will not be random, because upon an open() call to a character
  device, I believe the devfs_lookup() function will be called, which
  always populate the devfs tree at first by calls to
  devfs_populate_xxx(). Any delisted devices which don't have the
  CDP_ACTIVE bit set, will never be seen by any open.
  Entry can be random, since after the populate loop is ran, your code in
  other thread could start and create duplicate entry. There is a window
  in the lookup where both directory vnode lock and mount point sx locks
  are dropped. So running the populate does not guarantee anything.
 
 If there is such a race, it is already there! My patch changes nothing 
 in that area:
 
 Thread1:
 Calls destroy_dev() and clears CDP_ACTIVE, after dev_unlock() it goes 
 waiting for some refs for xxx milliseconds.
 Thread2:
 Tries to create create a new character device having the same name like 
 the one in thread1. Device name duplication check is missed because 
 CDP_ACTIVE is cleared. Still thread1 is waiting.
 Thread3:
 Tries to open character device created by thread2 while thread1 is still 
 waiting for some ref held by a userspace app to go away.
 
 This can happen already before my patches! What do you think?
Possibly.

 
 
 
  Regarding leftover filedescriptors which still access the old cdev
  this is not a problem, and these will be closed when the 

Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-16 Thread Hans Petter Selasky

Hi Konstantin,

On 01/16/15 09:03, Konstantin Belousov wrote:

On Thu, Jan 15, 2015 at 01:14:39PM +0100, Hans Petter Selasky wrote:

On 01/15/15 12:51, Konstantin Belousov wrote:

On Thu, Jan 15, 2015 at 11:49:09AM +0100, Hans Petter Selasky wrote:


I see no leakage in that case either!

Because these cases come through destroy_dev().



Are there more cases which I don't see?

You are breaking existig devfs KPI by your hack.  You introduce yet another
reference on the device, which is not supposed to be there.


Hi Konstantin,

I need a non-sleeping way to say a character device is no longer
supposed to be used and be able to re-use the device name right away
creating a new device. I guess you got that.

Yes, I got it.  The devfs design is not suitable for this, and your
hack is the good witness of the fact.

My opinion is that you should have tried to handle the issue at the driver
level, instead of making this devfs issue.  I.e., if you already have
cdev node with the correct name, driver should have replaced the private
data to point to new device.


I think this way cannot be implemented in a clean way, because of 
locking order reversal. And if you try to avoid the LOR you end up with 
the race. Chess mate sort of ;-)  Let me explain:


Thread 1:
usb_sx_lock();
  cdev = Look in freelist for existing device();
else
  cdev = make_dev();
usb_sx_unlock();

Thread 2:
usb_sx_lock();
put cdev on freelist
usb_sx_unlock();

Thread 3:
usb_sx_lock();
cdev = remove first entry in freelist
usb_sx_unlock();

/*
 * XXX because USB needs to call destroy_dev() unlocked we
 * are racing with Thread 1 again
 */
destroy_dev(cdev);



This would also close a window where /dev node is non-existent or operate
erronously.


I'm not saying I plan so, but I think cdevs at some point need to 
understand mutexes and locks. That means like with other API's in the 
kernel we can associate a lock with the cdev, and this lock is then 
used to ensure an atomic shutdown of the system in a non-blocking 
fashion. In my past experience multithreaded APIs should be high level 
implemented like this:


NON-BLOCKING methods:
lock(); **
xxx_start();
xxx_stop();
unlock();

BLOCKING methods:
setup(); // init
unsetup(); // drain

Any callbacks should always be called locked **

In devfs there was no non-blocking stop before I added the delist_dev() 
function.




You do not understand my point.

I object against imposing one additional global reference on all cdevs
just to cope with the delist hack.  See the patch at the end of the message.


It's fine by me.


WRT destroy_dev_sched_cb() calling delist_dev(), even after calling
delist_dev(), the node still exists in the /dev. It is only removed
after populate loop is run sometime later. dev_sched() KPI is inheritly
racy, drivers must handle the races for other reasons.


The populate loop is all running under the dev_lock() from what I can 
see and make_dev() is also keeping the same lock when inserting and 
removing new cdevs. The populate loop should always give a consistent 
view of the character devices available, and I don't see how cdev 
structures without the CDP_ACTIVE flag can appear with recently created 
ones, even if the name is the same.







Entry can be random, since after the populate loop is ran, your code in
other thread could start and create duplicate entry. There is a window
in the lookup where both directory vnode lock and mount point sx locks
are dropped. So running the populate does not guarantee anything.


If there is such a race, it is already there! My patch changes nothing
in that area:

Thread1:
Calls destroy_dev() and clears CDP_ACTIVE, after dev_unlock() it goes
waiting for some refs for xxx milliseconds.
Thread2:
Tries to create create a new character device having the same name like
the one in thread1. Device name duplication check is missed because
CDP_ACTIVE is cleared. Still thread1 is waiting.
Thread3:
Tries to open character device created by thread2 while thread1 is still
waiting for some ref held by a userspace app to go away.

This can happen already before my patches! What do you think?

Possibly.





At what level do you mean duplicate names, I don't get this fully? At
the directory level (DE nodes)? Or inside the list of character devices
(LIST_XXX)?

It does not matter, dup at either one directory level, or dup of full
names in the global list are equivalent (bad) things.


Like I write above I don't see where the problem is. At the cdev level, 
we are protecting the cdev's LIST with dev_lock() and only one entry 
will exist having CDP_ACTIVE bit set per unique cdev name and path. Else 
we will hit a panic in make_dev() and friends.


In the directory entry level the populate loop will also ensure a 
consistent view, and hence the cdev's LIST is consistent, the view 
presented to userspace will also be consistent.


That system functions can still call into the dangling read/write/ioctl 
functions is another story, and that is why I tell, 

Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-15 Thread Hans Petter Selasky

Hi,

On 01/15/15 04:31, Konstantin Belousov wrote:

Please do not commit (to devfs) without seeking for the review first.


I was a bit stuck this time either having the choice to back out a USB 
patch fixing no problems or to fix the kernel devfs to support what I 
needed.


Actually devfs is not listed in src/MAINTAINERS. I understand a review 
can be good in such critical areas in advance as a general practice and 
I'm currently doing that for another kern/ patch which is much bigger 
than this one.


--HPS
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-15 Thread Hans Petter Selasky

On 01/15/15 10:38, Konstantin Belousov wrote:

On Thu, Jan 15, 2015 at 08:41:31AM +0100, Hans Petter Selasky wrote:

On 01/15/15 04:31, Konstantin Belousov wrote:

On Wed, Jan 14, 2015 at 10:07:13PM +, Hans Petter Selasky wrote:

Author: hselasky
Date: Wed Jan 14 22:07:13 2015
New Revision: 277199
URL: https://svnweb.freebsd.org/changeset/base/277199

Log:
Avoid race with dev_rel() when using the recently added
delist_dev() function. Make sure the character device structure
doesn't go away until the end of the destroy_dev() function due to
concurrently running cleanup code inside devfs_populate().

MFC after:  1 week
Reported by:dchagin@

Modified:
head/sys/fs/devfs/devfs_devs.c
head/sys/kern/kern_conf.c

Modified: head/sys/fs/devfs/devfs_devs.c
==
--- head/sys/fs/devfs/devfs_devs.c  Wed Jan 14 22:05:57 2015
(r277198)
+++ head/sys/fs/devfs/devfs_devs.c  Wed Jan 14 22:07:13 2015
(r277199)
@@ -137,6 +137,12 @@ devfs_alloc(int flags)
vfs_timestamp(ts);
cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts;
cdev-si_cred = NULL;
+   /*
+* Avoid race with dev_rel() by setting the initial
+* reference count to 1. This last reference is taken
+* by the destroy_dev() function.
+*/
+   cdev-si_refcount = 1;

This is wrong.  Not all devices are destroyed with destroy_dev().
dev_rel() must be allowed to clean up allocated device.

That said, I do not understand what race you are trying to solve.
Freeing of the accessible cdev memory cannot happen in parallel while
dev_mtx is owned.

Please do not commit (to devfs) without seeking for the review first.


Hi Konstantin,

  From my analysis there are basically three ways for a cdev to die:

1) Through dev_free_devlocked()
2) Through destroy_devl() which then later calls dev_free_devlocked()
3) Through destroy_dev_sched() which really is a wrapper around
destroy_devl().

You only look from the consumers PoV.  Devfs cdev can be dereferenced
because e.g. clone handler decides that cdev is not valid/needed,
and now the memory is never freed due to extra reference.

Do not assume that all cdevs go through destroy_dev().


Hi,

All cdevs go through either case #2 or case #1 eventually from what I 
can see, including clone devices, which call destroy_devl() in the end 
aswell. See the clone_destroy() function!


I did a simple test with /dev/dspX.Y which use clone devices. I did:

vmstat -m | grep -i devfs1

1) Before plugging USB audio device:

   DEVFS1   15779K   -  189  512

2) Plug USB audio device:

   DEVFS1   16482K   -  196  512

3) Play something (env AUDIODEV=/dev/dsp2.4 play track01.wav)

   DEVFS1   16583K   -  197  512

4) Stop playing (clone device still exits):

   DEVFS1   16583K   -  197  512

5) Detach USB audio device:

   DEVFS1   15779K   -  197  512

I see no leakage in that case!

Other case:

1) After kldload if_tap

  DEVFS1   15879K   -  201  512

2) After creating TAP device (cat /dev/tap99)

   DEVFS1   15980K   -  204  512

3) After creating TAP device (cat /dev/tap101)

   DEVFS1   16080K   -  207  512

5) After kldunload if_tap:

   DEVFS1   15879K   -  207  512

6) After kldload if_tap again:

   DEVFS1   15879K   -  207  512

I see no leakage in that case either!

Are there more cases which I don't see?





In the case of direct free through #1, the reference count is ignored
and it doesn't matter if it is one or zero. Only in the case of
destruction through destroy_dev() it matters.

Like the comment says in destroy_devl():

/* Avoid race with dev_rel() */

The problem is that the cdev-si_refcount is zero when the initial
devfs_create() is called. Then one ref is made. When we clear the
CDP_ACTIVE flag in devfs_destroy() it instructs a !parallel! running
process to destroy all the FS related structures and the reference count
goes back to zero when the cdp is removed from the cdevp_list. Then
the cdev is freed too early. This happens because destroy_devl() is
dropping the dev_lock() to sleep waiting for pending references.

Basically, this is very good explanation why your delist hack is wrong,
for one of the reason.  Another reason is explained below.
You are trying to cover it with additional reference, but this is wrong
as well.



Do you see something else?


I think that what you are trying to do with the CDP_ACTIVE hack is doomed
anyway, because you are allowing for devfs directory to have two entries
with the same name, until the populate loop cleans up the inactive one.
In the meantime, any access to the directory operates on random entry.


The entry will not be random, because upon an open() call to a character 
device, I believe the devfs_lookup() function will be called, 

Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-15 Thread Konstantin Belousov
On Thu, Jan 15, 2015 at 08:41:31AM +0100, Hans Petter Selasky wrote:
 On 01/15/15 04:31, Konstantin Belousov wrote:
  On Wed, Jan 14, 2015 at 10:07:13PM +, Hans Petter Selasky wrote:
  Author: hselasky
  Date: Wed Jan 14 22:07:13 2015
  New Revision: 277199
  URL: https://svnweb.freebsd.org/changeset/base/277199
 
  Log:
 Avoid race with dev_rel() when using the recently added
 delist_dev() function. Make sure the character device structure
 doesn't go away until the end of the destroy_dev() function due to
 concurrently running cleanup code inside devfs_populate().
 
 MFC after:  1 week
 Reported by:dchagin@
 
  Modified:
 head/sys/fs/devfs/devfs_devs.c
 head/sys/kern/kern_conf.c
 
  Modified: head/sys/fs/devfs/devfs_devs.c
  ==
  --- head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:05:57 2015
  (r277198)
  +++ head/sys/fs/devfs/devfs_devs.c Wed Jan 14 22:07:13 2015
  (r277199)
  @@ -137,6 +137,12 @@ devfs_alloc(int flags)
 vfs_timestamp(ts);
 cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts;
 cdev-si_cred = NULL;
  +  /*
  +   * Avoid race with dev_rel() by setting the initial
  +   * reference count to 1. This last reference is taken
  +   * by the destroy_dev() function.
  +   */
  +  cdev-si_refcount = 1;
  This is wrong.  Not all devices are destroyed with destroy_dev().
  dev_rel() must be allowed to clean up allocated device.
 
  That said, I do not understand what race you are trying to solve.
  Freeing of the accessible cdev memory cannot happen in parallel while
  dev_mtx is owned.
 
  Please do not commit (to devfs) without seeking for the review first.
 
 Hi Konstantin,
 
  From my analysis there are basically three ways for a cdev to die:
 
 1) Through dev_free_devlocked()
 2) Through destroy_devl() which then later calls dev_free_devlocked()
 3) Through destroy_dev_sched() which really is a wrapper around 
 destroy_devl().
You only look from the consumers PoV.  Devfs cdev can be dereferenced
because e.g. clone handler decides that cdev is not valid/needed,
and now the memory is never freed due to extra reference.

Do not assume that all cdevs go through destroy_dev().

 
 In the case of direct free through #1, the reference count is ignored 
 and it doesn't matter if it is one or zero. Only in the case of 
 destruction through destroy_dev() it matters.
 
 Like the comment says in destroy_devl():
 
 /* Avoid race with dev_rel() */
 
 The problem is that the cdev-si_refcount is zero when the initial
 devfs_create() is called. Then one ref is made. When we clear the 
 CDP_ACTIVE flag in devfs_destroy() it instructs a !parallel! running 
 process to destroy all the FS related structures and the reference count 
 goes back to zero when the cdp is removed from the cdevp_list. Then 
 the cdev is freed too early. This happens because destroy_devl() is 
 dropping the dev_lock() to sleep waiting for pending references.
Basically, this is very good explanation why your delist hack is wrong,
for one of the reason.  Another reason is explained below.
You are trying to cover it with additional reference, but this is wrong
as well.

 
 Do you see something else?

I think that what you are trying to do with the CDP_ACTIVE hack is doomed
anyway, because you are allowing for devfs directory to have two entries
with the same name, until the populate loop cleans up the inactive one.
In the meantime, any access to the directory operates on random entry.

The checks for existent names in make_dev() are performed for the reason,
and you makes the rounds to effectively ignore it.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-15 Thread Konstantin Belousov
On Thu, Jan 15, 2015 at 11:49:09AM +0100, Hans Petter Selasky wrote:
 On 01/15/15 10:38, Konstantin Belousov wrote:
  On Thu, Jan 15, 2015 at 08:41:31AM +0100, Hans Petter Selasky wrote:
  On 01/15/15 04:31, Konstantin Belousov wrote:
  On Wed, Jan 14, 2015 at 10:07:13PM +, Hans Petter Selasky wrote:
  Author: hselasky
  Date: Wed Jan 14 22:07:13 2015
  New Revision: 277199
  URL: https://svnweb.freebsd.org/changeset/base/277199
 
  Log:
  Avoid race with dev_rel() when using the recently added
  delist_dev() function. Make sure the character device structure
  doesn't go away until the end of the destroy_dev() function due to
  concurrently running cleanup code inside devfs_populate().
 
  MFC after:   1 week
  Reported by: dchagin@
 
  Modified:
  head/sys/fs/devfs/devfs_devs.c
  head/sys/kern/kern_conf.c
 
  Modified: head/sys/fs/devfs/devfs_devs.c
  ==
  --- head/sys/fs/devfs/devfs_devs.c   Wed Jan 14 22:05:57 2015
  (r277198)
  +++ head/sys/fs/devfs/devfs_devs.c   Wed Jan 14 22:07:13 2015
  (r277199)
  @@ -137,6 +137,12 @@ devfs_alloc(int flags)
   vfs_timestamp(ts);
   cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts;
   cdev-si_cred = NULL;
  +/*
  + * Avoid race with dev_rel() by setting the initial
  + * reference count to 1. This last reference is taken
  + * by the destroy_dev() function.
  + */
  +cdev-si_refcount = 1;
  This is wrong.  Not all devices are destroyed with destroy_dev().
  dev_rel() must be allowed to clean up allocated device.
 
  That said, I do not understand what race you are trying to solve.
  Freeing of the accessible cdev memory cannot happen in parallel while
  dev_mtx is owned.
 
  Please do not commit (to devfs) without seeking for the review first.
 
  Hi Konstantin,
 
From my analysis there are basically three ways for a cdev to die:
 
  1) Through dev_free_devlocked()
  2) Through destroy_devl() which then later calls dev_free_devlocked()
  3) Through destroy_dev_sched() which really is a wrapper around
  destroy_devl().
  You only look from the consumers PoV.  Devfs cdev can be dereferenced
  because e.g. clone handler decides that cdev is not valid/needed,
  and now the memory is never freed due to extra reference.
 
  Do not assume that all cdevs go through destroy_dev().
 
 Hi,
 
 All cdevs go through either case #2 or case #1 eventually from what I 
 can see, including clone devices, which call destroy_devl() in the end 
 aswell. See the clone_destroy() function!
 
 I did a simple test with /dev/dspX.Y which use clone devices. I did:
 
 vmstat -m | grep -i devfs1
 
 1) Before plugging USB audio device:
 
 DEVFS1   15779K   -  189  512
 
 2) Plug USB audio device:
 
 DEVFS1   16482K   -  196  512
 
 3) Play something (env AUDIODEV=/dev/dsp2.4 play track01.wav)
 
 DEVFS1   16583K   -  197  512
 
 4) Stop playing (clone device still exits):
 
 DEVFS1   16583K   -  197  512
 
 5) Detach USB audio device:
 
 DEVFS1   15779K   -  197  512
 
 I see no leakage in that case!
 
 Other case:
 
 1) After kldload if_tap
 
DEVFS1   15879K   -  201  512
 
 2) After creating TAP device (cat /dev/tap99)
 
 DEVFS1   15980K   -  204  512
 
 3) After creating TAP device (cat /dev/tap101)
 
 DEVFS1   16080K   -  207  512
 
 5) After kldunload if_tap:
 
 DEVFS1   15879K   -  207  512
 
 6) After kldload if_tap again:
 
 DEVFS1   15879K   -  207  512
 
 I see no leakage in that case either!
Because these cases come through destroy_dev().

 
 Are there more cases which I don't see?
You are breaking existig devfs KPI by your hack.  You introduce yet another
reference on the device, which is not supposed to be there.

If some code calls delist_dev(), it could be said that it is a contract
of the new function that destroy_dev() must be called eventually on
the cdev. Then, the reference could be said to be shared-owned by
delist_dev() and destroy_dev(). But, for arbitrary devfs user this new
reference is unacceptable and breaks interface.

 
 
 
  In the case of direct free through #1, the reference count is ignored
  and it doesn't matter if it is one or zero. Only in the case of
  destruction through destroy_dev() it matters.
 
  Like the comment says in destroy_devl():
 
  /* Avoid race with dev_rel() */
 
  The problem is that the cdev-si_refcount is zero when the initial
  devfs_create() is called. Then one ref is made. When we clear the
  CDP_ACTIVE flag in devfs_destroy() it instructs a !parallel! running
  process to destroy all the FS related structures and the reference count
  goes back to zero when the cdp is removed from the cdevp_list. Then
  the cdev is freed too 

svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-14 Thread Hans Petter Selasky
Author: hselasky
Date: Wed Jan 14 22:07:13 2015
New Revision: 277199
URL: https://svnweb.freebsd.org/changeset/base/277199

Log:
  Avoid race with dev_rel() when using the recently added
  delist_dev() function. Make sure the character device structure
  doesn't go away until the end of the destroy_dev() function due to
  concurrently running cleanup code inside devfs_populate().
  
  MFC after:1 week
  Reported by:  dchagin@

Modified:
  head/sys/fs/devfs/devfs_devs.c
  head/sys/kern/kern_conf.c

Modified: head/sys/fs/devfs/devfs_devs.c
==
--- head/sys/fs/devfs/devfs_devs.c  Wed Jan 14 22:05:57 2015
(r277198)
+++ head/sys/fs/devfs/devfs_devs.c  Wed Jan 14 22:07:13 2015
(r277199)
@@ -137,6 +137,12 @@ devfs_alloc(int flags)
vfs_timestamp(ts);
cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts;
cdev-si_cred = NULL;
+   /*
+* Avoid race with dev_rel() by setting the initial
+* reference count to 1. This last reference is taken
+* by the destroy_dev() function.
+*/
+   cdev-si_refcount = 1;
 
return (cdev);
 }

Modified: head/sys/kern/kern_conf.c
==
--- head/sys/kern/kern_conf.c   Wed Jan 14 22:05:57 2015(r277198)
+++ head/sys/kern/kern_conf.c   Wed Jan 14 22:07:13 2015(r277199)
@@ -1048,8 +1048,6 @@ destroy_devl(struct cdev *dev)
/* Remove name marking */
dev-si_flags = ~SI_NAMED;
 
-   dev-si_refcount++; /* Avoid race with dev_rel() */
-
/* If we are a child, remove us from the parents list */
if (dev-si_flags  SI_CHILD) {
LIST_REMOVE(dev, si_siblings);
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-14 Thread Konstantin Belousov
On Wed, Jan 14, 2015 at 10:07:13PM +, Hans Petter Selasky wrote:
 Author: hselasky
 Date: Wed Jan 14 22:07:13 2015
 New Revision: 277199
 URL: https://svnweb.freebsd.org/changeset/base/277199
 
 Log:
   Avoid race with dev_rel() when using the recently added
   delist_dev() function. Make sure the character device structure
   doesn't go away until the end of the destroy_dev() function due to
   concurrently running cleanup code inside devfs_populate().
   
   MFC after:  1 week
   Reported by:dchagin@
 
 Modified:
   head/sys/fs/devfs/devfs_devs.c
   head/sys/kern/kern_conf.c
 
 Modified: head/sys/fs/devfs/devfs_devs.c
 ==
 --- head/sys/fs/devfs/devfs_devs.cWed Jan 14 22:05:57 2015
 (r277198)
 +++ head/sys/fs/devfs/devfs_devs.cWed Jan 14 22:07:13 2015
 (r277199)
 @@ -137,6 +137,12 @@ devfs_alloc(int flags)
   vfs_timestamp(ts);
   cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts;
   cdev-si_cred = NULL;
 + /*
 +  * Avoid race with dev_rel() by setting the initial
 +  * reference count to 1. This last reference is taken
 +  * by the destroy_dev() function.
 +  */
 + cdev-si_refcount = 1;
This is wrong.  Not all devices are destroyed with destroy_dev().
dev_rel() must be allowed to clean up allocated device.

That said, I do not understand what race you are trying to solve.
Freeing of the accessible cdev memory cannot happen in parallel while
dev_mtx is owned.

Please do not commit (to devfs) without seeking for the review first.
  
   return (cdev);
  }
 
 Modified: head/sys/kern/kern_conf.c
 ==
 --- head/sys/kern/kern_conf.c Wed Jan 14 22:05:57 2015(r277198)
 +++ head/sys/kern/kern_conf.c Wed Jan 14 22:07:13 2015(r277199)
 @@ -1048,8 +1048,6 @@ destroy_devl(struct cdev *dev)
   /* Remove name marking */
   dev-si_flags = ~SI_NAMED;
  
 - dev-si_refcount++; /* Avoid race with dev_rel() */
 -
   /* If we are a child, remove us from the parents list */
   if (dev-si_flags  SI_CHILD) {
   LIST_REMOVE(dev, si_siblings);
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r277199 - in head/sys: fs/devfs kern

2015-01-14 Thread Hans Petter Selasky

On 01/15/15 04:31, Konstantin Belousov wrote:

On Wed, Jan 14, 2015 at 10:07:13PM +, Hans Petter Selasky wrote:

Author: hselasky
Date: Wed Jan 14 22:07:13 2015
New Revision: 277199
URL: https://svnweb.freebsd.org/changeset/base/277199

Log:
   Avoid race with dev_rel() when using the recently added
   delist_dev() function. Make sure the character device structure
   doesn't go away until the end of the destroy_dev() function due to
   concurrently running cleanup code inside devfs_populate().

   MFC after:   1 week
   Reported by: dchagin@

Modified:
   head/sys/fs/devfs/devfs_devs.c
   head/sys/kern/kern_conf.c

Modified: head/sys/fs/devfs/devfs_devs.c
==
--- head/sys/fs/devfs/devfs_devs.c  Wed Jan 14 22:05:57 2015
(r277198)
+++ head/sys/fs/devfs/devfs_devs.c  Wed Jan 14 22:07:13 2015
(r277199)
@@ -137,6 +137,12 @@ devfs_alloc(int flags)
vfs_timestamp(ts);
cdev-si_atime = cdev-si_mtime = cdev-si_ctime = ts;
cdev-si_cred = NULL;
+   /*
+* Avoid race with dev_rel() by setting the initial
+* reference count to 1. This last reference is taken
+* by the destroy_dev() function.
+*/
+   cdev-si_refcount = 1;

This is wrong.  Not all devices are destroyed with destroy_dev().
dev_rel() must be allowed to clean up allocated device.

That said, I do not understand what race you are trying to solve.
Freeing of the accessible cdev memory cannot happen in parallel while
dev_mtx is owned.

Please do not commit (to devfs) without seeking for the review first.


Hi Konstantin,

From my analysis there are basically three ways for a cdev to die:

1) Through dev_free_devlocked()
2) Through destroy_devl() which then later calls dev_free_devlocked()
3) Through destroy_dev_sched() which really is a wrapper around 
destroy_devl().


In the case of direct free through #1, the reference count is ignored 
and it doesn't matter if it is one or zero. Only in the case of 
destruction through destroy_dev() it matters.


Like the comment says in destroy_devl():

/* Avoid race with dev_rel() */

The problem is that the cdev-si_refcount is zero when the initial
devfs_create() is called. Then one ref is made. When we clear the 
CDP_ACTIVE flag in devfs_destroy() it instructs a !parallel! running 
process to destroy all the FS related structures and the reference count 
goes back to zero when the cdp is removed from the cdevp_list. Then 
the cdev is freed too early. This happens because destroy_devl() is 
dropping the dev_lock() to sleep waiting for pending references.


Do you see something else?

--HPS
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org