Re: dev_lock() contention for fdesc syscalls -- possible fix

2014-11-10 Thread Konstantin Belousov
On Mon, Nov 10, 2014 at 10:44:12AM +0100, Luigi Rizzo wrote:
> On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote:
> > On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote:
> > > It was noticed that there is huge dev_lock() contention when multiple
> > > processes do a poll() even on independent file descriptors.
> > > 
> > > Turns out that not just poll but most syscalls on file descriptors
> > > (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
> > > devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
> > > as they use the following pattern
> > > 
> > > devfs_poll_f() {
> > > ...
> > > devfs_fp_check(fp, ...) -->
> > > kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) -->
> > > dev_lock();
> > > dev = vp->v_rdev; // lock on vp ?
> > > ... check that dev != NULL
> > > atomic_add_long(&dev->si_threadcount, 1);
> > > dev_unlock();
> > > dsw->d_poll()
> > > dev_relthread() -->
> > > atomic_subtract_rel_long(&dev->si_threadcount, 1);
> > >}
> > > 
> > > 
> > > I believe that dev_lock() in devvn_refthread() is protecting
> > > dev = vp->v_rdev
> > > (the cdev entry 'dev' cannot go away for the reasons stated below).
> > > 
> > > However looking at places where vp->v_rdev is modified, turns out
> > > that it only happens when holding VI_LOCK(vp) -- the places are
> > > devfs_allocv() and devfs_reclaim().
> > > There is one place in zfs where the vnode is created and v_rdev
> > > is set without holding a lock, so nobody can dereference it.
> > > 
> > > As a consequence i believe that if in devfs_fp_check() we replace
> > > dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
> > > we make sure that we can safely dereference vp->v_rdev, and the
> > > cdev cannot go away because the vnode has a reference to it.
> > > The counter uses an atomic instruction (so the release is lockless)
> > Vnode reference, as well as cdev reference, which is obtained by
> > dev_ref(), do not provide any protection there.  v_rdev is only
> > coincidentally protected by the vnode interlock.
> > 
> > If you look at larger part of the code, you would note that dev mutex
> > is held even after v_rdev is dereferenced.  The real protection it
> > provides is against race with destroy_dev(l)(), which could invalidate
> > dev->so_devsw at any moment when either device thread reference is
> > not held, or dev mutex is not held.  So your patch breaks the
> > device destruction.
> 
> I see. Thanks for the clarification.
> 
> Would it help to rewrite the part of devvn_refthread as follows:
> 
> devvn_refthread() {
>   // protect vp->v_rdev dereference and dev disappearing
>   VI_LOCK(vp);
>   dev = vi->v_rdev;
>   .. check that dev != NULL
>   // protect the race on dev->si_devsw
>   atomic_add_long(&dev->si_threadcount, 1);
>   VI_UNLOCK(vp);
>   ... appropriate memory barrier
>   csw = dev->si_devsw;
>   if (csw != NULL) {
>   // we won the race, even if destroy_devl() runs
>   // it will keep the cdevsw alive until si_threadcount goes to 0
>   *devp = dev;
>   *ref = 1;
>   } else {
>   // we lost
>   *ref = 0;
>   }
>   return csw;
> }
> 
> i.e.  tentatively increment si_threadcount before dereferencing si_devsw
> and then restoring it if we lose the race ?
> It might be necessary to add a barrier in destroy_devl() between clearing
> si_devsw and reading si_threadcount.
Sure it is neccessary to add a barrier in destroy_devl() then.

>From the first look, this might work, but how would you handle the
possibility that cdev memory is already freed when you do the increment ?
The vnode interlock does not protect against it; if you mean that
vnode reclamation already happen and vp->v_rdev must be cleared, this
might need some additional argumentation.

Note that the real patch is more involved, since both dev_refthread()
and devvn_refthread() should be handled. Also, there is some ugly hack
in sound/clone.c.

> 
> > > 
> > > This should be enough to remove the contention.
> > If you never calls destroy_dev() for the devfs node, you could use
> > MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there
> > is no risk of race with destroy_dev() for the created node.
> > 
> > Usually, code which could be compiled in kernel statically but also
> > loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of
> > module needed to call destroy_dev().
> 
> that I suppose would mean that the module cannot be safely unloaded ?

Well, you are not supposed to use MAKEDEV_ETERNAL for loadable modules
at all.  This is what MAKEDEV_ETERNAL_KLD ensures.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To un

Re: dev_lock() contention for fdesc syscalls -- possible fix

2014-11-10 Thread Luigi Rizzo
On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote:
> On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote:
> > It was noticed that there is huge dev_lock() contention when multiple
> > processes do a poll() even on independent file descriptors.
> > 
> > Turns out that not just poll but most syscalls on file descriptors
> > (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
> > devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
> > as they use the following pattern
> > 
> > devfs_poll_f() {
> > ...
> > devfs_fp_check(fp, ...) -->
> > kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) -->
> > dev_lock();
> > dev = vp->v_rdev; // lock on vp ?
> > ... check that dev != NULL
> > atomic_add_long(&dev->si_threadcount, 1);
> > dev_unlock();
> > dsw->d_poll()
> > dev_relthread() -->
> > atomic_subtract_rel_long(&dev->si_threadcount, 1);
> >}
> > 
> > 
> > I believe that dev_lock() in devvn_refthread() is protecting
> > dev = vp->v_rdev
> > (the cdev entry 'dev' cannot go away for the reasons stated below).
> > 
> > However looking at places where vp->v_rdev is modified, turns out
> > that it only happens when holding VI_LOCK(vp) -- the places are
> > devfs_allocv() and devfs_reclaim().
> > There is one place in zfs where the vnode is created and v_rdev
> > is set without holding a lock, so nobody can dereference it.
> > 
> > As a consequence i believe that if in devfs_fp_check() we replace
> > dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
> > we make sure that we can safely dereference vp->v_rdev, and the
> > cdev cannot go away because the vnode has a reference to it.
> > The counter uses an atomic instruction (so the release is lockless)
> Vnode reference, as well as cdev reference, which is obtained by
> dev_ref(), do not provide any protection there.  v_rdev is only
> coincidentally protected by the vnode interlock.
> 
> If you look at larger part of the code, you would note that dev mutex
> is held even after v_rdev is dereferenced.  The real protection it
> provides is against race with destroy_dev(l)(), which could invalidate
> dev->so_devsw at any moment when either device thread reference is
> not held, or dev mutex is not held.  So your patch breaks the
> device destruction.

I see. Thanks for the clarification.

Would it help to rewrite the part of devvn_refthread as follows:

devvn_refthread() {
// protect vp->v_rdev dereference and dev disappearing
VI_LOCK(vp);
dev = vi->v_rdev;
.. check that dev != NULL
// protect the race on dev->si_devsw
atomic_add_long(&dev->si_threadcount, 1);
VI_UNLOCK(vp);
... appropriate memory barrier
csw = dev->si_devsw;
if (csw != NULL) {
// we won the race, even if destroy_devl() runs
// it will keep the cdevsw alive until si_threadcount goes to 0
*devp = dev;
*ref = 1;
} else {
// we lost
*ref = 0;
}
return csw;
}

i.e.  tentatively increment si_threadcount before dereferencing si_devsw
and then restoring it if we lose the race ?
It might be necessary to add a barrier in destroy_devl() between clearing
si_devsw and reading si_threadcount.

> > 
> > This should be enough to remove the contention.
> If you never calls destroy_dev() for the devfs node, you could use
> MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there
> is no risk of race with destroy_dev() for the created node.
> 
> Usually, code which could be compiled in kernel statically but also
> loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of
> module needed to call destroy_dev().

that I suppose would mean that the module cannot be safely unloaded ?

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: dev_lock() contention for fdesc syscalls -- possible fix

2014-11-10 Thread Konstantin Belousov
On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote:
> It was noticed that there is huge dev_lock() contention when multiple
> processes do a poll() even on independent file descriptors.
> 
> Turns out that not just poll but most syscalls on file descriptors
> (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
> devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
> as they use the following pattern
> 
> devfs_poll_f() {
> ...
> devfs_fp_check(fp, ...) -->
> kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) -->
> dev_lock();
> dev = vp->v_rdev; // lock on vp ?
> ... check that dev != NULL
> atomic_add_long(&dev->si_threadcount, 1);
> dev_unlock();
> dsw->d_poll()
> dev_relthread() -->
> atomic_subtract_rel_long(&dev->si_threadcount, 1);
>}
> 
> 
> I believe that dev_lock() in devvn_refthread() is protecting
> dev = vp->v_rdev
> (the cdev entry 'dev' cannot go away for the reasons stated below).
> 
> However looking at places where vp->v_rdev is modified, turns out
> that it only happens when holding VI_LOCK(vp) -- the places are
> devfs_allocv() and devfs_reclaim().
> There is one place in zfs where the vnode is created and v_rdev
> is set without holding a lock, so nobody can dereference it.
> 
> As a consequence i believe that if in devfs_fp_check() we replace
> dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
> we make sure that we can safely dereference vp->v_rdev, and the
> cdev cannot go away because the vnode has a reference to it.
> The counter uses an atomic instruction (so the release is lockless)
Vnode reference, as well as cdev reference, which is obtained by
dev_ref(), do not provide any protection there.  v_rdev is only
coincidentally protected by the vnode interlock.

If you look at larger part of the code, you would note that dev mutex
is held even after v_rdev is dereferenced.  The real protection it
provides is against race with destroy_dev(l)(), which could invalidate
dev->so_devsw at any moment when either device thread reference is
not held, or dev mutex is not held.  So your patch breaks the
device destruction.

> 
> This should be enough to remove the contention.
If you never calls destroy_dev() for the devfs node, you could use
MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there
is no risk of race with destroy_dev() for the created node.

Usually, code which could be compiled in kernel statically but also
loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of
module needed to call destroy_dev().


> 
> Anyone care to review/test the patch below ?
Yes, there are people who care.

> (dev_refthread() still has the single dev_lock() contention,
> i don't know how to address that yet)
> 
>   cheers
>   luigi
> 
> Index: /home/luigi/FreeBSD/head/sys/kern/kern_conf.c
> ===
> --- /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (revision 273452)
> +++ /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (working copy)
> @@ -224,10 +224,11 @@
>   }
>  
>   csw = NULL;
> - dev_lock();
> + ASSERT_VI_UNLOCKED(vp, __func__);
> + VI_LOCK(vp); // dev_lock();
>   dev = vp->v_rdev;
>   if (dev == NULL) {
> - dev_unlock();
> + VI_UNLOCK(vp); // dev_unlock();
>   return (NULL);
>   }
>   cdp = cdev2priv(dev);
> @@ -236,7 +237,7 @@
>   if (csw != NULL)
>   atomic_add_long(&dev->si_threadcount, 1);
>   }
> - dev_unlock();
> + VI_UNLOCK(vp); // dev_unlock();
>   if (csw != NULL) {
>   *devp = dev;
>   *ref = 1;
> ___
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


dev_lock() contention for fdesc syscalls -- possible fix

2014-11-09 Thread Luigi Rizzo
It was noticed that there is huge dev_lock() contention when multiple
processes do a poll() even on independent file descriptors.

Turns out that not just poll but most syscalls on file descriptors
(as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
as they use the following pattern

devfs_poll_f() {
...
devfs_fp_check(fp, ...) -->
kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) -->
dev_lock();
dev = vp->v_rdev; // lock on vp ?
... check that dev != NULL
atomic_add_long(&dev->si_threadcount, 1);
dev_unlock();
dsw->d_poll()
dev_relthread() -->
atomic_subtract_rel_long(&dev->si_threadcount, 1);
   }


I believe that dev_lock() in devvn_refthread() is protecting
dev = vp->v_rdev
(the cdev entry 'dev' cannot go away for the reasons stated below).

However looking at places where vp->v_rdev is modified, turns out
that it only happens when holding VI_LOCK(vp) -- the places are
devfs_allocv() and devfs_reclaim().
There is one place in zfs where the vnode is created and v_rdev
is set without holding a lock, so nobody can dereference it.

As a consequence i believe that if in devfs_fp_check() we replace
dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
we make sure that we can safely dereference vp->v_rdev, and the
cdev cannot go away because the vnode has a reference to it.
The counter uses an atomic instruction (so the release is lockless)

This should be enough to remove the contention.

Anyone care to review/test the patch below ?
(dev_refthread() still has the single dev_lock() contention,
i don't know how to address that yet)

cheers
luigi

Index: /home/luigi/FreeBSD/head/sys/kern/kern_conf.c
===
--- /home/luigi/FreeBSD/head/sys/kern/kern_conf.c   (revision 273452)
+++ /home/luigi/FreeBSD/head/sys/kern/kern_conf.c   (working copy)
@@ -224,10 +224,11 @@
}
 
csw = NULL;
-   dev_lock();
+   ASSERT_VI_UNLOCKED(vp, __func__);
+   VI_LOCK(vp); // dev_lock();
dev = vp->v_rdev;
if (dev == NULL) {
-   dev_unlock();
+   VI_UNLOCK(vp); // dev_unlock();
return (NULL);
}
cdp = cdev2priv(dev);
@@ -236,7 +237,7 @@
if (csw != NULL)
atomic_add_long(&dev->si_threadcount, 1);
}
-   dev_unlock();
+   VI_UNLOCK(vp); // dev_unlock();
if (csw != NULL) {
*devp = dev;
*ref = 1;
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"